Unit 2.3 Extracting Information from Data, Pandas
Data connections, trends, and correlation. Pandas is introduced as it could be valuable for PBL, data validation, as well as understanding College Board Topics.
**All the writing in this blog that is BLUE is my notes (besides some links)**.
Pandas and DataFrames
In this lesson we will be exploring data analysis using Pandas.
- College Board talks about ideas like
- Tools. "the ability to process data depends on users capabilities and their tools"
- Combining Data. "combine county data sets"
- Status on Data"determining the artist with the greatest attendance during a particular month"
- Data poses challenge. "the need to clean data", "incomplete data"
- From Pandas Overview -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.
'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd
import pandas as pd
df = pd.read_json('files/grade.json')
print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data? Hint, remember Garbage in, Garbage out?
This Dataframe is missing data points, has invalid data, and inaccurate Data. For index 9, the student ID is missing and it labeled 'nil'. In index 4 and 8, the Year in school doesn't match with the format of the other values in Year in School. In index 5, the year in school is 20 but it can only be 9-12.
print(df[['GPA']])
print()
#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
print(df.sort_values(by=['GPA']))
print()
#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
print(df[df.GPA > 3.00])
print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
import pandas as pd
#the data can be stored as a python dictionary
dict = {
"calories": [420, 380, 390, 999],
"duration": [50, 40, 45, 999]
}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)
print("----------Dict_to_DF_labels--------------")
#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3", "day4"])
print(df)
print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])
#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])
print(df.info())
import pandas as pd
#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)
print("--Duration Top 10---------")
print(df.head(10))
print("--Duration Bottom 10------")
print(df.tail(10))
'''Pandas can be used to analyze data'''
import pandas as pd
import requests
def fetch():
'''Obtain data from an endpoint'''
url = "https://flask.nighthawkcodingsociety.com/api/covid/"
fetch = requests.get(url)
json = fetch.json()
# filter data for requirement
df = pd.DataFrame(json['countries_stat']) # filter endpoint for country stats
print(df.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths
fetch()
- Does it have a good sample size?
The sample size is 9 but is consided a good sample of the code because it uses ALL the 9 episodes of star wars to determine which episode was the most loved and most hated. Doesn't including another other movies and doesn't excluded any of the episodes.
- Is there bias in the data?
Most of the values in the subcategories are determined by how much people liked or disliked a given star wars movie, which involves personal bias. However, the computer doesn't have bias because it is using correct and clean data and is not including and/or excluded pieces of infomation when running.
- Does the data set need to be cleaned?
The data set below doesn't need to be cleaned. It was created with the intent to have clean and correct data.
- What is the purpose of the data set?
The purpose of the data set is to find out which star wars movie is loved the most and loved the least by both Audience score and Rotten Tomatoes. It is to help people know what both parties think of a moive and determine which one is favored by both parties.
import pandas as pd
#the data is stored in a python dictionary
dict = {
"Audience": [59, 56, 66, 96, 97, 94, 85, 42, 86], # Audience Score
"Rotten Tomatoes": [51, 65, 79, 93, 94, 83, 93, 91, 52], # Rotten Tomatoes Score
"Duration": [133, 143, 140, 121, 124, 133, 136, 152, 142], # Movie's Runtime
"money made(millions)":[431, 302, 380, 307, 209, 252, 936, 620, 515] # Money the movie made in the millions
}
#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["The Phantom Menace", "Attack of the Clones", "Revenge of the Sith", "A New Hope", "Empire Strikes Back", "Return of the Jedi", "The Force Awakens", "The Last Jedi", "The Rise of Skywalker"])
df['Loved by all'] = 100 * ((df['Rotten Tomatoes'] + df['Audience']) / 200) # finds the ratings total and divides it by the highest possible rating and *100 to make it a %.
dfLovedbyAll = (df.sort_values(by='Loved by all', ascending=False)) # sets the base list to the variable dfLovedbyAll
print(dfLovedbyAll[['Audience','Rotten Tomatoes','Loved by all']].to_string(index=True)) # prints the data based on % that all rated the moive (Highest to lowest).
print() #spaces out the data
print(dfLovedbyAll[['Audience','Rotten Tomatoes','Loved by all']].head(1)) # prints the Highest Loved by All %.
print() #spaces out the data
print(dfLovedbyAll[['Audience','Rotten Tomatoes','Loved by all']].tail(1)) # prints the Lowest Loved by All %.