# Python for Data Analysis

### Research Computing Services

Instructor: Brian Gregor
Website: [rcs.bu.edu](http://www.bu.edu/tech/support/research/)  
Tutorial materials: [http://rcs.bu.edu/examples/python/DataAnalysis](http://rcs.bu.edu/examples/python/DataAnalysis)  
Contact us: help@scc.bu.edu

## Course Content
1. Python packages for data scientists
2. Data manipulation with Pandas
3. Basic data plotting
4. Descriptive statistics
5. Inferential statistics


## Tutorial Evaluation
After the tutorial please submit an evaluation by clicking on this link [Tutorial Evaluation](http://scv.bu.edu/survey/tutorial_evaluation.html)

## Python packages for data scientists
- [NumPy](https://numpy.org)
    - Introduces objects for handling n-dimensional arrays such as vectors (1-D) and matrices (2-D).
    - Introduces functions to perform advanced mathematical and statistical operations on these objects.
    - Provides vectorization of mathematical operations on arrays and matrices which significantly improves performance.
    - Many other Python libraries are built on NumPy
- [SciPy](https://scipy.org)
    - An enormous collection of algorithms for statistics, linear algebra, optimization, differential equations, numerical integration, and more.
    - Developed and released with Numpy. 
- [Pandas](https://pandas.pydata.org)
    - Adds data structures and tools designed to work with table-like data (similar to Vectors and Data Frames in R)
    - Provides tools for data maniuplation: *reshaping*, *merging*, *sorting*, *slicing*, *aggregation*, etc.
    - Easily allows to handle missing data
      
- [SciKit-Learn](https://scikit-learn.org/stable/)
    - Provides machine learning algorithms: classification, regression, clustering, model validation, etc.
    - Built on NumPy, Scipy, and matplotlib.

- Machine Learning libraries
    - [Pytorch](https://pytorch.org/)
    - [Tensorflow](https://www.tensorflow.org/)
    - [Jax](https://github.com/jax-ml/jax)
    - For more info on using these on the SCC see [this page](https://www.bu.edu/tech/support/research/software-and-programming/common-languages/python/python-ml/).

- Pandas alternatives
  - Pandas is very popular and it has some alternatives
  - [Dask](https://www.dask.org/) - process large scale data in parallel, built on Pandas.
  - [Modin](https://github.com/modin-project/modin) - another library for scaling up Pandas to large datasets.
  - [Polars](https://pola.rs/) - Similar functionality (but not built on Pandas), fast, parallel processing, gaining in popularity.
    
### Visualization
More in-depth look at visualization in the `Data Visualization in Python` course.
- [matplotlib](https://matplotlib.org/)
    - Python 2-D plotting library for pulibcation quality figures in a variety of hardcopy formats
    - Functionalities similar to MATLAB
    - Line plots, scatter plots, bar charts, histograms, pie charts, etc.
    - Effort needed to create advanced visualizations
- [seaborn](https://seaborn.pydata.org/)
    - Based on matplotlib
    - Provides a high-level interface for drawing attractive statistical graphs
    - Similar to the ggplot2 library in R
- [plotly](https://plotly.com/python/)
    - over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.
    - Built on top of the Plotly JavaScript library
    - Can create beautiful interactive web-based visualizations
- [Datashader](https://datashader.org/)
    - Used to create visualizations and plots from very large datasets.

## Loading Python libraries

In [None]:
# Press shift-enter to execute a Jupyter notebook cell
# Import the Python Libraries used in the tutorial
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Pandas
The main focus of this tutorial is using the Pandas library to manipulate and analyze data.

Pandas is a python package that deals mostly with :
- **Series**  (1-D homogeneous array)
    - the array has 1 data type (int, floating point, etc)
- **DataFrame** (2-D labeled heterogeneous array)
    - each column has a specific data type
- **MultiIndex** (for hierarchical data)
- **Xarray** (built on top of Pandas for n-D arrays)

The Pandas content of this tutorial will cover:
* Creating and understanding Series and DataFrames
* Importing/Reading data
* Data selection and filtering
* Data maniuplation via sorting, grouping, and rearranging
* Handling missing data


In addition we will also provide information on the following.
* Basic data plotting
* Descriptive statistics (time permitting)
* Inferential statistics (time permitting)

### Pandas Series

A Pandas *Series* is a 1-dimensional labeled array containing data of the same type (integers, strings, floating point numbers, Python objects, etc. ). It is a generalized numpy array with an explicit axis called the *index*.

In [None]:
# Example of creating Pandas series :
# Order all S1 together
s1 = pd.Series([-3, -1, 1, 3, 5])
print(s1)

![image.png](attachment:e141912c-cf3c-4599-a21b-4c3c3f2f6785.png)

We did not pass any index, so by default, it assigned the indexes ranging from `0`to `len(data)-1`. Contrast this with a Python list, which always has an implicit index that counts from 0:
```
x = [10,20,30]
y = x[1]  # y --> 20
```
and also with a Python dictionary, where the keys act as an index:
```
x = {'a':10, 'b':20, 'c':30}
y = x['b']  # y --> 20
```

In [None]:
# View index values
print(s1.index)

In [None]:
s1[:2] # First 2 elements

In [None]:
print(s1[[2,1,0]])  # Elements out of order

In [None]:
type(s1)

In [None]:
# Can place filtering conditions on series
s1[s1 > 0]

In [None]:
# Creating Pandas series with index:
# fetch a random number generator object.
rng = np.random.default_rng()
# select 5 points from a normal (Gaussian) distribution.
s2 = pd.Series(rng.normal(size=5), index=['a', 'b', 'c', 'd', 'e'])
print(s2)

In [None]:
# View index values
print(s2.index)

In [None]:
# Create a Series from dictionary
data = {'pi': 3.14159, 'e': 2.71828}  # dictionary
print(data)
s3 = pd.Series(data)
print(s3)

In [None]:
# Create a new series from a dictionary and reorder the elements
s4 = pd.Series(data, index=['e', 'pi', 'tau'])
print(s4)

NaN (Not a Number) - is used to specify a missing value in Pandas.

In [None]:
# Series can be treated as a 1-D array and you can apply functions to them:
print("Median:", s4.median())

In [None]:
# Methods can be used to filter series:
s4[s4 > s4.median()]

### Attributes and Methods:
An attribute is a variable stored in the object, e.g., index or size with Series.
A method is a function stored in the object, e.g., head() or median() with Series.

|  Attribute/Method | Description |
|-----|-----|
| dtype | data type of values in series |
| empty | True if series is empty |
| size | number of elements |
| values | Returns values as ndarray |
| head() | First n elements |
| tail() | Last n elements |

Execute *dir(s1)* to see all attributes and methods. 

I recommend using online documentation as well. This will be in a much easier format to read and come with examples.



In [None]:
# For more information on a particular method or attribute use the help() function
help(s4.head())

In [None]:
help(s4.index)

In [None]:
# You can also add a question mark to get help information
s4.head?

In [None]:
s4.index?

Pandas Series can also [plot](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html) themselves:

In [None]:
# Going back to s2
s2.plot()

One final way to get help is to press shift-tab when you are in the parentheses of a method or after an attribute. Try this in the exercise below.

### Exercise - Create your own Series

In [None]:
# Create a series with 10 elements containing both positive and negative integers
# Examine the series with the head() method
# mys = pd.Series(  ...  )

Series from Numpy or a Series. See the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series)

In [None]:
# create a Series from a numpy array
a=np.array(range(6))  # numbers 0 thru 5
print(f'numpy: {a}')
ser = pd.Series(a)
print(f'series:\n{ser}')
# Change an element of the numpy array
a[0] = -100
# Print the series again
print(f'numpy: {a}')
print(f'series:\n{ser}')

In [None]:
# Tell pandas to make a copy of the numpy array.
ser2 = pd.Series(a, copy=True)
print(f'series2:\n{ser2}')
# change the array again
a[-1]=1000
# ser2 was built with a copy of "a", so no changes.
print(f'series2:\n{ser2}')

### Pandas DataFrames

A Pandas *DataFrame* is a 2-dimensional, size-mutable, heterogeneous tabular data structure with labeled rows and columns. You can think of it as a dictionary-like container to store Python Series objects.

In [None]:
d = pd.DataFrame({'Name': pd.Series(['Alice', 'Bob', 'Chris']), 
                  'Age': pd.Series([21, 25, 23])})
print(d)

In [None]:
d2 = pd.DataFrame(np.array([['Alice','Bob','Chris'], [21, 25, 23]]).T, columns=['Name','Age'])

In [None]:
# Use the head() method to print the first 5 records in the dataframe (same as with series)
d2.head()

In [None]:
# Add a new column to d2:
d2['Height'] = pd.Series([5.2, 6.0, 5.6])
d2.head()

In [None]:
# Add your own index:
d3 = d2.copy()
d3['my_index'] = ['person1','person2','person3']
# assign the values in my_index as the new index
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
d3 = d3.set_index('my_index', drop=True) # remove my_index afterwards
d3

In [None]:
# If you don't like an index, you can remove it and reset it to the usual one 0...len(df)-1
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html#pandas.DataFrame.reset_index
d4 = d3.reset_index(drop=True)  # What do you get if drop=False?
d4

In [None]:
# Combine dataframes. There's a bunch of ways. Here let's stack d2 onto d3:
#https://pandas.pydata.org/docs/reference/api/pandas.concat.html
d5 = pd.concat([d2,d4], axis=0)
d5
# See here for a discussion of a bunch of ways: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [None]:
# The index has gotten weird...fix it.
d5 = d5.reset_index(drop=True)
d5

### Reading data using Pandas
You can read CSV (comma separated values) files using Pandas. The command shown below reads a CSV file into the Pandas dataframe df.

In [None]:
# Read a csv file into Pandas Dataframe
df = pd.read_csv("http://rcs.bu.edu/examples/python/DataAnalysis/Salaries.csv")

The above command has many optional arguments that you can find in the Pandas documentation online.

You can also read many other formats, for instance:
* Excel - pd.read_excel('myfile.xlsx', sheet_name='Sheet1', index_col=None, na_values=['NA'])
* Stata - pd.read_stata('myfile.dta')
* SAS - pd.read_sas('myfile.sas7bdat')
* HDF - pd.read_hdf('myfile.h5', 'df')

Before we can perform any analysis on the data we need to


*   Check if the data is correctly imported 
*   Check the types of each column
*   Determine how many missing values each column has

We can then carefully prepare the data:

*   Remove columns that are not needed in our analysis
*   Rename the columns (if necessary)
*   Possibly rearrange the columns to make it easier to work with them
*   Create new or modify existing columns (e.g., convert into different units) if necessary

In [None]:
# Display the first 10 records
df.head(10)

In [None]:
# Display structure of the data frame
df.info()

### More details on DataFrame data types

|Pandas Type | Native Python Type | Description |
|------------|--------------------|-------------|
| object | string | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).|
| int64  | int | Numeric characters. 64 refers to the memory allocated to hold this character. |
| float64 | float | Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal. |
| datetime64, timedelta\[ns\]| N/A (but see the datetime module in Pythonâ€™s standard library) | Values meant to hold time data. Look into these for time series experiments. |


### DataFrame attributes
|df.attribute | Description |
|-------------|-------------|
| dtypes | list the types of the columns |
| columns | list the column names |
| axes | list the row labels and column names |
| ndim | number of dimensions |
| size | number of elements |
| shape | return a tuple representung the dimensionality |
| values | numpy representation of the data |

### Dataframe methods
|df.method() | Description |
|-------------|-------------|
| head(\[n\]), tail(\[n\]) | first/last n rows |
| describe() | generate descriptive statistics (for numeric columns only) |
| max(), min() | return max/min values for all numeric columns |
| mean(), median() | return mean/median values for all numeric columns |
| std() | standard deviation |
| sample(\[n\]) | returns a random sample of n elements from the data frame |
| dropna() | drop all the records with missing values |

Sometimes the column names in the input file are too long or contain special characters. In such cases we rename them to make it easier to work with these columns.

In [None]:
# Let's create a copy of this dataframe with a new column names
# If we do not want to create a new data frame, we can add inplace=True argument
df_new =df.rename(columns={'sex': 'gender', 'phd': 'yearsAfterPhD', 'service': 'yearsOfService'})
df_new.head()

### DataFrame Exploration

In [None]:
# Identify the type of df_new object
type(df_new)

In [None]:
# Check the data type of the column "salary"
# We access columns using the brackets, e.g., df['column_name']
df_new['salary'].dtype

In [None]:
# If the column name has no spaces, complex symbols, and is not the name of an attribute/method
# you can use the syntax df.column_name
df_new.salary.dtype

In [None]:
# List the types of all columns
df_new.dtypes

In [None]:
# List the column names
df_new.columns

In [None]:
# List the row labels and the column names
df_new.axes

In [None]:
# Number of rows and columns
df_new.shape

In [None]:
# Total number of elements in the Data Frame (78 x 6)
df_new.size

In [None]:
# Output some descriptive statistics for the numeric columns
# On a large dataframe this can take a long time to calculate
df_new.describe()

In [None]:
# Remember we can use the ? to get help about the function
df_new.describe?

#### Adding columns

Here's two ways to add a column:

In [None]:
# Create a new column using the assign method
# temporarily make a copy for demonstration purposes.
df_copy = df_new.copy()

# vectorized computation. This syntax is used to MODIFY a dataframe to
# contain a new column.
df_copy['salary_k'] = df_copy['salary'] / 1000.0
df_copy.head(10)

In [None]:
# Create a new column using the assign method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html
# Assign returns a copy of df_new with a new column attached. df_copy2 is a brand new dataframe.
df_copy2 = df_new.assign(salary_k=df_new['salary']/1000.0)
df_copy2.head(10)

In [None]:
# Check how many unique values are in a column
# There is a rank attribute in DataFrame object so we access using df['rank']
df_new['rank'].unique()

In [None]:
# Get the frequency table for a categorical or binary column
df_new['rank'].value_counts()

In [None]:
# Get a proportion table
df_new['rank'].value_counts()/sum(df['rank'].value_counts())

In [None]:
# Alternatively we can use the pandas function crosstab() to calculate a frequency table
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
pd.crosstab(index=df_new['rank'], columns="count")

In [None]:
# Two-way tables
pd.crosstab(index=df_new['rank'], columns=df_new['discipline'], margins=True)

### Data slicing and grouping

In [None]:
#Extract a column by name 
df_new['gender'].head()

In [None]:
# If the column name does not contain spaces or other special characters and does not collide with data frame methods, we can use a dot notation
df_new.gender.head()

In [None]:
# Calculate median number of service years
df_new.yearsOfService.median()

### Exercise - Working with a single column

In [None]:
# Calculate the descriptive statistics for only the salary column in df_new
# <your code goes here>

In [None]:
# Get a count for each of the values in the salary column in df_new
# <your code goes here>

In [None]:
# Calculate the average (mean) salary in df_new
# <your code goes here>

### Grouping data

In [None]:
# Group data using rank
df_rank = df_new.groupby('rank')
df_rank.head()

In [None]:
# Calculate mean of all numeric columns for the grouped object
df_rank.mean(numeric_only=True)
# What happens with df_rank.mean() ?

In [None]:
# Most of the time, the "grouping" object is not stored, but is used as a step in getting a summary:
#   df_new.groupby('gender')
# Calculate the mean salary for men and women. The following produce Pandas Series (single brackets around salary)
df_new.groupby('gender')['salary'].mean()

In [None]:
# If we use double brackets Pandas will produce a DataFrame
df_new.groupby('gender')[['salary']].mean()

In [None]:
# Group using 2 variables - gender and rank:
df_new.groupby(['rank','gender'], sort=True)[['salary']].mean()

### Exercise - Grouping data

In [None]:
# Group data by the rank and discipline and find the average yearsOfService and salary_k for each group. 
# <your code goes here>

### Filtering

In [None]:
# Select observation with the value in the salary column > 120K
df_filter = df_new[df_new.salary > 120000]
df_filter.head()

In [None]:
df_filter.axes

In [None]:
# Select data for female professors
df_w = df_new[df_new.gender == 'Female']
df_w.head()

In [None]:
# To subset one column using a condition in another columns use method "where"
df_new.salary.where(df_new.gender=='Female').dropna().head(6)

### Exercise - Filtering data 

In [None]:
# Using filtering, find the mean value of the salary for the discipline A
# <your code goes here>

In [None]:
# Challenge:
# Determine how many female and male professors earned more than 100K
# <your code goes here>

### Slicing a dataframe

In [None]:
# Select column salary
salary = df_new['salary']

In [None]:
# Check data type of the result
type(salary)

In [None]:
# Look at the first few elements of the output
salary.head()

In [None]:
# Select column salary and make the output to be a data frame
df_salary = df_new[['salary']]

In [None]:
# Check the type
type(df_salary)

In [None]:
# Select a subset of rows (based on their position):
# Note 1: The location of the first row is 0
# Note 2: The last value in the range is not included
df_new[0:10]

In [None]:
# If we want to select both rows and columns we can use method .loc
df_new.loc[10:20, ['rank', 'gender','salary']]

In [None]:
# Recall our filtered dataset with salaries over 120K
df_filter.head(25)

In [None]:
# Let's see what we get for our df_filter data frame
# Method .loc subsets the data frame based on the index values:
# loc = location
df_filter.loc[10:20,['rank','gender','salary']]

In [None]:
# Unlike method .loc, method iloc selects rows (and columns) by absolute position:
# iloc = integer location
df_filter.iloc[10:20, [0,3,4,5]]

### Exercise - Slicing a dataframe

In [None]:
# Create a new dataframe where you filter out salaries below 100K from df_new
# Call this data frame df_sub100

In [None]:
# Extract rows 5:10 and columns ['yearsOfService', 'salary_k'] of df_sub100 using the .loc method
# How many rows are in the output?

In [None]:
# Extract rows 5:10 and columns ['yearsOfService', 'salary_k'] from df_sub100 using the iloc method
# What are the values of the indices in the output?

In [None]:
# Extract rows with index values [6, 12, 20, 22] and columns ['yearsOfService','salary_k'] from df_sub100
# Hint: Use the loc method

### Sorting the Data

In [None]:
# Sort the data frame df_new by service and create a new data frame
df_sorted = df_new.sort_values(by = 'yearsOfService')
df_sorted.head()

In [None]:
# Sort the data frame df_new by yearsOfService and overwrite the original dataset
df_new.sort_values(by = 'yearsOfService', ascending = False, inplace = True)
df_new.head()

In [None]:
# Restore the original order using the sort_index method
df_new.sort_index(axis=0, ascending = True, inplace = True)
df_new.head()

In [None]:
# Sort the data frame using 2 or more columns:
df_sorted2 = df_new.sort_values(by = ['yearsOfService', 'salary'], ascending = [True,False])
df_sorted2.head(15)

### Exercise - Sorting 

In [None]:
# Sort the data frame df_new by the salary (in descending order)
# Store the output of this sorting in a dataframe called df_desc
# Display the first 10 records of the output
# <your code goes here>

### Looping and DataFrames

You can iterate over rows in a loop:
```
# use the iterrows() method
sum_sal = 0
for index, row in df_sorted2.iterrows():
   sum_sal += row['salary']
```
or using .loc():
```
sum_sal = 0
for i in range(df_sorted2.shape[0]):
    sum_sal += df_sorted2.loc[i,'salary']
```
However, performance is generally VERY POOR, so this is to be *avoided* where a better alternative exists. The [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply) function performs much better if you must do an operation on every element in a column.

### Missing Values
To discuss how to handle missing values we will import the flights data set.

In [None]:
# Read a dataset with missing values
flights = pd.read_csv("http://rcs.bu.edu/examples/python/DataAnalysis/flights.csv")
flights.head()

In [None]:
flights.info()

In [None]:
# Select the rows that have at least one missing value
flights[flights.isnull().any(axis=1)].head()

In [None]:
# Filter all the rows where arr_delay value is missing:
flights1 = flights[flights['arr_delay'].notnull( )]
flights1.head()

In [None]:
# Remove all the observations with missing values
flights2 = flights.dropna()

In [None]:
# Fill missing values with zeros
nomiss =flights['dep_delay'].fillna(0)
nomiss.isnull().any()

### Exercise Count missing data

In [None]:
# Count how many missing pieces of data there are in the dep_delay and arr_delay columns


### Common Aggregation Functions:

The following functions are commonly used functions to aggregate data.

|Function|Description
|-------|--------
|min   | minimum
|max   | maximum
|count   | number of non-null observations
|sum   | sum of values
|mean  | arithmetic mean of values
|median | median
|mad | mean absolute deviation
|mode | mode
|prod   | product of values
|std  | standard deviation
|var | unbiased variance



In [None]:
# Find the number of non-missing values in each column
flights.describe()

In [None]:
flights.info()

In [None]:
# Get mean values
flights.mean(numeric_only=True)

In [None]:
# Let's compute summary statistic per group:
flights.groupby('carrier')['dep_delay'].mean()

In [None]:
# We can use agg() methods for aggregation:
flights[['dep_delay','arr_delay']].agg(['min','mean','max'])

In [None]:
# The value returned is a dataframe:
agg_vals = flights[['dep_delay','arr_delay']].agg(['min','mean','max'])
agg_vals.info()

In [None]:
# An example of computing different statistics for different columns
flights.agg({'dep_delay':['min','mean','max'], 'carrier':['nunique']})

## Exploring data using graphics

### Graphics with the Salaries dataset

In [None]:
# Use matplotlib to draw a histogram of a salary data
plt.hist(df_new['salary'],bins=8, density=True)

In [None]:
# Use seaborn package to draw a histogram
sns.displot(df_new['salary']);

In [None]:
# Use regular matplotlib function to display a barplot
df_new.groupby(['rank'])['salary'].count().plot(kind='bar')

In [None]:
# Use seaborn package to display a barplot
sns.set_style("whitegrid")
ax = sns.barplot(x='rank',y ='salary', data=df_new, estimator=len)

In [None]:
# Split into 2 groups:
ax = sns.barplot(x='rank',y ='salary', hue='gender', data=df_new, estimator=len)

In [None]:
# Violinplot
sns.violinplot(x = "salary", data=df_new)

In [None]:
# Scatterplot in seaborn
sns.jointplot(x='yearsOfService', y='salary', data=df_new)

In [None]:
# If we are interested in linear regression plot for 2 numeric variables we can use regplot
sns.regplot(x='yearsOfService', y='salary', data=df_new)

In [None]:
# Box plot
sns.boxplot(x='rank',y='salary', data=df_new)

In [None]:
# Side-by-side box plot
sns.boxplot(x='rank', y='salary', data=df_new, hue='gender')

In [None]:
# Swarm plot
sns.swarmplot(x='rank', y='salary', data=df_new)

In [None]:
# Factorplot
sns.catplot(x='rank', y='salary', data=df_new, kind='bar')
print(df_new.groupby('rank').mean())

In [None]:
# Pairplot 
sns.pairplot(df_new)

### Exercise 7 Graphing data

In [None]:
# Use the seaborn package to explore the dependency of arr_delay on dep_delay 
# in the flights dataset. You can use a scatterplot or regplot using flight.
# <your code goes here>

## Descriptive statistics
Statistics that are used to describe data. We have seen methods that calculate descriptive statistics before with the DataFrame describe() method. 

Descriptive statistics summarize attributes of a sample, such as the min/max values, and the mean (average) of the data. Below is a summary of some additional methods that calculate descriptive statistics.

|Function|Description
|-------|--------
|min   | minimum
|max   | maximum
|mean  | arithmetic mean of values
|median | median
|mad | mean absolute deviation
|mode | mode
|std  | standard deviation
|var | unbiased variance
|sem | standard error of the mean
|skew| sample skewness
|kurt|kurtosis
|quantile| value at %


In [None]:
# Recall the describe() function which computes a subset of the above listed statistics
flights.dep_delay.describe()

In [None]:
# find the index of the maximum or minimum value
# if there are multiple values matching idxmin() and idxmax() will return the first match
flights['dep_delay'].idxmin()  #minimum value

In [None]:
# Count the number of records for each different value in a vector
flights['carrier'].value_counts()

## Inferential Statistics
Use data analysis on a sample of data to infer properties and make predictions that cannot be derived from descriptive statistics. Examples of this could be predicting a new unknown value based on previous data (linear regression, machine learning) or hypothesis testing (such as T-tests).

### Linear Regression
A linear approach to model the relationship between a scalar output and one (or more) input variables. With one input and one output variable you are finding a line of *best fit*. You calculate the slope and y-intercept for a line that minimizes the distance between all the existing data points. You can then use this line to make predictions on unknown data.

In [None]:
# Import Statsmodel functions:
import statsmodels.api as sm

In [None]:
# Create a fitted model
lm = sm.OLS(df_new.yearsOfService, df_new.salary).fit()

# Print model summary
print(lm.summary())

In [None]:
# Print the coefficients
lm.params

In [None]:
# Using scikit-learn:
from sklearn import linear_model
est = linear_model.LinearRegression(fit_intercept = True)   # create estimator object
# When you use dataframe columns this fits your model with feature names
est.fit(df_new[['yearsOfService']], df_new[['salary']])

# If you pass the values, you don't have to use feature names
# est.fit(df_new[['yearsOfService']].values, df_new[['salary']].values)

# Print result
print("Coef:", est.coef_, "\nIntercept:", est.intercept_)

# Predict 
# When you predict you have to use the features name otherwise you get a warning
pred = est.predict(pd.DataFrame(np.array([21]), columns=['yearsOfService']))

# If you created a model based on values, then you predict with only a value, 
# though you have to pass it with the same shape coef_ 
# pred = est.predict([[21]])
print("Predicted salary: ", pred)


### Exercise 8 Build a linear model

In [None]:
# Build a linear model for arr_delay ~ dep_delay

# Print model summary

# Predict a value


### Student T-test
Used to compare the means of two groups. In this case you have a null hypothesis that the two group means are equal. The T-test then tells you whether you have statistically significant evidence to reject the null hypothesis. 
The T-test has two output results, the statistic, and the p-value. The statistic (or T-value) quantifies the difference between the two mean values. The p-value is the probability of obtaining test results at least as extreme as the result that is observed assuming that the null hypothesis is correct.

More succintly, the p-value  tells you how likely it is that your data could have occured under the null hypothesis.  Small p-values indicate there is a small probability of observing such a difference in the mean assuming the null hypothesis is true. Small p-values indicate that there is evidence to reject the null hypothesis (i.e., the group means are different). 

Conversely, larger p-value scores indicate there is a large probabiliy of observing the calculated statistic under the null hypothesis. Large p-values indicate that you have evidence to accept the null hypothesis (i.e., the group means are equal).

One area the T-test is important is in clinical trials. Consider an example where you are looking at whether a drug reduces your cholesterol level. You have 2 populations, one where the drug is administered and another that is not administered. You can use the T-test to determine whether there is evidence to suggest that the drug causes a statistically significant change to the cholesterol level between the two populations.

Below we calculate whether there is a meaningful difference between male and female salaries. Generally a p-value below 0.05 is considered statistically significant.

In [None]:
# Using scipy package:
from scipy import stats
df_w = df[df['sex'] == 'Female']['salary']
df_m = df[df['sex'] == 'Male']['salary']
stats.ttest_ind(df_w, df_m)   

## Tutorial Evaluation
After the tutorial please submit an evaluation by clicking on this link [Tutorial Evaluation](http://scv.bu.edu/survey/tutorial_evaluation.html)