More Python (Stat/ML/Viz)

We continue our journey with Python. At the end of this week, you will be able to:

Practice using statsmodels library for statistical analysis
Exercise using Scikit-learn library for machine learning
Create plots using Matplotlib and seaborn

Statistical Models in Python

statsmodels is a Python package that provides functions for fitting statistical models, conducting statistical tests, and statistical data exploration.

Let’s read a data set from the list provided in this link. We use the mtcars data set in R package datasets.

# allow to access easily to most of the functions
import statsmodels.api as stat 
# allow to use formula style to fit the models
import statsmodels.formula.api as statf 
import pandas as pd
import numpy as np
# matplotlib for plots
import matplotlib.pyplot as plt

# load data "mtcars" from the R package 'datasets' 
mtcars_python = stat.datasets.get_rdataset("mtcars", "datasets").data 

# print data
print(mtcars_python.info())

# fit linear regression

<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, Mazda RX4 to Volvo 142E
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     32 non-null     float64
 1   cyl     32 non-null     int64  
 2   disp    32 non-null     float64
 3   hp      32 non-null     int64  
 4   drat    32 non-null     float64
 5   wt      32 non-null     float64
 6   qsec    32 non-null     float64
 7   vs      32 non-null     int64  
 8   am      32 non-null     int64  
 9   gear    32 non-null     int64  
 10  carb    32 non-null     int64  
dtypes: float64(5), int64(6)
memory usage: 3.0+ KB
None

fit_olsregression = statf.ols("mpg ~ wt + cyl",data=mtcars_python).fit()
# print linear regression results
print(fit_olsregression.summary())

# predict using linear regression

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.830
Model:                            OLS   Adj. R-squared:                  0.819
Method:                 Least Squares   F-statistic:                     70.91
Date:                Fri, 30 Jun 2023   Prob (F-statistic):           6.81e-12
Time:                        10:44:30   Log-Likelihood:                -74.005
No. Observations:                  32   AIC:                             154.0
Df Residuals:                      29   BIC:                             158.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     39.6863      1.715     23.141      0.000      36.179      43.194
wt            -3.1910      0.757     -4.216      0.000      -4.739      -1.643
cyl           -1.5078      0.415     -3.636      0.001      -2.356      -0.660
==============================================================================
Omnibus:                        4.628   Durbin-Watson:                   1.671
Prob(Omnibus):                  0.099   Jarque-Bera (JB):                3.426
Skew:                           0.789   Prob(JB):                        0.180
Kurtosis:                       3.287   Cond. No.                         27.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

pred_ols = fit_olsregression.get_prediction()

# print prediction results
pred_ols.summary_frame().head()

# Convert 'cyl' variable to categorical variable type

                        mean   mean_se  ...  obs_ci_lower  obs_ci_upper
Mazda RX4          22.279145  0.601167  ...     16.885964     27.672326
Mazda RX4 Wag      21.465447  0.497629  ...     16.116566     26.814327
Datsun 710         26.252026  0.725244  ...     20.795395     31.708658
Hornet 4 Drive     20.380516  0.460267  ...     15.045648     25.715384
Hornet Sportabout  16.646958  0.775271  ...     11.161630     22.132285

[5 rows x 6 columns]

mtcars_python["cyl"] = mtcars_python["cyl"].astype("category") # make cyl categorical variable

# refit with a categorical variable 
fit_olsregression = statf.ols("mpg ~ wt + cyl",data=mtcars_python).fit() 

# # print the results
print(fit_olsregression.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.837
Model:                            OLS   Adj. R-squared:                  0.820
Method:                 Least Squares   F-statistic:                     48.08
Date:                Fri, 30 Jun 2023   Prob (F-statistic):           3.59e-11
Time:                        10:44:30   Log-Likelihood:                -73.311
No. Observations:                  32   AIC:                             154.6
Df Residuals:                      28   BIC:                             160.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     33.9908      1.888     18.006      0.000      30.124      37.858
cyl[T.6]      -4.2556      1.386     -3.070      0.005      -7.095      -1.416
cyl[T.8]      -6.0709      1.652     -3.674      0.001      -9.455      -2.686
wt            -3.2056      0.754     -4.252      0.000      -4.750      -1.661
==============================================================================
Omnibus:                        2.709   Durbin-Watson:                   1.806
Prob(Omnibus):                  0.258   Jarque-Bera (JB):                1.735
Skew:                           0.559   Prob(JB):                        0.420
Kurtosis:                       3.222   Cond. No.                         18.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

For more statistics with Python consult the following links: - Statistical tests - Generalized Linear Models - Contingency Tables

Machine Learning

The scikit-learn provides function that support machine learning techniques and practices including model fitting, predicting, cross-validation, etc. It also provides various supervised and unsupervised methods. The website of the package is https://scikit-learn.org

Linear models

Fitting regression models is relevant when the target value or response variable is assumed to be a linear combinations of some predictors. The following code will allow you to fit various linear models using sklearn module.

# import libraries
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error

# Load data
df = stat.datasets.get_rdataset("mtcars", "datasets").data 

# split data
training_data, testing_data = train_test_split(df, test_size=0.2, random_state=25)

# Create X and Y from training
Y = training_data["mpg"] # response variable / outcome
X = training_data.drop(columns=["mpg"]) #predictors / features
reg =  linear_model.LinearRegression().fit(X,Y)

# Create X and Y from testing
Y_test = testing_data["mpg"] # response variable / outcome
X_test = testing_data.drop(columns=["mpg"]) #predictors / features
mpg_y_pred = reg.predict(X_test) # predictions

print(reg.coef_)

# Compute the MAPE

[-0.52365917  0.01668511 -0.02157865  0.12362249 -3.46329267  0.70433502
  1.1100029   3.90616473  0.28676536 -0.16588934]

mean_absolute_percentage_error(y_true=Y_test,y_pred=mpg_y_pred)

0.12447286077371422

Visualization with Python

Python offers multiple tools and libraries that come with a lot of features for data visualization and plotting. Among the popular libraries we have:

The matplotlib.pyplot module is a collection of command style functions that make matplotlib work like MATLAB.

A few plots!

# import libraries
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
#matplotlib.use('Agg') # To plot with Markdown

x = np.linspace(0, 10, 100)
plt.figure();
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))
plt.show()

plt.close()

Read data from sklearn and vizualize

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris 
import matplotlib
#matplotlib.use('Agg') # To plot with Markdown

# load iris data
iris = load_iris()
# make iris data a Pandas data frame
df_iris = pd.DataFrame(iris.data)
# features names as columns name
df_iris.columns = iris.feature_names

# Boxplot
plt.figure();
plt.boxplot(df_iris)

{'whiskers': [<matplotlib.lines.Line2D object at 0x13567a880>, <matplotlib.lines.Line2D object at 0x135650af0>, <matplotlib.lines.Line2D object at 0x135bb9910>, <matplotlib.lines.Line2D object at 0x135bb9bb0>, <matplotlib.lines.Line2D object at 0x135bcabb0>, <matplotlib.lines.Line2D object at 0x135bcae50>, <matplotlib.lines.Line2D object at 0x135bd7e50>, <matplotlib.lines.Line2D object at 0x135be5130>], 'caps': [<matplotlib.lines.Line2D object at 0x13567ab80>, <matplotlib.lines.Line2D object at 0x13567ae20>, <matplotlib.lines.Line2D object at 0x135bb9e50>, <matplotlib.lines.Line2D object at 0x135bca130>, <matplotlib.lines.Line2D object at 0x135bd7130>, <matplotlib.lines.Line2D object at 0x135bd73d0>, <matplotlib.lines.Line2D object at 0x135be53d0>, <matplotlib.lines.Line2D object at 0x135be5670>], 'boxes': [<matplotlib.lines.Line2D object at 0x13567a5e0>, <matplotlib.lines.Line2D object at 0x135bb9670>, <matplotlib.lines.Line2D object at 0x135bca910>, <matplotlib.lines.Line2D object at 0x135bd7bb0>], 'medians': [<matplotlib.lines.Line2D object at 0x135bb9100>, <matplotlib.lines.Line2D object at 0x135bca3d0>, <matplotlib.lines.Line2D object at 0x135bd7670>, <matplotlib.lines.Line2D object at 0x135be5910>], 'fliers': [<matplotlib.lines.Line2D object at 0x135bb93a0>, <matplotlib.lines.Line2D object at 0x135bca670>, <matplotlib.lines.Line2D object at 0x135bd7910>, <matplotlib.lines.Line2D object at 0x135be5bb0>], 'means': []}

plt.xticks([1, 2, 3, 4], iris.feature_names)

([<matplotlib.axis.XTick object at 0x135648670>, <matplotlib.axis.XTick object at 0x135648640>, <matplotlib.axis.XTick object at 0x135be5fa0>, <matplotlib.axis.XTick object at 0x135bf4520>], [Text(1, 0, 'sepal length (cm)'), Text(2, 0, 'sepal width (cm)'), Text(3, 0, 'petal length (cm)'), Text(4, 0, 'petal width (cm)')])

plt.grid()
plt.show()

plt.close()

# histogram
plt.figure();
plt.hist(df_iris.iloc[:,0])

(array([ 9., 23., 14., 27., 16., 26., 18.,  6.,  5.,  6.]), array([4.3 , 4.66, 5.02, 5.38, 5.74, 6.1 , 6.46, 6.82, 7.18, 7.54, 7.9 ]), <BarContainer object of 10 artists>)

plt.show()

plt.close()

Pilot Certification Data

Data was obtained from the Federation Aviation Administration (FAA) in June 2023 on pilot certification records and contained the following:

Pilot ID,
CertLevel: the certification level (Airline, Commercial, Student, Sport, Private, and Recreational),
STATE: the USA state,
MedClass: the medical class,
MedExpMonth: the medical expire month, and
MedExpYear: the medical expire year.

Number of Pilots per State

import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
from collections import Counter
import seaborn as sns

df = pd.read_csv("../datasets/pilotsCertFAA2023.csv")
print(df.info())

# Select FL, CA, NY, and TX state

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450697 entries, 0 to 450696
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   ID           450697 non-null  object 
 1   STATE        430111 non-null  object 
 2   MedClass     305162 non-null  float64
 3   MedExpMonth  305162 non-null  float64
 4   MedExpYear   305162 non-null  float64
 5   CertLevel    450697 non-null  object 
dtypes: float64(3), object(3)
memory usage: 20.6+ MB
None

st = ['FL', 'CA', 'NY', 'TX']
df_reduced = df[df['STATE'].isin(st)]

# Counts how many in each state
counts = Counter(df_reduced.STATE)
# Convert to data frame
df_reduced = pd.DataFrame.from_dict(counts, orient='index').reset_index()
# Rename columns
df_reduced = df_reduced.rename(columns={'index':'state', 0:'count'})

# bar plot with seaborn
axx = sns.barplot(data=df_reduced, x="state", y="count")
# add x and y labels
axx.set(xlabel='Certification Level', ylabel='Number of Pilots', title="Data update 2023")
plt.show()

Number of Pilots per Certification Level

# Counts how many in each certification level
countsCert = Counter(df.CertLevel)
# Convert to data frame
df_reduced_cert = pd.DataFrame.from_dict(countsCert, orient='index').reset_index()
# Rename columns
df_reduced_cert = df_reduced_cert.rename(columns={'index':'certlevel', 0:'count'})

# bar plot with seaborn
ax = sns.barplot(data=df_reduced_cert, x="certlevel", y="count")
# add x and y labels
ax.set(xlabel='Certification Level', ylabel='Number of Pilots', title="Data update 2023")
plt.show()

🛎 🎙️ Recordings on Canvas will cover more details and examples! Have fun learning and coding 😃! Let me know how I can help!

📚 👈 Assignment - Python Stat/ML/Viz

Instructions are posted on Canvas.