Skip to main content
Regression

Regression #

(Maybe) Useful resources #

Regression with statsmodels #

statsmodels 0.14.0

R-style formulas (e.g. smf.ols) do not need manually adding constant, while plain formulas (e.g. sm.OLS) need.

OLS basics #

Doc: statsmodels.regression.linear_model.OLSResults - statsmodels 0.14.0

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# formula
formula_1 = "y ~ x1 + x2 + ..."

# fit
results = smf.ols(formula_1, data=df).fit()

# results: two types of tables
print(results.summary())
print(results.summary2())

Formula #

Doc: Fitting models using R-style formulas - statsmodels 0.14.0

Use a lot of columns of df:

formula = "y ~ " + " + ".join(df.columns[1:])

Use log:

formula = "y ~ " + " + ".join(df.columns[1:-1]) + "+ np.log(last_col)"

Interaction: (ref)

# The followings are the same:
formula = "y ~ x1 * x2"
formula = "y ~ x1 + x2 + x1:x2"

Standard error types #

Doc: statsmodels.regression.linear_model.RegressionResults.HC3_se - statsmodels 0.15.0

Get hetroskedasticity-robust standard errors:

results = smf.ols(formula_1, data=df).fit(cov_type="HC3")

Debugging #

ValueError: endog has evaluated to an array with multiple columns that has shape (...). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).

Check df column types with df.info(). For any object type columns that you need, convert them.

ValueWarning: covariance of constraints does not have full rank. The number of constraints is 20, but rank is 19

Multicollinearity issues. If you are sure there is no problem with the data, one quick fix to try is to rescale or np.log some variables.

Detect multicollinearity #

Doc: statsmodels.stats.outliers_influence.variance_inflation_factor - statsmodels 0.14.0

Ref: Detecting Multicollinearity with VIF - Python - GeeksforGeeks

from statsmodels.stats.outliers_influence import variance_inflation_factor

# make VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = df.columns

# calculating VIF for each feature
vif_data["VIF"] = [
    variance_inflation_factor(df.values, i) for i in range(len(df.columns))
]

print(vif_data)

Get results in tables #

Get results in tables: (doc)

# statsmodels.iolib.table.SimpleTable
results.summary().tables[0]
results.summary().tables[1].as_html()
results.summary().tables[2].as_latex_tabular(center=False)

Get results programmatically #

Refs:

formula = "y ~ x1 + x2"
results = smf.ols(formula, data=df).fit()

r2 = results.rsquared
r2_adj = results.rsquared_adj
coef = results.params["x1"]
p_val = results.pvalues["x1"]

Loop of regressions #

results_list = []

for col in df.columns[1:]:
    # x1 is df.columns[0]
    formula = "y ~ x1" + " + " + str(col)
    results = smf.ols(formula, data=df).fit()
    
    corr = results.params[str(col)]
    p_val = results.pvalues[str(col)]

    dict = {"index": col, "corr": corr, "p_val": p_val}

    results_list.append(dict)

results_df = pd.DataFrame(results_list).set_index("index")

Tables with stargazer #

Doc (?): mwburke/stargazer: Python implementation of the R stargazer multiple regression model creation tool
Main examples: stargazer/examples.ipynb

from stargazer.stargazer import Stargazer, LineLocation

stargazer = Stargazer([results])  # can have multiple specifications

# Settings
stargazer.significant_digits(3)   # show 3 digits

# Preview table
stargazer

# Output latex code
print(stargazer.render_latex())

Change variable order #

By default, the variables are ordered alphabetically.

coef_order = df.columns[1:-1].values.tolist()
coef_order = np.append(coef_order, "np.log(last_col)")
coef_order = np.insert(coef_order, 0, "Intercept")

stargazer.covariate_order(coef_order)

Rename variables #

stargazer.rename_covariates({"np.log(last_col)": "$\log(last_col)$"})

Regression plots #

Ref: Predicting Housing Prices with Linear Regression using Python, pandas, and statsmodels – LearnDataSci

Regression with sklearn #

TBE