Fitting a regression model is only the first step. Diagnostics tell you whether assumptions are reasonable and whether a few unusual points are driving your conclusions. In this tutorial, you will run a practical diagnostics workflow in `statsmodels`. ## Checking regression assumptions with a baseline model
import requests
import pandas as pd
import statsmodels.api as sm
# Download once so all diagnostic steps use the same dataset
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
# Build and fit baseline model for downstream diagnostics
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
print(model.summary())This block downloads and saves `mtcars.csv` once and fits the [baseline OLS model](/tutorials/statsmodels-linear-regression) that all later diagnostics inspect. Using one shared baseline makes each diagnostic result comparable and easier to interpret. ## Residual analysis and residual plots
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.gofplots import qqplot
# Refit baseline model and compute fitted values + residuals
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
fitted = model.fittedvalues
residuals = model.resid
plt.figure(figsize=(8, 5))
plt.scatter(fitted, residuals, alpha=0.8)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted")
plt.tight_layout()
plt.show()
qqplot(residuals, line="s")
plt.title("Q-Q Plot of Residuals")
plt.tight_layout()
plt.show()Residual-vs-fitted helps detect nonlinearity or variance patterns, while the Q-Q plot checks how close residuals are to normal. Together, these two plots give a fast visual check of key OLS assumptions before you trust inference. ## Heteroskedasticity tests
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
# Run Breusch-Pagan test on model residuals
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
bp_stat, bp_pvalue, f_stat, f_pvalue = het_breuschpagan(model.resid, model.model.exog)
print("Breusch-Pagan LM stat:", bp_stat)
print("Breusch-Pagan LM p-value:", bp_pvalue)
print("Breusch-Pagan F stat:", f_stat)
print("Breusch-Pagan F p-value:", f_pvalue)Breusch-Pagan evaluates whether residual variance depends on predictors, which can affect inference reliability. If heteroskedasticity is present, you may need robust standard errors or a different model specification. ## Variance Inflation Factor (VIF)
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Compute VIF for each predictor to quantify collinearity
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
vif_df = pd.DataFrame(
{
"feature": X.columns,
"VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
}
)
print(vif_df)VIF measures multicollinearity, helping you detect predictors that may inflate coefficient uncertainty. High VIF values are a warning sign that coefficient magnitudes and p-values can become unstable. ## Influence and outlier detection
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Use model influence diagnostics to identify high-impact points
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
influence = model.get_influence()
cooks_d = influence.cooks_distance[0]
result = pd.DataFrame({"car": df["rownames"], "cooks_distance": cooks_d})
print(result.sort_values("cooks_distance", ascending=False).head(5))
plt.figure(figsize=(9, 4))
plt.stem(range(len(cooks_d)), cooks_d, basefmt=" ")
plt.xlabel("Observation index")
plt.ylabel("Cook's distance")
plt.title("Influence by observation")
plt.tight_layout()
plt.show()Cook's distance surfaces influential observations and the stem plot makes large-impact points easy to inspect. Reviewing top points helps you decide whether unusual rows are valid data, errors, or cases requiring separate treatment.