Linear relationships appear everywhere: pricing, forecasting, operations, and experimentation. If you want interpretable models with statistical diagnostics, `statsmodels` is one of the best tools in Python. This tutorial focuses on **statsmodels linear regression** using **statsmodels OLS**. ## What OLS regression is Ordinary Least Squares (OLS) estimates a line by minimizing the sum of squared residuals, where each residual is the difference between an observed value and the model prediction. In practice, OLS gives you coefficients, [confidence intervals](/tutorials/bootstrap-confidence-intervals-with-scipy), p-values, and R-squared in one workflow. ## Creating the dataset
import requests
import pandas as pd
# Download the source CSV once for reuse in later code blocks
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
# Save a local copy so later examples can start from the same file
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
# Load only the columns we need for this tutorial step
df = pd.read_csv("mtcars.csv")
print(df[["mpg", "hp"]].head())This code downloads the dataset once, saves it locally as `mtcars.csv`, and loads it into a DataFrame for the rest of the tutorial. Saving the file first keeps later code blocks simpler and repeatable, because every block can start from the same local data source instead of re-downloading. ## Adding the intercept (`add_constant`)
import pandas as pd
import statsmodels.api as sm
# Reload the shared local dataset
df = pd.read_csv("mtcars.csv")
# Build predictor matrix and add intercept column
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
print(X.head())`statsmodels.add_constant()` adds an intercept column (`const`) so the fitted model can learn both slope and baseline level. Without this step, the regression would be forced through the origin, which often produces biased coefficients and a worse fit. ## Fitting the OLS model
import pandas as pd
import statsmodels.api as sm
# Prepare inputs in the shape expected by statsmodels
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
# Fit ordinary least squares model
model = sm.OLS(y, X).fit()
print(model.summary())This block fits `statsmodels.OLS` and prints the full regression summary with fit quality and inference metrics. The summary is useful because it combines coefficient estimates, standard errors, [confidence intervals](/tutorials/bootstrap-confidence-intervals-with-scipy), and global fit statistics in one place for quick model review. ## Understanding coefficients, p-values, and R²
import pandas as pd
import statsmodels.api as sm
# Recreate the same fitted model used above
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
print("Intercept:", model.params["const"])
print("Slope for hp:", model.params["hp"])
print("P-value for hp:", model.pvalues["hp"])
print("R-squared:", model.rsquared)The intercept gives the baseline prediction, the slope shows the change in `mpg` per unit of `hp`, the p-value tests statistical significance, and R² summarizes explained variance. These values are the core outputs you use to decide whether the relationship is meaningful and whether the model is strong enough for forecasting or decision support. ## Predicting new values
import pandas as pd
import statsmodels.api as sm
# Train model on historical rows
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
# New input values where we want predictions
new_points = pd.DataFrame({"hp": [90, 150, 220]})
new_points = sm.add_constant(new_points, has_constant="add")
# Return prediction mean and confidence bounds
pred = model.get_prediction(new_points).summary_frame(alpha=0.05)
print(pred[["mean", "mean_ci_lower", "mean_ci_upper"]])This creates predictions for unseen values and includes [confidence intervals](/tutorials/bootstrap-confidence-intervals-with-scipy) so you can report uncertainty with each estimate. Reporting intervals is important because point predictions alone can hide how much variation is plausible around each forecast. ## Visualizing regression results
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Fit model on the original points
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
# Build a sorted x-grid so the regression line is smooth
x_grid = pd.DataFrame({"hp": sorted(df["hp"].tolist())})
x_grid_const = sm.add_constant(x_grid, has_constant="add")
y_hat = model.predict(x_grid_const)
# Plot observed points plus fitted line
plt.figure(figsize=(8, 5))
plt.scatter(df["hp"], y, alpha=0.7, label="Observed")
plt.plot(x_grid["hp"], y_hat, color="red", linewidth=2, label="OLS fit")
plt.xlabel("Horsepower")
plt.ylabel("Miles per gallon")
plt.title("Statsmodels OLS: mpg vs hp")
plt.legend()
plt.tight_layout()
plt.show()This plot overlays the fitted OLS line on the observed data so you can visually assess trend direction and fit quality. Visual checks help confirm whether a straight-line assumption is reasonable and can reveal patterns that summary metrics alone may miss.