Multiple Linear Regression with Statsmodels

Most real outcomes depend on several factors at once. **[Multiple linear regression](/tutorials/statsmodels-linear-regression)** helps you estimate how each predictor relates to the target while controlling for other predictors.

In this tutorial, you will build a model with multiple predictors and interpret partial effects.

## Preparing multiple predictors

import requests
import pandas as pd

# Download once so every later example can reuse the same data file
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

# Save local CSV copy for the rest of this tutorial
with open("mtcars.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

df = pd.read_csv("mtcars.csv")

features = ["hp", "wt", "qsec"]
# X contains multiple predictors; y is the target to predict
X = df[features]
y = df["mpg"]

print(X.head())
print(y.head())

    hp     wt   qsec
0  110  2.620  16.46
1  110  2.875  17.02
2   93  2.320  18.61
3  110  3.215  19.44
4  175  3.440  17.02
0    21.0
1    21.0
2    22.8
3    21.4
4    18.7
Name: mpg, dtype: float64

This code downloads and saves `mtcars.csv` once, then creates a feature matrix (`X`) with multiple columns and a target vector (`y`). Splitting predictors and target clearly at the start makes later model code easier to reason about and helps prevent accidental leakage of target information into features.

## Building the feature matrix and adding intercept

import pandas as pd
import statsmodels.api as sm

# Load source data and build model matrix with intercept
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]

print(X.head())

   const   hp     wt   qsec
0    1.0  110  2.620  16.46
1    1.0  110  2.875  17.02
2    1.0   93  2.320  18.61
3    1.0  110  3.215  19.44
4    1.0  175  3.440  17.02

The added constant gives the model an intercept term, while each feature column contributes to the regression line in higher-dimensional space. This setup is required for OLS to estimate each coefficient as a partial effect while accounting for other predictors.

## Fitting the regression model

import pandas as pd
import statsmodels.api as sm

# Fit OLS with three predictors at once
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]

model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     47.15
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           4.51e-11
Time:                        16:48:01   Log-Likelihood:                -73.571
No. Observations:                  32   AIC:                             155.1
Df Residuals:                      28   BIC:                             161.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         27.6105      8.420      3.279      0.003      10.363      44.858
hp            -0.0178      0.015     -1.190      0.244      -0.049       0.013
wt            -4.3588      0.753     -5.791      0.000      -5.901      -2.817
qsec           0.5108      0.439      1.163      0.255      -0.389       1.411
==============================================================================
Omnibus:                        4.495   Durbin-Watson:                   1.422
Prob(Omnibus):                  0.106   Jarque-Bera (JB):                3.368
Skew:                           0.786   Prob(JB):                        0.186
Kurtosis:                       3.230   Cond. No.                     3.00e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large,  3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

This fits the full multiple linear regression model and prints inference statistics for each predictor. Looking at the full summary lets you evaluate both overall model quality and whether individual variables contribute useful signal.

## Interpreting coefficients and significance

import pandas as pd
import statsmodels.api as sm

# Refit model and extract key inference fields into one table
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

summary_table = pd.DataFrame(
    {
        "coefficient": model.params,
        "p_value": model.pvalues,
        "ci_lower": model.conf_int()[0],
        "ci_upper": model.conf_int()[1],
    }
)
print(summary_table)

       coefficient   p_value   ci_lower   ci_upper
const    27.610527  0.002785  10.363085  44.857968
hp       -0.017822  0.244176  -0.048510   0.012865
wt       -4.358797  0.000003  -5.900634  -2.816960
qsec      0.510834  0.254628  -0.388871   1.410538

Each coefficient is a **partial effect**: it represents the expected change in `mpg` from a one-unit increase in that feature while holding other predictors fixed. This interpretation is the main reason to use multiple regression, since it separates the contribution of correlated real-world factors.

## Multicollinearity discussion with VIF

When predictors are strongly correlated with each other, coefficient estimates can become unstable. Variance Inflation Factor (VIF) helps detect this.

import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Compute VIF on the same design matrix used for modeling
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])

vif = pd.DataFrame(
    {
        "feature": X.columns,
        "VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
    }
)
print(vif)

  feature         VIF
0   const  341.407799
1      hp    4.921956
2      wt    2.530443
3    qsec    2.873804

This block calculates VIF values for each predictor so you can identify potential multicollinearity before interpreting coefficients. Checking VIF first helps you avoid overconfident conclusions from unstable coefficient estimates when predictors overlap heavily.

Multiple Linear Regression with Statsmodels

You may also like