Using the Statsmodels Formula API for Regression

As models grow, manually assembling design matrices becomes harder to read and maintain. `statsmodels.formula.api` keeps model definitions compact and expressive.

In this tutorial, you will use formula syntax for [OLS](/tutorials/statsmodels-linear-regression) and [Logit workflows](/tutorials/statsmodels-logistic-regression).

## Formula syntax with `statsmodels.formula.api`

import requests
import pandas as pd
import statsmodels.formula.api as smf

# Download once and keep a local copy for all later formula examples
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("mtcars.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

# Fit OLS using formula notation instead of manual matrix building
df = pd.read_csv("mtcars.csv")
model = smf.ols("mpg ~ hp + wt + qsec", data=df).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     47.15
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           4.51e-11
Time:                        16:48:02   Log-Likelihood:                -73.571
No. Observations:                  32   AIC:                             155.1
Df Residuals:                      28   BIC:                             161.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     27.6105      8.420      3.279      0.003      10.363      44.858
hp            -0.0178      0.015     -1.190      0.244      -0.049       0.013
wt            -4.3588      0.753     -5.791      0.000      -5.901      -2.817
qsec           0.5108      0.439      1.163      0.255      -0.389       1.411
==============================================================================
Omnibus:                        4.495   Durbin-Watson:                   1.422
Prob(Omnibus):                  0.106   Jarque-Bera (JB):                3.368
Skew:                           0.786   Prob(JB):                        0.186
Kurtosis:                       3.230   Cond. No.                     3.00e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large,  3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The formula `y ~ x1 + x2` style is concise and automatically includes an intercept unless you explicitly remove it. This approach reduces boilerplate and makes it easier for readers to understand your model definition at a glance.

## Categorical predictors with `C()`

import pandas as pd
import statsmodels.formula.api as smf

# Treat `cyl` as categorical so statsmodels creates indicator variables
df = pd.read_csv("mtcars.csv")
cat_model = smf.ols("mpg ~ hp + wt + C(cyl)", data=df).fit()
print(cat_model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.857
Model:                            OLS   Adj. R-squared:                  0.836
Method:                 Least Squares   F-statistic:                     40.53
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           4.87e-11
Time:                        16:48:02   Log-Likelihood:                -71.235
No. Observations:                  32   AIC:                             152.5
Df Residuals:                      27   BIC:                             159.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      35.8460      2.041     17.563      0.000      31.658      40.034
C(cyl)[T.6]    -3.3590      1.402     -2.396      0.024      -6.235      -0.483
C(cyl)[T.8]    -3.1859      2.170     -1.468      0.154      -7.639       1.268
hp             -0.0231      0.012     -1.934      0.064      -0.048       0.001
wt             -3.1814      0.720     -4.421      0.000      -4.658      -1.705
==============================================================================
Omnibus:                        2.972   Durbin-Watson:                   1.790
Prob(Omnibus):                  0.226   Jarque-Bera (JB):                1.864
Skew:                           0.569   Prob(JB):                        0.394
Kurtosis:                       3.320   Cond. No.                     1.08e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

`C(cyl)` encodes cylinder count as categorical levels instead of treating it as one continuous numeric trend. This matters when you expect group-level differences rather than a single linear effect across numeric values.

## Interaction terms

import pandas as pd
import statsmodels.formula.api as smf

# `*` includes both main effects and their interaction term
df = pd.read_csv("mtcars.csv")
interaction_model = smf.ols("mpg ~ wt * hp", data=df).fit()
print(interaction_model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.885
Model:                            OLS   Adj. R-squared:                  0.872
Method:                 Least Squares   F-statistic:                     71.66
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           2.98e-13
Time:                        16:48:02   Log-Likelihood:                -67.805
No. Observations:                  32   AIC:                             143.6
Df Residuals:                      28   BIC:                             149.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     49.8084      3.605     13.816      0.000      42.424      57.193
wt            -8.2166      1.270     -6.471      0.000     -10.818      -5.616
hp            -0.1201      0.025     -4.863      0.000      -0.171      -0.070
wt:hp          0.0278      0.007      3.753      0.001       0.013       0.043
==============================================================================
Omnibus:                        2.221   Durbin-Watson:                   2.128
Prob(Omnibus):                  0.329   Jarque-Bera (JB):                1.736
Skew:                           0.407   Prob(JB):                        0.420
Kurtosis:                       2.200   Cond. No.                     6.35e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.35e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

`wt * hp` expands to main effects plus `wt:hp`, so the effect of one variable can depend on the value of the other. Interaction terms are useful when relationships are not additive and one feature changes the impact of another.

## Formula API with logistic regression

import pandas as pd
import statsmodels.formula.api as smf

# The same formula workflow works for binary outcomes with logit
df = pd.read_csv("mtcars.csv")
logit_model = smf.logit("am ~ mpg + hp + wt", data=df).fit(disp=False)
print(logit_model.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                     am   No. Observations:                   32
Model:                          Logit   Df Residuals:                       28
Method:                           MLE   Df Model:                            3
Date:                Wed, 11 Mar 2026   Pseudo R-squ.:                  0.7972
Time:                        16:48:02   Log-Likelihood:                -4.3831
converged:                       True   LL-Null:                       -21.615
Covariance Type:            nonrobust   LLR p-value:                 1.581e-07
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -15.7214     40.003     -0.393      0.694     -94.125      62.683
mpg            1.2293      1.581      0.778      0.437      -1.870       4.328
hp             0.0839      0.082      1.020      0.308      -0.077       0.245
wt            -6.9549      3.353     -2.074      0.038     -13.527      -0.383
==============================================================================

Possibly complete quasi-separation: A fraction 0.25 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

The same formula style works for `logit`, making it easy to switch between regression types while keeping readable model definitions. That consistency is helpful when you compare linear and classification models in the same project.

Using the Statsmodels Formula API for Regression

You may also like