As models grow, manually assembling design matrices becomes harder to read and maintain. `statsmodels.formula.api` keeps model definitions compact and expressive. In this tutorial, you will use formula syntax for [OLS](/tutorials/statsmodels-linear-regression) and [Logit workflows](/tutorials/statsmodels-logistic-regression). ## Formula syntax with `statsmodels.formula.api`
import requests
import pandas as pd
import statsmodels.formula.api as smf
# Download once and keep a local copy for all later formula examples
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
# Fit OLS using formula notation instead of manual matrix building
df = pd.read_csv("mtcars.csv")
model = smf.ols("mpg ~ hp + wt + qsec", data=df).fit()
print(model.summary())The formula `y ~ x1 + x2` style is concise and automatically includes an intercept unless you explicitly remove it. This approach reduces boilerplate and makes it easier for readers to understand your model definition at a glance. ## Categorical predictors with `C()`
import pandas as pd
import statsmodels.formula.api as smf
# Treat `cyl` as categorical so statsmodels creates indicator variables
df = pd.read_csv("mtcars.csv")
cat_model = smf.ols("mpg ~ hp + wt + C(cyl)", data=df).fit()
print(cat_model.summary())`C(cyl)` encodes cylinder count as categorical levels instead of treating it as one continuous numeric trend. This matters when you expect group-level differences rather than a single linear effect across numeric values. ## Interaction terms
import pandas as pd
import statsmodels.formula.api as smf
# `*` includes both main effects and their interaction term
df = pd.read_csv("mtcars.csv")
interaction_model = smf.ols("mpg ~ wt * hp", data=df).fit()
print(interaction_model.summary())`wt * hp` expands to main effects plus `wt:hp`, so the effect of one variable can depend on the value of the other. Interaction terms are useful when relationships are not additive and one feature changes the impact of another. ## Formula API with logistic regression
import pandas as pd
import statsmodels.formula.api as smf
# The same formula workflow works for binary outcomes with logit
df = pd.read_csv("mtcars.csv")
logit_model = smf.logit("am ~ mpg + hp + wt", data=df).fit(disp=False)
print(logit_model.summary())The same formula style works for `logit`, making it easy to switch between regression types while keeping readable model definitions. That consistency is helpful when you compare linear and classification models in the same project.