Many practical modeling tasks are binary: convert or not, churn or stay, default or repay. Logistic regression is built for these outcomes and keeps interpretation straightforward. This tutorial focuses on **statsmodels logistic regression** with **statsmodels Logit**. ## Logistic regression vs [linear regression](/tutorials/statsmodels-linear-regression) Linear regression predicts unbounded numeric values. Logistic regression predicts probabilities between `0` and `1` using a logistic curve, then maps those probabilities to classes using a threshold. ## Preparing binary target data
import requests
import pandas as pd
# Download once and persist locally for later blocks
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
# Binary target and feature matrix
df = pd.read_csv("mtcars.csv")
y = df["am"]
X = df[["mpg", "hp", "wt"]]
print(X.head())
print(y.head())This block downloads and saves `mtcars.csv` once, then prepares the binary target and predictor matrix for classification. Defining `y` and `X` explicitly up front makes the modeling flow clearer and ensures the same dataset is reused consistently in later blocks. ## Adding intercept
import pandas as pd
import statsmodels.api as sm
# Add intercept for baseline log-odds
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
print(X.head())`add_constant()` adds the intercept term required for a baseline log-odds estimate. Including an intercept lets the model represent the base probability level instead of forcing all predictors to explain it. ## Fitting the Logit model
import pandas as pd
import statsmodels.api as sm
# Fit logistic regression model
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)
print(logit_model.summary())This fits `Logit` and prints coefficient significance, confidence intervals, and fit statistics. The summary helps you decide which predictors carry useful classification signal before you move to prediction. ## Interpreting coefficients and odds ratios
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Train model then convert log-odds coefficients to odds ratios
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)
coef_df = pd.DataFrame(
{
"coef": logit_model.params,
"odds_ratio": np.exp(logit_model.params),
"p_value": logit_model.pvalues,
}
)
print(coef_df)Logit coefficients are in log-odds space; exponentiating them gives odds ratios, which are easier to interpret operationally. Odds ratios let you explain model behavior in practical terms, such as how much odds change when a feature increases. ## Predicting probabilities and classification thresholds
import pandas as pd
import statsmodels.api as sm
# Fit model on training rows
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)
new_cars = pd.DataFrame(
{
"mpg": [18.0, 28.0],
"hp": [150, 95],
"wt": [3.4, 2.1],
}
)
new_cars = sm.add_constant(new_cars, has_constant="add")
# Predict probabilities, then map to classes at two thresholds
prob = logit_model.predict(new_cars)
class_at_05 = (prob >= 0.5).astype(int)
class_at_07 = (prob >= 0.7).astype(int)
print("Probabilities:")
print(prob)
print("Classes at threshold 0.5:")
print(class_at_05)
print("Classes at threshold 0.7:")
print(class_at_07)This computes probabilities first and then shows how changing thresholds changes class assignments. Comparing thresholds demonstrates the precision/recall tradeoff you make when turning probabilities into hard yes/no decisions.