Mini Project on Predicting House Prices Using Linear Regression with Feature Engineering, and Deployment
Mini Project on Predicting House Prices Using Linear Regression with Data Cleaning, Feature Engineering, and Deployment
1. Introduction
-
Importance of predicting house prices in real estate.
-
Why machine learning is useful in pricing.
-
Linear Regression as a baseline model.
-
Goals of this mini-project:
-
Clean dataset
-
Perform feature engineering
-
Build Linear Regression model
-
Evaluate performance
-
Deploy model for real-world use
-
2. Dataset Overview
-
Dataset used: House Prices: Advanced Regression Techniques (Kaggle) or any open dataset (Ames Housing).
-
Features include:
-
Location
-
Size (square feet)
-
Bedrooms, Bathrooms
-
Year built
-
Garage, Lot size, etc.
-
-
Target: SalePrice
👉 Provide dataset link + small table preview.
3. Step 1: Data Collection
-
Download dataset (CSV format).
-
Import libraries:
pandas
,numpy
,matplotlib
,seaborn
,scikit-learn
.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("house_prices.csv")
print(data.head())
4. Step 2: Exploratory Data Analysis (EDA)
-
Understanding dataset shape, missing values, data types.
-
Summary statistics (
data.describe()
). -
Correlation heatmap (
sns.heatmap(data.corr())
). -
Distribution of Sale Prices (
sns.histplot
).
👉 Add visualizations:
-
Histogram of prices
-
Boxplots (price vs. number of rooms, location)
-
Correlation matrix
5. Step 3: Data Cleaning
-
Handle missing values:
-
Numerical → Impute with mean/median.
-
Categorical → Fill with mode or "Unknown".
-
-
Handle outliers (log-transform SalePrice, remove extreme values).
-
Convert categorical columns into consistent formats.
data['LotFrontage'] = data['LotFrontage'].fillna(data['LotFrontage'].median())
data['GarageType'] = data['GarageType'].fillna("None")
6. Step 4: Feature Engineering
-
Encoding categorical variables: One-hot encoding (
pd.get_dummies
). -
Feature scaling: Standardization / normalization.
-
Feature selection: Keep only important predictors (using correlation, feature importance).
-
Create new features:
-
HouseAge = YrSold - YearBuilt
-
Remodeled = (YearRemodAdd != YearBuilt)
-
data['HouseAge'] = data['YrSold'] - data['YearBuilt']
data['Remodeled'] = (data['YearRemodAdd'] != data['YearBuilt']).astype(int)
7. Step 5: Train-Test Split
-
Separate input (X) and output (y).
-
Train-test split (80-20).
from sklearn.model_selection import train_test_split
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8. Step 6: Build Linear Regression Model
-
Import model:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
-
View coefficients:
coef_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coef_df)
9. Step 7: Model Evaluation
-
Predictions:
y_pred = model.predict(X_test)
-
Metrics:
-
MAE (Mean Absolute Error)
-
MSE (Mean Squared Error)
-
RMSE (Root Mean Squared Error)
-
R² Score
-
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print("MAE:", mean_absolute_error(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score:", r2_score(y_test, y_pred))
-
Plot Actual vs Predicted Prices.
10. Step 8: Model Improvement
-
Feature engineering (interaction features, polynomial regression).
-
Regularization: Ridge & Lasso Regression.
-
Cross-validation to avoid overfitting.
Sponsor Key-Word
"This Content Sponsored by SBO Digital Marketing.
Mobile-Based Part-Time Job Opportunity by SBO!
Earn money online by doing simple content publishing and sharing tasks. Here's how:
Job Type: Mobile-based part-time work
Work Involves:
Content publishing
Content sharing on social media
Time Required: As little as 1 hour a day
Earnings: ₹300 or more daily
Requirements:
Active Facebook and Instagram account
Basic knowledge of using mobile and social media
For more details:
WhatsApp your Name and Qualification to 9994104160
a.Online Part Time Jobs from Home
b.Work from Home Jobs Without Investment
c.Freelance Jobs Online for Students
d.Mobile Based Online Jobs
e.Daily Payment Online Jobs
Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"
11. Step 9: Deployment
-
Save model using
joblib
orpickle
.
import joblib
joblib.dump(model, "house_price_model.pkl")
-
Create a Flask API to serve predictions.
-
Simple web interface using Streamlit or Flask + HTML.
12. Step 10: Real-World Application
-
Deploy on Heroku / Vercel / AWS.
-
User inputs features → Model predicts house price.
-
Show example deployment screenshots.
13. Conclusion
-
What we learned:
-
Data preprocessing is key.
-
Linear Regression works as a good baseline.
-
Regularization helps improve generalization.
-
Deployment makes ML models useful in practice.
-
14. Points to Remember
-
Always clean & preprocess data before modeling.
-
Linear Regression assumes linear relationships.
-
Evaluate with multiple metrics, not just accuracy.
-
Use regularization when dataset is large with many features.
-
Deploy models for real-world use.
Mini project 2 Comprehensive Mini Project on Predicting House Prices Using Linear Regression with Data Cleaning, Feature Engineering, and Deployment
1. Introduction
Predicting house prices is one of the most common and practical applications of machine learning. Real estate is influenced by multiple factors such as location, area (square footage), number of bedrooms, proximity to schools/markets, and economic conditions. Buyers, sellers, and real estate companies rely on predictive models to estimate fair house prices.
In this mini-project, we will walk through end-to-end steps of building a house price prediction model using Linear Regression.
We will cover:
-
Data collection & understanding
-
Data cleaning
-
Exploratory Data Analysis (EDA)
-
Feature engineering & preprocessing
-
Model building (Linear Regression)
-
Model evaluation
-
Deployment (making the model accessible through a simple app or API)
By the end, you will have a working project that you can showcase in your portfolio or use as a resume project.
2. Dataset Overview
For this project, we will use the House Prices dataset (similar to the Kaggle dataset). It typically contains columns like:
-
Id
– Unique identifier for each house -
LotArea
– Size of the lot in square feet -
OverallQual
– Overall material and finish quality -
YearBuilt
– Year the house was built -
TotalBsmtSF
– Total square feet of basement area -
GrLivArea
– Above-ground living area in square feet -
FullBath
– Number of full bathrooms -
BedroomAbvGr
– Number of bedrooms above ground -
GarageCars
– Size of garage in car capacity -
SalePrice
– Target variable (House price in $)
You can download the dataset from Kaggle or use a smaller sample dataset for demonstration.
3. Step 1: Data Collection
We will use Python’s pandas library to load the dataset.
import pandas as pd
# Load dataset
data = pd.read_csv("house_prices.csv")
# View first 5 rows
print(data.head())
✅ Output: First few rows of the dataset
Sponsor Key-Word
"This Content Sponsored by SBO Digital Marketing.
Mobile-Based Part-Time Job Opportunity by SBO!
Earn money online by doing simple content publishing and sharing tasks. Here's how:
Job Type: Mobile-based part-time work
Work Involves:
Content publishing
Content sharing on social media
Time Required: As little as 1 hour a day
Earnings: ₹300 or more daily
Requirements:
Active Facebook and Instagram account
Basic knowledge of using mobile and social media
For more details:
WhatsApp your Name and Qualification to 9994104160
a.Online Part Time Jobs from Home
b.Work from Home Jobs Without Investment
c.Freelance Jobs Online for Students
d.Mobile Based Online Jobs
e.Daily Payment Online Jobs
Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"
4. Step 2: Data Cleaning
Real-world datasets are messy. Cleaning is essential.
Common cleaning steps:
-
Handling Missing Values
-
Drop columns with too many missing values.
-
Fill missing numeric values with median.
-
Fill categorical values with mode.
-
# Drop columns with more than 30% missing data
data = data.dropna(thresh=0.7*len(data), axis=1)
# Fill missing numeric with median
for col in data.select_dtypes(include=['int64','float64']).columns:
data[col].fillna(data[col].median(), inplace=True)
# Fill categorical with mode
for col in data.select_dtypes(include=['object']).columns:
data[col].fillna(data[col].mode()[0], inplace=True)
-
Removing Duplicates
data.drop_duplicates(inplace=True)
-
Checking Outliers (Boxplot method for
SalePrice
andGrLivArea
)
import seaborn as sns
sns.boxplot(data['SalePrice'])
5. Step 3: Exploratory Data Analysis (EDA)
EDA helps us understand relationships between features and target (SalePrice
).
-
Distribution of House Prices
import matplotlib.pyplot as plt
sns.histplot(data['SalePrice'], kde=True)
plt.show()
-
Correlation Heatmap
plt.figure(figsize=(12,8))
sns.heatmap(data.corr(), annot=True, cmap="coolwarm")
plt.show()
-
Scatterplot: Living Area vs Sale Price
sns.scatterplot(x='GrLivArea', y='SalePrice', data=data)
plt.show()
✅ Insight: Larger houses tend to have higher prices.
6. Step 4: Feature Engineering & Preprocessing
-
Feature Selection (Choosing most important columns)
features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
X = data[features]
y = data['SalePrice']
-
Feature Scaling (optional for Linear Regression)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
-
Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
7. Step 5: Model Building (Linear Regression)
from sklearn.linear_model import LinearRegression
# Train Model
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
8. Step 6: Model Evaluation
We evaluate with R² Score, MAE, RMSE.
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np
print("R² Score:", r2_score(y_test, y_pred))
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
✅ Example Output:
-
R² Score: 0.83
-
MAE: $22,000
-
RMSE: $35,000
9. Step 7: Deployment
We can deploy our model using Flask or Streamlit.
Example: Streamlit Deployment
import streamlit as st
import numpy as np
st.title("🏡 House Price Prediction App")
# Inputs
overallQual = st.slider("Overall Quality (1-10)", 1, 10, 5)
grLivArea = st.number_input("Living Area (sq ft)", 500, 5000, 1500)
garageCars = st.slider("Garage Cars", 0, 4, 2)
totalBsmtSF = st.number_input("Basement Area (sq ft)", 0, 3000, 800)
fullBath = st.slider("Full Bathrooms", 0, 3, 2)
yearBuilt = st.number_input("Year Built", 1900, 2023, 2000)
# Prediction
features = np.array([[overallQual, grLivArea, garageCars, totalBsmtSF, fullBath, yearBuilt]])
prediction = model.predict(features)[0]
st.write(f"💰 Estimated House Price: ${prediction:,.2f}")
Run the app:
streamlit run app.py
10. Conclusion
In this mini-project, we:
-
Collected and cleaned real-world house price data
-
Conducted exploratory data analysis (EDA)
-
Engineered and preprocessed features
-
Built and evaluated a Linear Regression model
-
Deployed the model using Streamlit
This project demonstrates end-to-end ML workflow and is a portfolio-ready project. You can expand it by:
-
Trying Advanced Models (Random Forest, XGBoost)
-
Using Cross-Validation for robustness
-
Adding interactive dashboards
Sponsor Key-Word
"This Content Sponsored by SBO Digital Marketing.
Mobile-Based Part-Time Job Opportunity by SBO!
Earn money online by doing simple content publishing and sharing tasks. Here's how:
Job Type: Mobile-based part-time work
Work Involves:
Content publishing
Content sharing on social media
Time Required: As little as 1 hour a day
Earnings: ₹300 or more daily
Requirements:
Active Facebook and Instagram account
Basic knowledge of using mobile and social media
For more details:
WhatsApp your Name and Qualification to 9994104160
a.Online Part Time Jobs from Home
b.Work from Home Jobs Without Investment
c.Freelance Jobs Online for Students
d.Mobile Based Online Jobs
e.Daily Payment Online Jobs
Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"
11. Points to Remember 📝
-
Always clean & preprocess data before modeling.
-
Feature engineering is as important as the algorithm itself.
-
Start simple (Linear Regression), then move to complex models.
-
Deploying your project makes it impactful for resumes.
Mini Project 3 Predicting Diabetes Using Logistic Regression
Heading for Blog Post:
“Step-by-Step Mini Project on Predicting Diabetes Using Logistic Regression with Data Preprocessing, Model Evaluation, and Insights”
Blog Structure (Detailed Outline):
1. Introduction
-
Why healthcare predictions matter.
-
Importance of early diabetes detection.
-
Why logistic regression is a good starting model for classification problems.
2. Dataset Overview
-
Dataset: Pima Indians Diabetes Dataset (from Kaggle/UCI Repository).
-
Features: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age.
-
Target: Outcome (1 = Diabetes, 0 = No Diabetes).
3. Data Preprocessing
-
Import dataset with
pandas
. -
Check for missing values, outliers.
-
Handle zeros in features like Blood Pressure, BMI, Insulin (replace with median or impute).
-
Normalize/scale features using
StandardScaler
.
4. Exploratory Data Analysis (EDA)
-
Show distribution plots for glucose, age, and BMI.
-
Correlation heatmap between features.
-
Class balance check (diabetic vs non-diabetic).
5. Splitting the Dataset
-
Train-test split (e.g., 80–20).
6. Model Building — Logistic Regression
-
Train logistic regression model.
-
Predict outcomes on test set.
-
Interpret coefficients to understand feature importance.
7. Model Evaluation
-
Confusion Matrix.
-
Accuracy Score.
-
Precision, Recall, F1-score.
-
ROC Curve & AUC Score.
8. Improving the Model
-
Feature scaling.
-
Regularization (L1/L2).
-
Hyperparameter tuning (C, solver).
9. Results & Insights
-
Which features contribute most to diabetes risk?
-
How accurate is the model in detecting diabetes?
-
Discuss real-world application in preventive healthcare.
10. Deployment (Optional)
-
Export model with
joblib
. -
Create a simple Flask API or Streamlit app where users can input values to predict diabetes risk.
11. Conclusion
-
Logistic regression is a solid baseline for binary classification.
-
Highlights the importance of preprocessing in healthcare datasets.
-
Foundation for more advanced models (Random Forest, XGBoost, Deep Learning).
Sponsor Key-Word
"This Content Sponsored by SBO Digital Marketing.
Mobile-Based Part-Time Job Opportunity by SBO!
Earn money online by doing simple content publishing and sharing tasks. Here's how:
Job Type: Mobile-based part-time work
Work Involves:
Content publishing
Content sharing on social media
Time Required: As little as 1 hour a day
Earnings: ₹300 or more daily
Requirements:
Active Facebook and Instagram account
Basic knowledge of using mobile and social media
For more details:
WhatsApp your Name and Qualification to 9994104160
a.Online Part Time Jobs from Home
b.Work from Home Jobs Without Investment
c.Freelance Jobs Online for Students
d.Mobile Based Online Jobs
e.Daily Payment Online Jobs
Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"
Mini Project 4 — Predicting Diabetes Using Logistic Regression (End-to-End)
What you’ll build An end-to-end, production-style pipeline that:
1. cleans and validates a healthcare dataset,
2. explores the data visually,
3. trains a regularized logistic-regression classifier with cross-validation,
4. evaluates it with clinically meaningful metrics (ROC-AUC, PR-AUC, recall at fixed precision, etc.),
5. tunes the probability threshold for your use case,
6. packages the entire preprocessing+model into a reusable Pipeline, and
7. exposes predictions via a minimal API / Streamlit app.
> Dataset: Pima Indians Diabetes (UCI/Kaggle).
Target: Outcome (1 = diabetes, 0 = no diabetes).
Features: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age.
1) Project setup
# recommended env
python -m venv .venv && source .venv/bin/activate # (Windows: .venv\Scripts\activate)
pip install -U numpy pandas scikit-learn matplotlib seaborn imbalanced-learn shap joblib flask streamlit
Directory layout (simple and portable):
diabetesproject/
├── data/diabetes.csv
├── notebooks/ (optional)
├── src/
│ └── utils.py
├── models/
├── app/
│ ├── api.py # Flask API
│ └── streamlit_app.py
└── train.py
2) Load data & quick sanity checks
# train.py (top)
import numpy as np, pandas as pd
df = pd.read_csv("data/diabetes.csv")
print(df.shape)
print(df.head())
print(df.Outcome.value_counts(normalize=True)) # class balance
print(df.isna().sum())
A healthcare-specific gotcha: zeros that mean “missing”
In this dataset, several physiological measures can’t realistically be zero; zeros are actually missing values.
ZERO_AS_MISSING = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
df[ZERO_AS_MISSING] = df[ZERO_AS_MISSING].replace(0, np.nan)
df.isna().sum()
3) Train/Validation/Test split (stratified)
from sklearn.model_selection import train_test_split
X = df.drop(columns=["Outcome"])
y = df["Outcome"].astype(int)
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.30, random_state=42, stratify=y
)
X_valid, X_test, y_valid, y_test = train_test_split(
X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp
)
print(len(X_train), len(X_valid), len(X_test))
4) Exploratory Data Analysis (EDA)
Class balance
import matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x=y)
plt.title("Class distribution (0 = no diabetes, 1 = diabetes)")
plt.show()
Distributions & relationships
X_train.hist(figsize=(12,8), bins=20); plt.tight_layout(); plt.show()
plt.figure(figsize=(10,8))
sns.heatmap(pd.concat([X_train, y_train], axis=1).corr(), annot=False, cmap="coolwarm")
plt.title("Feature correlation with Outcome"); plt.show()
sns.scatterplot(x=X_train["Glucose"], y=y_train, alpha=0.4)
plt.title("Glucose vs Outcome"); plt.show()
Typical insights
Glucose and BMI usually correlate positively with diabetes risk.
Age and DiabetesPedigreeFunction also carry signal.
Several features need imputation and scaling.
5) Preprocessing pipeline (impute + scale)
Since all columns here are numeric, we’ll:
1. impute missing values with median,
2. standardize to zero-mean unit-variance.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
numeric_features = X_train.columns.tolist()
preprocess = ColumnTransformer(
transformers=[
("num", Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
]), numeric_features)
]
)
6) Establish a baseline (DummyClassifier)
from sklearn.dummy import DummyClassifier
from sklearn.metrics import roc_auc_score
baseline = Pipeline([("prep", preprocess), ("clf", DummyClassifier(strategy="most_frequent"))])
baseline.fit(X_train, y_train)
proba_valid = baseline.predict_proba(X_valid)[:,1]
print("Baseline ROC-AUC:", roc_auc_score(y_valid, proba_valid))
This gives you a sanity-check floor. Your real model should beat it comfortably.
7) Train regularized Logistic Regression (with class imbalance in mind)
We’ll try L2 (Ridge) and L1 (Lasso) penalties, use class_weight="balanced" to counter imbalance, and perform CV on C (inverse regularization).
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, GridSearchCV
logreg = LogisticRegression(max_iter=1000, solver="liblinear", class_weight="balanced")
pipe = Pipeline([
("prep", preprocess),
("clf", logreg)
])
param_grid = {
"clf__penalty": ["l1", "l2"],
"clf__C": [0.01, 0.03, 0.1, 0.3, 1, 3, 10]
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(pipe, param_grid, scoring="roc_auc", cv=cv, n_jobs=-1, refit=True)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
print("Best CV ROC-AUC:", grid.best_score_)
best_model = grid.best_estimator_
8) Evaluation: ROC, PR, confusion matrix, and calibration
from sklearn.metrics import (roc_auc_score, roc_curve, precision_recall_curve,
average_precision_score, confusion_matrix,
classification_report)
proba_valid = best_model.predict_proba(X_valid)[:,1]
pred_valid_default = (proba_valid >= 0.5).astype(int)
print("Valid ROC-AUC:", roc_auc_score(y_valid, proba_valid))
print("Valid PR-AUC:", average_precision_score(y_valid, proba_valid))
print(confusion_matrix(y_valid, pred_valid_default))
print(classification_report(y_valid, pred_valid_default, digits=3))
Curves
# ROC
fpr, tpr, _ = roc_curve(y_valid, proba_valid)
plt.plot(fpr, tpr, label="LogReg")
plt.plot([0,1],[0,1],"--"); plt.xlabel("FPR"); plt.ylabel("TPR"); plt.title("ROC"); plt.legend(); plt.show()
# Precision-Recall (useful for imbalance)
prec, rec, thr = precision_recall_curve(y_valid, proba_valid)
plt.plot(rec, prec); plt.xlabel("Recall"); plt.ylabel("Precision"); plt.title("PR Curve"); plt.show()
Probability calibration (optional but valuable)
If you need calibrated probabilities (for risk scoring), wrap with CalibratedClassifierCV.
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
calibrated = CalibratedClassifierCV(best_model.named_steps["clf"], method="isotonic", cv=5)
cal_pipe = Pipeline([("prep", preprocess), ("cal", calibrated)])
cal_pipe.fit(X_train, y_train)
proba_valid_cal = cal_pipe.predict_proba(X_valid)[:,1]
roc_auc_score(y_valid, proba_valid_cal)
Plot reliability:
prob_true, prob_pred = calibration_curve(y_valid, proba_valid_cal, n_bins=10)
plt.plot(prob_pred, prob_true, marker="o"); plt.plot([0,1],[0,1],"--")
plt.xlabel("Predicted probability"); plt.ylabel("Observed frequency"); plt.title("Calibration"); plt.show()
9) Threshold tuning (optimize for your clinical goal)
Default 0.5 often isn’t optimal. For screening, you may prefer high recall (catch most true diabetes cases).
def threshold_at_recall(y_true, proba, target_recall=0.85):
prec, rec, thr = precision_recall_curve(y_true, proba)
# find the smallest threshold that achieves >= target_recall
for r, t in zip(rec, np.r_[thr, 1]): # align sizes
if r >= target_recall:
return t
return 0.5
best_thr = threshold_at_recall(y_valid, proba_valid, target_recall=0.85)
pred_valid = (proba_valid >= best_thr).astype(int)
print("Chosen threshold:", best_thr)
print(confusion_matrix(y_valid, pred_valid))
print(classification_report(y_valid, pred_valid, digits=3))
If false positives are very costly in your context, flip the optimization to high precision instead.
10) Handling class imbalance (alternative to class_weight)
SMOTE can synthetically up-sample the minority class. Use within a pipeline to avoid leakage.
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
imb_pipe = ImbPipeline([
("prep", preprocess),
("smote", SMOTE(random_state=42)),
("clf", LogisticRegression(max_iter=1000, penalty="l2", C=1.0))
])
imb_pipe.fit(X_train, y_train)
proba_valid_sm = imb_pipe.predict_proba(X_valid)[:,1]
print("ROC-AUC with SMOTE:", roc_auc_score(y_valid, proba_valid_sm))
Compare with/without SMOTE; pick what generalizes best to test.
11) Interpretability: coefficients → odds ratios (+ sanity checks)
final_lr = best_model.named_steps["clf"]
coef = pd.Series(final_lr.coef_.ravel(), index=numeric_features)
odds = np.exp(coef).sort_values(ascending=False)
print("Odds ratios (per 1 SD increase):")
print(odds)
Odds ratio > 1 → higher risk as the feature increases (after scaling).
Cross-check clinical plausibility (e.g., higher glucose should increase risk).
Optional: SHAP for local & global explanations (use after training on raw arrays).
import shap
explainer = shap.LinearExplainer(final_lr, best_model.named_steps["prep"].transform(X_train), feature_perturbation="interventional")
shap_values = explainer.shap_values(best_model.named_steps["prep"].transform(X_valid))
shap.summary_plot(shap_values, features=best_model.named_steps["prep"].transform(X_valid), feature_names=numeric_features)
(If you run this in notebooks, SHAP draws beautiful plots.)
12) Final evaluation on the held-out test set
Always lock your decisions (model + threshold) using validation, then report one final unbiased score on the test set.
# Fit on train+valid using chosen hyperparams
X_trval = pd.concat([X_train, X_valid], axis=0)
y_trval = pd.concat([y_train, y_valid], axis=0)
final_model = GridSearchCV(pipe, param_grid, scoring="roc_auc", cv=cv, n_jobs=-1, refit=True)
final_model.fit(X_trval, y_trval)
proba_test = final_model.best_estimator_.predict_proba(X_test)[:,1]
print("TEST ROC-AUC:", roc_auc_score(y_test, proba_test))
# apply the same threshold picked on validation
pred_test = (proba_test >= best_thr).astype(int)
print(confusion_matrix(y_test, pred_test))
print(classification_report(y_test, pred_test, digits=3))
13) Save a single, production-ready Pipeline
import joblib
joblib.dump(final_model.best_estimator_, "models/diabetes_pipeline.joblib")
print("Saved → models/diabetes_pipeline.joblib")
This artifact includes: imputation, scaling, and the classifier — no separate preprocessing needed at inference time.
14) Minimal Flask API for deployment
app/api.py:
from flask import Flask, request, jsonify
import numpy as np, joblib
app = Flask(__name__)
pipe = joblib.load("models/diabetes_pipeline.joblib")
FEATURES = ["Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin",
"BMI","DiabetesPedigreeFunction","Age"]
THRESHOLD = 0.5 # (or use the tuned best_thr you picked)
@app.route("/predict", methods=["POST"])
def predict():
payload = request.get_json()
x = np.array([[payload.get(f, None) for f in FEATURES]], dtype=float)
proba = pipe.predict_proba(x)[0,1].item()
pred = int(proba >= THRESHOLD)
return jsonify({"probability": proba, "prediction": pred})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)
Run locally:
export FLASK_APP=app/api.py
flask run -p 8000
Sample request:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"Pregnancies":2,"Glucose":148,"BloodPressure":72,"SkinThickness":35,"Insulin":0,
"BMI":33.6,"DiabetesPedigreeFunction":0.627,"Age":50}'
15) Simple Streamlit app (nice for demos)
app/streamlit_app.py:
import streamlit as st, numpy as np, joblib
st.set_page_config(page_title="Diabetes Risk Predictor", page_icon="🩺", layout="centered")
pipe = joblib.load("models/diabetes_pipeline.joblib")
st.title("🩺 Diabetes Risk Prediction (Logistic Regression)")
st.caption("Educational demo. Not a medical device.")
cols = st.columns(2)
Pregnancies = cols[0].number_input("Pregnancies", 0, 20, 1)
Glucose = cols[1].number_input("Glucose", 0, 250, 120)
BloodPressure = cols[0].number_input("BloodPressure", 0, 140, 70)
SkinThickness = cols[1].number_input("SkinThickness", 0, 100, 20)
Insulin = cols[0].number_input("Insulin", 0, 900, 80)
BMI = cols[1].number_input("BMI", 0.0, 70.0, 28.5)
DiabetesPedigreeFunction = cols[0].number_input("DiabetesPedigreeFunction", 0.0, 3.0, 0.5, step=0.01)
Age = cols[1].number_input("Age", 10, 100, 35)
if st.button("Predict"):
x = np.array([[Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin,
BMI, DiabetesPedigreeFunction, Age]], dtype=float)
proba = pipe.predict_proba(x)[0,1]
st.metric("Estimated probability of diabetes", f"{proba:.2%}")
st.progress(min(1.0, proba))
Run:
streamlit run app/streamlit_app.py
16) Validation & governance checklist (production-minded)
Data quality gates: refuse rows with missing critical inputs (or impute consistently).
Monitoring: track drift in feature distributions and output probabilities.
Bias/fairness: if you add sensitive attributes (age groups, sex), evaluate subgroup performance.
Security & privacy: no PII; log only hashed request IDs; encrypt model artifacts at rest.
Clinical caution: include a disclaimer; do not present as diagnostic device.
17) Common troubleshooting
ROC-AUC looks okay but precision is low → adjust threshold for higher precision; consider PR-AUC optimization.
Model unstable across CV folds → more data, stronger regularization (smaller C), or simpler preprocessing.
SMOTE helps on validation but hurts test → avoid; prefer class_weight="balanced" or better features.
Weird coefficients → confirm scaling; check multicollinearity; try L1 to zero-out noisy features.
18) Summary
You built a clean, interpretable, regularized logistic-regression classifier for diabetes risk:
robust preprocessing (impute + scale),
principled model selection with CV,
clinically relevant evalua
tion (ROC/PR, threshold tuning),
interpretation via odds ratios / SHAP,
and deployable artifacts (single Pipeline + API/UI).
This exact scaffolding scales to many healthcare classification problems (CKD risk, readmission prediction, etc.).
If you want, I can also generate a ready-to-run Jupyter Notebook and a zip scaffold (train.py, Flask API, Streamlit app, and sample data template). Just say the word and I’ll drop the downloadable files here.
Sponsor Key-Word
"This Content Sponsored by SBO Digital Marketing.
Mobile-Based Part-Time Job Opportunity by SBO!
Earn money online by doing simple content publishing and sharing tasks. Here's how:
Job Type: Mobile-based part-time work
Work Involves:
Content publishing
Content sharing on social media
Time Required: As little as 1 hour a day
Earnings: ₹300 or more daily
Requirements:
Active Facebook and Instagram account
Basic knowledge of using mobile and social media
For more details:
WhatsApp your Name and Qualification to 9994104160
a.Online Part Time Jobs from Home
b.Work from Home Jobs Without Investment
c.Freelance Jobs Online for Students
d.Mobile Based Online Jobs
e.Daily Payment Online Jobs
Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"
Comments
Post a Comment