Mini Project on Predicting House Prices Using Linear Regression with Feature Engineering, and Deployment

 Mini Project on Predicting House Prices Using Linear Regression with Data Cleaning, Feature Engineering, and Deployment


1. Introduction

  • Importance of predicting house prices in real estate.

  • Why machine learning is useful in pricing.

  • Linear Regression as a baseline model.

  • Goals of this mini-project:

    • Clean dataset

    • Perform feature engineering

    • Build Linear Regression model

    • Evaluate performance

    • Deploy model for real-world use


2. Dataset Overview

  • Dataset used: House Prices: Advanced Regression Techniques (Kaggle) or any open dataset (Ames Housing).

  • Features include:

    • Location

    • Size (square feet)

    • Bedrooms, Bathrooms

    • Year built

    • Garage, Lot size, etc.

  • Target: SalePrice

👉 Provide dataset link + small table preview.


3. Step 1: Data Collection

  • Download dataset (CSV format).

  • Import libraries: pandas, numpy, matplotlib, seaborn, scikit-learn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv("house_prices.csv")
print(data.head())

4. Step 2: Exploratory Data Analysis (EDA)

  • Understanding dataset shape, missing values, data types.

  • Summary statistics (data.describe()).

  • Correlation heatmap (sns.heatmap(data.corr())).

  • Distribution of Sale Prices (sns.histplot).

👉 Add visualizations:

  • Histogram of prices

  • Boxplots (price vs. number of rooms, location)

  • Correlation matrix


5. Step 3: Data Cleaning

  • Handle missing values:

    • Numerical → Impute with mean/median.

    • Categorical → Fill with mode or "Unknown".

  • Handle outliers (log-transform SalePrice, remove extreme values).

  • Convert categorical columns into consistent formats.

data['LotFrontage'] = data['LotFrontage'].fillna(data['LotFrontage'].median())
data['GarageType'] = data['GarageType'].fillna("None")

6. Step 4: Feature Engineering

  • Encoding categorical variables: One-hot encoding (pd.get_dummies).

  • Feature scaling: Standardization / normalization.

  • Feature selection: Keep only important predictors (using correlation, feature importance).

  • Create new features:

    • HouseAge = YrSold - YearBuilt

    • Remodeled = (YearRemodAdd != YearBuilt)

data['HouseAge'] = data['YrSold'] - data['YearBuilt']
data['Remodeled'] = (data['YearRemodAdd'] != data['YearBuilt']).astype(int)

7. Step 5: Train-Test Split

  • Separate input (X) and output (y).

  • Train-test split (80-20).

from sklearn.model_selection import train_test_split

X = data.drop('SalePrice', axis=1)
y = data['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

8. Step 6: Build Linear Regression Model

  • Import model:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
  • View coefficients:

coef_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coef_df)

9. Step 7: Model Evaluation

  • Predictions: y_pred = model.predict(X_test)

  • Metrics:

    • MAE (Mean Absolute Error)

    • MSE (Mean Squared Error)

    • RMSE (Root Mean Squared Error)

    • R² Score

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("MAE:", mean_absolute_error(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score:", r2_score(y_test, y_pred))
  • Plot Actual vs Predicted Prices.


10. Step 8: Model Improvement

  • Feature engineering (interaction features, polynomial regression).

  • Regularization: Ridge & Lasso Regression.

  • Cross-validation to avoid overfitting.

Sponsor Key-Word

"This Content Sponsored by SBO Digital Marketing.

Mobile-Based Part-Time Job Opportunity by SBO!

Earn money online by doing simple content publishing and sharing tasks. Here's how:

Job Type: Mobile-based part-time work

Work Involves:

Content publishing

Content sharing on social media

Time Required: As little as 1 hour a day

Earnings: ₹300 or more daily

Requirements:

Active Facebook and Instagram account

Basic knowledge of using mobile and social media

For more details:

WhatsApp your Name and Qualification to 9994104160

a.Online Part Time Jobs from Home

b.Work from Home Jobs Without Investment

c.Freelance Jobs Online for Students

d.Mobile Based Online Jobs

e.Daily Payment Online Jobs

Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"



11. Step 9: Deployment

  • Save model using joblib or pickle.

import joblib
joblib.dump(model, "house_price_model.pkl")
  • Create a Flask API to serve predictions.

  • Simple web interface using Streamlit or Flask + HTML.


12. Step 10: Real-World Application

  • Deploy on Heroku / Vercel / AWS.

  • User inputs features → Model predicts house price.

  • Show example deployment screenshots.


13. Conclusion

  • What we learned:

    • Data preprocessing is key.

    • Linear Regression works as a good baseline.

    • Regularization helps improve generalization.

    • Deployment makes ML models useful in practice.


14. Points to Remember

  • Always clean & preprocess data before modeling.

  • Linear Regression assumes linear relationships.

  • Evaluate with multiple metrics, not just accuracy.

  • Use regularization when dataset is large with many features.

  • Deploy models for real-world use.


Mini project 2  Comprehensive Mini Project on Predicting House Prices Using Linear Regression with Data Cleaning, Feature Engineering, and Deployment

1. Introduction

Predicting house prices is one of the most common and practical applications of machine learning. Real estate is influenced by multiple factors such as location, area (square footage), number of bedrooms, proximity to schools/markets, and economic conditions. Buyers, sellers, and real estate companies rely on predictive models to estimate fair house prices.

In this mini-project, we will walk through end-to-end steps of building a house price prediction model using Linear Regression.

We will cover:

  • Data collection & understanding

  • Data cleaning

  • Exploratory Data Analysis (EDA)

  • Feature engineering & preprocessing

  • Model building (Linear Regression)

  • Model evaluation

  • Deployment (making the model accessible through a simple app or API)

By the end, you will have a working project that you can showcase in your portfolio or use as a resume project.


2. Dataset Overview

For this project, we will use the House Prices dataset (similar to the Kaggle dataset). It typically contains columns like:

  • Id – Unique identifier for each house

  • LotArea – Size of the lot in square feet

  • OverallQual – Overall material and finish quality

  • YearBuilt – Year the house was built

  • TotalBsmtSF – Total square feet of basement area

  • GrLivArea – Above-ground living area in square feet

  • FullBath – Number of full bathrooms

  • BedroomAbvGr – Number of bedrooms above ground

  • GarageCars – Size of garage in car capacity

  • SalePrice – Target variable (House price in $)

You can download the dataset from Kaggle or use a smaller sample dataset for demonstration.


3. Step 1: Data Collection

We will use Python’s pandas library to load the dataset.

import pandas as pd

# Load dataset
data = pd.read_csv("house_prices.csv")

# View first 5 rows
print(data.head())

✅ Output: First few rows of the dataset

Sponsor Key-Word

"This Content Sponsored by SBO Digital Marketing.

Mobile-Based Part-Time Job Opportunity by SBO!

Earn money online by doing simple content publishing and sharing tasks. Here's how:

Job Type: Mobile-based part-time work

Work Involves:

Content publishing

Content sharing on social media

Time Required: As little as 1 hour a day

Earnings: ₹300 or more daily

Requirements:

Active Facebook and Instagram account

Basic knowledge of using mobile and social media

For more details:

WhatsApp your Name and Qualification to 9994104160

a.Online Part Time Jobs from Home

b.Work from Home Jobs Without Investment

c.Freelance Jobs Online for Students

d.Mobile Based Online Jobs

e.Daily Payment Online Jobs

Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"



4. Step 2: Data Cleaning

Real-world datasets are messy. Cleaning is essential.

Common cleaning steps:

  1. Handling Missing Values

    • Drop columns with too many missing values.

    • Fill missing numeric values with median.

    • Fill categorical values with mode.

# Drop columns with more than 30% missing data
data = data.dropna(thresh=0.7*len(data), axis=1)

# Fill missing numeric with median
for col in data.select_dtypes(include=['int64','float64']).columns:
    data[col].fillna(data[col].median(), inplace=True)

# Fill categorical with mode
for col in data.select_dtypes(include=['object']).columns:
    data[col].fillna(data[col].mode()[0], inplace=True)
  1. Removing Duplicates

data.drop_duplicates(inplace=True)
  1. Checking Outliers (Boxplot method for SalePrice and GrLivArea)

import seaborn as sns
sns.boxplot(data['SalePrice'])

5. Step 3: Exploratory Data Analysis (EDA)

EDA helps us understand relationships between features and target (SalePrice).

  1. Distribution of House Prices

import matplotlib.pyplot as plt
sns.histplot(data['SalePrice'], kde=True)
plt.show()
  1. Correlation Heatmap

plt.figure(figsize=(12,8))
sns.heatmap(data.corr(), annot=True, cmap="coolwarm")
plt.show()
  1. Scatterplot: Living Area vs Sale Price

sns.scatterplot(x='GrLivArea', y='SalePrice', data=data)
plt.show()

✅ Insight: Larger houses tend to have higher prices.


6. Step 4: Feature Engineering & Preprocessing

  1. Feature Selection (Choosing most important columns)

features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
X = data[features]
y = data['SalePrice']
  1. Feature Scaling (optional for Linear Regression)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
  1. Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

7. Step 5: Model Building (Linear Regression)

from sklearn.linear_model import LinearRegression

# Train Model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

8. Step 6: Model Evaluation

We evaluate with R² Score, MAE, RMSE.

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

print("R² Score:", r2_score(y_test, y_pred))
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

✅ Example Output:

  • R² Score: 0.83

  • MAE: $22,000

  • RMSE: $35,000


9. Step 7: Deployment

We can deploy our model using Flask or Streamlit.

Example: Streamlit Deployment

import streamlit as st
import numpy as np

st.title("🏡 House Price Prediction App")

# Inputs
overallQual = st.slider("Overall Quality (1-10)", 1, 10, 5)
grLivArea = st.number_input("Living Area (sq ft)", 500, 5000, 1500)
garageCars = st.slider("Garage Cars", 0, 4, 2)
totalBsmtSF = st.number_input("Basement Area (sq ft)", 0, 3000, 800)
fullBath = st.slider("Full Bathrooms", 0, 3, 2)
yearBuilt = st.number_input("Year Built", 1900, 2023, 2000)

# Prediction
features = np.array([[overallQual, grLivArea, garageCars, totalBsmtSF, fullBath, yearBuilt]])
prediction = model.predict(features)[0]

st.write(f"💰 Estimated House Price: ${prediction:,.2f}")

Run the app:

streamlit run app.py

10. Conclusion

In this mini-project, we:

  • Collected and cleaned real-world house price data

  • Conducted exploratory data analysis (EDA)

  • Engineered and preprocessed features

  • Built and evaluated a Linear Regression model

  • Deployed the model using Streamlit

This project demonstrates end-to-end ML workflow and is a portfolio-ready project. You can expand it by:

  • Trying Advanced Models (Random Forest, XGBoost)

  • Using Cross-Validation for robustness

  • Adding interactive dashboards

Sponsor Key-Word

"This Content Sponsored by SBO Digital Marketing.

Mobile-Based Part-Time Job Opportunity by SBO!

Earn money online by doing simple content publishing and sharing tasks. Here's how:

Job Type: Mobile-based part-time work

Work Involves:

Content publishing

Content sharing on social media

Time Required: As little as 1 hour a day

Earnings: ₹300 or more daily

Requirements:

Active Facebook and Instagram account

Basic knowledge of using mobile and social media

For more details:

WhatsApp your Name and Qualification to 9994104160

a.Online Part Time Jobs from Home

b.Work from Home Jobs Without Investment

c.Freelance Jobs Online for Students

d.Mobile Based Online Jobs

e.Daily Payment Online Jobs

Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"



11. Points to Remember 📝

  • Always clean & preprocess data before modeling.

  • Feature engineering is as important as the algorithm itself.

  • Start simple (Linear Regression), then move to complex models.

  • Deploying your project makes it impactful for resumes.


 Mini Project 3 Predicting Diabetes Using Logistic Regression

Heading for Blog Post:

“Step-by-Step Mini Project on Predicting Diabetes Using Logistic Regression with Data Preprocessing, Model Evaluation, and Insights”


Blog Structure (Detailed Outline):

1. Introduction

  • Why healthcare predictions matter.

  • Importance of early diabetes detection.

  • Why logistic regression is a good starting model for classification problems.


2. Dataset Overview

  • Dataset: Pima Indians Diabetes Dataset (from Kaggle/UCI Repository).

  • Features: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age.

  • Target: Outcome (1 = Diabetes, 0 = No Diabetes).


3. Data Preprocessing

  • Import dataset with pandas.

  • Check for missing values, outliers.

  • Handle zeros in features like Blood Pressure, BMI, Insulin (replace with median or impute).

  • Normalize/scale features using StandardScaler.


4. Exploratory Data Analysis (EDA)

  • Show distribution plots for glucose, age, and BMI.

  • Correlation heatmap between features.

  • Class balance check (diabetic vs non-diabetic).


5. Splitting the Dataset

  • Train-test split (e.g., 80–20).


6. Model Building — Logistic Regression

  • Train logistic regression model.

  • Predict outcomes on test set.

  • Interpret coefficients to understand feature importance.


7. Model Evaluation

  • Confusion Matrix.

  • Accuracy Score.

  • Precision, Recall, F1-score.

  • ROC Curve & AUC Score.


8. Improving the Model

  • Feature scaling.

  • Regularization (L1/L2).

  • Hyperparameter tuning (C, solver).


9. Results & Insights

  • Which features contribute most to diabetes risk?

  • How accurate is the model in detecting diabetes?

  • Discuss real-world application in preventive healthcare.


10. Deployment (Optional)

  • Export model with joblib.

  • Create a simple Flask API or Streamlit app where users can input values to predict diabetes risk.


11. Conclusion

  • Logistic regression is a solid baseline for binary classification.

  • Highlights the importance of preprocessing in healthcare datasets.

  • Foundation for more advanced models (Random Forest, XGBoost, Deep Learning).

Sponsor Key-Word

"This Content Sponsored by SBO Digital Marketing.

Mobile-Based Part-Time Job Opportunity by SBO!

Earn money online by doing simple content publishing and sharing tasks. Here's how:

Job Type: Mobile-based part-time work

Work Involves:

Content publishing

Content sharing on social media

Time Required: As little as 1 hour a day

Earnings: ₹300 or more daily

Requirements:

Active Facebook and Instagram account

Basic knowledge of using mobile and social media

For more details:

WhatsApp your Name and Qualification to 9994104160

a.Online Part Time Jobs from Home

b.Work from Home Jobs Without Investment

c.Freelance Jobs Online for Students

d.Mobile Based Online Jobs

e.Daily Payment Online Jobs

Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"



 Mini Project 4 — Predicting Diabetes Using Logistic Regression (End-to-End)

What you’ll build An end-to-end, production-style pipeline that:

1. cleans and validates a healthcare dataset,

2. explores the data visually,

3. trains a regularized logistic-regression classifier with cross-validation,

4. evaluates it with clinically meaningful metrics (ROC-AUC, PR-AUC, recall at fixed precision, etc.),

5. tunes the probability threshold for your use case,

6. packages the entire preprocessing+model into a reusable Pipeline, and

7. exposes predictions via a minimal API / Streamlit app.

> Dataset: Pima Indians Diabetes (UCI/Kaggle).

Target: Outcome (1 = diabetes, 0 = no diabetes).
Features: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age.

1) Project setup

# recommended env

python -m venv .venv && source .venv/bin/activate # (Windows: .venv\Scripts\activate)

pip install -U numpy pandas scikit-learn matplotlib seaborn imbalanced-learn shap joblib flask streamlit

Directory layout (simple and portable):

diabetesproject/

  ├── data/diabetes.csv

  ├── notebooks/ (optional)

  ├── src/

  │ └── utils.py

  ├── models/

  ├── app/

  │ ├── api.py # Flask API

  │ └── streamlit_app.py

  └── train.py

2) Load data & quick sanity checks

# train.py (top)

import numpy as np, pandas as pd

df = pd.read_csv("data/diabetes.csv")

print(df.shape)

print(df.head())

print(df.Outcome.value_counts(normalize=True)) # class balance

print(df.isna().sum())

A healthcare-specific gotcha: zeros that mean “missing”

In this dataset, several physiological measures can’t realistically be zero; zeros are actually missing values.

ZERO_AS_MISSING = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

df[ZERO_AS_MISSING] = df[ZERO_AS_MISSING].replace(0, np.nan)

df.isna().sum()

3) Train/Validation/Test split (stratified)

from sklearn.model_selection import train_test_split

X = df.drop(columns=["Outcome"])

y = df["Outcome"].astype(int)

X_train, X_temp, y_train, y_temp = train_test_split(

    X, y, test_size=0.30, random_state=42, stratify=y

)

X_valid, X_test, y_valid, y_test = train_test_split(

    X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp

)

print(len(X_train), len(X_valid), len(X_test))


4) Exploratory Data Analysis (EDA)

Class balance

import matplotlib.pyplot as plt

import seaborn as sns

sns.countplot(x=y)

plt.title("Class distribution (0 = no diabetes, 1 = diabetes)")

plt.show()

Distributions & relationships

X_train.hist(figsize=(12,8), bins=20); plt.tight_layout(); plt.show()

plt.figure(figsize=(10,8))

sns.heatmap(pd.concat([X_train, y_train], axis=1).corr(), annot=False, cmap="coolwarm")

plt.title("Feature correlation with Outcome"); plt.show()

sns.scatterplot(x=X_train["Glucose"], y=y_train, alpha=0.4)

plt.title("Glucose vs Outcome"); plt.show()

Typical insights

Glucose and BMI usually correlate positively with diabetes risk.

Age and DiabetesPedigreeFunction also carry signal.

Several features need imputation and scaling.


5) Preprocessing pipeline (impute + scale)

Since all columns here are numeric, we’ll:

1. impute missing values with median,

2. standardize to zero-mean unit-variance.

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

numeric_features = X_train.columns.tolist()

preprocess = ColumnTransformer(

    transformers=[

        ("num", Pipeline(steps=[

            ("imputer", SimpleImputer(strategy="median")),

            ("scaler", StandardScaler())

        ]), numeric_features)

    ]

)


6) Establish a baseline (DummyClassifier)


from sklearn.dummy import DummyClassifier

from sklearn.metrics import roc_auc_score

baseline = Pipeline([("prep", preprocess), ("clf", DummyClassifier(strategy="most_frequent"))])

baseline.fit(X_train, y_train)

proba_valid = baseline.predict_proba(X_valid)[:,1]

print("Baseline ROC-AUC:", roc_auc_score(y_valid, proba_valid))

This gives you a sanity-check floor. Your real model should beat it comfortably.


7) Train regularized Logistic Regression (with class imbalance in mind)


We’ll try L2 (Ridge) and L1 (Lasso) penalties, use class_weight="balanced" to counter imbalance, and perform CV on C (inverse regularization).


from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import StratifiedKFold, GridSearchCV


logreg = LogisticRegression(max_iter=1000, solver="liblinear", class_weight="balanced")


pipe = Pipeline([

    ("prep", preprocess),

    ("clf", logreg)

])


param_grid = {

    "clf__penalty": ["l1", "l2"],

    "clf__C": [0.01, 0.03, 0.1, 0.3, 1, 3, 10]

}


cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(pipe, param_grid, scoring="roc_auc", cv=cv, n_jobs=-1, refit=True)

grid.fit(X_train, y_train)


print("Best params:", grid.best_params_)

print("Best CV ROC-AUC:", grid.best_score_)

best_model = grid.best_estimator_


8) Evaluation: ROC, PR, confusion matrix, and calibration


from sklearn.metrics import (roc_auc_score, roc_curve, precision_recall_curve,

                             average_precision_score, confusion_matrix,

                             classification_report)


proba_valid = best_model.predict_proba(X_valid)[:,1]

pred_valid_default = (proba_valid >= 0.5).astype(int)


print("Valid ROC-AUC:", roc_auc_score(y_valid, proba_valid))

print("Valid PR-AUC:", average_precision_score(y_valid, proba_valid))

print(confusion_matrix(y_valid, pred_valid_default))

print(classification_report(y_valid, pred_valid_default, digits=3))


Curves


# ROC

fpr, tpr, _ = roc_curve(y_valid, proba_valid)

plt.plot(fpr, tpr, label="LogReg")

plt.plot([0,1],[0,1],"--"); plt.xlabel("FPR"); plt.ylabel("TPR"); plt.title("ROC"); plt.legend(); plt.show()


# Precision-Recall (useful for imbalance)

prec, rec, thr = precision_recall_curve(y_valid, proba_valid)

plt.plot(rec, prec); plt.xlabel("Recall"); plt.ylabel("Precision"); plt.title("PR Curve"); plt.show()


Probability calibration (optional but valuable)


If you need calibrated probabilities (for risk scoring), wrap with CalibratedClassifierCV.


from sklearn.calibration import CalibratedClassifierCV, calibration_curve


calibrated = CalibratedClassifierCV(best_model.named_steps["clf"], method="isotonic", cv=5)

cal_pipe = Pipeline([("prep", preprocess), ("cal", calibrated)])

cal_pipe.fit(X_train, y_train)


proba_valid_cal = cal_pipe.predict_proba(X_valid)[:,1]

roc_auc_score(y_valid, proba_valid_cal)


Plot reliability:


prob_true, prob_pred = calibration_curve(y_valid, proba_valid_cal, n_bins=10)

plt.plot(prob_pred, prob_true, marker="o"); plt.plot([0,1],[0,1],"--")

plt.xlabel("Predicted probability"); plt.ylabel("Observed frequency"); plt.title("Calibration"); plt.show()


9) Threshold tuning (optimize for your clinical goal)


Default 0.5 often isn’t optimal. For screening, you may prefer high recall (catch most true diabetes cases).


def threshold_at_recall(y_true, proba, target_recall=0.85):

    prec, rec, thr = precision_recall_curve(y_true, proba)

    # find the smallest threshold that achieves >= target_recall

    for r, t in zip(rec, np.r_[thr, 1]): # align sizes

        if r >= target_recall:

            return t

    return 0.5


best_thr = threshold_at_recall(y_valid, proba_valid, target_recall=0.85)

pred_valid = (proba_valid >= best_thr).astype(int)


print("Chosen threshold:", best_thr)

print(confusion_matrix(y_valid, pred_valid))

print(classification_report(y_valid, pred_valid, digits=3))


If false positives are very costly in your context, flip the optimization to high precision instead.


10) Handling class imbalance (alternative to class_weight)


SMOTE can synthetically up-sample the minority class. Use within a pipeline to avoid leakage.


from imblearn.pipeline import Pipeline as ImbPipeline

from imblearn.over_sampling import SMOTE


imb_pipe = ImbPipeline([

    ("prep", preprocess),

    ("smote", SMOTE(random_state=42)),

    ("clf", LogisticRegression(max_iter=1000, penalty="l2", C=1.0))

])


imb_pipe.fit(X_train, y_train)

proba_valid_sm = imb_pipe.predict_proba(X_valid)[:,1]

print("ROC-AUC with SMOTE:", roc_auc_score(y_valid, proba_valid_sm))


Compare with/without SMOTE; pick what generalizes best to test.


11) Interpretability: coefficients → odds ratios (+ sanity checks)


final_lr = best_model.named_steps["clf"]

coef = pd.Series(final_lr.coef_.ravel(), index=numeric_features)

odds = np.exp(coef).sort_values(ascending=False)

print("Odds ratios (per 1 SD increase):")

print(odds)


Odds ratio > 1 → higher risk as the feature increases (after scaling).


Cross-check clinical plausibility (e.g., higher glucose should increase risk).


Optional: SHAP for local & global explanations (use after training on raw arrays).



import shap

explainer = shap.LinearExplainer(final_lr, best_model.named_steps["prep"].transform(X_train), feature_perturbation="interventional")

shap_values = explainer.shap_values(best_model.named_steps["prep"].transform(X_valid))

shap.summary_plot(shap_values, features=best_model.named_steps["prep"].transform(X_valid), feature_names=numeric_features)


(If you run this in notebooks, SHAP draws beautiful plots.)


12) Final evaluation on the held-out test set


Always lock your decisions (model + threshold) using validation, then report one final unbiased score on the test set.


# Fit on train+valid using chosen hyperparams

X_trval = pd.concat([X_train, X_valid], axis=0)

y_trval = pd.concat([y_train, y_valid], axis=0)

final_model = GridSearchCV(pipe, param_grid, scoring="roc_auc", cv=cv, n_jobs=-1, refit=True)

final_model.fit(X_trval, y_trval)


proba_test = final_model.best_estimator_.predict_proba(X_test)[:,1]

print("TEST ROC-AUC:", roc_auc_score(y_test, proba_test))


# apply the same threshold picked on validation

pred_test = (proba_test >= best_thr).astype(int)

print(confusion_matrix(y_test, pred_test))

print(classification_report(y_test, pred_test, digits=3))


13) Save a single, production-ready Pipeline


import joblib

joblib.dump(final_model.best_estimator_, "models/diabetes_pipeline.joblib")

print("Saved → models/diabetes_pipeline.joblib")


This artifact includes: imputation, scaling, and the classifier — no separate preprocessing needed at inference time.


14) Minimal Flask API for deployment

app/api.py:


from flask import Flask, request, jsonify

import numpy as np, joblib


app = Flask(__name__)

pipe = joblib.load("models/diabetes_pipeline.joblib")


FEATURES = ["Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin",

            "BMI","DiabetesPedigreeFunction","Age"]


THRESHOLD = 0.5 # (or use the tuned best_thr you picked)


@app.route("/predict", methods=["POST"])

def predict():

    payload = request.get_json()

    x = np.array([[payload.get(f, None) for f in FEATURES]], dtype=float)

    proba = pipe.predict_proba(x)[0,1].item()

    pred = int(proba >= THRESHOLD)

    return jsonify({"probability": proba, "prediction": pred})


if __name__ == "__main__":

    app.run(host="0.0.0.0", port=8000)


Run locally:


export FLASK_APP=app/api.py

flask run -p 8000


Sample request:


curl -X POST http://localhost:8000/predict \

  -H "Content-Type: application/json" \

  -d '{"Pregnancies":2,"Glucose":148,"BloodPressure":72,"SkinThickness":35,"Insulin":0,

       "BMI":33.6,"DiabetesPedigreeFunction":0.627,"Age":50}'


15) Simple Streamlit app (nice for demos)


app/streamlit_app.py:


import streamlit as st, numpy as np, joblib


st.set_page_config(page_title="Diabetes Risk Predictor", page_icon="🩺", layout="centered")

pipe = joblib.load("models/diabetes_pipeline.joblib")


st.title("🩺 Diabetes Risk Prediction (Logistic Regression)")

st.caption("Educational demo. Not a medical device.")


cols = st.columns(2)

Pregnancies = cols[0].number_input("Pregnancies", 0, 20, 1)

Glucose = cols[1].number_input("Glucose", 0, 250, 120)

BloodPressure = cols[0].number_input("BloodPressure", 0, 140, 70)

SkinThickness = cols[1].number_input("SkinThickness", 0, 100, 20)

Insulin = cols[0].number_input("Insulin", 0, 900, 80)

BMI = cols[1].number_input("BMI", 0.0, 70.0, 28.5)

DiabetesPedigreeFunction = cols[0].number_input("DiabetesPedigreeFunction", 0.0, 3.0, 0.5, step=0.01)

Age = cols[1].number_input("Age", 10, 100, 35)


if st.button("Predict"):

    x = np.array([[Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin,

                   BMI, DiabetesPedigreeFunction, Age]], dtype=float)

    proba = pipe.predict_proba(x)[0,1]

    st.metric("Estimated probability of diabetes", f"{proba:.2%}")

    st.progress(min(1.0, proba))


Run:


streamlit run app/streamlit_app.py



16) Validation & governance checklist (production-minded)

Data quality gates: refuse rows with missing critical inputs (or impute consistently).

Monitoring: track drift in feature distributions and output probabilities.

Bias/fairness: if you add sensitive attributes (age groups, sex), evaluate subgroup performance.

Security & privacy: no PII; log only hashed request IDs; encrypt model artifacts at rest.

Clinical caution: include a disclaimer; do not present as diagnostic device.


17) Common troubleshooting

ROC-AUC looks okay but precision is low → adjust threshold for higher precision; consider PR-AUC optimization.

Model unstable across CV folds → more data, stronger regularization (smaller C), or simpler preprocessing.

SMOTE helps on validation but hurts test → avoid; prefer class_weight="balanced" or better features.


Weird coefficients → confirm scaling; check multicollinearity; try L1 to zero-out noisy features.


18) Summary


You built a clean, interpretable, regularized logistic-regression classifier for diabetes risk:

robust preprocessing (impute + scale),

principled model selection with CV,

clinically relevant evalua

tion (ROC/PR, threshold tuning),

interpretation via odds ratios / SHAP,

and deployable artifacts (single Pipeline + API/UI).

This exact scaffolding scales to many healthcare classification problems (CKD risk, readmission prediction, etc.).


If you want, I can also generate a ready-to-run Jupyter Notebook and a zip scaffold (train.py, Flask API, Streamlit app, and sample data template). Just say the word and I’ll drop the downloadable files here.


Sponsor Key-Word

"This Content Sponsored by SBO Digital Marketing.

Mobile-Based Part-Time Job Opportunity by SBO!

Earn money online by doing simple content publishing and sharing tasks. Here's how:

Job Type: Mobile-based part-time work

Work Involves:

Content publishing

Content sharing on social media

Time Required: As little as 1 hour a day

Earnings: ₹300 or more daily

Requirements:

Active Facebook and Instagram account

Basic knowledge of using mobile and social media

For more details:

WhatsApp your Name and Qualification to 9994104160

a.Online Part Time Jobs from Home

b.Work from Home Jobs Without Investment

c.Freelance Jobs Online for Students

d.Mobile Based Online Jobs

e.Daily Payment Online Jobs

Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"



Comments