Day 4 – Linear Regression and classifications (Hands-on in Python)
1. Introduction to Linear Regression
When it comes to predicting continuous values based on existing data, Linear Regression is one of the simplest and most widely used algorithms in the field of machine learning and statistics.
In plain English, linear regression tries to find a straight line (or plane, in the case of multiple variables) that best describes the relationship between input variables (independent variables) and the output variable (dependent variable).
Imagine you own a shop and want to predict your future sales based on advertising spending — this is exactly where linear regression can help.
2. What is Linear Regression?
Mathematically, the formula for simple linear regression is:
Where:
-
= Dependent variable (value we want to predict)
-
= Independent variable (input)
-
= Slope (how much changes for every unit change in )
-
= Intercept (value of when is 0)
The goal of linear regression is to minimize the error between predicted values and actual values, often using the least squares method.
3. Real-World Applications of Linear Regression
Linear regression is not just an academic concept — it’s used everywhere in business, science, and technology. Here are practical examples:
A. Business and Finance
-
Predicting Sales Revenue: Companies predict sales based on advertising budget, seasonal trends, and customer footfall.
-
Stock Price Prediction (short-term trend analysis): Using historical prices to identify short-term price movement trends.
B. Real Estate
-
House Price Prediction: Estimating the selling price of a house based on square footage, location, number of bedrooms, etc.
C. Healthcare
-
Medical Cost Prediction: Insurance companies predict healthcare costs based on patient age, lifestyle, and medical history.
D. Agriculture
-
Crop Yield Prediction: Estimating crop output based on rainfall, temperature, and fertilizer usage.
E. Sports Analytics
-
Performance Prediction: Predicting a cricket player’s future score based on past performance and match conditions.
F. Marketing
-
Customer Lifetime Value (CLV): Estimating how much a customer will spend in their lifetime based on purchase history.
4. Simple vs. Multiple Linear Regression
A. Simple Linear Regression
This involves one independent variable and one dependent variable.
Example: Predicting a person’s weight based on height.
Scenario:
-
= Height (cm)
-
= Weight (kg)
-
Goal: Find the line that best predicts weight from height.
B. Multiple Linear Regression
This involves two or more independent variables to predict the dependent variable.
Example: Predicting a person’s salary based on years of experience, education level, and location.
Formula:
Where:
-
= Intercept
-
= Coefficients for each independent variable
5. Assumptions of Linear Regression
For linear regression to work well, certain assumptions should be met:
-
Linearity: The relationship between the dependent and independent variable(s) is linear.
-
Independence: Observations are independent of each other.
-
Homoscedasticity: Equal variance of errors.
-
Normality: Residuals should be normally distributed.
-
No Multicollinearity: In multiple regression, independent variables should not be highly correlated.
Key Concepts in Linear Regression
Before we jump into more coding and applications, it’s important to understand the core concepts that form the foundation of linear regression.
1. Dependent vs. Independent Variables
-
Dependent Variable (Target)
This is the variable we want to predict or explain.
In regression, the dependent variable is continuous (e.g., price, salary, height).-
Example: Predicting house price based on other features — here,
Price
is the dependent variable.
-
-
Independent Variable(s) (Features)
These are the variables that influence or predict the dependent variable.
They can be one (in simple linear regression) or multiple (in multiple linear regression).-
Example:
Square Footage
,Number of Bedrooms
, andAge of the House
are independent variables for predictingPrice
.
-
💡 Real-life analogy:
Think of baking a cake:
-
Independent variables = amount of flour, sugar, butter.
-
Dependent variable = size or taste score of the cake.
2. Line of Best Fit & Regression Equation
The line of best fit is the straight line that best represents the relationship between the independent and dependent variables in a scatter plot.
Equation:
Where:
-
= Predicted value
-
= Independent variable
-
= Slope (rate of change in per unit change in )
-
= Intercept (value of when )
Example:
If :
-
means for every 1 unit increase in , increases by 5.
-
means when , is 20.
📊 Why “best fit”?
The algorithm chooses the line that minimizes the total error (difference between predicted and actual values), often using the least squares method.
3. Assumptions of Linear Regression
For linear regression to produce reliable results, certain statistical assumptions must hold true:
-
Linearity
-
The relationship between the independent and dependent variable(s) should be linear.
-
Example: Hours studied vs. score is often linear, but age vs. income may not be.
-
-
Independence
-
The observations should be independent of each other.
-
Example: Data collected from different students, not repeated measures from the same student without adjustment.
-
-
Homoscedasticity
-
The variance of residuals (errors) should be constant across all values of .
-
If residuals spread wider at higher values, that’s heteroscedasticity — and it can bias the model.
-
-
Normality of Residuals
-
The residuals (difference between actual and predicted values) should follow a normal distribution.
-
This is important for hypothesis testing and confidence intervals.
-
📌 Tip for practitioners:
In real-world projects, these assumptions are often violated — so it’s always good to check them using statistical tests and visualizations before relying on results.
Evaluation Metrics for Regression
How do we know if our linear regression model is good? That’s where evaluation metrics come in.
1. Mean Absolute Error (MAE)
Measures the average magnitude of errors between predicted and actual values.
Formula:
-
Pros: Easy to understand, less sensitive to outliers.
-
Cons: Doesn’t penalize large errors as much as squared metrics.
Example:
If predicted sales = [100, 150, 200] and actual sales = [110, 140, 210],
2. Mean Squared Error (MSE)
Squares the errors before averaging — penalizes larger errors more heavily.
Formula:
-
Pros: Penalizes big errors, useful when large deviations are undesirable.
-
Cons: Squared unit (e.g., “dollars squared”) makes it less interpretable.
3. Root Mean Squared Error (RMSE)
Square root of MSE — brings error back to the original unit of measurement.
Formula:
-
Pros: Same unit as target variable, good for interpretability.
-
Cons: Like MSE, sensitive to outliers.
4. Score (Coefficient of Determination)
Represents the proportion of variance in the dependent variable that can be explained by the model.
Formula:
Where:
-
= Sum of squared residuals
-
= Total sum of squares
-
Value ranges from 0 to 1:
-
→ Perfect fit
-
→ Model does no better than the mean
-
-
Example: means 85% of the variation in is explained by the model.
💡 Practical Tip:
In real projects, use multiple metrics together — relying on just one (like ) can be misleading.
-
Show how MAE, MSE, RMSE, and are calculated.
-
Plot residuals to demonstrate assumptions.
-
Include graphics for “line of best fit” and “dependent vs. independent variables.”
Sponsor Key-Word
6. Hands-On in Python – Simple Linear Regression
We’ll predict student scores based on study hours using Python.
Step 1 – Install Required Libraries
pip install pandas numpy matplotlib scikit-learn
Step 2 – Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Step 3 – Create Dataset
# Sample data
data = {
'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Scores': [35, 40, 50, 55, 60, 65, 70, 78, 85, 90]
}
df = pd.DataFrame(data)
print(df)
Step 4 – Visualize Data
plt.scatter(df['Hours_Studied'], df['Scores'], color='blue')
plt.xlabel('Hours Studied')
plt.ylabel('Score')
plt.title('Hours Studied vs Score')
plt.show()
Step 5 – Train the Model
X = df[['Hours_Studied']]
y = df['Scores']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
Step 6 – Model Prediction
y_pred = model.predict(X_test)
Step 7 – Evaluate Model
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R² Score:", r2_score(y_test, y_pred))
print("Slope (m):", model.coef_[0])
print("Intercept (c):", model.intercept_)
Step 8 – Plot Regression Line
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.xlabel('Hours Studied')
plt.ylabel('Score')
plt.title('Linear Regression Line')
plt.show()
7. Hands-On in Python – Multiple Linear Regression
Let’s predict house prices based on square footage, number of bedrooms, and age of house.
Dataset Example
data = {
'Square_Feet': [1500, 1800, 2400, 3000, 3500],
'Bedrooms': [3, 4, 3, 5, 4],
'Age': [10, 15, 20, 8, 12],
'Price': [400000, 500000, 600000, 650000, 700000]
}
df = pd.DataFrame(data)
Training the Model
X = df[['Square_Feet', 'Bedrooms', 'Age']]
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("R² Score:", r2_score(y_test, y_pred))
8. Advantages of Linear Regression
-
Easy to understand and implement.
-
Works well when variables have a linear relationship.
-
Fast and computationally inexpensive.
9. Limitations of Linear Regression
-
Assumes linearity, which may not hold in real-world data.
-
Sensitive to outliers.
-
Not great for complex relationships (non-linear data).
10. Conclusion
Linear Regression is a fundamental building block in machine learning.
By understanding its concepts, assumptions, and limitations, and by practicing with Python, you can easily apply it to business, science, and everyday problems.
✅ For bloggers – You can make it more engaging by adding:
-
Code snippets for each step.
-
Visual explanations of regression lines & decision boundaries.
-
Real datasets from Kaggle to make examples relatable.
-
Downloadable Jupyter Notebook for readers.
-
Comparison tables for algorithms and metrics.
Day 5 – Classification (Hands-on in Python)
1. Introduction to Classification
In the world of machine learning, classification is one of the most important and widely used problem types.
While regression predicts continuous numerical values, classification predicts categories. In simple words — given some input data, the model tries to decide which “bucket” or “class” the data belongs to.
What is Classification?
Classification is a supervised learning technique where the model learns from labeled data (data with known outcomes) and then predicts the class of new, unseen data.
The main goal is to assign labels to observations based on patterns in the data.
Example:
-
Email filtering: Predict if an email is “Spam” or “Not Spam.”
-
Medical diagnosis: Predict if a tumor is “Benign” or “Malignant.”
-
Sentiment analysis: Predict if a review is “Positive,” “Neutral,” or “Negative.”
Real-World Applications of Classification
-
Spam Detection
-
Gmail automatically marks spam emails by learning patterns in spam content.
-
-
Medical Diagnosis
-
Predicting diseases based on patient symptoms, reports, and scans.
-
-
Credit Scoring
-
Banks use classification to approve or reject loans based on past repayment history.
-
-
Image Recognition
-
Classifying photos as “cat,” “dog,” or “car.”
-
-
Sentiment Analysis
-
Businesses analyze customer reviews to classify sentiment as positive, negative, or neutral.
-
-
Fraud Detection
-
Classifying transactions as fraudulent or genuine.
Sponsor Key-Word
Difference Between Classification and Regression
Feature | Classification | Regression |
---|---|---|
Output type | Categories (discrete values) | Continuous values |
Examples | Spam/Not Spam, Yes/No, Class A/B/C | Predicting price, temperature, salary |
Evaluation metrics | Accuracy, Precision, Recall, F1-score | MAE, MSE, RMSE, score |
Algorithms | Logistic Regression, Decision Trees, KNN | Linear Regression, Polynomial Regression |
💡 Quick rule:
If your target variable has labels, it’s classification. If it’s numbers, it’s regression.
2. Key Concepts in Classification
1. Binary vs. Multiclass Classification
-
Binary Classification:
Two possible classes.
Example: Spam (1) vs. Not Spam (0). -
Multiclass Classification:
More than two classes.
Example: Classifying animals into cat, dog, rabbit, horse.
2. Decision Boundary
A decision boundary is the dividing line (or curve) that separates different classes in a dataset.
-
In binary classification, it’s the line that splits the space into two regions — one for each class.
-
In multiclass classification, multiple boundaries separate the different classes.
📊 Visualization Tip:
In Python, decision boundaries can be plotted using matplotlib
and numpy
after fitting a model like Logistic Regression or KNN.
3. Confusion Matrix Terms (TP, FP, TN, FN)
A confusion matrix is a table that summarizes how well a classification model performs.
Term | Meaning |
---|---|
TP | True Positive — correctly predicted positive cases |
FP | False Positive — predicted positive but was actually negative |
TN | True Negative — correctly predicted negative cases |
FN | False Negative — predicted negative but was actually positive |
4. Evaluation Metrics for Classification
Accuracy
Percentage of correct predictions.
Precision
Out of all predicted positives, how many were actually positive?
Recall (Sensitivity)
Out of all actual positives, how many did we predict correctly?
F1-Score
Balances Precision and Recall.
ROC-AUC Score
-
ROC curve plots the True Positive Rate (Recall) against the False Positive Rate.
-
AUC measures the area under the ROC curve — closer to 1 means better performance.
Sponsor Key-Word
Hands-On in Python: Classification Example
We’ll use the Iris dataset to classify flowers into species.
Step 1 – Install and Import Libraries
pip install pandas numpy matplotlib scikit-learn seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
Step 2 – Load Dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target
df.head()
Step 3 – Split Data
X = df.iloc[:, :-1]
y = df['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4 – Train Model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
Step 5 – Predictions & Evaluation
y_pred = model.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
Step 6 – Decision Boundary (Binary Example)
If you want to visualize decision boundaries, use a binary subset of the dataset and plot using matplotlib
.
Step 7 – ROC-AUC (Binary Example)
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_auc_score
# Example for binary classes
y_binary = label_binarize(y_test, classes=[0, 1, 2])[:, 0]
y_scores = model.decision_function(X_test)[:, 0]
roc_auc = roc_auc_score(y_binary, y_scores)
print("ROC-AUC Score:", roc_auc)
For Bloggers – Make It More Engaging
-
Code Snippets for Each Step:
Include the dataset loading, preprocessing, training, prediction, and evaluation code clearly. -
Visual Explanations:
Use diagrams to explain decision boundaries and confusion matrices. -
Real Datasets from Kaggle:
Example: Titanic dataset (predict survival) or Credit Card Fraud dataset. -
Downloadable Jupyter Notebook:
Give readers a.ipynb
file so they can run everything. -
Comparison Tables:
Create a table comparing Logistic Regression, Decision Trees, and Random Forest for the same dataset.
Two complete mini-projects:
-
Spam Detection using Naive Bayes.
-
Titanic Survival Prediction with Logistic Regression.
Sponsor Key-Word
Day 5 – Classification (Hands-on in Python)
1. Introduction to Classification
In the world of machine learning, classification is one of the most important and widely used problem types.
While regression predicts continuous numerical values, classification predicts categories. In simple words — given some input data, the model tries to decide which “bucket” or “class” the data belongs to.
What is Classification?
Classification is a supervised learning technique where the model learns from labeled data (data with known outcomes) and then predicts the class of new, unseen data.
The main goal is to assign labels to observations based on patterns in the data.
Example:
-
Email filtering: Predict if an email is “Spam” or “Not Spam.”
-
Medical diagnosis: Predict if a tumor is “Benign” or “Malignant.”
-
Sentiment analysis: Predict if a review is “Positive,” “Neutral,” or “Negative.”
Real-World Applications of Classification
-
Spam Detection
-
Gmail automatically marks spam emails by learning patterns in spam content.
-
-
Medical Diagnosis
-
Predicting diseases based on patient symptoms, reports, and scans.
-
-
Credit Scoring
-
Banks use classification to approve or reject loans based on past repayment history.
-
-
Image Recognition
-
Classifying photos as “cat,” “dog,” or “car.”
-
-
Sentiment Analysis
-
Businesses analyze customer reviews to classify sentiment as positive, negative, or neutral.
-
-
Fraud Detection
-
Classifying transactions as fraudulent or genuine.
-
Difference Between Classification and Regression
Feature | Classification | Regression |
---|---|---|
Output type | Categories (discrete values) | Continuous values |
Examples | Spam/Not Spam, Yes/No, Class A/B/C | Predicting price, temperature, salary |
Evaluation metrics | Accuracy, Precision, Recall, F1-score | MAE, MSE, RMSE, score |
Algorithms | Logistic Regression, Decision Trees, KNN | Linear Regression, Polynomial Regression |
💡 Quick rule:
If your target variable has labels, it’s classification. If it’s numbers, it’s regression.
2. Key Concepts in Classification
1. Binary vs. Multiclass Classification
-
Binary Classification:
Two possible classes.
Example: Spam (1) vs. Not Spam (0). -
Multiclass Classification:
More than two classes.
Example: Classifying animals into cat, dog, rabbit, horse.
2. Decision Boundary
A decision boundary is the dividing line (or curve) that separates different classes in a dataset.
-
In binary classification, it’s the line that splits the space into two regions — one for each class.
-
In multiclass classification, multiple boundaries separate the different classes.
📊 Visualization Tip:
In Python, decision boundaries can be plotted using matplotlib
and numpy
after fitting a model like Logistic Regression or KNN.
3. Confusion Matrix Terms (TP, FP, TN, FN)
A confusion matrix is a table that summarizes how well a classification model performs.
Term | Meaning |
---|---|
TP | True Positive — correctly predicted positive cases |
FP | False Positive — predicted positive but was actually negative |
TN | True Negative — correctly predicted negative cases |
FN | False Negative — predicted negative but was actually positive |
4. Evaluation Metrics for Classification
Accuracy
Percentage of correct predictions.
Precision
Out of all predicted positives, how many were actually positive?
Recall (Sensitivity)
Out of all actual positives, how many did we predict correctly?
F1-Score
Balances Precision and Recall.
ROC-AUC Score
-
ROC curve plots the True Positive Rate (Recall) against the False Positive Rate.
-
AUC measures the area under the ROC curve — closer to 1 means better performance.
Hands-On in Python: Classification Example
We’ll use the Iris dataset to classify flowers into species.
Step 1 – Install and Import Libraries
pip install pandas numpy matplotlib scikit-learn seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
Step 2 – Load Dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target
df.head()
Step 3 – Split Data
X = df.iloc[:, :-1]
y = df['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4 – Train Model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
Step 5 – Predictions & Evaluation
y_pred = model.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
Step 6 – Decision Boundary (Binary Example)
If you want to visualize decision boundaries, use a binary subset of the dataset and plot using matplotlib
.
Step 7 – ROC-AUC (Binary Example)
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_auc_score
# Example for binary classes
y_binary = label_binarize(y_test, classes=[0, 1, 2])[:, 0]
y_scores = model.decision_function(X_test)[:, 0]
roc_auc = roc_auc_score(y_binary, y_scores)
print("ROC-AUC Score:", roc_auc)
For Bloggers – Make It More Engaging
-
Code Snippets for Each Step:
Include the dataset loading, preprocessing, training, prediction, and evaluation code clearly. -
Visual Explanations:
Use diagrams to explain decision boundaries and confusion matrices. -
Real Datasets from Kaggle:
Example: Titanic dataset (predict survival) or Credit Card Fraud dataset. -
Downloadable Jupyter Notebook:
Give readers a.ipynb
file so they can run everything. -
Comparison Tables:
Create a table comparing Logistic Regression, Decision Trees, and Random Forest for the same dataset
Comments
Post a Comment