Building Your First Image Classifier with PyTorch: Loss Function and Optimizer, Training, Evaluation of Neural network and model improvement

Building Your First Image Classifier with PyTorch: A Step-by-Step Guide Using the MNIST Dataset - II

Content: 

5. Defining the Loss Function and Optimizer
6. Training the Neural Network
7. Evaluating Model Performance on Test Data
8. Improving the Model — Regularization, Dropout, and Batch Normalization

๐Ÿงฎ Section 5: Defining the Loss Function and Optimizer

Now that we’ve built the neural network model, it’s time to teach it how to learn — and that’s where loss functions and optimizers come into play.

In this section, we’ll discuss how to measure errors, optimize weights, and prepare the model for training on the MNIST dataset.


๐Ÿ”น 5.1 What Is a Loss Function?

A loss function (also called a cost function) quantifies how well or poorly the model is performing.
It measures the difference between the model’s predictions and the true target values.

During training:

  • The model makes predictions.

  • The loss function calculates the error.

  • The optimizer adjusts the model’s weights to minimize that loss.

Mathematically:

[
\text{Loss} = f(y_{\text{true}}, y_{\text{pred}})
]

The smaller the loss, the better the model’s predictions.


๐Ÿ”น 5.2 Choosing a Loss Function for Classification

Since MNIST is a multi-class classification problem (digits 0–9), the ideal loss function is:

[
\text{Cross-Entropy Loss}
]

In PyTorch, this is implemented as:

nn.CrossEntropyLoss()

Cross-entropy measures the distance between the predicted probability distribution and the true labels.

[
L = -\sum_{i} y_i \log(\hat{y}_i)
]

Where:

  • ( y_i ) = 1 if the true class is i, else 0

  • ( \hat{y}_i ) = predicted probability for class i


๐Ÿ”น 5.3 What Is an Optimizer?

An optimizer updates the weights of the network to reduce the loss.
It uses the gradients computed during backpropagation to make small adjustments in the direction that minimizes error.

Common optimizers include:

  • SGD (Stochastic Gradient Descent)

  • Adam (Adaptive Moment Estimation)

  • RMSprop

For most tasks (including MNIST), Adam performs exceptionally well because it adapts the learning rate for each parameter automatically.


๐Ÿ”น 5.4 Setting Up the Loss and Optimizer in Code

Let’s add these to our PyTorch setup.

import torch.optim as optim

# Define the loss function
criterion = nn.CrossEntropyLoss()

# Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

Here’s what happens:

  • criterion calculates how far off the model’s predictions are from the true labels.

  • optimizer updates the parameters of model to reduce that error over time.

  • lr (learning rate) controls how big each update step should be — too high and the model might overshoot, too low and training becomes very slow.


๐Ÿ”น 5.5 A Quick Peek Under the Hood: How Optimizers Work

During training, these steps repeat for each batch:

  1. Forward Pass: Model predicts output.

  2. Loss Computation: Compute the loss between prediction and target.

  3. Backward Pass: Calculate gradients with respect to loss.

  4. Weight Update: Optimizer updates weights.

Mathematically, a simple weight update (SGD) looks like:

[
w := w - \eta \frac{\partial L}{\partial w}
]

Where:

  • ( w ) = weight

  • ( \eta ) = learning rate

  • ( \frac{\partial L}{\partial w} ) = gradient of the loss with respect to weight


๐Ÿ”น 5.6 Summary

We’ve Defined:

  • Loss Function: nn.CrossEntropyLoss() → measures prediction error

  • Optimizer: optim.Adam(model.parameters(), lr=0.001) → updates model weights

Next Step:
Now that our model knows how to learn, we’ll start the training process — iterating over batches of data, computing losses, and optimizing weights.


⚙️ Section 6: Training the Neural Network

Now that we’ve built the model and defined both the loss function and optimizer, it’s time to bring our neural network to life — through training.
This is where the model learns from data, adjusts its weights, and gradually improves its ability to recognize handwritten digits from the MNIST dataset.


๐Ÿ”น 6.1 What Happens During Training?

Training a neural network involves several iterative steps, typically repeated over epochs (full passes through the dataset).
Let’s break this process down:

๐Ÿงฉ The Training Cycle (Per Epoch)

  1. Forward Pass:
    The model processes a batch of input images and produces predictions.

  2. Compute Loss:
    The loss function measures how far off the predictions are from the true labels.

  3. Backward Pass:
    Using backpropagation, the model computes gradients — the direction and magnitude of changes needed to reduce the loss.

  4. Update Weights:
    The optimizer adjusts model parameters (weights) using the computed gradients.

  5. Repeat:
    Continue for all batches → then for all epochs → until the loss stops decreasing.

Sponsor Key-Word

"This Content Sponsored by SBO Digital Marketing.

Mobile-Based Part-Time Job Opportunity by SBO!

Earn money online by doing simple content publishing and sharing tasks. Here's how:

Job Type: Mobile-based part-time work

Work Involves:

Content publishing

Content sharing on social media

Time Required: As little as 1 hour a day

Earnings: ₹300 or more daily

Requirements:

Active Facebook and Instagram account

Basic knowledge of using mobile and social media

For more details:

WhatsApp your Name and Qualification to 9994104160

a.Online Part Time Jobs from Home

b.Work from Home Jobs Without Investment

c.Freelance Jobs Online for Students

d.Mobile Based Online Jobs

e.Daily Payment Online Jobs

Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"


๐Ÿ”น 6.2 Setting Up the Training Loop

We’ll now define our training loop in PyTorch, which involves:

  • Iterating through train_loader

  • Zeroing the gradients

  • Performing forward and backward passes

  • Updating weights

  • Tracking loss and accuracy


๐Ÿง  Full Training Loop Code

import torch

# Set device (GPU if available)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Training on: {device}")

# Move model to device
model.to(device)

# Training parameters
epochs = 5  # You can increase this for better accuracy

for epoch in range(epochs):
    running_loss = 0.0
    correct = 0
    total = 0
    
    # Set model to training mode
    model.train()
    
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        
        # 1️⃣ Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # 2️⃣ Backward pass
        optimizer.zero_grad()      # Reset gradients
        loss.backward()            # Compute gradients
        optimizer.step()           # Update weights
        
        # 3️⃣ Track statistics
        running_loss += loss.item()
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    
    # Calculate average loss and accuracy for the epoch
    epoch_loss = running_loss / len(train_loader)
    accuracy = 100 * correct / total
    
    print(f"Epoch [{epoch+1}/{epochs}] - Loss: {epoch_loss:.4f}, Accuracy: {accuracy:.2f}%")

๐Ÿงพ Example Output

Training on: cuda
Epoch [1/5] - Loss: 0.3567, Accuracy: 89.23%
Epoch [2/5] - Loss: 0.1805, Accuracy: 94.21%
Epoch [3/5] - Loss: 0.1312, Accuracy: 96.11%
Epoch [4/5] - Loss: 0.1057, Accuracy: 96.89%
Epoch [5/5] - Loss: 0.0894, Accuracy: 97.35%

As training progresses:

  • Loss decreases (model predictions improve)

  • Accuracy increases (model classifies digits correctly)


๐Ÿ”น 6.3 Visualizing the Loss Curve

Visualizing how the loss changes over time helps you understand whether your model is learning efficiently or overfitting.

import matplotlib.pyplot as plt

# Example: storing losses across epochs
train_losses = []

for epoch in range(epochs):
    running_loss = 0.0
    
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    
    train_losses.append(running_loss / len(train_loader))

# Plot
plt.plot(range(1, epochs+1), train_losses, marker='o')
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training Loss Curve")
plt.show()

This curve should decline smoothly as training progresses — a healthy indicator that the model is learning effectively.


๐Ÿ”น 6.4 Understanding Overfitting and Underfitting

  • Underfitting:
    Model is too simple or hasn’t trained enough → both training and validation accuracy are low.

  • Overfitting:
    Model performs well on training data but poorly on unseen data → memorize instead of generalize.

Solution Tips:

  • Increase training data (data augmentation)

  • Add regularization (dropout, weight decay)

  • Use early stopping


๐Ÿ”น 6.5 Saving the Trained Model

Once your model achieves good accuracy, save it for reuse without retraining:

torch.save(model.state_dict(), 'mnist_model.pth')
print("Model saved successfully!")

To load it later:

model.load_state_dict(torch.load('mnist_model.pth'))
model.eval()  # Set model to evaluation mode

๐Ÿ”น 6.6 Summary

In this section, we covered:

  • How to implement a full training loop in PyTorch

  • How to track loss and accuracy during training

  • How to visualize the loss curve for learning analysis

  • How to save and reload models for future use

Your model now understands handwritten digits! ๐ŸŽ‰

Sponsor Key-Word

"This Content Sponsored by SBO Digital Marketing.

Mobile-Based Part-Time Job Opportunity by SBO!

Earn money online by doing simple content publishing and sharing tasks. Here's how:

Job Type: Mobile-based part-time work

Work Involves:

Content publishing

Content sharing on social media

Time Required: As little as 1 hour a day

Earnings: ₹300 or more daily

Requirements:

Active Facebook and Instagram account

Basic knowledge of using mobile and social media

For more details:

WhatsApp your Name and Qualification to 9994104160

a.Online Part Time Jobs from Home

b.Work from Home Jobs Without Investment

c.Freelance Jobs Online for Students

d.Mobile Based Online Jobs

e.Daily Payment Online Jobs

Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"


๐Ÿงพ Section 7: Evaluating Model Performance on Test Data

After successfully training our neural network, the next step is to evaluate how well it performs on unseen data.
Training accuracy alone isn’t enough — our goal is to ensure the model can generalize to new, unseen images.

In this section, we’ll:

  • Evaluate the model on the test dataset

  • Measure accuracy, precision, recall, and F1-score

  • Visualize a confusion matrix

  • Display sample predictions


๐Ÿ”น 7.1 Why Evaluation Matters

When a model performs well on training data but poorly on test data, it’s overfitting — meaning it memorized patterns rather than learning general ones.

Evaluating on a separate test set helps verify:

  • How well the model generalizes

  • Which digits are misclassified

  • Whether further tuning is needed (architecture, epochs, learning rate, etc.)


๐Ÿ”น 7.2 Switching to Evaluation Mode

Before testing, we must set the model to evaluation mode using:

model.eval()

This turns off features like dropout or batch normalization updates, ensuring stable inference behavior.

We’ll also disable gradient computation using:

with torch.no_grad():

This saves memory and speeds up the evaluation since gradients aren’t needed during inference.


๐Ÿ”น 7.3 Evaluating Model Accuracy on the Test Set

Let’s compute the overall accuracy on the MNIST test data.

from torch.utils.data import DataLoader

# Evaluation mode
model.eval()

# Initialize counters
correct = 0
total = 0

with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")

Example Output:

Test Accuracy: 97.85%

That’s a strong performance — indicating the model generalizes well to unseen data.


๐Ÿ”น 7.4 Generating a Confusion Matrix

A confusion matrix gives a detailed breakdown of how the model performs across each class.
It shows which digits the model confuses with others.

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

y_true = []
y_pred = []

# Collect true and predicted labels
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        
        y_true.extend(labels.cpu().numpy())
        y_pred.extend(predicted.cpu().numpy())

# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Plot confusion matrix
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix for MNIST Classifier")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

This visualization clearly shows which digits are misclassified.
For example, the model may confuse ‘4’ with ‘9’ or ‘3’ with ‘8’ due to their similar shapes.


๐Ÿ”น 7.5 Classification Report (Precision, Recall, F1-Score)

Let’s generate a more detailed report:

from sklearn.metrics import classification_report

print("Classification Report:")
print(classification_report(y_true, y_pred))

Sample Output:

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       980
           1       0.99      0.99      0.99      1135
           2       0.98      0.98      0.98      1032
           3       0.97      0.97      0.97      1010
           4       0.98      0.98      0.98       982
           5       0.97      0.97      0.97       892
           6       0.98      0.98      0.98       958
           7       0.98      0.98      0.98      1028
           8       0.97      0.97      0.97       974
           9       0.97      0.97      0.97      1009

    accuracy                           0.98     10000
   macro avg       0.98      0.98      0.98     10000
weighted avg       0.98      0.98      0.98     10000

Interpretation:

  • Precision: Fraction of correct positive predictions

  • Recall: Fraction of correctly identified actual positives

  • F1-score: Harmonic mean of precision and recall — balanced measure of performance


๐Ÿ”น 7.6 Visualizing Predictions on Sample Images

Let’s visualize some model predictions on the test set to understand its behavior better.

import numpy as np

# Get a batch of test images
dataiter = iter(test_loader)
images, labels = next(dataiter)

# Predict
model.eval()
with torch.no_grad():
    images = images.to(device)
    outputs = model(images)
    _, preds = torch.max(outputs, 1)

# Plot first 8 test images with predictions
fig, axes = plt.subplots(1, 8, figsize=(15, 2))
for i in range(8):
    ax = axes[i]
    ax.imshow(images[i].cpu().squeeze(), cmap='gray')
    ax.set_title(f"Pred: {preds[i].item()}\nTrue: {labels[i].item()}")
    ax.axis('off')

plt.show()

This visualization helps you quickly see where the model succeeds — and where it might misclassify certain digits.


๐Ÿ”น 7.7 Summary

In this section, we accomplished:

  • Evaluated the model on test data

  • Computed overall accuracy

  • Visualized a confusion matrix

  • Generated a classification report

  • Displayed sample predictions

Outcome:
Our model achieves over 97% test accuracy, with only a few confusions between similar-looking digits.
This indicates a strong, generalizable classifier built from scratch using PyTorch.

Sponsor Key-Word

"This Content Sponsored by SBO Digital Marketing.

Mobile-Based Part-Time Job Opportunity by SBO!

Earn money online by doing simple content publishing and sharing tasks. Here's how:

Job Type: Mobile-based part-time work

Work Involves:

Content publishing

Content sharing on social media

Time Required: As little as 1 hour a day

Earnings: ₹300 or more daily

Requirements:

Active Facebook and Instagram account

Basic knowledge of using mobile and social media

For more details:

WhatsApp your Name and Qualification to 9994104160

a.Online Part Time Jobs from Home

b.Work from Home Jobs Without Investment

c.Freelance Jobs Online for Students

d.Mobile Based Online Jobs

e.Daily Payment Online Jobs

Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"


Section 8: Improving the Model — Regularization, Dropout, and Batch Normalization

Now that our MNIST classifier achieves high accuracy (~97–98%), the next step is to make it more robust and generalizable.
Even though our model performs well on the test data, it could still overfit — meaning it memorizes training patterns rather than learning true underlying features.

In this section, we’ll enhance our model using three key regularization techniques widely used in deep learning:

  1. Regularization (L2 weight decay)

  2. Dropout

  3. Batch Normalization

These techniques help models generalize better and avoid overfitting, especially when scaling to more complex datasets beyond MNIST.


๐Ÿ”น 8.1 What Is Overfitting?

Overfitting occurs when the model performs exceptionally well on the training data but poorly on unseen data.

Symptoms:

  • High training accuracy but low test accuracy

  • Model memorizes patterns rather than learning general ones

Example Analogy:
Imagine studying for an exam by memorizing the exact questions — you might do great on a practice test, but fail the real one.

Goal:
Encourage the model to learn general features, not memorize specific patterns.


๐Ÿ”น 8.2 Regularization via Weight Decay (L2 Regularization)

Regularization adds a penalty for large weight values to the loss function, preventing the model from relying too heavily on any single feature.

Mathematically, the new loss function becomes:

[
L' = L + \lambda \sum_{i} w_i^2
]

Where:

  • ( L ) = original loss

  • ( \lambda ) = regularization strength (hyperparameter)

  • ( w_i ) = model weights

In PyTorch, you can add L2 regularization directly through the optimizer:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

Here, weight_decay acts as the ( \lambda ) term.
It encourages smaller weight magnitudes → smoother decision boundaries → less overfitting.


๐Ÿ”น 8.3 Dropout: Randomly Turning Off Neurons

Dropout is one of the most effective and widely used regularization techniques.
During training, dropout randomly disables a fraction of neurons in each layer.

Mathematically, for each neuron output ( y_i ):

[
y_i' =
\begin{cases}
0 & \text{with probability } p \
\frac{y_i}{1-p} & \text{otherwise}
\end{cases}
]

Where ( p ) is the dropout probability (commonly 0.2–0.5).

This forces the network to not depend on specific neurons, encouraging redundancy and improving generalization.


๐Ÿง  Updated Model with Dropout

Let’s modify our previous model to include dropout layers.

import torch.nn as nn
import torch.nn.functional as F

class MNISTModel_Improved(nn.Module):
    def __init__(self):
        super(MNISTModel_Improved, self).__init__()
        
        self.fc1 = nn.Linear(28 * 28, 256)
        self.dropout1 = nn.Dropout(0.3)
        self.fc2 = nn.Linear(256, 128)
        self.dropout2 = nn.Dropout(0.3)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.dropout1(x)
        x = F.relu(self.fc2(x))
        x = self.dropout2(x)
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x

๐Ÿ”น 8.4 Batch Normalization

Batch Normalization (BatchNorm) standardizes the activations of a layer for each mini-batch, stabilizing learning and improving convergence.

[
\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
]

Where:

  • ( \mu_B ), ( \sigma_B^2 ) = mean and variance of the batch

  • ( \epsilon ) = small constant for numerical stability

Benefits:

  • Faster training

  • Higher learning rates possible

  • Reduced dependence on initialization

  • Acts as a mild regularizer

Sponsor Key-Word

"This Content Sponsored by SBO Digital Marketing.

Mobile-Based Part-Time Job Opportunity by SBO!

Earn money online by doing simple content publishing and sharing tasks. Here's how:

Job Type: Mobile-based part-time work

Work Involves:

Content publishing

Content sharing on social media

Time Required: As little as 1 hour a day

Earnings: ₹300 or more daily

Requirements:

Active Facebook and Instagram account

Basic knowledge of using mobile and social media

For more details:

WhatsApp your Name and Qualification to 9994104160

a.Online Part Time Jobs from Home

b.Work from Home Jobs Without Investment

c.Freelance Jobs Online for Students

d.Mobile Based Online Jobs

e.Daily Payment Online Jobs

Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"


๐Ÿง  Model with Dropout + Batch Normalization

Here’s how to integrate both dropout and batch normalization:

class MNISTModel_Advanced(nn.Module):
    def __init__(self):
        super(MNISTModel_Advanced, self).__init__()
        
        self.fc1 = nn.Linear(28 * 28, 256)
        self.bn1 = nn.BatchNorm1d(256)
        self.dropout1 = nn.Dropout(0.3)
        
        self.fc2 = nn.Linear(256, 128)
        self.bn2 = nn.BatchNorm1d(128)
        self.dropout2 = nn.Dropout(0.3)
        
        self.fc3 = nn.Linear(128, 64)
        self.bn3 = nn.BatchNorm1d(64)
        self.fc4 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.bn1(self.fc1(x)))
        x = self.dropout1(x)
        x = F.relu(self.bn2(self.fc2(x)))
        x = self.dropout2(x)
        x = F.relu(self.bn3(self.fc3(x)))
        x = self.fc4(x)
        return x

๐Ÿ”น 8.5 Comparing Models: Before vs After Regularization

Feature Baseline Model Improved Model
Layers 3 (128 → 64 → 10) 4 (256 → 128 → 64 → 10)
Regularization None Dropout (0.3), BatchNorm
Weight Decay 0 1e-5
Accuracy ~97% ~98.3%
Overfitting Moderate Significantly reduced

This demonstrates how simple regularization methods can make your model more robust and generalizable — a crucial aspect when working with larger, more complex datasets.


๐Ÿ”น 8.6 Practical Tips for Regularization

  1. Start simple — Add dropout only where needed.

  2. Avoid overdoing dropout — Too much can underfit the model.

  3. Use BatchNorm early — It helps stabilize deeper models.

  4. Experiment with weight_decay — Common values: 1e-4 to 1e-6.

  5. Monitor validation accuracy — Use early stopping if loss stops improving.


๐Ÿ”น 8.7 Summary

What We Learned:

  • Regularization reduces overfitting by penalizing complex models.

  • Dropout randomly disables neurons during training to improve generalization.

  • Batch Normalization stabilizes and accelerates training.

  • Combining all three techniques leads to faster convergence and better generalization.


Sponsor Key-Word

"This Content Sponsored by SBO Digital Marketing.

Mobile-Based Part-Time Job Opportunity by SBO!

Earn money online by doing simple content publishing and sharing tasks. Here's how:

Job Type: Mobile-based part-time work

Work Involves:

Content publishing

Content sharing on social media

Time Required: As little as 1 hour a day

Earnings: ₹300 or more daily

Requirements:

Active Facebook and Instagram account

Basic knowledge of using mobile and social media

For more details:

WhatsApp your Name and Qualification to 9994104160

a.Online Part Time Jobs from Home

b.Work from Home Jobs Without Investment

c.Freelance Jobs Online for Students

d.Mobile Based Online Jobs

e.Daily Payment Online Jobs

Keyword & Tag: #OnlinePartTimeJob #WorkFromHome #EarnMoneyOnline #PartTimeJob #jobs #jobalerts #withoutinvestmentjob"


Comments