Part 3: Building Neural Networks with torch.nn

My First Neural Network

I built my first neural network to classify images of defective vs non-defective products. Started with a tutorial's code - worked perfectly on MNIST digits.

Then I tried it on my actual images: 51% accuracy. Random guessing!

The problem? I blindly copied layers without understanding what they do. Once I learned torch.nn properly, I built a custom architecture: 94% accuracy.

Understanding the building blocks transforms you from copy-paster to architect.

Let me show you how torch.nn works.

The nn.Module Foundation

Every PyTorch model inherits from nn.Module. This base class provides:

Parameter management
GPU/CPU movement
Saving/loading
Training/evaluation modes

Basic nn.Module

import torch
import torch.nn as nn

class SimpleModel(nn.Module):
    """Basic neural network structure."""
    
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        
        # Define layers
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.activation = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        """Forward pass - define computation."""
        x = self.layer1(x)
        x = self.activation(x)
        x = self.layer2(x)
        return x

# Create model
model = SimpleModel(input_size=10, hidden_size=20, output_size=2)

# Use model
input_data = torch.randn(5, 10)  # Batch of 5 samples
output = model(input_data)

print(f"Input shape: {input_data.shape}")
print(f"Output shape: {output.shape}")
print(f"Number of parameters: {sum(p.numel() for p in model.parameters())}")

Output:

Input shape: torch.Size([5, 10])
Output shape: torch.Size([5, 2])
Number of parameters: 262

Key points:

__init__: Define layers
forward: Define computation
Call model with input: model(x) automatically calls forward(x)

Common Layers

Linear (Fully Connected)

import torch.nn as nn

# Linear layer: y = xW^T + b
linear = nn.Linear(in_features=10, out_features=5)

input = torch.randn(3, 10)  # Batch of 3
output = linear(input)

print(f"Input: {input.shape}")
print(f"Output: {output.shape}")
print(f"Weight: {linear.weight.shape}")  # (out_features, in_features)
print(f"Bias: {linear.bias.shape}")      # (out_features,)

Convolutional Layers

import torch.nn as nn

# 2D Convolution for images
conv = nn.Conv2d(
    in_channels=3,      # RGB input
    out_channels=64,    # Number of filters
    kernel_size=3,      # 3x3 kernel
    stride=1,
    padding=1
)

# Input: (batch, channels, height, width)
input = torch.randn(8, 3, 224, 224)  # 8 RGB images, 224x224
output = conv(input)

print(f"Input: {input.shape}")
print(f"Output: {output.shape}")

With stride and padding:

# Reduce spatial dimensions
conv_stride = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1)
output_stride = conv_stride(input)
print(f"With stride=2: {output_stride.shape}")  # Height and width halved

Pooling Layers

import torch.nn as nn

# Max pooling - take maximum value in window
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)

input = torch.randn(8, 64, 224, 224)
output = maxpool(input)

print(f"Input: {input.shape}")
print(f"Output: {output.shape}")  # Height and width halved

# Average pooling
avgpool = nn.AvgPool2d(kernel_size=2, stride=2)
output_avg = avgpool(input)

# Global average pooling (reduce to 1x1)
global_avgpool = nn.AdaptiveAvgPool2d((1, 1))
output_global = global_avgpool(input)
print(f"Global pooling: {output_global.shape}")  # (8, 64, 1, 1)

Activation Functions

Crucial for non-linearity - without them, network is just linear regression.

import torch
import torch.nn as nn

x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])

# ReLU: max(0, x)
relu = nn.ReLU()
print(f"ReLU: {relu(x)}")

# LeakyReLU: max(0.01*x, x)
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
print(f"LeakyReLU: {leaky_relu(x)}")

# Sigmoid: 1 / (1 + e^-x)
sigmoid = nn.Sigmoid()
print(f"Sigmoid: {sigmoid(x)}")

# Tanh: (e^x - e^-x) / (e^x + e^-x)
tanh = nn.Tanh()
print(f"Tanh: {tanh(x)}")

# GELU (Gaussian Error Linear Unit) - used in transformers
gelu = nn.GELU()
print(f"GELU: {gelu(x)}")

# Softmax - for classification
softmax = nn.Softmax(dim=-1)
logits = torch.tensor([[2.0, 1.0, 0.1]])
probs = softmax(logits)
print(f"Softmax: {probs}")
print(f"Sum: {probs.sum()}")  # Should be 1.0

When to use which:

ReLU: Default choice, fast and effective
LeakyReLU: When training struggles (dying ReLU problem)
GELU: Transformers and modern architectures
Tanh/Sigmoid: Output layers (when range matters)
Softmax: Multi-class classification output

Dropout and Regularization

Prevent overfitting:

import torch.nn as nn

class ModelWithDropout(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(100, 50)
        self.dropout = nn.Dropout(p=0.5)  # Drop 50% of neurons
        self.fc2 = nn.Linear(50, 10)
    
    def forward(self, x):
        x = self.fc1(x)
        x = nn.ReLU()(x)
        x = self.dropout(x)  # Only active during training
        x = self.fc2(x)
        return x

model = ModelWithDropout()

# Training mode (dropout active)
model.train()
output_train = model(torch.randn(5, 100))

# Evaluation mode (dropout inactive)
model.eval()
output_eval = model(torch.randn(5, 100))

Batch Normalization:

import torch.nn as nn

class ModelWithBatchNorm(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(100, 50)
        self.bn1 = nn.BatchNorm1d(50)
        self.fc2 = nn.Linear(50, 10)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)  # Normalize across batch
        x = nn.ReLU()(x)
        x = self.fc2(x)
        return x

I use both - BatchNorm for stability, Dropout for regularization.

Building Real Architectures

Image Classifier (CNN)

My production image classifier for defect detection:

import torch
import torch.nn as nn

class DefectClassifier(nn.Module):
    """CNN for binary image classification."""
    
    def __init__(self, num_classes=2):
        super().__init__()
        
        # Feature extraction
        self.features = nn.Sequential(
            # Block 1
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Block 2
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Block 3
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        
        # Global average pooling
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        
        # Classifier
        self.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(256, 128),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(128, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

# Create model
model = DefectClassifier(num_classes=2)

# Test
input_image = torch.randn(8, 3, 224, 224)  # 8 RGB images
output = model(input_image)

print(f"Input shape: {input_image.shape}")
print(f"Output shape: {output.shape}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

This architecture gets 94% accuracy on my defect detection task.

Text Classifier (RNN/LSTM)

import torch.nn as nn

class TextClassifier(nn.Module):
    """LSTM-based text classifier."""
    
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256, num_classes=2):
        super().__init__()
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # LSTM
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            num_layers=2,
            batch_first=True,
            dropout=0.5,
            bidirectional=True
        )
        
        # Classifier
        self.fc = nn.Linear(hidden_dim * 2, num_classes)  # *2 for bidirectional
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        # x: (batch, seq_len)
        x = self.embedding(x)  # (batch, seq_len, embedding_dim)
        
        # LSTM
        lstm_out, (hidden, cell) = self.lstm(x)
        
        # Use last hidden state
        # Concatenate forward and backward hidden states
        hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
        
        # Classifier
        hidden = self.dropout(hidden)
        output = self.fc(hidden)
        
        return output

# Create model
model = TextClassifier(vocab_size=10000, num_classes=5)

# Test
input_sequence = torch.randint(0, 10000, (16, 50))  # 16 sequences, length 50
output = model(input_sequence)

print(f"Input shape: {input_sequence.shape}")
print(f"Output shape: {output.shape}")

Sequential vs Custom Forward

Using nn.Sequential

import torch.nn as nn

# Simple sequential model
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 10),
    nn.ReLU(),
    nn.Linear(10, 1)
)

input = torch.randn(5, 10)
output = model(input)

Good for simple linear flows. I use it for building blocks:

def conv_block(in_channels, out_channels):
    """Reusable conv block."""
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, 3, padding=1),
        nn.BatchNorm2d(out_channels),
        nn.ReLU(inplace=True)
    )

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.block1 = conv_block(3, 64)
        self.block2 = conv_block(64, 128)
    
    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        return x

Custom Forward (More Flexible)

class FlexibleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding=1)
        self.conv2 = nn.Conv2d(3, 64, 5, padding=2)  # Different kernel
        self.fc = nn.Linear(128, 10)
    
    def forward(self, x):
        # Multi-path
        path1 = self.conv1(x)
        path2 = self.conv2(x)
        
        # Concatenate
        combined = torch.cat([path1, path2], dim=1)
        
        # Global pool and classify
        pooled = combined.mean(dim=[2, 3])
        output = self.fc(pooled)
        
        return output

I use custom forward when I need:

Multiple paths (inception-style)
Skip connections (ResNet-style)
Dynamic behavior

Loss Functions

Classification Losses

import torch
import torch.nn as nn

# Binary classification
bce_loss = nn.BCEWithLogitsLoss()  # Binary cross-entropy with sigmoid
logits = torch.randn(32, 1)
targets = torch.randint(0, 2, (32, 1)).float()
loss = bce_loss(logits, targets)

# Multi-class classification
ce_loss = nn.CrossEntropyLoss()  # Includes softmax
logits = torch.randn(32, 10)  # 10 classes
targets = torch.randint(0, 10, (32,))
loss = ce_loss(logits, targets)

print(f"Cross-entropy loss: {loss.item():.4f}")

Regression Losses

# Mean Squared Error
mse_loss = nn.MSELoss()
predictions = torch.randn(32, 1)
targets = torch.randn(32, 1)
loss = mse_loss(predictions, targets)

# Mean Absolute Error
mae_loss = nn.L1Loss()
loss = mae_loss(predictions, targets)

# Huber Loss (robust to outliers)
huber_loss = nn.HuberLoss(delta=1.0)
loss = huber_loss(predictions, targets)

Custom Loss

class FocalLoss(nn.Module):
    """Focal Loss for imbalanced classification."""
    
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, inputs, targets):
        bce_loss = nn.functional.binary_cross_entropy_with_logits(
            inputs, targets, reduction='none'
        )
        
        pt = torch.exp(-bce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * bce_loss
        
        return focal_loss.mean()

# Usage
loss_fn = FocalLoss(alpha=0.25, gamma=2.0)

I use Focal Loss when dealing with severely imbalanced datasets.

Real Production Model

Here's my actual product recommendation model (simplified):

import torch
import torch.nn as nn

class RecommendationModel(nn.Module):
    """Collaborative filtering model for product recommendations."""
    
    def __init__(self, num_users, num_products, embedding_dim=64):
        super().__init__()
        
        # User and product embeddings
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.product_embedding = nn.Embedding(num_products, embedding_dim)
        
        # User and product biases
        self.user_bias = nn.Embedding(num_users, 1)
        self.product_bias = nn.Embedding(num_products, 1)
        
        # Global bias
        self.global_bias = nn.Parameter(torch.zeros(1))
        
        # MLP for interaction
        self.mlp = nn.Sequential(
            nn.Linear(embedding_dim * 2, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 1)
        )
        
        # Initialize
        nn.init.normal_(self.user_embedding.weight, std=0.01)
        nn.init.normal_(self.product_embedding.weight, std=0.01)
    
    def forward(self, user_ids, product_ids):
        # Get embeddings
        user_emb = self.user_embedding(user_ids)
        product_emb = self.product_embedding(product_ids)
        
        # Dot product (matrix factorization component)
        dot_product = (user_emb * product_emb).sum(dim=1, keepdim=True)
        
        # Biases
        user_b = self.user_bias(user_ids)
        product_b = self.product_bias(product_ids)
        
        # MLP (neural component)
        concat = torch.cat([user_emb, product_emb], dim=1)
        mlp_output = self.mlp(concat)
        
        # Combine
        prediction = (
            self.global_bias +
            user_b +
            product_b +
            dot_product +
            mlp_output
        )
        
        return prediction.squeeze()

# Create model
model = RecommendationModel(num_users=10000, num_products=5000)

# Test
user_ids = torch.randint(0, 10000, (32,))
product_ids = torch.randint(0, 5000, (32,))
ratings = model(user_ids, product_ids)

print(f"Predicted ratings shape: {ratings.shape}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

This model achieved 12% improvement over matrix factorization alone.

Model Inspection

Print architecture:

print(model)

Count parameters:

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Trainable parameters: {count_parameters(model):,}")

Layer-wise parameters:

for name, param in model.named_parameters():
    print(f"{name}: {param.shape} ({param.numel():,} parameters)")

Summary (requires torchsummary):

from torchsummary import summary

model = model.to('cuda')
summary(model, input_size=(3, 224, 224))

Best Practices

From building dozens of models:

1. Initialize weights properly:

def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        if m.bias is not None:
            nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out')

model.apply(init_weights)

2. Use BatchNorm before activation:

nn.Conv2d(in_channels, out_channels, 3),
nn.BatchNorm2d(out_channels),
nn.ReLU(),

3. Set model to train/eval mode:

model.train()  # Training
model.eval()   # Evaluation/inference

4. Move model to device:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

5. Use inplace=True for ReLU to save memory:

nn.ReLU(inplace=True)

6. Freeze layers when fine-tuning:

for param in model.features.parameters():
    param.requires_grad = False

What's Next?

You now know how to build neural networks with torch.nn. In Part 4, we'll learn how to train these models effectively.

Next: Part 4 - Training and Optimization

Previous: Part 2 - Autograd and Automatic Differentiation

This article is part of the PyTorch 101 series. All examples use Python 3 and are based on real projects.

PreviousPart 2: Autograd and Automatic Differentiation NextPart 4: Training and Optimization

Last updated 2 days ago

hashtagMy First Neural Network

hashtagThe nn.Module Foundation

hashtagBasic nn.Module

hashtagCommon Layers

hashtagLinear (Fully Connected)

hashtagConvolutional Layers

hashtagPooling Layers

hashtagActivation Functions

hashtagDropout and Regularization

hashtagBuilding Real Architectures

hashtagImage Classifier (CNN)

hashtagText Classifier (RNN/LSTM)

hashtagSequential vs Custom Forward

hashtagUsing nn.Sequential

hashtagCustom Forward (More Flexible)

hashtagLoss Functions

hashtagClassification Losses

hashtagRegression Losses

hashtagCustom Loss

hashtagReal Production Model

hashtagModel Inspection

hashtagPrint architecture:

hashtagCount parameters:

hashtagLayer-wise parameters:

hashtagSummary (requires torchsummary):

hashtagBest Practices

hashtagWhat's Next?