Part 1: Introduction to PyTorch and Tensors

Why I Switched to PyTorch

I was building an image recognition system using NumPy. The gradient calculation code alone was 200+ lines. Every time I changed the model architecture, I had to rewrite the backward pass.

Bugs were everywhere. A minus sign in the wrong place could ruin days of training.

Then I rewrote it in PyTorch: 15 lines. Gradients calculated automatically. No bugs in backpropagation.

That's when I understood: PyTorch isn't just a library, it's freedom from gradient hell.

Let me show you what makes PyTorch special.

What is PyTorch?

PyTorch is a tensor computation library with automatic differentiation, optimized for deep learning.

Key features:

Tensors: N-dimensional arrays on CPU or GPU
Autograd: Automatic differentiation (calculates gradients for you)
Neural Networks: Pre-built layers and models (torch.nn)
Dynamic graphs: Change architecture during runtime
Production-ready: TorchScript, ONNX, mobile deployment

Think of it as: NumPy with GPU acceleration + automatic gradients + neural network building blocks.

Installation

# CPU only
pip install torch torchvision torchaudio

# With CUDA 11.8 (for NVIDIA GPUs)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# With CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# macOS with MPS (Metal Performance Shaders)
pip install torch torchvision torchaudio

Check installation:

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")

My setup:

PyTorch version: 2.2.0
CUDA available: True
CUDA version: 12.1
GPU: NVIDIA RTX 4090

Tensors: The Foundation

Tensors are multi-dimensional arrays - the foundation of PyTorch.

Creating Tensors

import torch

# From Python list
tensor_list = torch.tensor([1, 2, 3, 4, 5])
print("From list:", tensor_list)

# From NumPy array
import numpy as np
numpy_array = np.array([1, 2, 3, 4, 5])
tensor_numpy = torch.from_numpy(numpy_array)
print("From NumPy:", tensor_numpy)

# Zeros and ones
zeros = torch.zeros(3, 4)  # 3x4 matrix of zeros
ones = torch.ones(2, 3)    # 2x3 matrix of ones
print("\nZeros:\n", zeros)
print("\nOnes:\n", ones)

# Random tensors
random_tensor = torch.rand(3, 3)  # Uniform [0, 1)
randn_tensor = torch.randn(3, 3)  # Normal distribution
print("\nRandom:\n", random_tensor)
print("\nRandom normal:\n", randn_tensor)

# Range
range_tensor = torch.arange(0, 10, 2)  # Start, end, step
print("\nRange:", range_tensor)

# Like another tensor
x = torch.ones(2, 3)
zeros_like = torch.zeros_like(x)
print("\nZeros like:\n", zeros_like)

Output:

From list: tensor([1, 2, 3, 4, 5])
From NumPy: tensor([1, 2, 3, 4, 5])

Zeros:
 tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

Random:
 tensor([[0.8823, 0.9150, 0.3829],
        [0.9593, 0.3904, 0.6009],
        [0.2566, 0.7936, 0.9408]])

Tensor Attributes

x = torch.rand(3, 4, 5)

print(f"Shape: {x.shape}")        # Size of each dimension
print(f"Size: {x.size()}")        # Same as shape
print(f"Dtype: {x.dtype}")        # Data type
print(f"Device: {x.device}")      # CPU or CUDA
print(f"Requires grad: {x.requires_grad}")  # Track gradients?

Output:

Shape: torch.Size([3, 4, 5])
Size: torch.Size([3, 4, 5])
Dtype: torch.float32
Device: cpu
Requires grad: False

Data Types

# Float types
float32 = torch.tensor([1.0, 2.0], dtype=torch.float32)  # Default
float64 = torch.tensor([1.0, 2.0], dtype=torch.float64)
float16 = torch.tensor([1.0, 2.0], dtype=torch.float16)  # Half precision

# Integer types
int32 = torch.tensor([1, 2], dtype=torch.int32)
int64 = torch.tensor([1, 2], dtype=torch.int64)  # Default for integers
int8 = torch.tensor([1, 2], dtype=torch.int8)

# Boolean
bool_tensor = torch.tensor([True, False], dtype=torch.bool)

# Convert types
x = torch.tensor([1.5, 2.7, 3.9])
x_int = x.int()        # Convert to int32
x_long = x.long()      # Convert to int64
x_float = x.float()    # Convert to float32

print(f"Original: {x}, dtype: {x.dtype}")
print(f"Int: {x_int}, dtype: {x_int.dtype}")

I use float32 for most tasks - good balance of precision and performance.

Tensor Operations

Basic Math

import torch

x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 6.0])

# Element-wise operations
print("Addition:", x + y)
print("Subtraction:", x - y)
print("Multiplication:", x * y)
print("Division:", x / y)
print("Power:", x ** 2)

# In-place operations (end with _)
x.add_(y)  # x = x + y
print("In-place add:", x)

# Mathematical functions
a = torch.tensor([1.0, 2.0, 3.0])
print("Sqrt:", torch.sqrt(a))
print("Exp:", torch.exp(a))
print("Log:", torch.log(a))
print("Sin:", torch.sin(a))

Output:

Addition: tensor([5., 7., 9.])
Subtraction: tensor([-3., -3., -3.])
Multiplication: tensor([ 4., 10., 18.])
Division: tensor([0.2500, 0.4000, 0.5000])

Matrix Operations

# Matrix multiplication
a = torch.rand(3, 4)
b = torch.rand(4, 5)
c = torch.matmul(a, b)  # or a @ b
print(f"Matrix multiply: {a.shape} @ {b.shape} = {c.shape}")

# Element-wise multiplication
x = torch.rand(3, 4)
y = torch.rand(3, 4)
z = x * y
print(f"Element-wise: {x.shape} * {y.shape} = {z.shape}")

# Transpose
matrix = torch.rand(3, 4)
transposed = matrix.T  # or matrix.transpose(0, 1)
print(f"Transpose: {matrix.shape} -> {transposed.shape}")

# Reshape
x = torch.arange(12)
reshaped = x.view(3, 4)  # or x.reshape(3, 4)
print(f"Reshape: {x.shape} -> {reshaped.shape}")
print(reshaped)

Reduction Operations

x = torch.rand(3, 4)

# Sum
print("Sum all:", x.sum())
print("Sum dim 0:", x.sum(dim=0))  # Sum rows
print("Sum dim 1:", x.sum(dim=1))  # Sum columns

# Mean
print("Mean:", x.mean())
print("Mean dim 0:", x.mean(dim=0))

# Max and Min
print("Max:", x.max())
print("Max dim 0:", x.max(dim=0))  # Returns (values, indices)
print("Min:", x.min())

# Standard deviation
print("Std:", x.std())

Indexing and Slicing

x = torch.arange(12).reshape(3, 4)
print("Original:\n", x)

# Indexing
print("\nFirst row:", x[0])
print("First column:", x[:, 0])
print("Element [1, 2]:", x[1, 2])

# Slicing
print("\nFirst 2 rows:\n", x[:2])
print("\nLast 2 columns:\n", x[:, -2:])

# Boolean indexing
mask = x > 5
print("\nMask:\n", mask)
print("\nElements > 5:", x[mask])

# Advanced indexing
indices = torch.tensor([0, 2])
print("\nRows 0 and 2:\n", x[indices])

Output:

Original:
 tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

First row: tensor([0, 1, 2, 3])
Element [1, 2]: tensor(6)

Elements > 5: tensor([ 6,  7,  8,  9, 10, 11])

GPU Acceleration

This is where PyTorch shines - seamless GPU computation.

Moving Tensors to GPU

import torch

# Check if CUDA is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Using CPU")

# Create tensor on CPU
x_cpu = torch.rand(3, 3)
print(f"CPU tensor device: {x_cpu.device}")

# Move to GPU
if torch.cuda.is_available():
    x_gpu = x_cpu.to(device)  # or x_cpu.cuda()
    print(f"GPU tensor device: {x_gpu.device}")
    
    # Create directly on GPU
    y_gpu = torch.rand(3, 3, device=device)
    
    # Operations on GPU
    z_gpu = x_gpu @ y_gpu
    
    # Move back to CPU
    z_cpu = z_gpu.cpu()
    print(f"Back to CPU: {z_cpu.device}")

Best practice:

# Configure device at start
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create tensors on device
x = torch.rand(1000, 1000, device=device)
y = torch.rand(1000, 1000, device=device)

# All operations stay on device
z = x @ y

Performance Comparison

import torch
import time

size = 5000

# CPU benchmark
x_cpu = torch.rand(size, size)
y_cpu = torch.rand(size, size)

start = time.time()
z_cpu = x_cpu @ y_cpu
cpu_time = time.time() - start

print(f"CPU time: {cpu_time:.4f}s")

# GPU benchmark
if torch.cuda.is_available():
    x_gpu = torch.rand(size, size, device='cuda')
    y_gpu = torch.rand(size, size, device='cuda')
    
    # Warm up
    _ = x_gpu @ y_gpu
    torch.cuda.synchronize()
    
    start = time.time()
    z_gpu = x_gpu @ y_gpu
    torch.cuda.synchronize()
    gpu_time = time.time() - start
    
    print(f"GPU time: {gpu_time:.4f}s")
    print(f"Speedup: {cpu_time/gpu_time:.1f}x")

My results (5000x5000 matrix multiplication):

CPU time: 2.3456s
GPU time: 0.0234s
Speedup: 100.2x

GPU is 100x faster for large matrix operations!

From NumPy to PyTorch

When I migrated my image processing pipeline from NumPy to PyTorch, performance improved dramatically.

NumPy vs PyTorch

import numpy as np
import torch
import time

# NumPy version
def process_numpy(data):
    # Normalize
    data = (data - data.mean()) / data.std()
    # Apply transformation
    result = np.matmul(data, data.T)
    return result

# PyTorch version (CPU)
def process_pytorch_cpu(data):
    data = (data - data.mean()) / data.std()
    result = torch.matmul(data, data.T)
    return result

# PyTorch version (GPU)
def process_pytorch_gpu(data):
    data = (data - data.mean()) / data.std()
    result = torch.matmul(data, data.T)
    return result

# Benchmark
size = 3000
numpy_data = np.random.rand(size, size)
torch_cpu_data = torch.from_numpy(numpy_data)
torch_gpu_data = torch_cpu_data.cuda() if torch.cuda.is_available() else torch_cpu_data

# NumPy
start = time.time()
result_numpy = process_numpy(numpy_data)
numpy_time = time.time() - start

# PyTorch CPU
start = time.time()
result_pytorch_cpu = process_pytorch_cpu(torch_cpu_data)
pytorch_cpu_time = time.time() - start

# PyTorch GPU
if torch.cuda.is_available():
    torch.cuda.synchronize()
    start = time.time()
    result_pytorch_gpu = process_pytorch_gpu(torch_gpu_data)
    torch.cuda.synchronize()
    pytorch_gpu_time = time.time() - start
    
    print(f"NumPy: {numpy_time:.4f}s")
    print(f"PyTorch CPU: {pytorch_cpu_time:.4f}s")
    print(f"PyTorch GPU: {pytorch_gpu_time:.4f}s")
    print(f"GPU speedup vs NumPy: {numpy_time/pytorch_gpu_time:.1f}x")

Real Migration Example

My image preprocessing pipeline before and after:

Before (NumPy):

import numpy as np

def preprocess_image_numpy(image):
    """Preprocess image with NumPy (CPU only)."""
    # Normalize to [0, 1]
    image = image.astype(np.float32) / 255.0
    
    # Standardize
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    image = (image - mean) / std
    
    # Augmentation: random flip
    if np.random.rand() > 0.5:
        image = np.fliplr(image)
    
    return image

# Process batch
images = np.random.randint(0, 255, (32, 224, 224, 3), dtype=np.uint8)
processed = np.array([preprocess_image_numpy(img) for img in images])

After (PyTorch):

import torch
import torchvision.transforms as transforms

# Define transforms
transform = transforms.Compose([
    transforms.ToTensor(),  # Convert to tensor and normalize to [0, 1]
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
    transforms.RandomHorizontalFlip(p=0.5),
])

def preprocess_image_pytorch(images):
    """Preprocess batch of images with PyTorch (GPU-accelerated)."""
    # images: (batch, height, width, channels) numpy array
    # Convert to tensor
    images_tensor = torch.from_numpy(images).float() / 255.0
    
    # Rearrange to (batch, channels, height, width)
    images_tensor = images_tensor.permute(0, 3, 1, 2)
    
    # Normalize
    mean = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1)
    std = torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1)
    images_tensor = (images_tensor - mean) / std
    
    # Move to GPU
    if torch.cuda.is_available():
        images_tensor = images_tensor.cuda()
    
    return images_tensor

# Process batch (much faster!)
images = np.random.randint(0, 255, (32, 224, 224, 3), dtype=np.uint8)
processed = preprocess_image_pytorch(images)

Performance gain: 15x faster with GPU!

Common Patterns

Patterns I use in every PyTorch project:

1. Device-Agnostic Code

import torch

# Set device once
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Create tensors on device
x = torch.rand(100, 100, device=device)

# Or move existing tensors
y = torch.rand(100, 100).to(device)

# All operations work the same
z = x @ y

2. Seed for Reproducibility

import torch
import random
import numpy as np

def set_seed(seed=42):
    """Set all random seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  # For multi-GPU
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

set_seed(42)

3. Memory Management

import torch

# Check GPU memory
if torch.cuda.is_available():
    print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
    
    # Clear cache
    torch.cuda.empty_cache()
    
    # Delete tensors to free memory
    del large_tensor
    torch.cuda.empty_cache()

4. Inference Mode

import torch

model = ...  # Your model

# Disable gradient computation for inference
with torch.no_grad():
    output = model(input_tensor)

# Or use inference mode (faster)
with torch.inference_mode():
    output = model(input_tensor)

Quick Reference

Create tensors:

torch.tensor([1, 2, 3])          # From list
torch.from_numpy(numpy_array)    # From NumPy
torch.zeros(3, 4)                # Zeros
torch.ones(3, 4)                 # Ones
torch.rand(3, 4)                 # Random [0, 1)
torch.randn(3, 4)                # Random normal
torch.arange(0, 10, 2)           # Range

Operations:

x + y, x - y, x * y, x / y      # Element-wise
x @ y, torch.matmul(x, y)       # Matrix multiply
x.sum(), x.mean(), x.std()      # Reductions
x.max(), x.min()                # Max/min
x.reshape(shape), x.view(shape) # Reshape
x.T, x.transpose(0, 1)          # Transpose

GPU:

x.cuda()                         # Move to GPU
x.cpu()                          # Move to CPU
x.to(device)                     # Move to device
torch.cuda.is_available()        # Check CUDA
torch.cuda.synchronize()         # Wait for GPU

Best Practices

From my experience:

1. Use GPU when available - 10-100x speedup for large operations.

2. Keep tensors on same device - mixing CPU and GPU tensors errors.

3. Use in-place operations sparingly - they can interfere with autograd.

4. Batch operations - process multiple samples together for efficiency.

5. Use appropriate dtype - float32 for most tasks, float16 for memory savings.

6. Set seeds for reproducibility - crucial for debugging.

7. Profile your code - find bottlenecks before optimizing.

What's Next?

You now understand PyTorch tensors - the foundation of deep learning. In Part 2, we'll explore autograd: automatic differentiation that calculates gradients for you.

Next: Part 2 - Autograd and Automatic Differentiation

Previous: Series Overview

This article is part of the PyTorch 101 series. All examples use Python 3 and are based on real projects.

PreviousPyTorch 101 NextPart 2: Autograd and Automatic Differentiation

Last updated 2 days ago

hashtagWhy I Switched to PyTorch

hashtagWhat is PyTorch?

hashtagInstallation

hashtagTensors: The Foundation

hashtagCreating Tensors

hashtagTensor Attributes

hashtagData Types

hashtagTensor Operations

hashtagBasic Math

hashtagMatrix Operations

hashtagReduction Operations

hashtagIndexing and Slicing

hashtagGPU Acceleration

hashtagMoving Tensors to GPU

hashtagPerformance Comparison

hashtagFrom NumPy to PyTorch

hashtagNumPy vs PyTorch

hashtagReal Migration Example

hashtagCommon Patterns

hashtag1. Device-Agnostic Code

hashtag2. Seed for Reproducibility

hashtag3. Memory Management

hashtag4. Inference Mode

hashtagQuick Reference

hashtagBest Practices

hashtagWhat's Next?

Why I Switched to PyTorch

What is PyTorch?

Installation

Tensors: The Foundation

Creating Tensors

Tensor Attributes

Data Types

Tensor Operations

Basic Math

Matrix Operations

Reduction Operations

Indexing and Slicing

GPU Acceleration

Moving Tensors to GPU

Performance Comparison

From NumPy to PyTorch

NumPy vs PyTorch

Real Migration Example

Common Patterns

1. Device-Agnostic Code

2. Seed for Reproducibility

3. Memory Management

4. Inference Mode

Quick Reference

Best Practices

What's Next?