Part 2: Understanding Models, Tokenizers, and Preprocessing

Part of the Hugging Face Transformers 101 Series

When Pipelines Weren't Enough

Pipelines are amazing for getting started, but I hit their limits quickly. I needed to:

Customize tokenization for domain-specific text
Access raw model outputs for custom post-processing
Implement batching with variable-length sequences
Understand why models failed on certain inputs

That's when I dove into the internals: models, tokenizers, and preprocessing.

Understanding these components transformed me from a pipeline user to someone who could build custom ML solutions. Let me share what I learned.

The Three-Component Architecture

Every Transformers pipeline uses three core components:

Text/Data → Tokenizer → Model → Post-processor → Output

Tokenizer: Converts text to numbers (tokens) Model: Processes tokens, generates predictions Post-processor: Converts model outputs to human-readable results

Let's understand each deeply.

Tokenizers: From Text to Numbers

The fundamental problem: Neural networks work with numbers, not text. Tokenizers bridge this gap.

How Tokenization Works

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Hello, how are you?"

# Tokenize
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

# Decode back
decoded = tokenizer.decode(token_ids)
print("Decoded:", decoded)

Output:

Tokens: ['hello', ',', 'how', 'are', 'you', '?']
Token IDs: [7592, 1010, 2129, 2024, 2017, 1029]
Decoded: hello, how are you?

What happened?

Split text into tokens (words, subwords, punctuation)
Map each token to a unique ID
These IDs are what the model actually processes

Tokenization Strategies

Different models use different strategies:

Word-based (older, rarely used now):

# Simple split on spaces
"Hello world" → ["Hello", "world"]

Problem: Huge vocabulary (every word needs an ID)

Subword-based (modern, most common):

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unhappiness")
print(tokens)  # ['un', '##hap', '##piness']

Advantages:

Smaller vocabulary
Handles unknown words (break into known subwords)
Language-efficient

Character-based:

"Hello" → ["H", "e", "l", "l", "o"]

Rare in practice - very long sequences.

The Full Tokenization Pipeline

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Transformers are amazing!"

# Full tokenization with all options
encoded = tokenizer(
    text,
    padding=True,  # Pad to max length
    truncation=True,  # Truncate if too long
    max_length=512,  # Maximum sequence length
    return_tensors="pt"  # Return PyTorch tensors
)

print("Input IDs:", encoded['input_ids'])
print("Attention Mask:", encoded['attention_mask'])

Output:

Input IDs: tensor([[ 101, 19081,  2024, 6429,  999,  102]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1]])

Components returned:

input_ids: Token IDs for the text
attention_mask: 1 for real tokens, 0 for padding tokens

Special Tokens

Every model uses special tokens:

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Show special tokens
print("CLS token:", tokenizer.cls_token, "ID:", tokenizer.cls_token_id)
print("SEP token:", tokenizer.sep_token, "ID:", tokenizer.sep_token_id)
print("PAD token:", tokenizer.pad_token, "ID:", tokenizer.pad_token_id)

# Tokenize with special tokens
text = "Hello world"
encoded = tokenizer(text)
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'])
print("\nTokens with special tokens:", tokens)

Output:

CLS token: [CLS] ID: 101
SEP token: [SEP] ID: 102
PAD token: [PAD] ID: 0

Tokens with special tokens: ['[CLS]', 'hello', 'world', '[SEP]']

Special token meanings:

[CLS]: Start of sequence (used for classification)
[SEP]: Separator between sequences
[PAD]: Padding to make sequences same length
[MASK]: Masked token (for training)

Handling Multiple Sequences

Processing pairs (question-answering, sentence similarity):

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

question = "What is AI?"
context = "AI stands for Artificial Intelligence."

# Encode pair
encoded = tokenizer(
    question,
    context,
    padding=True,
    truncation=True,
    return_tensors="pt"
)

# See how it's tokenized
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
print("Tokens:", tokens)

Output:

Tokens: ['[CLS]', 'what', 'is', 'ai', '?', '[SEP]', 'ai', 'stands', 'for', 'artificial', 'intelligence', '.', '[SEP]']

Format: [CLS] question [SEP] context [SEP]

Batch Tokenization

Efficient processing of multiple texts:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

texts = [
    "Short text",
    "This is a much longer text that contains many more words",
    "Medium length text here"
]

# Batch tokenize
encoded = tokenizer(
    texts,
    padding=True,  # Pad to longest in batch
    truncation=True,
    return_tensors="pt"
)

print("Batch shape:", encoded['input_ids'].shape)
print("\nInput IDs:\n", encoded['input_ids'])
print("\nAttention Masks:\n", encoded['attention_mask'])

Output:

Batch shape: torch.Size([3, 13])

Input IDs:
tensor([[  101,  2460,  3793,   102,     0,     0,     0,     0,     0,     0, 0,     0,     0],
        [  101,  2023,  2003,  1037,  2172,  3091,  3793,  2008,  3397,  2116, 2062,  2616,   102],
        [  101,  5396,  3092,  3793,  2182,   102,     0,     0,     0,     0, 0,     0,     0]])

Attention Masks:
tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])

Notice:

All sequences padded to same length (13 tokens)
Padding tokens have attention mask = 0
Real tokens have attention mask = 1

I use batch tokenization everywhere - much faster than one-by-one.

Tokenizer Types

Different models use different tokenizers:

# BERT / RoBERTa - WordPiece
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print("BERT:", tokenizer.tokenize("unhappiness"))

# GPT-2 / GPT-3 - Byte-Pair Encoding (BPE)
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
print("GPT-2:", tokenizer.tokenize("unhappiness"))

# T5 / ALBERT - SentencePiece
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")
print("T5:", tokenizer.tokenize("unhappiness"))

Output:

BERT: ['un', '##hap', '##piness']
GPT-2: ['un', 'h', 'appiness']
T5: ['▁un', 'h', 'app', 'iness']

Use AutoTokenizer - it automatically selects the right tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("any-model-name")

Models: The Heart of Transformers

Now that we have tokens, let's see how models process them.

Loading Models

from transformers import AutoModel, AutoModelForSequenceClassification

# Base model (outputs embeddings)
model = AutoModel.from_pretrained("bert-base-uncased")

# Task-specific model (outputs classifications)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

print(f"Model type: {type(model)}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Output:

Model type: <class 'transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification'>
Parameters: 66,955,010

Model Architectures

AutoModel classes for different tasks:

from transformers import (
    AutoModel,  # Base model (embeddings)
    AutoModelForSequenceClassification,  # Text classification
    AutoModelForTokenClassification,  # NER, POS tagging
    AutoModelForQuestionAnswering,  # Question answering
    AutoModelForMaskedLM,  # Fill-mask (BERT-style)
    AutoModelForCausalLM,  # Text generation (GPT-style)
    AutoModelForSeq2SeqLM  # Translation, summarization
)

Choose based on your task.

Using Models Manually

Without pipelines:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare input
text = "I love this library!"
inputs = tokenizer(text, return_tensors="pt")

# Forward pass
with torch.no_grad():  # No gradient calculation (inference only)
    outputs = model(**inputs)

# Extract logits
logits = outputs.logits
print("Raw logits:", logits)

# Convert to probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print("Probabilities:", probabilities)

# Get prediction
predicted_class = torch.argmax(probabilities, dim=-1).item()
print("Predicted class:", predicted_class)
print("Label:", model.config.id2label[predicted_class])

Output:

Raw logits: tensor([[-4.1634,  4.4258]])
Probabilities: tensor([[0.0002, 0.9998]])
Predicted class: 1
Label: POSITIVE

This is what pipelines do automatically.

Understanding Model Outputs

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "Hello world"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

print("Output keys:", outputs.keys())
print("\nLast hidden state shape:", outputs.last_hidden_state.shape)
print("Pooler output shape:", outputs.pooler_output.shape)

Output:

Output keys: odict_keys(['last_hidden_state', 'pooler_output'])

Last hidden state shape: torch.Size([1, 4, 768])
Pooler output shape: torch.Size([1, 768])

Components:

last_hidden_state: Token embeddings (batch_size, sequence_length, hidden_size)
pooler_output: Sentence embedding (batch_size, hidden_size)

I use last_hidden_state for token-level tasks (NER), pooler_output for sentence-level tasks (classification).

Extracting Embeddings

Get vector representations of text:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_embeddings(text):
    """Get BERT embeddings for text."""
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Use pooler output (CLS token representation)
    embeddings = outputs.pooler_output
    return embeddings

# Get embeddings for multiple texts
texts = [
    "Machine learning is fascinating",
    "Deep learning is awesome",
    "I love pizza"
]

embeddings = [get_embeddings(text) for text in texts]

# Compute similarity
from torch.nn.functional import cosine_similarity

sim_1_2 = cosine_similarity(embeddings[0], embeddings[1])
sim_1_3 = cosine_similarity(embeddings[0], embeddings[2])

print(f"Similarity (text1, text2): {sim_1_2.item():.4f}")
print(f"Similarity (text1, text3): {sim_1_3.item():.4f}")

Output:

Similarity (text1, text2): 0.8734
Similarity (text1, text3): 0.6142

Makes sense: ML texts are more similar to each other than to pizza.

I use this for semantic search, duplicate detection, clustering.

Complete Workflow Example

Let me show you a complete example: custom sentiment analysis with manual control.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

class SentimentAnalyzer:
    """Custom sentiment analyzer with full control."""
    
    def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model.eval()  # Set to evaluation mode
        
        # Move to GPU if available
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
    
    def analyze(self, texts, batch_size=32):
        """Analyze sentiment of texts in batches."""
        if isinstance(texts, str):
            texts = [texts]
        
        all_results = []
        
        # Process in batches
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            
            # Tokenize
            inputs = self.tokenizer(
                batch_texts,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors="pt"
            )
            
            # Move to device
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
            
            # Forward pass
            with torch.no_grad():
                outputs = self.model(**inputs)
            
            # Get probabilities
            probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
            
            # Extract results
            for j, probs in enumerate(probabilities):
                predicted_class = torch.argmax(probs).item()
                confidence = probs[predicted_class].item()
                
                result = {
                    'text': batch_texts[j],
                    'label': self.model.config.id2label[predicted_class],
                    'confidence': confidence,
                    'probabilities': {
                        self.model.config.id2label[k]: v.item()
                        for k, v in enumerate(probs)
                    }
                }
                all_results.append(result)
        
        return all_results
    
    def analyze_detailed(self, text):
        """Analyze with token-level attention."""
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            return_attention_mask=True,
            return_token_type_ids=True if 'token_type_ids' in self.tokenizer.model_input_names else False
        )
        
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model(**inputs, output_attentions=True)
        
        # Get tokens
        tokens = self.tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
        
        # Get prediction
        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(probabilities).item()
        
        return {
            'tokens': tokens,
            'label': self.model.config.id2label[predicted_class],
            'confidence': probabilities[0][predicted_class].item(),
            'attention_weights': outputs.attentions  # List of attention matrices
        }

# Usage
analyzer = SentimentAnalyzer()

# Single text
result = analyzer.analyze("This product is amazing!")
print(result[0])

# Batch of texts
texts = [
    "Great product, highly recommend!",
    "Terrible experience, very disappointed.",
    "It's okay, nothing special.",
    "Best purchase ever!",
    "Waste of money."
]

results = analyzer.analyze(texts, batch_size=2)
for r in results:
    print(f"{r['text'][:40]:40} -> {r['label']:8} ({r['confidence']:.4f})")

# Detailed analysis
detailed = analyzer.analyze_detailed("The quality is excellent but shipping was slow.")
print("\nToken-level analysis:")
print("Tokens:", detailed['tokens'])
print("Label:", detailed['label'])
print("Confidence:", detailed['confidence'])

Output:

{'text': 'This product is amazing!', 'label': 'POSITIVE', 'confidence': 0.9998, 'probabilities': {'NEGATIVE': 0.0002, 'POSITIVE': 0.9998}}

Great product, highly recommend!        -> POSITIVE (0.9999)
Terrible experience, very disappointed. -> NEGATIVE (0.9997)
It's okay, nothing special.             -> NEGATIVE (0.6234)
Best purchase ever!                     -> POSITIVE (0.9999)
Waste of money.                         -> NEGATIVE (0.9994)

Token-level analysis:
Tokens: ['[CLS]', 'the', 'quality', 'is', 'excellent', 'but', 'shipping', 'was', 'slow', '.', '[SEP]']
Label: POSITIVE
Confidence: 0.9245

This is production-ready code I've used in real systems.

Preprocessing Different Data Types

Text Preprocessing

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Clean and prepare text
def preprocess_text(text):
    # Basic cleaning
    text = text.strip()
    text = text.lower()
    
    # Tokenize
    encoded = tokenizer(
        text,
        padding='max_length',
        truncation=True,
        max_length=128,
        return_tensors='pt'
    )
    
    return encoded

text = "  HELLO WORLD!  This is a TEST.  "
processed = preprocess_text(text)
print("Preprocessed tokens:", tokenizer.decode(processed['input_ids'][0]))

Image Preprocessing

from transformers import AutoImageProcessor
from PIL import Image
import requests

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

# Load image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

# Preprocess
inputs = processor(images=image, return_tensors="pt")

print("Preprocessed image shape:", inputs['pixel_values'].shape)
# Output: Preprocessed image shape: torch.Size([1, 3, 224, 224])

Audio Preprocessing

from transformers import AutoProcessor
import librosa

processor = AutoProcessor.from_pretrained("openai/whisper-base")

# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)

# Preprocess
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")

print("Preprocessed audio shape:", inputs['input_features'].shape)

Performance Optimization

Memory Optimization

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Half precision (float16) - saves memory
model = model.half()

# Move to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Process with half precision
text = "Hello world"
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad(), torch.cuda.amp.autocast():
    outputs = model(**inputs)

print("Memory efficient processing complete")

Batch Size Tuning

def find_optimal_batch_size(model, tokenizer, sample_texts):
    """Find largest batch size that fits in memory."""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    
    batch_size = 1
    while True:
        try:
            # Try processing with current batch size
            batch = sample_texts[:batch_size]
            inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            with torch.no_grad():
                _ = model(**inputs)
            
            print(f"Batch size {batch_size}: OK")
            batch_size *= 2
            
        except RuntimeError as e:
            if "out of memory" in str(e):
                optimal = batch_size // 2
                print(f"\nOptimal batch size: {optimal}")
                return optimal
            else:
                raise e

# Usage
# optimal_batch_size = find_optimal_batch_size(model, tokenizer, sample_texts)

Common Issues and Solutions

Issue 1: Token Length Mismatch

Problem: Input longer than model's max length

Solution:

# Automatic truncation
inputs = tokenizer(
    long_text,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

# Or chunk manually
def chunk_text(text, tokenizer, max_length=512):
    tokens = tokenizer.tokenize(text)
    chunks = []
    
    for i in range(0, len(tokens), max_length-2):  # -2 for [CLS] and [SEP]
        chunk_tokens = tokens[i:i+max_length-2]
        chunk_text = tokenizer.convert_tokens_to_string(chunk_tokens)
        chunks.append(chunk_text)
    
    return chunks

Issue 2: Padding Issues

Problem: Different sequence lengths in batch

Solution:

# Dynamic padding (pad to longest in batch)
inputs = tokenizer(
    texts,
    padding=True,  # or 'longest'
    return_tensors="pt"
)

# Fixed padding (pad to max_length)
inputs = tokenizer(
    texts,
    padding='max_length',
    max_length=128,
    return_tensors="pt"
)

Issue 3: Special Tokens Issues

Problem: Need to preserve special characters

Solution:

# Don't add special tokens
inputs = tokenizer(
    text,
    add_special_tokens=False,
    return_tensors="pt"
)

# Or manually add them
inputs = tokenizer(
    text,
    add_special_tokens=True,
    return_tensors="pt"
)

Best Practices

From my experience:

1. Always use AutoTokenizer and AutoModel - they handle model-specific details automatically.

2. Batch process when possible - much faster than one-by-one.

3. Cache tokenizers and models - load once, reuse many times.

4. Use appropriate padding - dynamic padding for variable lengths, fixed for consistent shapes.

5. Handle long texts - truncate or chunk, don't ignore.

6. Monitor memory usage - especially with large batches or models.

7. Use half precision when possible - faster, less memory.

8. Validate inputs - check for edge cases (empty strings, special characters).

What's Next?

We've covered the fundamentals of models and tokenizers. In Part 3, we'll learn how to fine-tune models on your own data using the Trainer API.

Next: Part 3 - Fine-tuning and Training with Trainer

Previous: Part 1 - Introduction to Transformers and Pipelines

This article is part of the Hugging Face Transformers 101 series. Check out the series overview for more content.

PreviousPart 1: Introduction to Transformers and Pipelines NextPart 3: Fine-tuning and Training with Trainer API

Last updated 2 days ago

hashtagWhen Pipelines Weren't Enough

hashtagThe Three-Component Architecture

hashtagTokenizers: From Text to Numbers

hashtagHow Tokenization Works

hashtagTokenization Strategies

hashtagThe Full Tokenization Pipeline

hashtagSpecial Tokens

hashtagHandling Multiple Sequences

hashtagBatch Tokenization

hashtagTokenizer Types

hashtagModels: The Heart of Transformers

hashtagLoading Models

hashtagModel Architectures

hashtagUsing Models Manually

hashtagUnderstanding Model Outputs

hashtagExtracting Embeddings

hashtagComplete Workflow Example

hashtagPreprocessing Different Data Types

hashtagText Preprocessing

hashtagImage Preprocessing

hashtagAudio Preprocessing

hashtagPerformance Optimization

hashtagMemory Optimization

hashtagBatch Size Tuning

hashtagCommon Issues and Solutions

hashtagIssue 1: Token Length Mismatch

hashtagIssue 2: Padding Issues

hashtagIssue 3: Special Tokens Issues

hashtagBest Practices

hashtagWhat's Next?

When Pipelines Weren't Enough

The Three-Component Architecture

Tokenizers: From Text to Numbers

How Tokenization Works

Tokenization Strategies

The Full Tokenization Pipeline

Special Tokens

Handling Multiple Sequences

Batch Tokenization

Tokenizer Types

Models: The Heart of Transformers

Loading Models

Model Architectures

Using Models Manually

Understanding Model Outputs

Extracting Embeddings

Complete Workflow Example

Preprocessing Different Data Types

Text Preprocessing

Image Preprocessing

Audio Preprocessing

Performance Optimization

Memory Optimization

Batch Size Tuning

Common Issues and Solutions

Issue 1: Token Length Mismatch

Issue 2: Padding Issues

Issue 3: Special Tokens Issues

Best Practices

What's Next?