Part 2: Understanding Models, Tokenizers, and Preprocessing

Part of the Hugging Face Transformers 101 Series

When Pipelines Weren't Enough

Pipelines are amazing for getting started, but I hit their limits quickly. I needed to:

  • Customize tokenization for domain-specific text

  • Access raw model outputs for custom post-processing

  • Implement batching with variable-length sequences

  • Understand why models failed on certain inputs

That's when I dove into the internals: models, tokenizers, and preprocessing.

Understanding these components transformed me from a pipeline user to someone who could build custom ML solutions. Let me share what I learned.

The Three-Component Architecture

Every Transformers pipeline uses three core components:

Text/Data β†’ Tokenizer β†’ Model β†’ Post-processor β†’ Output

Tokenizer: Converts text to numbers (tokens) Model: Processes tokens, generates predictions Post-processor: Converts model outputs to human-readable results

Let's understand each deeply.

Tokenizers: From Text to Numbers

The fundamental problem: Neural networks work with numbers, not text. Tokenizers bridge this gap.

How Tokenization Works

Output:

What happened?

  1. Split text into tokens (words, subwords, punctuation)

  2. Map each token to a unique ID

  3. These IDs are what the model actually processes

Tokenization Strategies

Different models use different strategies:

Word-based (older, rarely used now):

Problem: Huge vocabulary (every word needs an ID)

Subword-based (modern, most common):

Advantages:

  • Smaller vocabulary

  • Handles unknown words (break into known subwords)

  • Language-efficient

Character-based:

Rare in practice - very long sequences.

The Full Tokenization Pipeline

Output:

Components returned:

  • input_ids: Token IDs for the text

  • attention_mask: 1 for real tokens, 0 for padding tokens

Special Tokens

Every model uses special tokens:

Output:

Special token meanings:

  • [CLS]: Start of sequence (used for classification)

  • [SEP]: Separator between sequences

  • [PAD]: Padding to make sequences same length

  • [MASK]: Masked token (for training)

Handling Multiple Sequences

Processing pairs (question-answering, sentence similarity):

Output:

Format: [CLS] question [SEP] context [SEP]

Batch Tokenization

Efficient processing of multiple texts:

Output:

Notice:

  • All sequences padded to same length (13 tokens)

  • Padding tokens have attention mask = 0

  • Real tokens have attention mask = 1

I use batch tokenization everywhere - much faster than one-by-one.

Tokenizer Types

Different models use different tokenizers:

Output:

Use AutoTokenizer - it automatically selects the right tokenizer:

Models: The Heart of Transformers

Now that we have tokens, let's see how models process them.

Loading Models

Output:

Model Architectures

AutoModel classes for different tasks:

Choose based on your task.

Using Models Manually

Without pipelines:

Output:

This is what pipelines do automatically.

Understanding Model Outputs

Output:

Components:

  • last_hidden_state: Token embeddings (batch_size, sequence_length, hidden_size)

  • pooler_output: Sentence embedding (batch_size, hidden_size)

I use last_hidden_state for token-level tasks (NER), pooler_output for sentence-level tasks (classification).

Extracting Embeddings

Get vector representations of text:

Output:

Makes sense: ML texts are more similar to each other than to pizza.

I use this for semantic search, duplicate detection, clustering.

Complete Workflow Example

Let me show you a complete example: custom sentiment analysis with manual control.

Output:

This is production-ready code I've used in real systems.

Preprocessing Different Data Types

Text Preprocessing

Image Preprocessing

Audio Preprocessing

Performance Optimization

Memory Optimization

Batch Size Tuning

Common Issues and Solutions

Issue 1: Token Length Mismatch

Problem: Input longer than model's max length

Solution:

Issue 2: Padding Issues

Problem: Different sequence lengths in batch

Solution:

Issue 3: Special Tokens Issues

Problem: Need to preserve special characters

Solution:

Best Practices

From my experience:

1. Always use AutoTokenizer and AutoModel - they handle model-specific details automatically.

2. Batch process when possible - much faster than one-by-one.

3. Cache tokenizers and models - load once, reuse many times.

4. Use appropriate padding - dynamic padding for variable lengths, fixed for consistent shapes.

5. Handle long texts - truncate or chunk, don't ignore.

6. Monitor memory usage - especially with large batches or models.

7. Use half precision when possible - faster, less memory.

8. Validate inputs - check for edge cases (empty strings, special characters).

What's Next?

We've covered the fundamentals of models and tokenizers. In Part 3, we'll learn how to fine-tune models on your own data using the Trainer API.

Next: Part 3 - Fine-tuning and Training with Trainer


Previous: Part 1 - Introduction to Transformers and Pipelines

This article is part of the Hugging Face Transformers 101 series. Check out the series overview for more content.

Last updated