Part 3: Logstash - Data Processing Pipeline

Part of the ELK Stack 101 Series

The Day Logstash Saved My Sanity

Picture this: 15 different microservices, each logging in its own creative format:

Service A: [2025-01-15 10:30:45] ERROR - Payment failed
Service B: ERROR|2025-01-15T10:30:45Z|user_service|Authentication timeout
Service C: {"timestamp":1705318245,"level":"error","msg":"Database connection lost"}

Searching was a nightmare. Correlating errors across services? Forget it.

Then I discovered Logstash. Within a day, I had:

  • Unified JSON format across all services

  • Parsed timestamps into proper date fields

  • Extracted user IDs, trace IDs, error codes

  • Enriched logs with environment and region data

  • Routed everything to Elasticsearch

Logstash is the unsung hero of the ELK stack. It's the data janitor that cleans up your mess.

In this article, I'll share everything I've learned about building Logstash pipelines - from basic parsing to advanced transformations.

What is Logstash?

Logstash is a server-side data processing pipeline that:

  1. Ingests data from multiple sources simultaneously

  2. Transforms it (parse, filter, enrich)

  3. Sends it to multiple destinations (Elasticsearch, S3, etc.)

Think of it as an ETL tool for logs (Extract, Transform, Load).

Written in JRuby, runs on the JVM, configured with a domain-specific language.

Logstash Architecture

Every Logstash pipeline has three stages:

Inputs

Where data comes from:

  • file: Read from log files

  • beats: Receive from Filebeat, Metricbeat

  • tcp/udp: Listen on network sockets

  • http: HTTP endpoint

  • kafka: Consume from Kafka

  • jdbc: Query databases

  • redis: Read from Redis

Filters

Transform and enrich data:

  • grok: Parse unstructured text

  • mutate: Modify fields

  • date: Parse timestamps

  • geoip: Add geographic data

  • json: Parse JSON

  • csv: Parse CSV

  • ruby: Custom Ruby code

Outputs

Where data goes:

  • elasticsearch: Index to Elasticsearch

  • file: Write to files

  • kafka: Produce to Kafka

  • s3: Upload to S3

  • stdout: Print to console (debugging)

  • email: Send alerts

One pipeline can have multiple inputs, filters, and outputs.

Installing Logstash

Method 1: Docker (Quick Start)

Method 2: Linux Installation

On Ubuntu/Debian:

Configuration directory: /etc/logstash Pipeline configs: /etc/logstash/conf.d/

My First Logstash Pipeline

Let's start with a simple "hello world" pipeline.

simple-pipeline.conf:

Run it:

Type something:

Output:

Congratulations! Your first pipeline.

Real-World Pipeline: Apache Access Logs

Let me show you a real pipeline I use for Apache logs.

Sample Apache log:

Goal: Parse this into structured JSON.

apache-pipeline.conf:

Result in Elasticsearch:

Beautiful structured data from a messy log line.

Grok Patterns - The Heart of Logstash

Grok is how you parse unstructured text. It uses regex patterns with names.

Basic Grok Syntax

Common Built-In Patterns

My Custom Application Log Pattern

Log format:

Grok pattern:

Result:

Testing Grok Patterns

Use the Grok Debugger in Kibana:

  1. Open Kibana

  2. Navigate to Dev Tools β†’ Grok Debugger

  3. Paste your log line

  4. Test patterns

Or use online tools: https://grokdebugger.com

Custom Grok Patterns

Define custom patterns in /etc/logstash/patterns/custom-patterns:

Use in pipeline:

Common Filters

Mutate Filter

Add, remove, replace, convert fields:

Date Filter

Parse timestamps into @timestamp:

Multiple date formats:

JSON Filter

Parse JSON logs:

Input:

Output:

GeoIP Filter

Add geographic data from IP addresses:

Result:

Drop Filter

Discard events:

Ruby Filter

Execute custom Ruby code:

Conditional Logic

Logstash supports if/else conditionals.

Syntax

Operators

Examples

Route by log level:

Parse different log formats:

Check field existence:

Multiple Pipelines

Problem: One Logstash instance handling different log types.

Solution: Multiple pipeline configurations.

pipelines.yml:

Each pipeline runs independently with dedicated workers.

Production Pipeline Example

Here's a complete pipeline I use in production for microservices logs.

microservices-pipeline.conf:

This pipeline:

  1. Receives logs from Filebeat

  2. Parses JSON

  3. Extracts timestamps and trace IDs

  4. Adds environment metadata

  5. Calculates severity scores

  6. Enriches with GeoIP data

  7. Sends to Elasticsearch with dynamic index names

  8. Alerts on critical errors

  9. Backs up to S3

Performance Tuning

1. Pipeline Workers

Configure worker threads:

Rule of thumb: Workers = CPU cores

2. Batch Processing

Larger batches = better throughput, higher latency:

Smaller batches = lower latency, less throughput:

3. JVM Heap

Set heap size (50% of RAM, max 31GB):

jvm.options:

4. Persistent Queue

Enable persistent queue for reliability:

Monitoring Logstash

Monitoring API

Key Metrics

From stats API:

Watch for:

  • Events in/out ratio (should be close)

  • Queue push duration (should be low)

  • CPU and memory usage

Debugging Pipelines

Use stdout Output

Add Debug Logging

Enable Logstash Debugging

Test Configurations

Common Patterns

Pattern 1: Multiline Logs

Java stack traces:

Pattern 2: Dead Letter Queue

Handle parse failures:

Pattern 3: Dynamic Routing

Route to different indices:

Logstash vs. Beats

When to use Logstash:

  • Complex parsing (grok patterns)

  • Data transformation and enrichment

  • Multiple outputs

  • Aggregation and filtering

When to use Beats only:

  • Simple log shipping

  • Low resource overhead

  • No parsing needed (already JSON)

My approach: Use Filebeat to ship logs, Logstash to process them.

Conclusion

Logstash is the data processing powerhouse of the ELK stack. Key takeaways:

Architecture:

  • Input β†’ Filter β†’ Output

  • Multiple inputs, filters, outputs supported

  • Conditional logic and routing

Parsing:

  • Grok patterns for unstructured logs

  • JSON filter for structured logs

  • Custom patterns for application logs

Transformation:

  • Mutate fields (add, remove, convert)

  • Date parsing for timestamps

  • GeoIP enrichment

  • Custom Ruby code

Production:

  • Multiple pipelines for different log types

  • Performance tuning (workers, batch size, heap)

  • Persistent queues for reliability

  • Monitoring and debugging tools

In the next article, we'll explore Kibana - the visualization layer that brings everything together.

Previous: Part 2 - Elasticsearch Deep Dive Next: Part 4 - Kibana Visualization


This article is part of the ELK Stack 101 series. Check out the series overview for more content.

Last updated