Part 3: Logstash - Data Processing Pipeline

The Day Logstash Saved My Sanity

Picture this: 15 different microservices, each logging in its own creative format:

Service A: [2025-01-15 10:30:45] ERROR - Payment failed
Service B: ERROR|2025-01-15T10:30:45Z|user_service|Authentication timeout
Service C: {"timestamp":1705318245,"level":"error","msg":"Database connection lost"}

Searching was a nightmare. Correlating errors across services? Forget it.

Then I discovered Logstash. Within a day, I had:

Unified JSON format across all services
Parsed timestamps into proper date fields
Extracted user IDs, trace IDs, error codes
Enriched logs with environment and region data
Routed everything to Elasticsearch

Logstash is the unsung hero of the ELK stack. It's the data janitor that cleans up your mess.

In this article, I'll share everything I've learned about building Logstash pipelines - from basic parsing to advanced transformations.

What is Logstash?

Logstash is a server-side data processing pipeline that:

Ingests data from multiple sources simultaneously
Transforms it (parse, filter, enrich)
Sends it to multiple destinations (Elasticsearch, S3, etc.)

Think of it as an ETL tool for logs (Extract, Transform, Load).

Written in JRuby, runs on the JVM, configured with a domain-specific language.

Logstash Architecture

Every Logstash pipeline has three stages:

Input → Filter → Output

Inputs

Where data comes from:

file: Read from log files
beats: Receive from Filebeat, Metricbeat
tcp/udp: Listen on network sockets
http: HTTP endpoint
kafka: Consume from Kafka
jdbc: Query databases
redis: Read from Redis

Filters

Transform and enrich data:

grok: Parse unstructured text
mutate: Modify fields
date: Parse timestamps
geoip: Add geographic data
json: Parse JSON
csv: Parse CSV
ruby: Custom Ruby code

Outputs

Where data goes:

elasticsearch: Index to Elasticsearch
file: Write to files
kafka: Produce to Kafka
s3: Upload to S3
stdout: Print to console (debugging)
email: Send alerts

One pipeline can have multiple inputs, filters, and outputs.

Installing Logstash

Method 1: Docker (Quick Start)

docker run -d \
  --name logstash \
  -p 5044:5044 \
  -p 9600:9600 \
  -v ~/logstash/pipeline:/usr/share/logstash/pipeline \
  docker.elastic.co/logstash/logstash:8.11.0

Method 2: Linux Installation

On Ubuntu/Debian:

# Add Elastic repository (same as Elasticsearch)
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-8.x.list

# Install
sudo apt-get update
sudo apt-get install logstash

# Start service
sudo systemctl start logstash
sudo systemctl enable logstash

Configuration directory: /etc/logstash Pipeline configs: /etc/logstash/conf.d/

My First Logstash Pipeline

Let's start with a simple "hello world" pipeline.

simple-pipeline.conf:

input {
  stdin { }
}

filter {
  mutate {
    uppercase => [ "message" ]
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

Run it:

/usr/share/logstash/bin/logstash -f simple-pipeline.conf

Type something:

hello logstash

Output:

{
    "@timestamp" => 2025-01-15T10:30:00.000Z,
      "@version" => "1",
       "message" => "HELLO LOGSTASH",
          "host" => "my-laptop"
}

Congratulations! Your first pipeline.

Real-World Pipeline: Apache Access Logs

Let me show you a real pipeline I use for Apache logs.

Sample Apache log:

192.168.1.100 - - [15/Jan/2025:10:30:45 +0000] "GET /api/users HTTP/1.1" 200 1234 "-" "Mozilla/5.0"

Goal: Parse this into structured JSON.

apache-pipeline.conf:

input {
  file {
    path => "/var/log/apache2/access.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  
  date {
    match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    target => "@timestamp"
  }
  
  mutate {
    convert => {
      "response" => "integer"
      "bytes" => "integer"
    }
  }
  
  geoip {
    source => "clientip"
  }
}

output {
  elasticsearch {
    hosts => ["http://localhost:9200"]
    index => "apache-logs-%{+YYYY.MM.dd}"
  }
  
  stdout {
    codec => rubydebug
  }
}

Result in Elasticsearch:

{
  "@timestamp": "2025-01-15T10:30:45.000Z",
  "clientip": "192.168.1.100",
  "ident": "-",
  "auth": "-",
  "verb": "GET",
  "request": "/api/users",
  "httpversion": "1.1",
  "response": 200,
  "bytes": 1234,
  "referrer": "-",
  "agent": "Mozilla/5.0",
  "geoip": {
    "country_name": "United States",
    "city_name": "New York",
    "location": {
      "lat": 40.7128,
      "lon": -74.0060
    }
  }
}

Beautiful structured data from a messy log line.

Grok Patterns - The Heart of Logstash

Grok is how you parse unstructured text. It uses regex patterns with names.

Basic Grok Syntax

%{PATTERN:field_name}

Common Built-In Patterns

%{NUMBER}        # Matches numbers
%{INT}           # Integers
%{WORD}          # A single word
%{GREEDYDATA}    # Matches everything
%{IP}            # IP address
%{TIMESTAMP_ISO8601}  # ISO timestamp
%{LOGLEVEL}      # ERROR, WARN, INFO, etc.
%{UUID}          # UUID

My Custom Application Log Pattern

Log format:

2025-01-15 10:30:45.123 [payment-service] ERROR [user-123] [trace-abc] Payment timeout

Grok pattern:

filter {
  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{DATA:service}\] %{LOGLEVEL:level} \[%{DATA:user_id}\] \[%{DATA:trace_id}\] %{GREEDYDATA:error_message}"
    }
  }
}

Result:

{
  "timestamp": "2025-01-15 10:30:45.123",
  "service": "payment-service",
  "level": "ERROR",
  "user_id": "user-123",
  "trace_id": "trace-abc",
  "error_message": "Payment timeout"
}

Testing Grok Patterns

Use the Grok Debugger in Kibana:

Open Kibana
Navigate to Dev Tools → Grok Debugger
Paste your log line
Test patterns

Or use online tools: https://grokdebugger.com

Custom Grok Patterns

Define custom patterns in /etc/logstash/patterns/custom-patterns:

# Custom patterns
USER_ID user-[0-9]+
TRACE_ID trace-[a-f0-9]+
SERVICE_NAME [a-z-]+

Use in pipeline:

filter {
  grok {
    patterns_dir => ["/etc/logstash/patterns"]
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{SERVICE_NAME:service}\] %{LOGLEVEL:level} \[%{USER_ID:user_id}\] \[%{TRACE_ID:trace_id}\] %{GREEDYDATA:message}"
    }
  }
}

Common Filters

Mutate Filter

Add, remove, replace, convert fields:

filter {
  mutate {
    # Add fields
    add_field => {
      "environment" => "production"
      "region" => "us-east-1"
    }
    
    # Remove fields
    remove_field => [ "temp_field", "debug_info" ]
    
    # Rename fields
    rename => {
      "old_name" => "new_name"
    }
    
    # Replace values
    replace => {
      "status" => "ACTIVE"
    }
    
    # Convert types
    convert => {
      "response_time" => "integer"
      "success" => "boolean"
    }
    
    # Lowercase
    lowercase => [ "username" ]
    
    # Uppercase
    uppercase => [ "country_code" ]
    
    # Strip whitespace
    strip => [ "message" ]
  }
}

Date Filter

Parse timestamps into @timestamp:

filter {
  date {
    match => [ "log_timestamp", "yyyy-MM-dd HH:mm:ss.SSS" ]
    target => "@timestamp"
    timezone => "America/New_York"
  }
}

Multiple date formats:

filter {
  date {
    match => [
      "timestamp",
      "ISO8601",
      "yyyy-MM-dd HH:mm:ss",
      "dd/MMM/yyyy:HH:mm:ss Z"
    ]
  }
}

JSON Filter

Parse JSON logs:

filter {
  json {
    source => "message"
    target => "parsed"
  }
}

Input:

{"level":"ERROR","service":"api","message":"Failed"}

Output:

{
  "message": "{\"level\":\"ERROR\",\"service\":\"api\",\"message\":\"Failed\"}",
  "parsed": {
    "level": "ERROR",
    "service": "api",
    "message": "Failed"
  }
}

GeoIP Filter

Add geographic data from IP addresses:

filter {
  geoip {
    source => "client_ip"
    target => "geoip"
    fields => ["city_name", "country_name", "location"]
  }
}

Result:

{
  "client_ip": "8.8.8.8",
  "geoip": {
    "city_name": "Mountain View",
    "country_name": "United States",
    "location": {
      "lat": 37.386,
      "lon": -122.0838
    }
  }
}

Drop Filter

Discard events:

filter {
  # Drop debug logs in production
  if [level] == "DEBUG" and [environment] == "production" {
    drop { }
  }
}

Ruby Filter

Execute custom Ruby code:

filter {
  ruby {
    code => '
      event.set("response_time_category",
        if event.get("response_time").to_i < 100
          "fast"
        elsif event.get("response_time").to_i < 1000
          "medium"
        else
          "slow"
        end
      )
    '
  }
}

Conditional Logic

Logstash supports if/else conditionals.

Syntax

filter {
  if [field] == "value" {
    # Do something
  }
  else if [field] == "other_value" {
    # Do something else
  }
  else {
    # Default
  }
}

Operators

==    # Equals
!=    # Not equals
<     # Less than
>     # Greater than
<=    # Less than or equal
>=    # Greater than or equal
=~    # Regex match
!~    # Regex not match
in    # In array
not in # Not in array
and   # Logical AND
or    # Logical OR
!     # Logical NOT

Examples

Route by log level:

filter {
  if [level] == "ERROR" {
    mutate {
      add_tag => ["needs_attention"]
    }
  }
  
  if [level] in ["ERROR", "FATAL"] {
    email {
      to => "[email protected]"
      subject => "Production Error"
    }
  }
}

Parse different log formats:

filter {
  if [type] == "apache" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
  }
  else if [type] == "nginx" {
    grok {
      match => { "message" => "%{NGINXACCESS}" }
    }
  }
  else {
    json {
      source => "message"
    }
  }
}

Check field existence:

filter {
  if [user_id] {
    # Field exists
    mutate {
      add_tag => ["has_user_id"]
    }
  }
}

Multiple Pipelines

Problem: One Logstash instance handling different log types.

Solution: Multiple pipeline configurations.

pipelines.yml:

- pipeline.id: apache-logs
  path.config: "/etc/logstash/conf.d/apache-pipeline.conf"
  pipeline.workers: 2
  
- pipeline.id: application-logs
  path.config: "/etc/logstash/conf.d/app-pipeline.conf"
  pipeline.workers: 4
  
- pipeline.id: metrics
  path.config: "/etc/logstash/conf.d/metrics-pipeline.conf"
  pipeline.workers: 1

Each pipeline runs independently with dedicated workers.

Production Pipeline Example

Here's a complete pipeline I use in production for microservices logs.

microservices-pipeline.conf:

input {
  beats {
    port => 5044
    client_inactivity_timeout => 3600
  }
}

filter {
  # Parse JSON logs
  json {
    source => "message"
    target => "log"
    skip_on_invalid_json => true
  }
  
  # Extract timestamp
  if [log][timestamp] {
    date {
      match => [ "[log][timestamp]", "ISO8601" ]
      target => "@timestamp"
    }
  }
  
  # Extract trace ID from message
  if [log][message] {
    grok {
      match => {
        "[log][message]" => "\[trace:%{DATA:trace_id}\]"
      }
      tag_on_failure => []
    }
  }
  
  # Add environment metadata
  mutate {
    add_field => {
      "environment" => "production"
      "cluster" => "us-east-1"
    }
  }
  
  # Extract service name from log path
  if [log][file][path] {
    grok {
      match => {
        "[log][file][path]" => "/var/log/(?<service>[^/]+)/"
      }
      tag_on_failure => []
    }
  }
  
  # Add severity score
  if [log][level] == "FATAL" or [log][level] == "ERROR" {
    mutate {
      add_field => { "severity" => 3 }
      add_tag => ["alert"]
    }
  }
  else if [log][level] == "WARN" {
    mutate {
      add_field => { "severity" => 2 }
    }
  }
  else {
    mutate {
      add_field => { "severity" => 1 }
    }
  }
  
  # Convert fields
  mutate {
    convert => {
      "severity" => "integer"
      "[log][response_time]" => "integer"
    }
  }
  
  # GeoIP for client IPs
  if [log][client_ip] {
    geoip {
      source => "[log][client_ip]"
      target => "geoip"
    }
  }
  
  # Remove temporary fields
  mutate {
    remove_field => [ "message", "host", "agent" ]
  }
}

output {
  # Send to Elasticsearch
  elasticsearch {
    hosts => ["http://es-node1:9200", "http://es-node2:9200"]
    index => "logs-%{[service]}-%{+YYYY.MM.dd}"
    user => "logstash_writer"
    password => "${LOGSTASH_ES_PASSWORD}"
  }
  
  # Alert on critical errors
  if "alert" in [tags] {
    email {
      to => "[email protected]"
      subject => "ALERT: %{[service]} - %{[log][message]}"
      body => "Environment: %{environment}\nService: %{service}\nLevel: %{[log][level]}\nMessage: %{[log][message]}\nTrace: %{trace_id}"
      address => "smtp.company.com"
      port => 587
    }
  }
  
  # Backup to S3 (optional)
  s3 {
    region => "us-east-1"
    bucket => "logs-backup"
    size_file => 50000000
    time_file => 15
    codec => "json_lines"
    prefix => "logstash/%{environment}/%{service}/%{+YYYY}/%{+MM}/%{+dd}"
  }
}

This pipeline:

Receives logs from Filebeat
Parses JSON
Extracts timestamps and trace IDs
Adds environment metadata
Calculates severity scores
Enriches with GeoIP data
Sends to Elasticsearch with dynamic index names
Alerts on critical errors
Backs up to S3

Performance Tuning

1. Pipeline Workers

Configure worker threads:

# logstash.yml
pipeline.workers: 4
pipeline.batch.size: 125
pipeline.batch.delay: 50

Rule of thumb: Workers = CPU cores

2. Batch Processing

Larger batches = better throughput, higher latency:

pipeline.batch.size: 1000  # More throughput
pipeline.batch.delay: 10   # Lower latency

Smaller batches = lower latency, less throughput:

pipeline.batch.size: 125   # Balance
pipeline.batch.delay: 50

3. JVM Heap

Set heap size (50% of RAM, max 31GB):

jvm.options:

-Xms4g
-Xmx4g

4. Persistent Queue

Enable persistent queue for reliability:

# logstash.yml
queue.type: persisted
queue.max_bytes: 4gb
queue.checkpoint.writes: 1024

Monitoring Logstash

Monitoring API

# Node stats
curl http://localhost:9600/_node/stats?pretty

# Pipeline stats
curl http://localhost:9600/_node/stats/pipelines?pretty

# Hot threads
curl http://localhost:9600/_node/hot_threads?pretty

Key Metrics

From stats API:

{
  "events": {
    "in": 1000000,
    "filtered": 950000,
    "out": 950000,
    "duration_in_millis": 60000,
    "queue_push_duration_in_millis": 500
  },
  "pipeline": {
    "workers": 4,
    "batch_size": 125
  }
}

Watch for:

Events in/out ratio (should be close)
Queue push duration (should be low)
CPU and memory usage

Debugging Pipelines

Use stdout Output

output {
  stdout {
    codec => rubydebug
  }
}

Add Debug Logging

filter {
  mutate {
    add_tag => ["debug"]
  }
  
  if "debug" in [tags] {
    ruby {
      code => '
        puts "DEBUG: #{event.to_hash}"
      '
    }
  }
}

Enable Logstash Debugging

/usr/share/logstash/bin/logstash -f pipeline.conf --log.level=debug

Test Configurations

# Check syntax
/usr/share/logstash/bin/logstash -f pipeline.conf --config.test_and_exit

# Run in foreground
/usr/share/logstash/bin/logstash -f pipeline.conf --config.reload.automatic

Common Patterns

Pattern 1: Multiline Logs

Java stack traces:

input {
  file {
    path => "/var/log/app.log"
    codec => multiline {
      pattern => "^%{TIMESTAMP_ISO8601}"
      negate => true
      what => "previous"
    }
  }
}

Pattern 2: Dead Letter Queue

Handle parse failures:

filter {
  grok {
    match => { "message" => "%{MYPATTERN}" }
    tag_on_failure => ["_grokparsefailure"]
  }
}

output {
  if "_grokparsefailure" in [tags] {
    file {
      path => "/var/log/logstash/failed-events.log"
      codec => json_lines
    }
  }
  else {
    elasticsearch {
      hosts => ["http://localhost:9200"]
      index => "logs-%{+YYYY.MM.dd}"
    }
  }
}

Pattern 3: Dynamic Routing

Route to different indices:

output {
  elasticsearch {
    hosts => ["http://localhost:9200"]
    index => "%{[@metadata][index_prefix]}-%{+YYYY.MM.dd}"
  }
}

filter {
  if [type] == "apache" {
    mutate {
      add_field => { "[@metadata][index_prefix]" => "apache-logs" }
    }
  }
  else if [type] == "nginx" {
    mutate {
      add_field => { "[@metadata][index_prefix]" => "nginx-logs" }
    }
  }
}

Logstash vs. Beats

When to use Logstash:

Complex parsing (grok patterns)
Data transformation and enrichment
Multiple outputs
Aggregation and filtering

When to use Beats only:

Simple log shipping
Low resource overhead
No parsing needed (already JSON)

My approach: Use Filebeat to ship logs, Logstash to process them.

Conclusion

Logstash is the data processing powerhouse of the ELK stack. Key takeaways:

Architecture:

Input → Filter → Output
Multiple inputs, filters, outputs supported
Conditional logic and routing

Parsing:

Grok patterns for unstructured logs
JSON filter for structured logs
Custom patterns for application logs

Transformation:

Mutate fields (add, remove, convert)
Date parsing for timestamps
GeoIP enrichment
Custom Ruby code

Production:

Multiple pipelines for different log types
Performance tuning (workers, batch size, heap)
Persistent queues for reliability
Monitoring and debugging tools

In the next article, we'll explore Kibana - the visualization layer that brings everything together.

Previous: Part 2 - Elasticsearch Deep Dive Next: Part 4 - Kibana Visualization

This article is part of the ELK Stack 101 series. Check out the series overview for more content.

PreviousPart 2: Elasticsearch - Search and Analytics Engine NextPart 4: Kibana - Visualization and Exploration

Last updated 2 days ago

hashtagThe Day Logstash Saved My Sanity

hashtagWhat is Logstash?

hashtagLogstash Architecture

hashtagInputs

hashtagFilters

hashtagOutputs

hashtagInstalling Logstash

hashtagMethod 1: Docker (Quick Start)

hashtagMethod 2: Linux Installation

hashtagMy First Logstash Pipeline

hashtagReal-World Pipeline: Apache Access Logs

hashtagGrok Patterns - The Heart of Logstash

hashtagBasic Grok Syntax

hashtagCommon Built-In Patterns

hashtagMy Custom Application Log Pattern

hashtagTesting Grok Patterns

hashtagCustom Grok Patterns

hashtagCommon Filters

hashtagMutate Filter

hashtagDate Filter

hashtagJSON Filter

hashtagGeoIP Filter

hashtagDrop Filter

hashtagRuby Filter

hashtagConditional Logic

hashtagSyntax

hashtagOperators

hashtagExamples

hashtagMultiple Pipelines

hashtagProduction Pipeline Example

hashtagPerformance Tuning

hashtag1. Pipeline Workers

hashtag2. Batch Processing

hashtag3. JVM Heap

hashtag4. Persistent Queue

hashtagMonitoring Logstash

hashtagMonitoring API

hashtagKey Metrics

hashtagDebugging Pipelines

hashtagUse stdout Output

hashtagAdd Debug Logging

hashtagEnable Logstash Debugging

hashtagTest Configurations

hashtagCommon Patterns

hashtagPattern 1: Multiline Logs

hashtagPattern 2: Dead Letter Queue

hashtagPattern 3: Dynamic Routing

hashtagLogstash vs. Beats

hashtagConclusion