Part 1: Introduction to ELK Stack

My Logging Nightmare

It was 2 AM, and production was down. Customers couldn't check out on our e-commerce platform. I was SSH-ed into five different servers, running variations of:

ssh user@app-server-1
tail -f /var/log/app/application.log | grep ERROR

ssh user@app-server-2
tail -f /var/log/app/application.log | grep ERROR
# ... repeat for 3 more servers

Each microservice was logging to its own file. To understand what happened, I needed to:

Check the API gateway logs
Check the user service logs
Check the payment service logs
Check the inventory service logs
Check the order service logs

And somehow correlate them all by timestamp. Hours later, I found the issue: a payment service timeout that cascaded through the system. The root cause was buried in a log file on app-server-3, 2,000 lines before I started looking.

That night, I decided to implement centralized logging. Enter ELK Stack.

What is ELK Stack?

ELK Stack is a collection of three open-source tools:

E - Elasticsearch: Search and analytics engine L - Logstash: Data processing pipeline K - Kibana: Visualization and exploration interface

Together, they provide a complete solution for:

Collecting logs from multiple sources
Processing and transforming log data
Storing logs centrally
Searching through logs efficiently
Visualizing log data and metrics
Alerting on specific patterns

The Problem ELK Solves

Before ELK:

App Server 1 → /var/log/app1.log (on disk)
App Server 2 → /var/log/app2.log (on disk)
App Server 3 → /var/log/app3.log (on disk)
Database → /var/log/postgresql/postgres.log (on disk)
Nginx → /var/log/nginx/access.log (on disk)

Developer: *SSH into each server, grep, pray*

After ELK:

App Server 1 ─┐
App Server 2 ─┤
App Server 3 ─┼─→ Logstash → Elasticsearch ← Kibana (Web UI)
Database ─────┤
Nginx ────────┘

Developer: *Search all logs from a single browser interface*

ELK Stack Architecture

Here's how the components work together:

The Flow I Implemented

Collection: Applications send logs to Logstash
Processing: Logstash parses, filters, and enriches logs
Storage: Elasticsearch stores and indexes the logs
Visualization: Kibana queries Elasticsearch and displays results

Understanding Each Component

Let me break down what I learned about each component.

Elasticsearch: The Heart of ELK

What it is: A distributed, RESTful search and analytics engine built on Apache Lucene.

What it does:

Stores log data as JSON documents
Indexes data for fast searching
Provides near real-time search capabilities
Scales horizontally across multiple nodes

My first encounter:

I started with a single Elasticsearch node on a VM. I sent it a sample log:

curl -X POST "localhost:9200/logs/_doc" -H 'Content-Type: application/json' -d'
{
  "timestamp": "2025-01-15T10:30:00",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Payment gateway timeout",
  "user_id": "12345"
}
'

And searched for it:

curl -X GET "localhost:9200/logs/_search?q=ERROR"

Response in milliseconds. I was hooked.

Key concepts I learned:

Index: Like a database, holds related documents (e.g., logs-2025-01-15)
Document: A single log entry stored as JSON
Shard: Index is split into shards for distributed storage
Replica: Backup copy of a shard for redundancy

Logstash: The Data Pipeline

What it is: A server-side data processing pipeline that ingests data, transforms it, and sends it to a "stash" (Elasticsearch).

What it does:

Collects logs from multiple sources (files, syslog, APIs)
Parses unstructured logs into structured data
Enriches data (add geolocation, lookup values)
Filters out unnecessary data
Sends processed data to Elasticsearch

My Logstash pipeline:

# Input: Read from application log files
input {
  file {
    path => "/var/log/app/*.log"
    start_position => "beginning"
  }
}

# Filter: Parse and transform
filter {
  # Parse timestamp
  grok {
    match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}" }
  }
  
  # Convert timestamp to date
  date {
    match => [ "timestamp", "ISO8601" ]
  }
  
  # Add hostname
  mutate {
    add_field => { "hostname" => "%{host}" }
  }
}

# Output: Send to Elasticsearch
output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

What this does:

Reads log files from /var/log/app/
Parses each line to extract timestamp, log level, and message
Converts timestamp to proper date format
Adds hostname field
Sends to Elasticsearch with daily indices

Kibana: The Window to Your Data

What it is: A web-based UI for visualizing and exploring Elasticsearch data.

What it does:

Search and filter logs in real-time
Create visualizations (line charts, bar charts, pie charts)
Build interactive dashboards
Set up alerts and monitors
Explore data with Discover interface

My first Kibana dashboard:

I created visualizations for:

Error rate over time: Line chart showing ERROR logs per minute
Errors by service: Pie chart breaking down which microservice had most errors
Response time percentiles: Line chart with P50, P95, P99 response times
Geographic distribution: Map showing user locations
Log volume: Bar chart of logs per hour

All in one dashboard. I could finally see what was happening across all my services at a glance.

Why I Chose ELK Stack

When I evaluated logging solutions, I considered:

Alternatives I Looked At

1. Splunk

Pros: Powerful, enterprise-grade, great support
Cons: Expensive licensing, cost scales with data volume
My decision: Too expensive for my budget

2. Graylog

Pros: Open source, built on Elasticsearch, simpler than ELK
Cons: Smaller community, fewer integrations
My decision: Good alternative, but ELK had more resources

3. Loki (Grafana)

Pros: Designed for Kubernetes, integrates with Grafana
Cons: Newer, fewer features than Elasticsearch
My decision: Considered for future, went with ELK for maturity

4. Cloud solutions (AWS CloudWatch, Datadog, New Relic)

Pros: Managed, easy setup
Cons: Vendor lock-in, ongoing costs, data retention limits
My decision: Wanted control and no per-GB pricing

Why ELK Won

Open Source: No licensing costs, full control Mature: Battle-tested in production environments Scalable: Can start small, scale to petabytes Flexible: Handles any type of log or data Community: Massive community, tons of resources Ecosystem: Beats (Filebeat, Metricbeat) extend functionality

ELK Stack Use Cases

Beyond logging, I've used ELK Stack for:

1. Application Performance Monitoring (APM)

What I track:

API response times
Database query duration
Cache hit rates
Error rates by endpoint

Example visualization: Dashboard showing which API endpoints are slow, helping me prioritize optimization.

2. Security and Audit Logging

What I track:

Failed login attempts
Unauthorized access attempts
Privilege escalation events
Configuration changes

Example alert: Email notification when > 5 failed logins in 1 minute (potential brute force attack).

3. Business Metrics

What I track:

Orders per hour
Revenue trends
User signups
Feature usage

Example dashboard: Real-time revenue dashboard for stakeholders.

4. Infrastructure Monitoring

What I track:

CPU and memory usage
Disk space
Network traffic
Container health

Example alert: Slack notification when disk usage > 85%.

5. Debugging and Troubleshooting

What I do:

Search for specific error messages
Trace requests across microservices
Investigate production incidents
Analyze user behavior

Example: Customer reports checkout failure. I search Kibana for their user ID, see the entire request flow, find the payment timeout, fix the issue.

ELK Stack vs. "The Elastic Stack"

Note on terminology:

Elastic (the company) now calls it the Elastic Stack, which includes:

Elasticsearch: Search and analytics
Logstash: Data processing
Kibana: Visualization
Beats: Lightweight data shippers (Filebeat, Metricbeat, etc.)

ELK traditionally means just Elasticsearch + Logstash + Kibana.

Elastic Stack = ELK + Beats + more tools

In practice, I use the terms interchangeably, and my stack includes Beats (especially Filebeat for log shipping).

Getting Started: My First ELK Setup

When I first started, I ran everything on a single Docker Compose setup for development.

My Docker Compose Configuration

version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:8.11.0
    container_name: logstash
    ports:
      - "5044:5044"
      - "9600:9600"
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch

volumes:
  elasticsearch-data:

To start:

docker-compose up -d

Access Kibana: http://localhost:5601

That's it. ELK running locally in minutes.

The Modern Alternative: ELK with Beats

Over time, I evolved my architecture to use Filebeat instead of Logstash for log shipping:

Why Filebeat:

Lighter: Lower resource consumption than Logstash
Simpler: Just ship logs, no heavy processing
Resilient: Handles backpressure, retries, etc.
Fast: Written in Go, efficient

When I still use Logstash:

Complex log parsing (Grok patterns)
Data enrichment (lookups, geo IP)
Multiple input sources
Heavy transformation

My current architecture:

Applications → Filebeat → Elasticsearch ← Kibana
                    ↓
              (Optional) Logstash for complex processing

Key Concepts to Understand

Before diving deeper, here are concepts I wish I'd understood earlier:

1. Indexing Strategy

Time-based indices: Instead of one giant logs index, create daily indices:

logs-2025-01-15
logs-2025-01-16
logs-2025-01-17

Benefits:

Easy to delete old data (drop entire index)
Query performance (search specific date range)
Manageable shard sizes

2. Index Lifecycle Management (ILM)

Automatic retention:

Hot phase (0-3 days): Fast SSDs, actively writing
Warm phase (3-30 days): Slower disks, read-only
Cold phase (30-90 days): Cheapest storage, rarely accessed
Delete phase (90+ days): Automatically deleted

Saves storage costs and maintains performance.

3. Document Mapping

Mapping = schema in Elasticsearch terms.

Example:

{
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "message": { "type": "text" },
      "user_id": { "type": "keyword" },
      "response_time": { "type": "integer" }
    }
  }
}

Key difference:

keyword: Exact match, aggregations (e.g., log level)
text: Full-text search (e.g., error messages)

4. Search Query Language

Kibana Query Language (KQL) - simple and intuitive:

level: ERROR and service: "payment-service"
response_time > 1000
user_id: "12345" and timestamp >= "2025-01-15"

Elasticsearch Query DSL - more powerful, JSON-based:

{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "ERROR" }},
        { "match": { "service": "payment-service" }}
      ],
      "filter": [
        { "range": { "response_time": { "gte": 1000 }}}
      ]
    }
  }
}

I use KQL for quick searches, Query DSL for complex queries and automation.

Common Pitfalls I Encountered

Pitfall 1: No Index Template

Mistake: Let Elasticsearch auto-create indices with default settings.

Problem: Inconsistent mappings, poor performance.

Solution: Create index templates defining mappings and settings upfront.

Pitfall 2: Storing Everything Forever

Mistake: Keep all logs indefinitely.

Problem: Storage costs explode, cluster performance degrades.

Solution: Implement ILM, delete logs after retention period (e.g., 30 days).

Pitfall 3: Single Node in Production

Mistake: Run Elasticsearch on a single node.

Problem: No redundancy, data loss if node fails.

Solution: Minimum 3-node cluster with replication.

Pitfall 4: No Monitoring

Mistake: Deploy ELK and forget about it.

Problem: Don't notice when disk fills up, cluster degrades.

Solution: Monitor Elasticsearch health, disk usage, query performance.

Pitfall 5: Over-parsing in Logstash

Mistake: Complex Grok patterns for every field.

Problem: Logstash becomes bottleneck, high CPU usage.

Solution: Parse only what you need, use Filebeat when possible.

My ELK Learning Path

Week 1: Basics

Set up Docker Compose ELK
Send sample logs to Elasticsearch
Explore data in Kibana
Create first visualization

Week 2: Logstash

Write Logstash pipeline
Parse application logs
Filter and transform data
Send to Elasticsearch

Week 3: Elasticsearch

Understand indices and documents
Create index templates
Learn KQL and Query DSL
Optimize mappings

Week 4: Kibana

Build dashboards
Create visualizations
Set up alerts
Share dashboards with team

Week 5: Production

Deploy multi-node cluster
Implement ILM
Configure security
Monitor and optimize

Tools and Resources I Used

Official Documentation

Community Resources

Elastic Community Forums
Elasticsearch Slack
Stack Overflow (elastic search tag)

Books

"Elasticsearch: The Definitive Guide" (free online)
"Learning Elastic Stack 7.0"

My Tooling

Docker & Docker Compose: Local development
Postman: Testing Elasticsearch APIs
curl: Quick Elasticsearch queries
Grafana: Additional visualization (works with Elasticsearch)

Conclusion

ELK Stack transformed how I debug, monitor, and understand my applications. What used to take hours of SSH-ing and grepping now takes seconds of searching in Kibana.

Key takeaways:

Centralized logging is essential for microservices and distributed systems
ELK Stack provides a complete solution for log management
Elasticsearch stores and searches data efficiently
Logstash processes and transforms logs
Kibana visualizes and explores data
Start simple, scale as you grow

In the next article, we'll dive deep into Elasticsearch - understanding how it works, how to index data efficiently, and how to query it effectively.

What's Next

In Part 2, I'll share:

Installing and configuring Elasticsearch
Index management and mappings
Writing search queries
Aggregations and analytics
Performance optimization techniques

Next: Part 2 - Elasticsearch Deep Dive

This article is part of the ELK Stack 101 series. Check out the series overview for more content.

PreviousELK Stack 101 NextPart 2: Elasticsearch - Search and Analytics Engine

Last updated 2 days ago

hashtagMy Logging Nightmare

hashtagWhat is ELK Stack?

hashtagThe Problem ELK Solves

hashtagELK Stack Architecture

hashtagThe Flow I Implemented

hashtagUnderstanding Each Component

hashtagElasticsearch: The Heart of ELK

hashtagLogstash: The Data Pipeline

hashtagKibana: The Window to Your Data

hashtagWhy I Chose ELK Stack

hashtagAlternatives I Looked At

hashtagWhy ELK Won

hashtagELK Stack Use Cases

hashtag1. Application Performance Monitoring (APM)

hashtag2. Security and Audit Logging

hashtag3. Business Metrics

hashtag4. Infrastructure Monitoring

hashtag5. Debugging and Troubleshooting

hashtagELK Stack vs. "The Elastic Stack"

hashtagGetting Started: My First ELK Setup

hashtagMy Docker Compose Configuration

hashtagThe Modern Alternative: ELK with Beats

hashtagKey Concepts to Understand

hashtag1. Indexing Strategy

hashtag2. Index Lifecycle Management (ILM)

hashtag3. Document Mapping

hashtag4. Search Query Language

hashtagCommon Pitfalls I Encountered

hashtagPitfall 1: No Index Template

hashtagPitfall 2: Storing Everything Forever

hashtagPitfall 3: Single Node in Production

hashtagPitfall 4: No Monitoring

hashtagPitfall 5: Over-parsing in Logstash

hashtagMy ELK Learning Path

hashtagTools and Resources I Used

hashtagOfficial Documentation

hashtagCommunity Resources

hashtagBooks

hashtagMy Tooling

hashtagConclusion

hashtagWhat's Next

My Logging Nightmare

What is ELK Stack?

The Problem ELK Solves

ELK Stack Architecture

The Flow I Implemented

Understanding Each Component

Elasticsearch: The Heart of ELK

Logstash: The Data Pipeline

Kibana: The Window to Your Data

Why I Chose ELK Stack

Alternatives I Looked At

Why ELK Won

ELK Stack Use Cases

1. Application Performance Monitoring (APM)

2. Security and Audit Logging

3. Business Metrics

4. Infrastructure Monitoring

5. Debugging and Troubleshooting

ELK Stack vs. "The Elastic Stack"

Getting Started: My First ELK Setup

My Docker Compose Configuration

The Modern Alternative: ELK with Beats

Key Concepts to Understand

1. Indexing Strategy

2. Index Lifecycle Management (ILM)

3. Document Mapping

4. Search Query Language

Common Pitfalls I Encountered

Pitfall 1: No Index Template

Pitfall 2: Storing Everything Forever

Pitfall 3: Single Node in Production

Pitfall 4: No Monitoring

Pitfall 5: Over-parsing in Logstash

My ELK Learning Path

Tools and Resources I Used

Official Documentation

Community Resources

Books

My Tooling

Conclusion

What's Next