Part 1: Introduction to ELK Stack

Part of the ELK Stack 101 Series

My Logging Nightmare

It was 2 AM, and production was down. Customers couldn't check out on our e-commerce platform. I was SSH-ed into five different servers, running variations of:

ssh user@app-server-1
tail -f /var/log/app/application.log | grep ERROR

ssh user@app-server-2
tail -f /var/log/app/application.log | grep ERROR
# ... repeat for 3 more servers

Each microservice was logging to its own file. To understand what happened, I needed to:

  1. Check the API gateway logs

  2. Check the user service logs

  3. Check the payment service logs

  4. Check the inventory service logs

  5. Check the order service logs

And somehow correlate them all by timestamp. Hours later, I found the issue: a payment service timeout that cascaded through the system. The root cause was buried in a log file on app-server-3, 2,000 lines before I started looking.

That night, I decided to implement centralized logging. Enter ELK Stack.

What is ELK Stack?

ELK Stack is a collection of three open-source tools:

E - Elasticsearch: Search and analytics engine L - Logstash: Data processing pipeline K - Kibana: Visualization and exploration interface

Together, they provide a complete solution for:

  • Collecting logs from multiple sources

  • Processing and transforming log data

  • Storing logs centrally

  • Searching through logs efficiently

  • Visualizing log data and metrics

  • Alerting on specific patterns

The Problem ELK Solves

Before ELK:

After ELK:

ELK Stack Architecture

Here's how the components work together:

spinner

The Flow I Implemented

  1. Collection: Applications send logs to Logstash

  2. Processing: Logstash parses, filters, and enriches logs

  3. Storage: Elasticsearch stores and indexes the logs

  4. Visualization: Kibana queries Elasticsearch and displays results

Understanding Each Component

Let me break down what I learned about each component.

Elasticsearch: The Heart of ELK

What it is: A distributed, RESTful search and analytics engine built on Apache Lucene.

What it does:

  • Stores log data as JSON documents

  • Indexes data for fast searching

  • Provides near real-time search capabilities

  • Scales horizontally across multiple nodes

My first encounter:

I started with a single Elasticsearch node on a VM. I sent it a sample log:

And searched for it:

Response in milliseconds. I was hooked.

Key concepts I learned:

  • Index: Like a database, holds related documents (e.g., logs-2025-01-15)

  • Document: A single log entry stored as JSON

  • Shard: Index is split into shards for distributed storage

  • Replica: Backup copy of a shard for redundancy

Logstash: The Data Pipeline

What it is: A server-side data processing pipeline that ingests data, transforms it, and sends it to a "stash" (Elasticsearch).

What it does:

  • Collects logs from multiple sources (files, syslog, APIs)

  • Parses unstructured logs into structured data

  • Enriches data (add geolocation, lookup values)

  • Filters out unnecessary data

  • Sends processed data to Elasticsearch

My Logstash pipeline:

What this does:

  1. Reads log files from /var/log/app/

  2. Parses each line to extract timestamp, log level, and message

  3. Converts timestamp to proper date format

  4. Adds hostname field

  5. Sends to Elasticsearch with daily indices

Kibana: The Window to Your Data

What it is: A web-based UI for visualizing and exploring Elasticsearch data.

What it does:

  • Search and filter logs in real-time

  • Create visualizations (line charts, bar charts, pie charts)

  • Build interactive dashboards

  • Set up alerts and monitors

  • Explore data with Discover interface

My first Kibana dashboard:

I created visualizations for:

  • Error rate over time: Line chart showing ERROR logs per minute

  • Errors by service: Pie chart breaking down which microservice had most errors

  • Response time percentiles: Line chart with P50, P95, P99 response times

  • Geographic distribution: Map showing user locations

  • Log volume: Bar chart of logs per hour

All in one dashboard. I could finally see what was happening across all my services at a glance.

Why I Chose ELK Stack

When I evaluated logging solutions, I considered:

Alternatives I Looked At

1. Splunk

  • Pros: Powerful, enterprise-grade, great support

  • Cons: Expensive licensing, cost scales with data volume

  • My decision: Too expensive for my budget

2. Graylog

  • Pros: Open source, built on Elasticsearch, simpler than ELK

  • Cons: Smaller community, fewer integrations

  • My decision: Good alternative, but ELK had more resources

3. Loki (Grafana)

  • Pros: Designed for Kubernetes, integrates with Grafana

  • Cons: Newer, fewer features than Elasticsearch

  • My decision: Considered for future, went with ELK for maturity

4. Cloud solutions (AWS CloudWatch, Datadog, New Relic)

  • Pros: Managed, easy setup

  • Cons: Vendor lock-in, ongoing costs, data retention limits

  • My decision: Wanted control and no per-GB pricing

Why ELK Won

Open Source: No licensing costs, full control Mature: Battle-tested in production environments Scalable: Can start small, scale to petabytes Flexible: Handles any type of log or data Community: Massive community, tons of resources Ecosystem: Beats (Filebeat, Metricbeat) extend functionality

ELK Stack Use Cases

Beyond logging, I've used ELK Stack for:

1. Application Performance Monitoring (APM)

What I track:

  • API response times

  • Database query duration

  • Cache hit rates

  • Error rates by endpoint

Example visualization: Dashboard showing which API endpoints are slow, helping me prioritize optimization.

2. Security and Audit Logging

What I track:

  • Failed login attempts

  • Unauthorized access attempts

  • Privilege escalation events

  • Configuration changes

Example alert: Email notification when > 5 failed logins in 1 minute (potential brute force attack).

3. Business Metrics

What I track:

  • Orders per hour

  • Revenue trends

  • User signups

  • Feature usage

Example dashboard: Real-time revenue dashboard for stakeholders.

4. Infrastructure Monitoring

What I track:

  • CPU and memory usage

  • Disk space

  • Network traffic

  • Container health

Example alert: Slack notification when disk usage > 85%.

5. Debugging and Troubleshooting

What I do:

  • Search for specific error messages

  • Trace requests across microservices

  • Investigate production incidents

  • Analyze user behavior

Example: Customer reports checkout failure. I search Kibana for their user ID, see the entire request flow, find the payment timeout, fix the issue.

ELK Stack vs. "The Elastic Stack"

Note on terminology:

Elastic (the company) now calls it the Elastic Stack, which includes:

  • Elasticsearch: Search and analytics

  • Logstash: Data processing

  • Kibana: Visualization

  • Beats: Lightweight data shippers (Filebeat, Metricbeat, etc.)

ELK traditionally means just Elasticsearch + Logstash + Kibana.

Elastic Stack = ELK + Beats + more tools

In practice, I use the terms interchangeably, and my stack includes Beats (especially Filebeat for log shipping).

Getting Started: My First ELK Setup

When I first started, I ran everything on a single Docker Compose setup for development.

My Docker Compose Configuration

To start:

Access Kibana: http://localhost:5601

That's it. ELK running locally in minutes.

The Modern Alternative: ELK with Beats

Over time, I evolved my architecture to use Filebeat instead of Logstash for log shipping:

spinner

Why Filebeat:

  • Lighter: Lower resource consumption than Logstash

  • Simpler: Just ship logs, no heavy processing

  • Resilient: Handles backpressure, retries, etc.

  • Fast: Written in Go, efficient

When I still use Logstash:

  • Complex log parsing (Grok patterns)

  • Data enrichment (lookups, geo IP)

  • Multiple input sources

  • Heavy transformation

My current architecture:

Key Concepts to Understand

Before diving deeper, here are concepts I wish I'd understood earlier:

1. Indexing Strategy

Time-based indices: Instead of one giant logs index, create daily indices:

Benefits:

  • Easy to delete old data (drop entire index)

  • Query performance (search specific date range)

  • Manageable shard sizes

2. Index Lifecycle Management (ILM)

Automatic retention:

Saves storage costs and maintains performance.

3. Document Mapping

Mapping = schema in Elasticsearch terms.

Example:

Key difference:

  • keyword: Exact match, aggregations (e.g., log level)

  • text: Full-text search (e.g., error messages)

4. Search Query Language

Kibana Query Language (KQL) - simple and intuitive:

Elasticsearch Query DSL - more powerful, JSON-based:

I use KQL for quick searches, Query DSL for complex queries and automation.

Common Pitfalls I Encountered

Pitfall 1: No Index Template

Mistake: Let Elasticsearch auto-create indices with default settings.

Problem: Inconsistent mappings, poor performance.

Solution: Create index templates defining mappings and settings upfront.

Pitfall 2: Storing Everything Forever

Mistake: Keep all logs indefinitely.

Problem: Storage costs explode, cluster performance degrades.

Solution: Implement ILM, delete logs after retention period (e.g., 30 days).

Pitfall 3: Single Node in Production

Mistake: Run Elasticsearch on a single node.

Problem: No redundancy, data loss if node fails.

Solution: Minimum 3-node cluster with replication.

Pitfall 4: No Monitoring

Mistake: Deploy ELK and forget about it.

Problem: Don't notice when disk fills up, cluster degrades.

Solution: Monitor Elasticsearch health, disk usage, query performance.

Pitfall 5: Over-parsing in Logstash

Mistake: Complex Grok patterns for every field.

Problem: Logstash becomes bottleneck, high CPU usage.

Solution: Parse only what you need, use Filebeat when possible.

My ELK Learning Path

Week 1: Basics

  • Set up Docker Compose ELK

  • Send sample logs to Elasticsearch

  • Explore data in Kibana

  • Create first visualization

Week 2: Logstash

  • Write Logstash pipeline

  • Parse application logs

  • Filter and transform data

  • Send to Elasticsearch

Week 3: Elasticsearch

  • Understand indices and documents

  • Create index templates

  • Learn KQL and Query DSL

  • Optimize mappings

Week 4: Kibana

  • Build dashboards

  • Create visualizations

  • Set up alerts

  • Share dashboards with team

Week 5: Production

  • Deploy multi-node cluster

  • Implement ILM

  • Configure security

  • Monitor and optimize

Tools and Resources I Used

Official Documentation

Community Resources

Books

  • "Elasticsearch: The Definitive Guide" (free online)

  • "Learning Elastic Stack 7.0"

My Tooling

  • Docker & Docker Compose: Local development

  • Postman: Testing Elasticsearch APIs

  • curl: Quick Elasticsearch queries

  • Grafana: Additional visualization (works with Elasticsearch)

Conclusion

ELK Stack transformed how I debug, monitor, and understand my applications. What used to take hours of SSH-ing and grepping now takes seconds of searching in Kibana.

Key takeaways:

  1. Centralized logging is essential for microservices and distributed systems

  2. ELK Stack provides a complete solution for log management

  3. Elasticsearch stores and searches data efficiently

  4. Logstash processes and transforms logs

  5. Kibana visualizes and explores data

  6. Start simple, scale as you grow

In the next article, we'll dive deep into Elasticsearch - understanding how it works, how to index data efficiently, and how to query it effectively.

What's Next

In Part 2, I'll share:

  • Installing and configuring Elasticsearch

  • Index management and mappings

  • Writing search queries

  • Aggregations and analytics

  • Performance optimization techniques

Next: Part 2 - Elasticsearch Deep Dive


This article is part of the ELK Stack 101 series. Check out the series overview for more content.

Last updated