Part 1: Introduction to ELK Stack
Part of the ELK Stack 101 Series
My Logging Nightmare
It was 2 AM, and production was down. Customers couldn't check out on our e-commerce platform. I was SSH-ed into five different servers, running variations of:
ssh user@app-server-1
tail -f /var/log/app/application.log | grep ERROR
ssh user@app-server-2
tail -f /var/log/app/application.log | grep ERROR
# ... repeat for 3 more serversEach microservice was logging to its own file. To understand what happened, I needed to:
Check the API gateway logs
Check the user service logs
Check the payment service logs
Check the inventory service logs
Check the order service logs
And somehow correlate them all by timestamp. Hours later, I found the issue: a payment service timeout that cascaded through the system. The root cause was buried in a log file on app-server-3, 2,000 lines before I started looking.
That night, I decided to implement centralized logging. Enter ELK Stack.
What is ELK Stack?
ELK Stack is a collection of three open-source tools:
E - Elasticsearch: Search and analytics engine L - Logstash: Data processing pipeline K - Kibana: Visualization and exploration interface
Together, they provide a complete solution for:
Collecting logs from multiple sources
Processing and transforming log data
Storing logs centrally
Searching through logs efficiently
Visualizing log data and metrics
Alerting on specific patterns
The Problem ELK Solves
Before ELK:
After ELK:
ELK Stack Architecture
Here's how the components work together:
The Flow I Implemented
Collection: Applications send logs to Logstash
Processing: Logstash parses, filters, and enriches logs
Storage: Elasticsearch stores and indexes the logs
Visualization: Kibana queries Elasticsearch and displays results
Understanding Each Component
Let me break down what I learned about each component.
Elasticsearch: The Heart of ELK
What it is: A distributed, RESTful search and analytics engine built on Apache Lucene.
What it does:
Stores log data as JSON documents
Indexes data for fast searching
Provides near real-time search capabilities
Scales horizontally across multiple nodes
My first encounter:
I started with a single Elasticsearch node on a VM. I sent it a sample log:
And searched for it:
Response in milliseconds. I was hooked.
Key concepts I learned:
Index: Like a database, holds related documents (e.g.,
logs-2025-01-15)Document: A single log entry stored as JSON
Shard: Index is split into shards for distributed storage
Replica: Backup copy of a shard for redundancy
Logstash: The Data Pipeline
What it is: A server-side data processing pipeline that ingests data, transforms it, and sends it to a "stash" (Elasticsearch).
What it does:
Collects logs from multiple sources (files, syslog, APIs)
Parses unstructured logs into structured data
Enriches data (add geolocation, lookup values)
Filters out unnecessary data
Sends processed data to Elasticsearch
My Logstash pipeline:
What this does:
Reads log files from
/var/log/app/Parses each line to extract timestamp, log level, and message
Converts timestamp to proper date format
Adds hostname field
Sends to Elasticsearch with daily indices
Kibana: The Window to Your Data
What it is: A web-based UI for visualizing and exploring Elasticsearch data.
What it does:
Search and filter logs in real-time
Create visualizations (line charts, bar charts, pie charts)
Build interactive dashboards
Set up alerts and monitors
Explore data with Discover interface
My first Kibana dashboard:
I created visualizations for:
Error rate over time: Line chart showing ERROR logs per minute
Errors by service: Pie chart breaking down which microservice had most errors
Response time percentiles: Line chart with P50, P95, P99 response times
Geographic distribution: Map showing user locations
Log volume: Bar chart of logs per hour
All in one dashboard. I could finally see what was happening across all my services at a glance.
Why I Chose ELK Stack
When I evaluated logging solutions, I considered:
Alternatives I Looked At
1. Splunk
Pros: Powerful, enterprise-grade, great support
Cons: Expensive licensing, cost scales with data volume
My decision: Too expensive for my budget
2. Graylog
Pros: Open source, built on Elasticsearch, simpler than ELK
Cons: Smaller community, fewer integrations
My decision: Good alternative, but ELK had more resources
3. Loki (Grafana)
Pros: Designed for Kubernetes, integrates with Grafana
Cons: Newer, fewer features than Elasticsearch
My decision: Considered for future, went with ELK for maturity
4. Cloud solutions (AWS CloudWatch, Datadog, New Relic)
Pros: Managed, easy setup
Cons: Vendor lock-in, ongoing costs, data retention limits
My decision: Wanted control and no per-GB pricing
Why ELK Won
Open Source: No licensing costs, full control Mature: Battle-tested in production environments Scalable: Can start small, scale to petabytes Flexible: Handles any type of log or data Community: Massive community, tons of resources Ecosystem: Beats (Filebeat, Metricbeat) extend functionality
ELK Stack Use Cases
Beyond logging, I've used ELK Stack for:
1. Application Performance Monitoring (APM)
What I track:
API response times
Database query duration
Cache hit rates
Error rates by endpoint
Example visualization: Dashboard showing which API endpoints are slow, helping me prioritize optimization.
2. Security and Audit Logging
What I track:
Failed login attempts
Unauthorized access attempts
Privilege escalation events
Configuration changes
Example alert: Email notification when > 5 failed logins in 1 minute (potential brute force attack).
3. Business Metrics
What I track:
Orders per hour
Revenue trends
User signups
Feature usage
Example dashboard: Real-time revenue dashboard for stakeholders.
4. Infrastructure Monitoring
What I track:
CPU and memory usage
Disk space
Network traffic
Container health
Example alert: Slack notification when disk usage > 85%.
5. Debugging and Troubleshooting
What I do:
Search for specific error messages
Trace requests across microservices
Investigate production incidents
Analyze user behavior
Example: Customer reports checkout failure. I search Kibana for their user ID, see the entire request flow, find the payment timeout, fix the issue.
ELK Stack vs. "The Elastic Stack"
Note on terminology:
Elastic (the company) now calls it the Elastic Stack, which includes:
Elasticsearch: Search and analytics
Logstash: Data processing
Kibana: Visualization
Beats: Lightweight data shippers (Filebeat, Metricbeat, etc.)
ELK traditionally means just Elasticsearch + Logstash + Kibana.
Elastic Stack = ELK + Beats + more tools
In practice, I use the terms interchangeably, and my stack includes Beats (especially Filebeat for log shipping).
Getting Started: My First ELK Setup
When I first started, I ran everything on a single Docker Compose setup for development.
My Docker Compose Configuration
To start:
Access Kibana: http://localhost:5601
That's it. ELK running locally in minutes.
The Modern Alternative: ELK with Beats
Over time, I evolved my architecture to use Filebeat instead of Logstash for log shipping:
Why Filebeat:
Lighter: Lower resource consumption than Logstash
Simpler: Just ship logs, no heavy processing
Resilient: Handles backpressure, retries, etc.
Fast: Written in Go, efficient
When I still use Logstash:
Complex log parsing (Grok patterns)
Data enrichment (lookups, geo IP)
Multiple input sources
Heavy transformation
My current architecture:
Key Concepts to Understand
Before diving deeper, here are concepts I wish I'd understood earlier:
1. Indexing Strategy
Time-based indices: Instead of one giant logs index, create daily indices:
Benefits:
Easy to delete old data (drop entire index)
Query performance (search specific date range)
Manageable shard sizes
2. Index Lifecycle Management (ILM)
Automatic retention:
Saves storage costs and maintains performance.
3. Document Mapping
Mapping = schema in Elasticsearch terms.
Example:
Key difference:
keyword: Exact match, aggregations (e.g., log level)text: Full-text search (e.g., error messages)
4. Search Query Language
Kibana Query Language (KQL) - simple and intuitive:
Elasticsearch Query DSL - more powerful, JSON-based:
I use KQL for quick searches, Query DSL for complex queries and automation.
Common Pitfalls I Encountered
Pitfall 1: No Index Template
Mistake: Let Elasticsearch auto-create indices with default settings.
Problem: Inconsistent mappings, poor performance.
Solution: Create index templates defining mappings and settings upfront.
Pitfall 2: Storing Everything Forever
Mistake: Keep all logs indefinitely.
Problem: Storage costs explode, cluster performance degrades.
Solution: Implement ILM, delete logs after retention period (e.g., 30 days).
Pitfall 3: Single Node in Production
Mistake: Run Elasticsearch on a single node.
Problem: No redundancy, data loss if node fails.
Solution: Minimum 3-node cluster with replication.
Pitfall 4: No Monitoring
Mistake: Deploy ELK and forget about it.
Problem: Don't notice when disk fills up, cluster degrades.
Solution: Monitor Elasticsearch health, disk usage, query performance.
Pitfall 5: Over-parsing in Logstash
Mistake: Complex Grok patterns for every field.
Problem: Logstash becomes bottleneck, high CPU usage.
Solution: Parse only what you need, use Filebeat when possible.
My ELK Learning Path
Week 1: Basics
Set up Docker Compose ELK
Send sample logs to Elasticsearch
Explore data in Kibana
Create first visualization
Week 2: Logstash
Write Logstash pipeline
Parse application logs
Filter and transform data
Send to Elasticsearch
Week 3: Elasticsearch
Understand indices and documents
Create index templates
Learn KQL and Query DSL
Optimize mappings
Week 4: Kibana
Build dashboards
Create visualizations
Set up alerts
Share dashboards with team
Week 5: Production
Deploy multi-node cluster
Implement ILM
Configure security
Monitor and optimize
Tools and Resources I Used
Official Documentation
Community Resources
Stack Overflow (elastic search tag)
Books
"Elasticsearch: The Definitive Guide" (free online)
"Learning Elastic Stack 7.0"
My Tooling
Docker & Docker Compose: Local development
Postman: Testing Elasticsearch APIs
curl: Quick Elasticsearch queries
Grafana: Additional visualization (works with Elasticsearch)
Conclusion
ELK Stack transformed how I debug, monitor, and understand my applications. What used to take hours of SSH-ing and grepping now takes seconds of searching in Kibana.
Key takeaways:
Centralized logging is essential for microservices and distributed systems
ELK Stack provides a complete solution for log management
Elasticsearch stores and searches data efficiently
Logstash processes and transforms logs
Kibana visualizes and explores data
Start simple, scale as you grow
In the next article, we'll dive deep into Elasticsearch - understanding how it works, how to index data efficiently, and how to query it effectively.
What's Next
In Part 2, I'll share:
Installing and configuring Elasticsearch
Index management and mappings
Writing search queries
Aggregations and analytics
Performance optimization techniques
Next: Part 2 - Elasticsearch Deep Dive
This article is part of the ELK Stack 101 series. Check out the series overview for more content.
Last updated