Part 5: Production Best Practices

Part of the ELK Stack 101 Series

The Day Our ELK Cluster Went Down

It was a Friday afternoon (of course). Our ELK cluster - running for 18 months without incident - suddenly stopped accepting logs. Disk space? 100% full. We had never implemented retention policies.

Result: 2 hours of scrambling to free up space, manually deleting indices, stressed on-call engineer, angry stakeholders.

Lesson: Production ELK requires planning, monitoring, and maintenance. It's not "set and forget."

In this article, I'll share everything I learned running ELK in production - the hard way and the smart way.

Production Architecture

Let me show you how I architect production ELK clusters.

Small Environment (< 50GB/day)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Application   β”‚
β”‚     Servers     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ (Filebeat)
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Logstash     │────▢│  Elasticsearch   β”‚
β”‚   (2 nodes)     β”‚     β”‚    (3 nodes)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚     Kibana      β”‚
                        β”‚    (1 node)     β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration:

  • Logstash: 2 nodes (HA)

  • Elasticsearch: 3 nodes (1 master-eligible, 2 data)

  • Kibana: 1 node

Total: 6 servers

Medium Environment (50-500GB/day)

Configuration:

  • Logstash: 4 nodes behind load balancer

  • Elasticsearch: 9 nodes (3 dedicated master, 6 data)

  • Kibana: 2 nodes behind load balancer

Total: 15 servers

Large Environment (> 500GB/day)

Configuration:

  • Message queue (Kafka/Redis) for buffering

  • Logstash: 8+ nodes for parallel processing

  • Elasticsearch: 20+ nodes (3 master, 2 coordinating, 15+ data)

  • Kibana: 3+ nodes (HA)

This is what I run for high-scale environments.

Hardware Sizing

Based on real production experience.

Elasticsearch Nodes

Data nodes (most critical):

Small (< 1TB data):

  • CPU: 4-8 cores

  • RAM: 16-32 GB (50% for heap, max 31GB)

  • Disk: 1-2 TB SSD

  • Network: 1 Gbps

Medium (1-5TB data):

  • CPU: 8-16 cores

  • RAM: 64 GB (31GB heap)

  • Disk: 2-4 TB SSD

  • Network: 10 Gbps

Large (> 5TB data):

  • CPU: 16-32 cores

  • RAM: 128 GB (31GB heap)

  • Disk: 4-8 TB NVMe SSD

  • Network: 10+ Gbps

Master nodes (lightweight):

  • CPU: 2-4 cores

  • RAM: 8 GB

  • Disk: 50-100 GB

  • Network: 1 Gbps

Key rules:

  • Heap: 50% of RAM, max 31GB (compressed pointers limit)

  • Disk: SSDs mandatory for production

  • Avoid over-provisioning heap - more != better

Logstash Nodes

Typical:

  • CPU: 4-8 cores (for worker threads)

  • RAM: 16-32 GB (4-8GB heap)

  • Disk: 100 GB (for persistent queue)

  • Network: 1-10 Gbps

Scale horizontally - add more nodes rather than bigger nodes.

Kibana Nodes

Typical:

  • CPU: 2-4 cores

  • RAM: 4-8 GB

  • Disk: 50 GB

  • Network: 1 Gbps

Lightweight - most work happens in Elasticsearch.

Data Retention and ILM

Problem: Elasticsearch fills up with old logs.

Solution: Index Lifecycle Management (ILM)

My Standard ILM Policy

logs-policy:

What this does:

Hot phase (0-3 days):

  • Actively indexing

  • Rollover daily or at 50GB

  • High priority for recovery

  • Full replicas

Warm phase (3-7 days):

  • Read-only

  • Lower priority

  • Keep replicas for queries

Cold phase (7-30 days):

  • Frozen (minimal resources)

  • No replicas (saves space)

  • Slow to query (acceptable for old data)

Delete phase (> 30 days):

  • Permanently delete

Customize retention based on your needs - we keep production logs 90 days, dev logs 7 days.

Applying ILM Policy

Create index template:

Create bootstrap index:

Now Elasticsearch automatically manages lifecycle.

Security

Production ELK must be secured. Period.

Enable X-Pack Security

elasticsearch.yml:

Setup passwords:

Set passwords for:

  • elastic (superuser)

  • kibana_system (Kibana to Elasticsearch)

  • logstash_system (Logstash to Elasticsearch)

  • beats_system (Beats to Elasticsearch)

  • apm_system (APM to Elasticsearch)

  • remote_monitoring_user

TLS/SSL Encryption

Generate certificates:

Enable HTTPS for Kibana:

kibana.yml:

Access Kibana via https://kibana:5601

Role-Based Access Control (RBAC)

Create roles for different teams:

DevOps role (full access):

Developer role (read-only to app logs):

Create users and assign roles:

Network Security

Firewall rules:

Use AWS Security Groups / Azure NSGs / firewall rules.

Backup and Disaster Recovery

Snapshot and Restore

Configure snapshot repository:

S3 repository (AWS):

Filesystem repository (local/NFS):

Create snapshot policy:

This creates daily backups at 2 AM, keeps for 30 days.

Manual snapshot:

Restore snapshot:

Testing Restores

Don't assume backups work - test them!

My monthly DR drill:

  1. Spin up new Elasticsearch cluster

  2. Restore latest snapshot

  3. Verify data integrity

  4. Test queries in Kibana

  5. Document any issues

  6. Destroy test cluster

This saved us when primary cluster failed.

Monitoring ELK Itself

You must monitor your monitoring system.

Approach 1: Elasticsearch Monitoring (Built-in)

Enable monitoring:

elasticsearch.yml:

View in Kibana:

  • Stack Monitoring β†’ Elasticsearch

  • See cluster health, node stats, index stats

Approach 2: External Monitoring

Use Prometheus + Grafana:

Deploy Elasticsearch exporter:

Prometheus scrape config:

Key metrics to monitor:

Alert on:

  • Cluster status != green

  • Disk usage > 85%

  • JVM heap usage > 85%

  • Node unavailable

  • Slow queries

Approach 3: Watcher (Self-Monitoring)

Create watch to alert on cluster issues:

Performance Optimization

Elasticsearch Tuning

Heap size:

jvm.options:

Rule: Set to 50% of RAM, max 31GB.

Disable swapping:

elasticsearch.yml:

Thread pools:

Refresh interval:

Default is 1s. Increase for higher indexing throughput.

Translog settings:

Warning: Risk of data loss if node crashes.

Logstash Tuning

logstash.yml:

Use persistent queues to prevent data loss during restarts.

Kibana Tuning

kibana.yml:

High Availability

Elasticsearch HA

Minimum 3 master-eligible nodes:

elasticsearch.yml:

Set replicas:

Shard allocation:

Prevents primary and replica on same host.

Logstash HA

Multiple Logstash nodes + load balancer:

Filebeat config (multiple Logstash hosts):

Or use external load balancer (HAProxy, ELB, etc.).

Kibana HA

Multiple Kibana nodes behind load balancer:

HAProxy config:

AWS ELB / ALB works great too.

Scaling Strategies

Vertical Scaling

Add resources to existing nodes:

  • Increase CPU

  • Add RAM (increase heap)

  • Add disk

Limitations: Heap max 31GB, single node limits

Horizontal Scaling

Add more nodes to cluster:

Add data node:

Start node - it joins cluster automatically.

Elasticsearch rebalances shards across all nodes.

This is how I scale production clusters.

Dedicated Node Roles

Large clusters benefit from role separation:

Master nodes (cluster management only):

Data nodes (indexing and search):

Coordinating nodes (query routing):

Typical large cluster:

  • 3 dedicated master nodes

  • 2 coordinating nodes

  • 15+ data nodes

Common Production Issues

Issue 1: Out of Disk Space

Symptoms:

  • Cluster status red/yellow

  • Can't index new data

  • Error: "flood stage disk watermark exceeded"

Solution:

Immediate:

Long-term:

  • Implement ILM retention

  • Add more disk/nodes

  • Archive old data to S3

Issue 2: Un assigned Shards

Symptoms:

  • Yellow/red cluster status

  • Shards stuck in UNASSIGNED state

Diagnose:

Common causes:

  • Not enough nodes for replicas

  • Disk space issues

  • Shard allocation rules preventing assignment

Solution:

Issue 3: High JVM Heap Usage

Symptoms:

  • Slow searches

  • Frequent garbage collection

  • OutOfMemoryError

Diagnose:

Solutions:

  • Increase heap (up to 31GB max)

  • Add more nodes

  • Reduce shard count

  • Optimize queries

  • Close unused indices

Issue 4: Slow Queries

Diagnose:

Solutions:

  • Limit time range

  • Use filters instead of queries

  • Reduce aggregation complexity

  • Add more nodes

  • Optimize mappings

Cost Optimization

AWS Cost Optimization

Use instance reservations (30-70% savings):

  • Reserved instances for stable workload

  • Spot for dev/test

Tiered storage:

  • Hot tier: i3 instances (NVMe SSD)

  • Warm tier: r5 instances (memory optimized)

  • Cold tier: d2 instances (HDD)

S3 for old data:

  • Archive old indices to S3

  • Searchable snapshots (8.x+)

General Cost Optimization

Reduce retention:

  • Store only what you need

  • 30 days in ELK, rest in S3

Optimize shard size:

  • Target 20-50GB per shard

  • Fewer, larger shards = less overhead

Use ILM:

  • Move old data to cheaper tier

  • Reduce replicas for old data

Right-size nodes:

  • Don't over-provision

  • Monitor utilization, adjust

My Production Checklist

Before going live, verify:

Infrastructure:

Configuration:

Monitoring:

Operational:

Documentation:

Conclusion

Running ELK in production requires careful planning and ongoing maintenance. Key takeaways:

Architecture:

  • Plan for scale (horizontal scaling)

  • Separate node roles (master, data, coordinating)

  • Use load balancers for HA

Data Management:

  • Implement ILM for retention

  • Regular snapshots to S3/backup

  • Test restores monthly

Security:

  • Enable X-Pack security

  • Use TLS/SSL encryption

  • Implement RBAC

  • Network isolation

Monitoring:

  • Monitor the monitoring system

  • Alert on cluster health, disk, heap

  • Use external monitoring (Prometheus)

Performance:

  • Optimize heap (50% RAM, max 31GB)

  • Use persistent queues in Logstash

  • Tune refresh intervals

Operations:

  • Document everything

  • Test DR procedures

  • Plan capacity growth

  • Optimize costs

ELK is powerful but requires discipline. Follow these practices and you'll have a reliable, scalable logging platform.


This concludes the ELK Stack 101 series. You now have the knowledge to build, deploy, and maintain production-grade ELK clusters.

Previous: Part 4 - Kibana Visualization Back to: Series Overview


This article is part of the ELK Stack 101 series. Check out the series overview for more content.

Last updated