Part 5: Production Best Practices

The Day Our ELK Cluster Went Down

It was a Friday afternoon (of course). Our ELK cluster - running for 18 months without incident - suddenly stopped accepting logs. Disk space? 100% full. We had never implemented retention policies.

Result: 2 hours of scrambling to free up space, manually deleting indices, stressed on-call engineer, angry stakeholders.

Lesson: Production ELK requires planning, monitoring, and maintenance. It's not "set and forget."

In this article, I'll share everything I learned running ELK in production - the hard way and the smart way.

Production Architecture

Let me show you how I architect production ELK clusters.

Small Environment (< 50GB/day)

┌─────────────────┐
│   Application   │
│     Servers     │
└────────┬────────┘
         │ (Filebeat)
         ▼
┌─────────────────┐     ┌──────────────────┐
│    Logstash     │────▶│  Elasticsearch   │
│   (2 nodes)     │     │    (3 nodes)     │
└─────────────────┘     └────────┬─────────┘
                                 │
                                 ▼
                        ┌─────────────────┐
                        │     Kibana      │
                        │    (1 node)     │
                        └─────────────────┘

Configuration:

Logstash: 2 nodes (HA)
Elasticsearch: 3 nodes (1 master-eligible, 2 data)
Kibana: 1 node

Total: 6 servers

Medium Environment (50-500GB/day)

┌─────────────────┐
│   Application   │
│     Servers     │
└────────┬────────┘
         │ (Filebeat)
         ▼
┌─────────────────┐     ┌──────────────────┐
│    Logstash     │────▶│  Elasticsearch   │
│   (4 nodes)     │     │   Cluster        │
│   + Load Bal    │     │  (9 nodes:       │
└─────────────────┘     │   3 master,      │
                        │   6 data)        │
                        └────────┬─────────┘
                                 │
                                 ▼
                        ┌─────────────────┐
                        │     Kibana      │
                        │   (2 nodes)     │
                        │   + Load Bal    │
                        └─────────────────┘

Configuration:

Logstash: 4 nodes behind load balancer
Elasticsearch: 9 nodes (3 dedicated master, 6 data)
Kibana: 2 nodes behind load balancer

Total: 15 servers

Large Environment (> 500GB/day)

┌─────────────────┐
│   Application   │
│     Servers     │
└────────┬────────┘
         │ (Filebeat)
         ▼
┌─────────────────┐     ┌──────────────────┐
│  Message Queue  │     │  Elasticsearch   │
│  (Kafka/Redis)  │     │   Cluster        │
└────────┬────────┘     │  (20+ nodes:     │
         │              │   3 master,      │
         ▼              │   2 coord,       │
┌─────────────────┐     │   15 data)       │
│    Logstash     │────▶│                  │
│   (8+ nodes)    │     └────────┬─────────┘
└─────────────────┘              │
                                 ▼
                        ┌─────────────────┐
                        │     Kibana      │
                        │   (3+ nodes)    │
                        │   + Load Bal    │
                        └─────────────────┘

Configuration:

Message queue (Kafka/Redis) for buffering
Logstash: 8+ nodes for parallel processing
Elasticsearch: 20+ nodes (3 master, 2 coordinating, 15+ data)
Kibana: 3+ nodes (HA)

This is what I run for high-scale environments.

Hardware Sizing

Based on real production experience.

Elasticsearch Nodes

Data nodes (most critical):

Small (< 1TB data):

CPU: 4-8 cores
RAM: 16-32 GB (50% for heap, max 31GB)
Disk: 1-2 TB SSD
Network: 1 Gbps

Medium (1-5TB data):

CPU: 8-16 cores
RAM: 64 GB (31GB heap)
Disk: 2-4 TB SSD
Network: 10 Gbps

Large (> 5TB data):

CPU: 16-32 cores
RAM: 128 GB (31GB heap)
Disk: 4-8 TB NVMe SSD
Network: 10+ Gbps

Master nodes (lightweight):

CPU: 2-4 cores
RAM: 8 GB
Disk: 50-100 GB
Network: 1 Gbps

Key rules:

Heap: 50% of RAM, max 31GB (compressed pointers limit)
Disk: SSDs mandatory for production
Avoid over-provisioning heap - more != better

Logstash Nodes

Typical:

CPU: 4-8 cores (for worker threads)
RAM: 16-32 GB (4-8GB heap)
Disk: 100 GB (for persistent queue)
Network: 1-10 Gbps

Scale horizontally - add more nodes rather than bigger nodes.

Kibana Nodes

Typical:

CPU: 2-4 cores
RAM: 4-8 GB
Disk: 50 GB
Network: 1 Gbps

Lightweight - most work happens in Elasticsearch.

Data Retention and ILM

Problem: Elasticsearch fills up with old logs.

Solution: Index Lifecycle Management (ILM)

My Standard ILM Policy

logs-policy:

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_primary_shard_size": "50gb"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "allocate": {
            "number_of_replicas": 1
          },
          "readonly": {}
        }
      },
      "cold": {
        "min_age": "7d",
        "actions": {
          "set_priority": {
            "priority": 0
          },
          "allocate": {
            "number_of_replicas": 0
          },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

What this does:

Hot phase (0-3 days):

Actively indexing
Rollover daily or at 50GB
High priority for recovery
Full replicas

Warm phase (3-7 days):

Read-only
Lower priority
Keep replicas for queries

Cold phase (7-30 days):

Frozen (minimal resources)
No replicas (saves space)
Slow to query (acceptable for old data)

Delete phase (> 30 days):

Permanently delete

Customize retention based on your needs - we keep production logs 90 days, dev logs 7 days.

Applying ILM Policy

Create index template:

PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-policy",
      "index.lifecycle.rollover_alias": "logs"
    }
  }
}

Create bootstrap index:

PUT logs-000001
{
  "aliases": {
    "logs": {
      "is_write_index": true
    }
  }
}

Now Elasticsearch automatically manages lifecycle.

Security

Production ELK must be secured. Period.

Enable X-Pack Security

elasticsearch.yml:

xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12

Setup passwords:

/usr/share/elasticsearch/bin/elasticsearch-setup-passwords interactive

Set passwords for:

elastic (superuser)
kibana_system (Kibana to Elasticsearch)
logstash_system (Logstash to Elasticsearch)
beats_system (Beats to Elasticsearch)
apm_system (APM to Elasticsearch)
remote_monitoring_user

TLS/SSL Encryption

Generate certificates:

# Create Certificate Authority
/usr/share/elasticsearch/bin/elasticsearch-certutil ca

# Create node certificates
/usr/share/elasticsearch/bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12

Enable HTTPS for Kibana:

kibana.yml:

server.ssl.enabled: true
server.ssl.certificate: /path/to/kibana.crt
server.ssl.key: /path/to/kibana.key
elasticsearch.username: "kibana_system"
elasticsearch.password: "kibana_password"
elasticsearch.ssl.certificateAuthorities: ["/path/to/ca.crt"]

Access Kibana via https://kibana:5601

Role-Based Access Control (RBAC)

Create roles for different teams:

DevOps role (full access):

POST /_security/role/devops
{
  "cluster": ["all"],
  "indices": [
    {
      "names": ["*"],
      "privileges": ["all"]
    }
  ]
}

Developer role (read-only to app logs):

POST /_security/role/developer
{
  "cluster": ["monitor"],
  "indices": [
    {
      "names": ["logs-app-*"],
      "privileges": ["read", "view_index_metadata"]
    }
  ],
  "applications": [
    {
      "application": "kibana-.kibana",
      "privileges": ["read"],
      "resources": ["*"]
    }
  ]
}

Create users and assign roles:

POST /_security/user/john.doe
{
  "password": "secure_password",
  "roles": ["developer"],
  "full_name": "John Doe",
  "email": "[email protected]"
}

Network Security

Firewall rules:

# Elasticsearch (only from Logstash/Kibana)
- Port 9200: HTTP API (restrict to internal)
- Port 9300: Transport (node-to-node, internal only)

# Kibana (users via VPN/proxy)
- Port 5601: Web UI

# Logstash
- Port 5044: Beats input (from app servers)

Use AWS Security Groups / Azure NSGs / firewall rules.

Backup and Disaster Recovery

Snapshot and Restore

Configure snapshot repository:

S3 repository (AWS):

PUT _snapshot/s3_repository
{
  "type": "s3",
  "settings": {
    "bucket": "my-elk-backups",
    "region": "us-east-1",
    "base_path": "elasticsearch/snapshots"
  }
}

Filesystem repository (local/NFS):

PUT _snapshot/fs_repository
{
  "type": "fs",
  "settings": {
    "location": "/mnt/backups/elasticsearch"
  }
}

Create snapshot policy:

PUT _slm/policy/daily-snapshots
{
  "schedule": "0 2 * * *",
  "name": "<daily-snap-{now/d}>",
  "repository": "s3_repository",
  "config": {
    "indices": ["logs-*", ".kibana*"],
    "ignore_unavailable": false,
    "include_global_state": true
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

This creates daily backups at 2 AM, keeps for 30 days.

Manual snapshot:

PUT _snapshot/s3_repository/snapshot_1
{
  "indices": "logs-*",
  "ignore_unavailable": true,
  "include_global_state": false
}

Restore snapshot:

POST _snapshot/s3_repository/snapshot_1/_restore
{
  "indices": "logs-2025-01-15",
  "ignore_unavailable": true,
  "include_global_state": false
}

Testing Restores

Don't assume backups work - test them!

My monthly DR drill:

Spin up new Elasticsearch cluster
Restore latest snapshot
Verify data integrity
Test queries in Kibana
Document any issues
Destroy test cluster

This saved us when primary cluster failed.

Monitoring ELK Itself

You must monitor your monitoring system.

Approach 1: Elasticsearch Monitoring (Built-in)

Enable monitoring:

elasticsearch.yml:

xpack.monitoring.collection.enabled: true

View in Kibana:

Stack Monitoring → Elasticsearch
See cluster health, node stats, index stats

Approach 2: External Monitoring

Use Prometheus + Grafana:

Deploy Elasticsearch exporter:

docker run -d \
  --name elasticsearch-exporter \
  -p 9114:9114 \
  quay.io/prometheuscommunity/elasticsearch-exporter:latest \
  --es.uri=http://elasticsearch:9200 \
  --es.all \
  --es.indices

Prometheus scrape config:

scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: ['elasticsearch-exporter:9114']

Key metrics to monitor:

- elasticsearch_cluster_health_status
- elasticsearch_filesystem_data_free_bytes
- elasticsearch_jvm_memory_used_bytes
- elasticsearch_indices_search_query_total
- elasticsearch_indices_indexing_index_total

Alert on:

Cluster status != green
Disk usage > 85%
JVM heap usage > 85%
Node unavailable
Slow queries

Approach 3: Watcher (Self-Monitoring)

Create watch to alert on cluster issues:

PUT _watcher/watch/cluster_health
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "http": {
      "request": {
        "host": "localhost",
        "port": 9200,
        "path": "/_cluster/health"
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.status": {
        "not_eq": "green"
      }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "to": "[email protected]",
        "subject": "Elasticsearch cluster unhealthy",
        "body": "Cluster status: {{ctx.payload.status}}"
      }
    }
  }
}

Performance Optimization

Elasticsearch Tuning

Heap size:

jvm.options:

-Xms16g
-Xmx16g

Rule: Set to 50% of RAM, max 31GB.

Disable swapping:

sudo swapoff -a

elasticsearch.yml:

bootstrap.memory_lock: true

Thread pools:

thread_pool.write.queue_size: 1000
thread_pool.search.queue_size: 1000

Refresh interval:

PUT logs-*/_settings
{
  "index.refresh_interval": "30s"
}

Default is 1s. Increase for higher indexing throughput.

Translog settings:

index.translog.durability: async
index.translog.sync_interval: 5s

Warning: Risk of data loss if node crashes.

Logstash Tuning

logstash.yml:

pipeline.workers: 8          # CPU cores
pipeline.batch.size: 1000    # Larger batches
pipeline.batch.delay: 10     # Milliseconds

queue.type: persisted
queue.max_bytes: 8gb

# JVM heap
jvm.options: -Xms4g -Xmx4g

Use persistent queues to prevent data loss during restarts.

Kibana Tuning

kibana.yml:

elasticsearch.requestTimeout: 60000
elasticsearch.shardTimeout: 60000

# Behind load balancer
server.basePath: "/kibana"
server.rewriteBasePath: true

High Availability

Elasticsearch HA

Minimum 3 master-eligible nodes:

elasticsearch.yml:

cluster.name: production-elk
node.name: es-node-1
node.roles: [master, data]

discovery.seed_hosts:
  - es-node-1
  - es-node-2
  - es-node-3

cluster.initial_master_nodes:
  - es-node-1
  - es-node-2
  - es-node-3

# Allocation awareness (spread across AZs)
cluster.routing.allocation.awareness.attributes: zone
node.attr.zone: us-east-1a

Set replicas:

PUT logs-*/_settings
{
  "number_of_replicas": 1
}

Shard allocation:

cluster.routing.allocation.same_shard.host: true

Prevents primary and replica on same host.

Logstash HA

Multiple Logstash nodes + load balancer:

Filebeat config (multiple Logstash hosts):

output.logstash:
  hosts: ["logstash1:5044", "logstash2:5044", "logstash3:5044"]
  loadbalance: true

Or use external load balancer (HAProxy, ELB, etc.).

Kibana HA

Multiple Kibana nodes behind load balancer:

HAProxy config:

frontend kibana_front
  bind *:5601
  default_backend kibana_back

backend kibana_back
  balance roundrobin
  server kibana1 kibana1:5601 check
  server kibana2 kibana2:5601 check
  server kibana3 kibana3:5601 check

AWS ELB / ALB works great too.

Scaling Strategies

Vertical Scaling

Add resources to existing nodes:

Increase CPU
Add RAM (increase heap)
Add disk

Limitations: Heap max 31GB, single node limits

Horizontal Scaling

Add more nodes to cluster:

Add data node:

# New node elasticsearch.yml
cluster.name: production-elk
node.name: es-node-4
node.roles: [data]

discovery.seed_hosts:
  - es-node-1
  - es-node-2
  - es-node-3

Start node - it joins cluster automatically.

Elasticsearch rebalances shards across all nodes.

This is how I scale production clusters.

Dedicated Node Roles

Large clusters benefit from role separation:

Master nodes (cluster management only):

node.roles: [master]

Data nodes (indexing and search):

node.roles: [data]

Coordinating nodes (query routing):

node.roles: []

Typical large cluster:

3 dedicated master nodes
2 coordinating nodes
15+ data nodes

Common Production Issues

Issue 1: Out of Disk Space

Symptoms:

Cluster status red/yellow
Can't index new data
Error: "flood stage disk watermark exceeded"

Solution:

Immediate:

# Delete old indices
DELETE /logs-2024-12-*

# Or increase disk watermark temporarily
PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "98%"
  }
}

Long-term:

Implement ILM retention
Add more disk/nodes
Archive old data to S3

Issue 2: Un assigned Shards

Symptoms:

Yellow/red cluster status
Shards stuck in UNASSIGNED state

Diagnose:

GET _cluster/allocation/explain

Common causes:

Not enough nodes for replicas
Disk space issues
Shard allocation rules preventing assignment

Solution:

# Reduce replicas
PUT logs-*/_settings
{
  "number_of_replicas": 0
}

# Or retry allocation
POST _cluster/reroute?retry_failed=true

Issue 3: High JVM Heap Usage

Symptoms:

Slow searches
Frequent garbage collection
OutOfMemoryError

Diagnose:

GET _nodes/stats/jvm

Solutions:

Increase heap (up to 31GB max)
Add more nodes
Reduce shard count
Optimize queries
Close unused indices

Issue 4: Slow Queries

Diagnose:

# Enable slow log
PUT /logs-*/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.slowlog.threshold.query.info": "2s"
}

# Check logs
tail -f /var/log/elasticsearch/*_index_search_slowlog.log

Solutions:

Limit time range
Use filters instead of queries
Reduce aggregation complexity
Add more nodes
Optimize mappings

Cost Optimization

AWS Cost Optimization

Use instance reservations (30-70% savings):

Reserved instances for stable workload
Spot for dev/test

Tiered storage:

Hot tier: i3 instances (NVMe SSD)
Warm tier: r5 instances (memory optimized)
Cold tier: d2 instances (HDD)

S3 for old data:

Archive old indices to S3
Searchable snapshots (8.x+)

General Cost Optimization

Reduce retention:

Store only what you need
30 days in ELK, rest in S3

Optimize shard size:

Target 20-50GB per shard
Fewer, larger shards = less overhead

Use ILM:

Move old data to cheaper tier
Reduce replicas for old data

Right-size nodes:

Don't over-provision
Monitor utilization, adjust

My Production Checklist

Before going live, verify:

Infrastructure:

Minimum 3 Elasticsearch nodes
Load balancers for Logstash and Kibana
Adequate disk space (3x expected usage)
Network segmentation (private VPC)

Configuration:

Monitoring:

Operational:

Documentation:

Architecture diagram
Access procedures
Troubleshooting guide
Escalation path

Conclusion

Running ELK in production requires careful planning and ongoing maintenance. Key takeaways:

Architecture:

Plan for scale (horizontal scaling)
Separate node roles (master, data, coordinating)
Use load balancers for HA

Data Management:

Implement ILM for retention
Regular snapshots to S3/backup
Test restores monthly

Security:

Enable X-Pack security
Use TLS/SSL encryption
Implement RBAC
Network isolation

Monitoring:

Monitor the monitoring system
Alert on cluster health, disk, heap
Use external monitoring (Prometheus)

Performance:

Optimize heap (50% RAM, max 31GB)
Use persistent queues in Logstash
Tune refresh intervals

Operations:

Document everything
Test DR procedures
Plan capacity growth
Optimize costs

ELK is powerful but requires discipline. Follow these practices and you'll have a reliable, scalable logging platform.

This concludes the ELK Stack 101 series. You now have the knowledge to build, deploy, and maintain production-grade ELK clusters.

Previous: Part 4 - Kibana Visualization Back to: Series Overview

This article is part of the ELK Stack 101 series. Check out the series overview for more content.

PreviousPart 4: Kibana - Visualization and Exploration NextOpenTelemetry 101

Last updated 2 days ago

hashtagThe Day Our ELK Cluster Went Down

hashtagProduction Architecture

hashtagSmall Environment (< 50GB/day)

hashtagMedium Environment (50-500GB/day)

hashtagLarge Environment (> 500GB/day)

hashtagHardware Sizing

hashtagElasticsearch Nodes

hashtagLogstash Nodes

hashtagKibana Nodes

hashtagData Retention and ILM

hashtagMy Standard ILM Policy

hashtagApplying ILM Policy

hashtagSecurity

hashtagEnable X-Pack Security

hashtagTLS/SSL Encryption

hashtagRole-Based Access Control (RBAC)

hashtagNetwork Security

hashtagBackup and Disaster Recovery

hashtagSnapshot and Restore

hashtagTesting Restores

hashtagMonitoring ELK Itself

hashtagApproach 1: Elasticsearch Monitoring (Built-in)

hashtagApproach 2: External Monitoring

hashtagApproach 3: Watcher (Self-Monitoring)

hashtagPerformance Optimization

hashtagElasticsearch Tuning

hashtagLogstash Tuning

hashtagKibana Tuning

hashtagHigh Availability

hashtagElasticsearch HA

hashtagLogstash HA

hashtagKibana HA

hashtagScaling Strategies

hashtagVertical Scaling

hashtagHorizontal Scaling

hashtagDedicated Node Roles

hashtagCommon Production Issues

hashtagIssue 1: Out of Disk Space

hashtagIssue 2: Un assigned Shards

hashtagIssue 3: High JVM Heap Usage

hashtagIssue 4: Slow Queries

hashtagCost Optimization

hashtagAWS Cost Optimization

hashtagGeneral Cost Optimization

hashtagMy Production Checklist

hashtagConclusion

The Day Our ELK Cluster Went Down

Production Architecture

Small Environment (< 50GB/day)

Medium Environment (50-500GB/day)

Large Environment (> 500GB/day)

Hardware Sizing

Elasticsearch Nodes

Logstash Nodes

Kibana Nodes

Data Retention and ILM

My Standard ILM Policy

Applying ILM Policy

Security

Enable X-Pack Security

TLS/SSL Encryption

Role-Based Access Control (RBAC)

Network Security

Backup and Disaster Recovery

Snapshot and Restore

Testing Restores

Monitoring ELK Itself

Approach 1: Elasticsearch Monitoring (Built-in)

Approach 2: External Monitoring

Approach 3: Watcher (Self-Monitoring)

Performance Optimization

Elasticsearch Tuning

Logstash Tuning

Kibana Tuning

High Availability

Elasticsearch HA

Logstash HA

Kibana HA

Scaling Strategies

Vertical Scaling

Horizontal Scaling

Dedicated Node Roles

Common Production Issues

Issue 1: Out of Disk Space

Issue 2: Un assigned Shards

Issue 3: High JVM Heap Usage

Issue 4: Slow Queries

Cost Optimization

AWS Cost Optimization

General Cost Optimization

My Production Checklist

Conclusion