Part 5: Production Best Practices
The Day Our ELK Cluster Went Down
Production Architecture
Small Environment (< 50GB/day)
βββββββββββββββββββ
β Application β
β Servers β
ββββββββββ¬βββββββββ
β (Filebeat)
βΌ
βββββββββββββββββββ ββββββββββββββββββββ
β Logstash ββββββΆβ Elasticsearch β
β (2 nodes) β β (3 nodes) β
βββββββββββββββββββ ββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββ
β Kibana β
β (1 node) β
βββββββββββββββββββMedium Environment (50-500GB/day)
Large Environment (> 500GB/day)
Hardware Sizing
Elasticsearch Nodes
Logstash Nodes
Kibana Nodes
Data Retention and ILM
My Standard ILM Policy
Applying ILM Policy
Security
Enable X-Pack Security
TLS/SSL Encryption
Role-Based Access Control (RBAC)
Network Security
Backup and Disaster Recovery
Snapshot and Restore
Testing Restores
Monitoring ELK Itself
Approach 1: Elasticsearch Monitoring (Built-in)
Approach 2: External Monitoring
Approach 3: Watcher (Self-Monitoring)
Performance Optimization
Elasticsearch Tuning
Logstash Tuning
Kibana Tuning
High Availability
Elasticsearch HA
Logstash HA
Kibana HA
Scaling Strategies
Vertical Scaling
Horizontal Scaling
Dedicated Node Roles
Common Production Issues
Issue 1: Out of Disk Space
Issue 2: Un assigned Shards
Issue 3: High JVM Heap Usage
Issue 4: Slow Queries
Cost Optimization
AWS Cost Optimization
General Cost Optimization
My Production Checklist
Conclusion
Last updated