Production Best Practices
Series: Elasticsearch 101 | Article: 08
Overview
Getting Elasticsearch working locally is straightforward. Getting it working reliably in production under real load β with growing data, schema changes, security requirements, and operational visibility β requires deliberate choices. This article covers the practices I apply before and after deploying Elasticsearch to production.
Authentication: API Keys over Username/Password
For application authentication, use API keys rather than the elastic superuser or a named user account.
API keys:
Can be scoped to specific indexes and operations (read-only, write-only, etc.)
Can be rotated without changing application credentials if compromised
Are audited individually in the security log
Do not require creating a persistent user account per application
Create an API key via the REST API (or Kibana β Stack Management β API Keys):
POST /_security/api_key
{
"name": "articles-api-prod",
"role_descriptors": {
"articles-writer": {
"cluster": ["monitor"],
"indices": [
{
"names": ["articles", "articles_*"],
"privileges": ["read", "write", "create_index", "manage"]
}
]
}
}
}Response:
Use encoded directly in the Go client config:
Store the encoded key in a secrets manager (Vault, AWS Secrets Manager, Kubernetes Secret) and inject it via environment variable. Never commit it.
Index Aliases in Production
All queries and writes should use an alias, never the actual index name. This was introduced in Article 03 and it becomes critical in production.
Set up two aliases at index creation time:
articles
Used by the application for reads and writes
articles_write
Points to the active write index; useful when blue-green reindexing
When you need to reindex (mapping change, shard count change):
Create
articles_v2with the new mapping.Run
_reindexfromarticles_v1toarticles_v2. For large indexes, add"slices": "auto"to parallelize.While reindex runs, continue writing to
articles_v1via the alias.After reindex completes, sync any documents written during the reindex (using the
updated_atfield as a cursor).Atomically swap the alias:
Verify queries work against
articles_v2, then deletearticles_v1.
Zero downtime reindexing requires this alias discipline from day one.
Index Lifecycle Management (ILM)
For time-series data (logs, events, metrics), use ILM to automate the index lifecycle. ILM moves indexes through phases as they age:
Hot
Active writes; high-performance hardware
Warm
Read-only; can reduce replica count
Cold
Searchable but on cheaper storage
Delete
Remove the index entirely
Example ILM policy for a log index:
Attach the policy to an index template and create a data stream. Elasticsearch handles rollover automatically. Without ILM on log-type data, a single index can grow unbounded and become impossibly expensive to shrink.
Mapping Discipline: Preventing Mapping Explosion
Mapping explosion occurs when an index accumulates thousands or tens of thousands of fields β typically from dynamic mapping enabled on JSON blobs with arbitrary keys. Every mapped field consumes heap memory on every node.
Preventive measures:
1. Use "dynamic": "strict" for known schemas.
If fields are known at design time, lock down the mapping:
2. Use flattened for blobs with arbitrary keys.
If you need to store and search dynamic key-value data (e.g., user-defined metadata), use the flattened field type instead of a nested object with dynamic mapping:
flattened indexes the leaf values of the entire object under a single backing field. You lose per-field type control but gain protection against field count explosion.
3. Set index.mapping.total_fields.limit.
Elastic's default is 1000 fields per index. Raise it only deliberately, never just to silence an error:
Shard Sizing
Shard sizing errors are the most common cause of cluster performance problems I have seen.
Rules of thumb:
Target 10β50 GB per shard for search-heavy workloads.
Target 20β40 GB per shard for time-series/log indexes.
Avoid shards smaller than 1 GB β they generate unnecessary overhead.
Do not set primary shard count to match node count automatically; plan based on data volume.
For an index that will grow to ~100 GB with equal read/write distribution, 3 primary shards + 1 replica is a sensible starting point on a 3-node cluster.
Check shard sizes:
Replica Count by Environment
Local dev
0
Avoids yellow status on single-node
Staging
0 or 1
Mirrors prod topology cheaply
Production
1 (minimum)
Zero replicas means data loss if a node dies
Setting index.number_of_replicas can be changed live without reindexing:
Monitoring
Cluster health is the first thing to check:
Green = all shards assigned and replicated. Yellow = all primaries assigned, replicas not (acceptable on single-node). Red = at least one primary shard unassigned β data may be missing.
Key metrics to track:
JVM heap used %
GET _nodes/stats/jvm
Query latency (p50, p99)
GET _nodes/stats/indices/search
Indexing rate
GET _nodes/stats/indices/indexing
Unassigned shards
GET _cluster/health
Disk usage per node
GET _cat/allocation?v
JVM heap above 75% sustained is a warning sign. Disk above 85% triggers Elasticsearch's flood-stage watermark, which blocks all writes.
For production monitoring, I use the Kibana Stack Monitoring view (Stack Management β Stack Monitoring) alongside alerting rules on the metrics above.
Slow Query Log
Enable the slow query log to identify expensive queries in production:
Slow log entries surface in the Elasticsearch logs and in Kibana. Any query consistently hitting the warn threshold needs to be investigated β usually it is missing a filter, a large terms aggregation without size control, or a wildcard query on an un-indexed field.
Backup: Snapshot and Restore
Elasticsearch does not write to a relational backend β if data is lost, it may not be recoverable from another source. Configure snapshots.
Register a repository (S3 example):
Create a snapshot lifecycle policy to automate this:
Test restore into a non-production cluster before you need it.
Go Application Health Check
Include an Elasticsearch reachability check in the Go service's /healthz endpoint:
A 503 from this endpoint should block traffic at the load balancer level and trigger an alert.
Summary
Use scoped API keys for application authentication; rotate on compromise; store in a secrets manager.
Use aliases for all index access. Enable zero-downtime reindexing from day one.
Use ILM for time-series data to automate retention without manual cleanup.
Lock mappings with
"dynamic": "strict"and useflattenedfor arbitrary key-value data.Size shards between 10β50 GB; set replicas to at least 1 in production.
Monitor JVM heap, shard assignment, disk watermarks, and query latency.
Enable snapshot lifecycle policies before data grows; test restore.
Expose Elasticsearch health through the application's healthcheck endpoint.
Previous: React Frontend Integration
Series Complete
Last updated