Production Best Practices

Series: Elasticsearch 101 | Article: 08

Overview

Getting Elasticsearch working locally is straightforward. Getting it working reliably in production under real load — with growing data, schema changes, security requirements, and operational visibility — requires deliberate choices. This article covers the practices I apply before and after deploying Elasticsearch to production.

Authentication: API Keys over Username/Password

For application authentication, use API keys rather than the elastic superuser or a named user account.

API keys:

Can be scoped to specific indexes and operations (read-only, write-only, etc.)
Can be rotated without changing application credentials if compromised
Are audited individually in the security log
Do not require creating a persistent user account per application

Create an API key via the REST API (or Kibana → Stack Management → API Keys):

POST /_security/api_key
{
  "name": "articles-api-prod",
  "role_descriptors": {
    "articles-writer": {
      "cluster": ["monitor"],
      "indices": [
        {
          "names": ["articles", "articles_*"],
          "privileges": ["read", "write", "create_index", "manage"]
        }
      ]
    }
  }
}

Response:

{
  "id": "abc123",
  "name": "articles-api-prod",
  "api_key": "xyz...",
  "encoded": "<base64(id:api_key)>"
}

Use encoded directly in the Go client config:

cfg := elasticsearch.Config{
    Addresses: []string{os.Getenv("ES_ADDRESS")},
    APIKey:    os.Getenv("ES_API_KEY"),
}

Store the encoded key in a secrets manager (Vault, AWS Secrets Manager, Kubernetes Secret) and inject it via environment variable. Never commit it.

Index Aliases in Production

All queries and writes should use an alias, never the actual index name. This was introduced in Article 03 and it becomes critical in production.

Set up two aliases at index creation time:

Alias

Purpose

articles

Used by the application for reads and writes

articles_write

Points to the active write index; useful when blue-green reindexing

POST _aliases
{
  "actions": [
    { "add": { "index": "articles_v1", "alias": "articles" } }
  ]
}

When you need to reindex (mapping change, shard count change):

Create articles_v2 with the new mapping.
Run _reindex from articles_v1 to articles_v2. For large indexes, add "slices": "auto" to parallelize.
While reindex runs, continue writing to articles_v1 via the alias.
After reindex completes, sync any documents written during the reindex (using the updated_at field as a cursor).

Atomically swap the alias:

POST _aliases
{
  "actions": [
    { "remove": { "index": "articles_v1", "alias": "articles" } },
    { "add":    { "index": "articles_v2", "alias": "articles" } }
  ]
}

Verify queries work against articles_v2, then delete articles_v1.

Zero downtime reindexing requires this alias discipline from day one.

Index Lifecycle Management (ILM)

For time-series data (logs, events, metrics), use ILM to automate the index lifecycle. ILM moves indexes through phases as they age:

Phase

Typical Action

Hot

Active writes; high-performance hardware

Warm

Read-only; can reduce replica count

Cold

Searchable but on cheaper storage

Delete

Remove the index entirely

Example ILM policy for a log index:

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "7d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "readonly": {},
          "set_priority": { "priority": 50 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Attach the policy to an index template and create a data stream. Elasticsearch handles rollover automatically. Without ILM on log-type data, a single index can grow unbounded and become impossibly expensive to shrink.

Mapping Discipline: Preventing Mapping Explosion

Mapping explosion occurs when an index accumulates thousands or tens of thousands of fields — typically from dynamic mapping enabled on JSON blobs with arbitrary keys. Every mapped field consumes heap memory on every node.

Preventive measures:

1. Use "dynamic": "strict" for known schemas.

If fields are known at design time, lock down the mapping:

{ "mappings": { "dynamic": "strict" } }

2. Use flattened for blobs with arbitrary keys.

If you need to store and search dynamic key-value data (e.g., user-defined metadata), use the flattened field type instead of a nested object with dynamic mapping:

"metadata": { "type": "flattened" }

flattened indexes the leaf values of the entire object under a single backing field. You lose per-field type control but gain protection against field count explosion.

3. Set index.mapping.total_fields.limit.

Elastic's default is 1000 fields per index. Raise it only deliberately, never just to silence an error:

PUT /articles/_settings
{
  "index.mapping.total_fields.limit": 500
}

Shard Sizing

Shard sizing errors are the most common cause of cluster performance problems I have seen.

Rules of thumb:

Target 10–50 GB per shard for search-heavy workloads.
Target 20–40 GB per shard for time-series/log indexes.
Avoid shards smaller than 1 GB — they generate unnecessary overhead.
Do not set primary shard count to match node count automatically; plan based on data volume.

For an index that will grow to ~100 GB with equal read/write distribution, 3 primary shards + 1 replica is a sensible starting point on a 3-node cluster.

Check shard sizes:

GET _cat/shards/articles?v&s=store:desc

Replica Count by Environment

Environment

Replicas

Reason

Local dev

Avoids yellow status on single-node

Staging

0 or 1

Mirrors prod topology cheaply

Production

1 (minimum)

Zero replicas means data loss if a node dies

Setting index.number_of_replicas can be changed live without reindexing:

PUT /articles/_settings
{
  "index.number_of_replicas": 1
}

Monitoring

Cluster health is the first thing to check:

curl -u elastic:$ES_PASSWORD $ES_ADDRESS/_cluster/health?pretty

Green = all shards assigned and replicated. Yellow = all primaries assigned, replicas not (acceptable on single-node). Red = at least one primary shard unassigned — data may be missing.

Key metrics to track:

Metric

Source

JVM heap used %

GET _nodes/stats/jvm

Query latency (p50, p99)

GET _nodes/stats/indices/search

Indexing rate

GET _nodes/stats/indices/indexing

Unassigned shards

GET _cluster/health

Disk usage per node

GET _cat/allocation?v

JVM heap above 75% sustained is a warning sign. Disk above 85% triggers Elasticsearch's flood-stage watermark, which blocks all writes.

For production monitoring, I use the Kibana Stack Monitoring view (Stack Management → Stack Monitoring) alongside alerting rules on the metrics above.

Slow Query Log

Enable the slow query log to identify expensive queries in production:

PUT /articles/_settings
{
  "index.search.slowlog.threshold.query.warn": "1s",
  "index.search.slowlog.threshold.query.info": "500ms",
  "index.search.slowlog.level": "info"
}

Slow log entries surface in the Elasticsearch logs and in Kibana. Any query consistently hitting the warn threshold needs to be investigated — usually it is missing a filter, a large terms aggregation without size control, or a wildcard query on an un-indexed field.

Backup: Snapshot and Restore

Elasticsearch does not write to a relational backend — if data is lost, it may not be recoverable from another source. Configure snapshots.

PUT _snapshot/s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-snapshots",
    "region": "ap-southeast-1"
  }
}

Create a snapshot lifecycle policy to automate this:

PUT _slm/policy/daily-snapshots
{
  "name": "<articles-{now/d}>",
  "schedule": "0 30 1 * * ?",
  "repository": "s3-backup",
  "config": {
    "indices": ["articles"],
    "ignore_unavailable": true
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

Test restore into a non-production cluster before you need it.

Go Application Health Check

Include an Elasticsearch reachability check in the Go service's /healthz endpoint:

func (h *HealthHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second)
    defer cancel()

    res, err := h.ES.Cluster.Health().Do(ctx)
    if err != nil || res.Status == "red" {
        http.Error(w, "elasticsearch unhealthy", http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("ok"))
}

A 503 from this endpoint should block traffic at the load balancer level and trigger an alert.

Summary

Use scoped API keys for application authentication; rotate on compromise; store in a secrets manager.
Use aliases for all index access. Enable zero-downtime reindexing from day one.
Use ILM for time-series data to automate retention without manual cleanup.
Lock mappings with "dynamic": "strict" and use flattened for arbitrary key-value data.
Size shards between 10–50 GB; set replicas to at least 1 in production.
Monitor JVM heap, shard assignment, disk watermarks, and query latency.
Enable snapshot lifecycle policies before data grows; test restore.
Expose Elasticsearch health through the application's healthcheck endpoint.