What is Elasticsearch and When to Use It

Series: Elasticsearch 101 | Article: 01


Background

Elasticsearch is a distributed search and analytics engine built on top of Apache Lucene. It was initially released in 2010, and what made it stand out from day one was not just the speed β€” it was the fact that you could scale it horizontally without changing your application code, and query it over a straightforward HTTP/JSON API.

I reached for Elasticsearch after hitting a wall with full-text search in a relational database. The query worked fine at 100k rows. At 2 million rows with fuzzy matching enabled, it was unusable. Elasticsearch solved that problem, but it also introduced a different mental model that is worth understanding before writing a single line of code.


How Elasticsearch Stores Data

The Inverted Index

Elasticsearch does not run LIKE '%keyword%' across rows. Instead, it builds an inverted index at write time.

When you index a document like this:

{
  "title": "Getting started with distributed systems"
}

Elasticsearch tokenizes the text and builds a map:

Token
Document IDs

getting

[1]

started

[1]

distributed

[1]

systems

[1]

When you search for "distributed", Elasticsearch does a direct lookup in this map rather than scanning every document. This is why full-text search is fast regardless of how many documents you have β€” the cost is at write time, not read time.

Segments and Immutability

Elasticsearch writes data into segments β€” small, immutable Lucene indexes on disk. Once a segment is written, it is never modified. Updates are modeled as a delete of the old document plus a write of the new one. Background merge operations periodically combine small segments into larger ones.

This immutability is why Elasticsearch is excellent for append-heavy workloads and why you need to think differently about high-update-rate data.


Core Concepts

Cluster

A cluster is one or more nodes working together, identified by a cluster name. All nodes in a cluster share the same data and expose it as a single API endpoint.

Node

A node is a single running instance of Elasticsearch. A node can hold data, coordinate requests, or do both. In local development you typically run a single node. In production you run at least three to avoid split-brain scenarios.

Index

An index in Elasticsearch is roughly analogous to a table in a relational database. It has a name, a mapping (schema), and contains documents. Unlike a database table, an index is actually a logical grouping of one or more shards.

Document

A document is a JSON object stored in an index. Every document has:

  • A _index β€” which index it belongs to

  • An _id β€” its unique identifier within the index

  • A _source β€” the original JSON body

Shard

A shard is a single Lucene index. Elasticsearch distributes shards across nodes. When you create an index, you configure how many primary shards it has (this cannot be changed later without reindexing). Each primary shard can have one or more replica shards for redundancy and read scalability.

A common starting configuration for a production index is 1 primary shard + 1 replica if the data fits one node, or 3–5 primary shards + 1 replica for larger datasets.

Mapping

A mapping defines the schema of documents in an index β€” what fields exist, their data types, and how they should be analyzed. Getting mappings right up front avoids painful reindexing jobs later.


Elasticsearch versus a Relational Database

Concern
PostgreSQL / MySQL
Elasticsearch

Full-text search

Slow at scale, limited relevance scoring

Purpose-built, fast, relevance-ranked

Structured queries (join, aggregation)

Strong support with SQL

Aggregations are powerful but no JOIN

ACID transactions

Yes

No (document-level atomicity only)

Schema changes

Migrations, alter table

Add fields freely; type changes require reindex

Primary data store

Yes

Avoid β€” treat as a secondary derived store

Writes per second

High, highly concurrent

High for bulk; single-doc updates have overhead

The most important line in the table: treat Elasticsearch as a secondary derived store. I write to PostgreSQL first and sync to Elasticsearch asynchronously (via outbox pattern, CDC, or periodic job depending on the latency requirement). This keeps your source of truth clean and your search index optimized for reads.


When Elasticsearch Is the Right Tool

Full-Text Search with Relevance

If users need to search natural language text β€” blog posts, product descriptions, documentation β€” Elasticsearch is the right choice. It handles stemming, synonyms, multi-language analyzers, fuzziness, and relevance scoring out of the box.

Autocomplete and Typeahead

With the completion field type or edge_ngram tokenizer, autocomplete queries on large datasets are very fast. I have used this for a local knowledge base project and the response time was under 10ms at p99.

Log and Event Analytics

The ELK Stack (Elasticsearch + Logstash + Kibana) became the de facto standard for log aggregation for good reason. Data streams and index lifecycle management (ILM) make it practical to store and search months of log data.

E-commerce style filtering β€” filter by category, price range, brand, rating β€” maps naturally to Elasticsearch aggregations combined with filter queries. The query runs over the full corpus, aggregations count the filtered subsets simultaneously, all in a single round trip.

Elasticsearch supports geo_point and geo_shape field types and can do distance-based filtering and sorting. For anything beyond basic PostGIS queries at scale, Elasticsearch is worth considering.


When NOT to Use Elasticsearch

  • Primary OLTP store: No multi-document transactions, no foreign key constraints. Use a relational database or document store for this.

  • High-frequency single-document updates: Each update is a delete + rewrite. If a record changes dozens of times per second, you will create segment churn. Batch updates or use a DB as the write-ahead store.

  • Heavy joins or relational data: Elasticsearch has no JOIN. nested and parent-child types exist but add complexity. If your query is fundamentally relational, stay in SQL.

  • Small datasets: The operational overhead of Elasticsearch is not worth it if a properly indexed PostgreSQL table would serve the same purpose.


Data Model Mental Shift

The biggest adjustment coming from relational databases is denormalization. In SQL you normalize to avoid duplication and rely on JOINs. In Elasticsearch you intentionally duplicate data to avoid needing JOINs.

If you are indexing articles with authors, you embed the author's name, avatar URL, and bio directly in the article document. You accept that if the author updates their avatar, you need to update every article. This is a deliberate trade-off: denormalized writes β†’ very fast reads.


Summary

  • Elasticsearch is a distributed search and analytics engine built on Lucene.

  • It uses inverted indexes to make full-text search fast regardless of dataset size.

  • Core units are: cluster β†’ node β†’ index β†’ shard β†’ document.

  • It excels at full-text search, faceted navigation, log analytics, and autocomplete.

  • Treat it as a secondary read-optimized store, not a primary database.

  • Mapping decisions made at index creation time are largely permanent β€” plan them.


Next: Setting Up Elasticsearch with Docker

Last updated