# Indexes and Performance

*My journey from 45-second page loads to sub-100ms responses*

***

## Table of Contents

1. [Introduction: The Day My Blog Became Unusable](#introduction-the-day-my-blog-became-unusable)
2. [What Is an Index?](#what-is-an-index)
3. [How PostgreSQL Finds Data Without an Index](#how-postgresql-finds-data-without-an-index)
4. [B-Tree Indexes: The Default Workhorse](#b-tree-indexes-the-default-workhorse)
5. [Other Index Types in PostgreSQL](#other-index-types-in-postgresql)
6. [Index Strategies for Real Queries](#index-strategies-for-real-queries)
7. [EXPLAIN and EXPLAIN ANALYZE](#explain-and-explain-analyze)
8. [Partial Indexes: Index Only What You Need](#partial-indexes-index-only-what-you-need)
9. [Composite Indexes and Column Order](#composite-indexes-and-column-order)
10. [Index-Only Scans and Covering Indexes](#index-only-scans-and-covering-indexes)
11. [The Cost of Indexes](#the-cost-of-indexes)
12. [Query Optimisation Techniques](#query-optimisation-techniques)
13. [Monitoring Index Usage](#monitoring-index-usage)
14. [What I Learned About Performance](#what-i-learned-about-performance)
15. [Next Steps](#next-steps)

***

## Introduction: The Day My Blog Became Unusable

It was a Saturday morning. I had just crossed 50,000 rows in my `posts` table after three years of writing. I opened my blog's admin panel to review comments, and nothing loaded. After 30 seconds, a timeout error.

I restarted the server. Same thing. Cleared the application cache. Still broken.

I connected directly to the database and ran the exact query my admin panel used:

```sql
SELECT p.*, a.name AS author_name, COUNT(c.id) AS comment_count
FROM posts p
JOIN authors a ON p.author_id = a.id
LEFT JOIN comments c ON p.id = c.post_id
WHERE p.status = 'published'
GROUP BY p.id, a.name
ORDER BY p.published_at DESC
LIMIT 20;
```

It returned… in **43 seconds**.

The same query had been running in 80ms when I had 5,000 posts. Nobody told me databases slow down as they grow. Nobody told me I needed to *design for scale*.

Two hours later, after adding the right indexes, the same query ran in **4ms**.

This article is the guide I wish had existed that Saturday morning.

***

## What Is an Index?

An index is a separate data structure that PostgreSQL maintains alongside your table. It's like the index at the back of a textbook—instead of reading every page to find "ACID properties", you look it up in the index and jump directly to page 347.

{% @mermaid/diagram content="graph LR
Query\["Query: WHERE status = 'published'"] --> Decision{Index exists?}
Decision -- No --> SeqScan\["Sequential Scan\n(read every row)"]
Decision -- Yes --> IndexScan\["Index Scan\n(jump to matching rows)"]
SeqScan --> Slow\["50,000 rows read\n~~43 seconds"]
IndexScan --> Fast\["Index lookup\n~~4ms"]" %}

**Trade-off**: Indexes speed up reads but slow down writes (INSERT, UPDATE, DELETE), because PostgreSQL must update the index alongside the table. Choose indexes that serve your actual query patterns.

***

## How PostgreSQL Finds Data Without an Index

Without an index, PostgreSQL performs a **Sequential Scan** (seq scan): it reads every single row in the table from start to finish, checks if the row matches the WHERE condition, and either includes or discards it.

For a table with 50,000 posts:

* Sequential scan: read all 50,000 rows
* Your query matches 12,000 published posts
* PostgreSQL discarded 38,000 rows it didn't need

With an index on `status`:

* Index lookup: jump directly to the 12,000 matching rows
* No wasted reads

Sequential scans aren't always bad—for small tables or when you're fetching most rows, they can be faster than index scans. PostgreSQL's query planner decides which to use.

***

## B-Tree Indexes: The Default Workhorse

The default index type in PostgreSQL is a **B-Tree** (Balanced Tree). It handles:

* `=`, `<`, `>`, `<=`, `>=` comparisons
* `BETWEEN` ranges
* `IN` lists
* `LIKE 'prefix%'` (prefix matches only, not `'%suffix'`)
* `IS NULL`, `IS NOT NULL`
* `ORDER BY` without a separate sort

```sql
-- Creating indexes on the blog schema
CREATE INDEX idx_posts_status ON posts(status);
CREATE INDEX idx_posts_author_id ON posts(author_id);
CREATE INDEX idx_posts_published_at ON posts(published_at);
CREATE INDEX idx_comments_post_id ON comments(post_id);
CREATE INDEX idx_post_categories_post_id ON post_categories(post_id);
CREATE INDEX idx_post_categories_category_id ON post_categories(category_id);
```

> 💡 **Foreign keys are not automatically indexed in PostgreSQL.** Always add indexes on foreign key columns—they're used in almost every JOIN.

### Naming Convention

```sql
-- Pattern: idx_{table}_{column(s)}
CREATE INDEX idx_posts_status         ON posts(status);
CREATE INDEX idx_posts_author_status  ON posts(author_id, status);

-- Unique indexes
CREATE UNIQUE INDEX uq_authors_email  ON authors(email);
```

***

## Other Index Types in PostgreSQL

### Hash Index

For equality comparisons only ( `=` ), slightly faster than B-Tree for simple lookups:

```sql
CREATE INDEX idx_posts_status_hash ON posts USING HASH(status);
```

In practice, B-Tree is almost always preferred because it supports range queries too and hash indexes weren't crash-safe before PostgreSQL 10.

### GIN Index (Generalised Inverted Index)

Best for array columns, JSONB, and full-text search—any column where a single row contains multiple values to index:

```sql
-- Full-text search
CREATE INDEX idx_posts_search ON posts USING GIN(search_vector);

-- JSONB column
ALTER TABLE posts ADD COLUMN metadata JSONB;
CREATE INDEX idx_posts_metadata ON posts USING GIN(metadata);

-- Array column
ALTER TABLE posts ADD COLUMN tag_ids INTEGER[];
CREATE INDEX idx_posts_tag_ids ON posts USING GIN(tag_ids);
```

GIN indexes are larger and slower to build but fast for lookups within complex values.

### GiST Index (Generalised Search Tree)

Used for geometric data, range types, and fuzzy string matching:

```sql
-- Range type example: post scheduled window
ALTER TABLE posts ADD COLUMN publish_window TSTZRANGE;
CREATE INDEX idx_posts_window ON posts USING GIST(publish_window);

-- pg_trgm for similarity/fuzzy search
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_posts_title_trgm ON posts USING GIN(title gin_trgm_ops);

-- Now fuzzy LIKE queries can use an index:
SELECT title FROM posts WHERE title ILIKE '%postgrSQL%';
```

### BRIN Index (Block Range Index)

Very small, efficient for naturally ordered data in huge tables (timestamps, sequential IDs):

```sql
-- Good for append-heavy timestamp columns in large tables
CREATE INDEX idx_posts_created_brin ON posts USING BRIN(created_at);
```

BRIN indexes are tiny (kilobytes vs megabytes for B-Tree) but only efficient when values are physically stored in order.

***

## Index Strategies for Real Queries

### Always Index Foreign Keys

```sql
-- These should exist on every blog schema
CREATE INDEX idx_posts_author_id      ON posts(author_id);
CREATE INDEX idx_comments_post_id     ON comments(post_id);
CREATE INDEX idx_post_categories_post ON post_categories(post_id);
CREATE INDEX idx_post_categories_cat  ON post_categories(category_id);
CREATE INDEX idx_post_tags_post       ON post_tags(post_id);
CREATE INDEX idx_post_tags_tag        ON post_tags(tag_id);
```

### Index Columns Used in WHERE Clauses

Identify the most common filter conditions:

```sql
-- We frequently filter by status and sort by published_at
CREATE INDEX idx_posts_status_published ON posts(status, published_at DESC);
```

### Index Columns Used in ORDER BY

Without an index, ORDER BY requires a full sort. With the right index, results arrive pre-sorted:

```sql
-- Dashboard: latest published posts (very common query)
-- This index serves both the WHERE and ORDER BY
CREATE INDEX idx_posts_published_at_desc ON posts(published_at DESC)
    WHERE status = 'published';
```

***

## EXPLAIN and EXPLAIN ANALYZE

`EXPLAIN` shows the query plan. `EXPLAIN ANALYZE` actually executes the query and shows real timing.

```sql
-- Check query plan without executing
EXPLAIN SELECT * FROM posts WHERE status = 'published';

-- Execute and show real timing
EXPLAIN ANALYZE SELECT * FROM posts WHERE status = 'published';

-- Full verbose output
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT p.title, a.name
FROM posts p
JOIN authors a ON p.author_id = a.id
WHERE p.status = 'published'
ORDER BY p.published_at DESC
LIMIT 20;
```

### Reading EXPLAIN Output

```
Limit (cost=0.29..1.23 rows=20 width=136) (actual time=0.042..0.198 rows=20 loops=1)
  -> Index Scan using idx_posts_published_at_desc on posts p
       (cost=0.29..2847.13 rows=60521 width=136)
       (actual time=0.040..0.180 rows=20 loops=1)
       Filter: ((status)::text = 'published'::text)
     -> Index Scan using authors_pkey on authors a
          (cost=0.15..0.17 rows=1 width=52) (actual time=0.004..0.004 rows=1 loops=20)
          Index Cond: (id = p.author_id)
Planning Time: 0.312 ms
Execution Time: 0.248 ms
```

**Key things to look for:**

| Term               | Meaning                                         |
| ------------------ | ----------------------------------------------- |
| `Seq Scan`         | ⚠️ Reading every row—may need an index          |
| `Index Scan`       | ✅ Using an index                                |
| `Index Only Scan`  | ✅✅ Reading only from index (fastest)            |
| `Bitmap Heap Scan` | ✅ Multiple index matches, then heap lookup      |
| `Hash Join`        | Joining using a hash table                      |
| `Nested Loop`      | Joining row by row (good for small inner sets)  |
| `Merge Join`       | Joining pre-sorted data                         |
| `cost=X..Y`        | Estimated startup..total cost (arbitrary units) |
| `actual time=X..Y` | Real execution time in milliseconds             |
| `rows=N`           | Estimated vs actual row count                   |

**When estimated rows ≠ actual rows** — your table statistics are stale. Run:

```sql
ANALYZE posts;
-- Or update all tables:
ANALYZE;
```

***

## Partial Indexes: Index Only What You Need

A partial index only includes rows matching a WHERE condition. Smaller, faster, and perfectly targeted:

```sql
-- 90% of queries only look at published posts
-- Index only those rows
CREATE INDEX idx_posts_pub_date ON posts(published_at DESC)
    WHERE status = 'published';

-- Only index non-deleted records
CREATE INDEX idx_authors_active ON authors(name)
    WHERE deleted_at IS NULL;

-- Index only recent posts (last 2 years) for the homepage
CREATE INDEX idx_posts_recent ON posts(published_at DESC)
    WHERE status = 'published'
      AND published_at > NOW() - INTERVAL '2 years';
```

Partial indexes can be **10-100x smaller** than full-table indexes while covering 99% of real queries.

***

## Composite Indexes and Column Order

A composite (multi-column) index can serve multiple query patterns, but **column order matters enormously**.

```sql
-- This index:
CREATE INDEX idx_posts_author_status ON posts(author_id, status);

-- Serves these queries:
--   WHERE author_id = 5                        ✅ (leading column match)
--   WHERE author_id = 5 AND status = 'published' ✅ (full index match)

-- But NOT these:
--   WHERE status = 'published'                 ❌ (non-leading column)
```

### The "Equality First, Range Last" Rule

Put equality conditions (=) before range conditions (<, >, BETWEEN) in composite indexes:

```sql
-- Query: WHERE status = 'published' AND published_at > '2024-01-01'
-- Good index:
CREATE INDEX idx_posts_status_date ON posts(status, published_at);
--  status uses equality → put first
--  published_at uses range → put second

-- Inefficient index (range column first):
CREATE INDEX idx_posts_date_status ON posts(published_at, status);
-- PostgreSQL can use the range on published_at but can't narrow by status efficiently
```

***

## Index-Only Scans and Covering Indexes

An **index-only scan** retrieves data solely from the index without touching the main table. This is extremely fast:

```sql
-- Query: SELECT title, published_at FROM posts WHERE status = 'published'
-- If index contains: (status, title, published_at), PostgreSQL never touches the heap

CREATE INDEX idx_posts_cover ON posts(status, published_at DESC)
    INCLUDE (title, author_id);
--  status, published_at in the index key (for seeking)
--  title, author_id INCLUDED (available for index-only scans)
```

`INCLUDE` columns (PostgreSQL 11+) don't affect sort order but allow index-only scans for those columns.

***

## The Cost of Indexes

Indexes aren't free. Every index:

* **Slows down writes**: INSERT, UPDATE, DELETE must update each index
* **Uses disk space**: An index can be as large as the table itself
* **Needs maintenance**: `VACUUM` and `AUTOVACUUM` spend time on indexes

```sql
-- Check index sizes
SELECT
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS size,
    idx_scan        AS times_used,
    idx_tup_read    AS rows_returned
FROM pg_stat_user_indexes
WHERE relname = 'posts'
ORDER BY pg_relation_size(indexname::regclass) DESC;
```

### When NOT to Add an Index

* Tables with fewer than \~1,000 rows (seq scan is faster)
* Columns with very low cardinality (e.g., `boolean` or `status` with 2 values in a small table)
* Columns rarely used in WHERE, JOIN, or ORDER BY
* Write-heavy tables where index maintenance outweighs read benefit

***

## Query Optimisation Techniques

### Avoid Functions on Indexed Columns in WHERE

```sql
-- WRONG: function on left side prevents index use
SELECT * FROM posts WHERE LOWER(title) = 'postgresql tips';

-- RIGHT option 1: use a functional index
CREATE INDEX idx_posts_title_lower ON posts(LOWER(title));
SELECT * FROM posts WHERE LOWER(title) = 'postgresql tips'; -- now uses index

-- RIGHT option 2: store data normalised
-- Store title always in lowercase at insert time
```

### Use LIMIT Aggressively

```sql
-- Without LIMIT: sorts ALL published posts, returns 20
SELECT * FROM posts WHERE status = 'published' ORDER BY published_at DESC;

-- With LIMIT: PostgreSQL stops after finding 20
SELECT * FROM posts WHERE status = 'published' ORDER BY published_at DESC LIMIT 20;
```

### Avoid `SELECT *` in Production

```sql
-- Pulls all columns (including large content TEXT)
SELECT * FROM posts WHERE status = 'published';

-- Only fetch what you need
SELECT id, title, published_at, views FROM posts WHERE status = 'published';
```

### Use Connection Pooling

Individual connections are expensive. Use PgBouncer or application-level pooling:

```sql
-- Check current connections
SELECT count(*) FROM pg_stat_activity;
SELECT max_conn, used, res_available
FROM (
    SELECT COUNT(*) used FROM pg_stat_activity
) t,
(
    SELECT setting::int res_available FROM pg_settings WHERE name = 'max_connections'
) s,
(SELECT setting::int max_conn FROM pg_settings WHERE name = 'max_connections') mc;
```

***

## Monitoring Index Usage

### Find Unused Indexes (Index Bloat)

```sql
-- Indexes that have never been used since last stats reset
SELECT
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS size,
    idx_scan AS times_used
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND NOT indisprimary              -- not primary key
  AND NOT indisunique               -- not unique constraint
ORDER BY pg_relation_size(indexname::regclass) DESC;
```

### Find Slow Queries

```sql
-- Enable pg_stat_statements extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 slowest queries
SELECT
    LEFT(query, 80)          AS query_snippet,
    calls,
    ROUND(mean_exec_time::numeric, 2)  AS avg_ms,
    ROUND(total_exec_time::numeric, 2) AS total_ms
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
```

### Find Missing Indexes (Sequential Scans on Large Tables)

```sql
SELECT
    relname            AS table_name,
    seq_scan,
    seq_tup_read,
    idx_scan,
    n_live_tup         AS row_count
FROM pg_stat_user_tables
WHERE seq_scan > 0
  AND n_live_tup > 10000           -- only tables with significant data
ORDER BY seq_tup_read DESC
LIMIT 20;
```

High `seq_tup_read` on a large table = potential missing index.

***

## What I Learned About Performance

The performance incident that broke my blog was entirely avoidable. The checklist I now follow for every new table:

**When creating a table:**

* [ ] Add indexes on all foreign key columns
* [ ] Add indexes on columns used in the most common WHERE clauses
* [ ] Add composite index if queries frequently filter by 2+ columns together
* [ ] Create partial index if only a subset of rows is queried

**When a query feels slow:**

1. Run `EXPLAIN ANALYZE` first—never guess
2. Look for `Seq Scan` on large tables
3. Check if estimated vs actual row counts diverge significantly (run `ANALYZE`)
4. Add the targeted index
5. Verify improvement with `EXPLAIN ANALYZE` again

**Golden rules:**

* ✅ Always index foreign keys
* ✅ Index columns in WHERE, JOIN ON, and ORDER BY
* ✅ Partial indexes for filtered queries
* ✅ Use `INCLUDE` for covering indexes (PostgreSQL 11+)
* ✅ Monitor with `pg_stat_user_indexes` and `pg_stat_statements`
* ❌ Don't add indexes "just in case"—measure first

***

## Next Steps

With indexes and performance mastered, the next topic is ensuring your data stays consistent even when things go wrong: **Transactions and ACID**.

* [**→ Next: Transactions and ACID**](https://blog.htunnthuthu.com/getting-started/programming/database-101/database-101-transactions-and-acid)
* [**← Previous: Advanced Queries and Joins**](https://blog.htunnthuthu.com/getting-started/programming/database-101/database-101-advanced-queries-and-joins)

***

*Part of the* [*Database 101 Series*](https://blog.htunnthuthu.com/getting-started/programming/database-101)
