# Advanced Queries and Joins

*My journey from simple SELECTs to powerful multi-table mastery*

***

## Table of Contents

1. [Introduction: The Query That Made Me Realise I Knew Nothing](#introduction-the-query-that-made-me-realise-i-knew-nothing)
2. [Recap: JOIN Types Refresher](#recap-join-types-refresher)
3. [Advanced JOIN Techniques](#advanced-join-techniques)
4. [Subqueries: Queries Within Queries](#subqueries-queries-within-queries)
5. [Common Table Expressions (CTEs)](#common-table-expressions-ctes)
6. [Window Functions](#window-functions)
7. [UNION, INTERSECT, and EXCEPT](#union-intersect-and-except)
8. [Advanced Filtering and Aggregation](#advanced-filtering-and-aggregation)
9. [Full-Text Search in PostgreSQL](#full-text-search-in-postgresql)
10. [Query Patterns for the Blog System](#query-patterns-for-the-blog-system)
11. [Common Mistakes and How to Avoid Them](#common-mistakes-and-how-to-avoid-them)
12. [Performance Implications](#performance-implications)
13. [What I Learned About Advanced Queries](#what-i-learned-about-advanced-queries)
14. [Next Steps](#next-steps)

***

## Introduction: The Query That Made Me Realise I Knew Nothing

I thought I had SQL figured out. I could do `SELECT`, `INSERT`, `UPDATE`, `DELETE`. I knew `WHERE` and `GROUP BY`. My blog was running, queries were returning data. Life was good.

Then a user asked: *"Can you show me the top 5 most commented posts from the last 30 days, but only from authors who have published at least 3 posts this year, ranked by a combination of comments and views?"*

I sat down to write the query and froze. I could see the data in my head—it was all there in the database—but I had no idea how to express that question in SQL.

I cobbled together something with five separate queries, pulled the results into Python, and filtered them manually. It worked. It was also 200 lines of code and took 3 seconds to run.

A senior engineer on my team looked at it and rewrote it as a single, elegant 20-line SQL query using CTEs and window functions. It ran in 12 milliseconds.

That was the day I committed to learning **advanced SQL**. This article is everything I wish someone had shown me back then.

***

## Recap: JOIN Types Refresher

Before diving in, let's quickly visualise the JOIN types we covered in the previous article:

{% @mermaid/diagram content="graph LR
A\["INNER JOIN\n(matching rows only)"]
B\["LEFT JOIN\n(all left + matching right)"]
C\["RIGHT JOIN\n(matching left + all right)"]
D\["FULL OUTER JOIN\n(all rows from both)"]
E\["CROSS JOIN\n(cartesian product)"]
F\["SELF JOIN\n(table joined to itself)"]" %}

We have the following blog tables to work with throughout this article:

```sql
-- Our blog schema reference
-- authors (id, name, email, bio, created_at)
-- posts (id, author_id, title, content, status, views, published_at, created_at)
-- comments (id, post_id, author_name, body, created_at)
-- categories (id, name, slug)
-- post_categories (post_id, category_id)
-- tags (id, name)
-- post_tags (post_id, tag_id)
```

***

## Advanced JOIN Techniques

### Self JOIN

A self join joins a table to itself. Useful when rows have parent-child relationships.

Imagine our authors table has a `mentor_id` column pointing to another author:

```sql
-- Add mentor relationship to authors
ALTER TABLE authors ADD COLUMN mentor_id INTEGER REFERENCES authors(id);

-- Find all authors and their mentors
SELECT
    a.name        AS author_name,
    m.name        AS mentor_name
FROM authors a
LEFT JOIN authors m ON a.mentor_id = m.id
ORDER BY a.name;
```

```
 author_name  | mentor_name
--------------+-------------
 Alice Wong   | Bob Smith
 Bob Smith    | NULL
 Carol Davis  | Bob Smith
```

Bob is a mentor but has no mentor himself (senior author).

### Multiple JOINs in One Query

Real queries often join 3, 4, or more tables:

```sql
-- Posts with author name, all their categories, and comment count
SELECT
    p.id,
    p.title,
    a.name                      AS author,
    STRING_AGG(c.name, ', ')    AS categories,
    COUNT(DISTINCT cm.id)       AS comment_count
FROM posts p
JOIN authors a          ON p.author_id = a.id
LEFT JOIN post_categories pc ON p.id = pc.post_id
LEFT JOIN categories c  ON pc.category_id = c.id
LEFT JOIN comments cm   ON p.id  = cm.post_id
WHERE p.status = 'published'
GROUP BY p.id, p.title, a.name
ORDER BY comment_count DESC
LIMIT 10;
```

**Key insight**: Use `LEFT JOIN` for optional relationships (a post may have no categories or no comments). Use `INNER JOIN` when the relationship is mandatory.

### LATERAL JOIN

`LATERAL` allows a subquery on the right-hand side to reference columns from tables on the left. Think of it as a correlated subquery in a JOIN position:

```sql
-- For each author, get their 3 most recent posts
SELECT
    a.name,
    recent.title,
    recent.published_at
FROM authors a,
LATERAL (
    SELECT title, published_at
    FROM posts
    WHERE author_id = a.id
      AND status = 'published'
    ORDER BY published_at DESC
    LIMIT 3
) AS recent;
```

Without `LATERAL`, this would require a complex window function or multiple queries. `LATERAL` makes it elegant.

***

## Subqueries: Queries Within Queries

### Scalar Subquery

Returns a single value, usable anywhere an expression is expected:

```sql
-- Posts that have more comments than average
SELECT title, 
    (SELECT COUNT(*) FROM comments WHERE post_id = p.id) AS comment_count
FROM posts p
WHERE (
    SELECT COUNT(*) FROM comments WHERE post_id = p.id
) > (
    SELECT AVG(comment_count)
    FROM (
        SELECT COUNT(*) AS comment_count
        FROM comments
        GROUP BY post_id
    ) sub
);
```

> ⚠️ **Performance note**: Correlated scalar subqueries execute once per row. For large tables, use JOINs or CTEs instead.

### IN and EXISTS Subqueries

```sql
-- Posts that have at least one comment (using IN)
SELECT id, title
FROM posts
WHERE id IN (SELECT DISTINCT post_id FROM comments);

-- Same query using EXISTS (usually more efficient)
SELECT id, title
FROM posts p
WHERE EXISTS (
    SELECT 1 FROM comments c WHERE c.post_id = p.id
);

-- Posts with NO comments (NOT EXISTS)
SELECT id, title
FROM posts p
WHERE NOT EXISTS (
    SELECT 1 FROM comments c WHERE c.post_id = p.id
);
```

`EXISTS` is generally faster than `IN` because it short-circuits on the first match.

### Derived Tables (Subquery in FROM)

```sql
-- Average comments per category
SELECT
    c.name            AS category,
    ROUND(AVG(post_comments.cnt), 2) AS avg_comments
FROM categories c
JOIN post_categories pc ON c.id = pc.category_id
JOIN (
    SELECT post_id, COUNT(*) AS cnt
    FROM comments
    GROUP BY post_id
) AS post_comments ON pc.post_id = post_comments.post_id
GROUP BY c.name
ORDER BY avg_comments DESC;
```

Derived tables are great but can only be referenced once. CTEs are better for reuse.

***

## Common Table Expressions (CTEs)

CTEs (introduced with `WITH`) make complex queries readable and allow reuse of intermediate results.

### Basic CTE

```sql
-- Find authors with above-average post counts
WITH author_post_counts AS (
    SELECT
        author_id,
        COUNT(*) AS post_count
    FROM posts
    WHERE status = 'published'
    GROUP BY author_id
),
avg_count AS (
    SELECT AVG(post_count) AS mean FROM author_post_counts
)
SELECT
    a.name,
    apc.post_count,
    ROUND(avg_count.mean, 2) AS average
FROM authors a
JOIN author_post_counts apc ON a.id = apc.author_id
CROSS JOIN avg_count
WHERE apc.post_count > avg_count.mean
ORDER BY apc.post_count DESC;
```

### Chained CTEs

You can reference earlier CTEs in later ones:

```sql
-- Top authors by engagement score (views + comments weighted)
WITH post_stats AS (
    SELECT
        p.id,
        p.author_id,
        p.views,
        COUNT(c.id)                            AS comment_count,
        p.views + (COUNT(c.id) * 10)           AS engagement_score
    FROM posts p
    LEFT JOIN comments c ON p.id = c.post_id
    WHERE p.status = 'published'
    GROUP BY p.id, p.author_id, p.views
),
author_engagement AS (
    SELECT
        author_id,
        SUM(engagement_score)          AS total_engagement,
        COUNT(*)                       AS published_posts,
        ROUND(AVG(engagement_score), 0) AS avg_engagement
    FROM post_stats
    GROUP BY author_id
)
SELECT
    a.name,
    ae.published_posts,
    ae.total_engagement,
    ae.avg_engagement
FROM authors a
JOIN author_engagement ae ON a.id = ae.author_id
ORDER BY ae.total_engagement DESC
LIMIT 5;
```

### Recursive CTEs

Powerful for hierarchical data (categories with subcategories, comment threads, org charts):

```sql
-- Build a threaded comment tree
-- Assumes comments have a parent_id column
ALTER TABLE comments ADD COLUMN parent_id INTEGER REFERENCES comments(id);

WITH RECURSIVE comment_tree AS (
    -- Base case: top-level comments
    SELECT
        id,
        post_id,
        author_name,
        body,
        parent_id,
        1          AS depth,
        ARRAY[id]  AS path
    FROM comments
    WHERE parent_id IS NULL

    UNION ALL

    -- Recursive case: replies
    SELECT
        c.id,
        c.post_id,
        c.author_name,
        c.body,
        c.parent_id,
        ct.depth + 1,
        ct.path || c.id
    FROM comments c
    JOIN comment_tree ct ON c.parent_id = ct.id
)
SELECT
    REPEAT('  ', depth - 1) || author_name AS author,
    body,
    depth
FROM comment_tree
WHERE post_id = 42
ORDER BY path;
```

***

## Window Functions

Window functions perform calculations across a set of rows related to the current row—without collapsing them into groups like `GROUP BY` does.

{% @mermaid/diagram content="graph TD
A\["Regular Aggregate\n(collapses rows)"] --> B\["1 row per group"]
C\["Window Function\n(retains all rows)"] --> D\["All rows + aggregate value"]" %}

### RANK and ROW\_NUMBER

```sql
-- Rank posts by views within each category
SELECT
    c.name           AS category,
    p.title,
    p.views,
    RANK() OVER (
        PARTITION BY c.id
        ORDER BY p.views DESC
    )                AS rank_in_category,
    ROW_NUMBER() OVER (
        PARTITION BY c.id
        ORDER BY p.views DESC
    )                AS row_num
FROM posts p
JOIN post_categories pc ON p.id = pc.post_id
JOIN categories c       ON pc.category_id = c.id
WHERE p.status = 'published';
```

**Difference**: `RANK()` gives ties the same rank (1, 1, 3...). `ROW_NUMBER()` always gives unique sequential numbers (1, 2, 3...). `DENSE_RANK()` gives ties the same rank without gaps (1, 1, 2...).

### Running Totals and Moving Averages

```sql
-- Running total of posts published per month
SELECT
    DATE_TRUNC('month', published_at)  AS month,
    COUNT(*)                            AS posts_this_month,
    SUM(COUNT(*)) OVER (
        ORDER BY DATE_TRUNC('month', published_at)
    )                                   AS running_total
FROM posts
WHERE status = 'published'
GROUP BY DATE_TRUNC('month', published_at)
ORDER BY month;
```

```sql
-- 3-month moving average of post views
SELECT
    DATE_TRUNC('month', published_at)       AS month,
    SUM(views)                              AS monthly_views,
    ROUND(AVG(SUM(views)) OVER (
        ORDER BY DATE_TRUNC('month', published_at)
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 0)                                   AS moving_avg_3m
FROM posts
WHERE status = 'published'
GROUP BY DATE_TRUNC('month', published_at)
ORDER BY month;
```

### LAG and LEAD

Access values from previous or next rows:

```sql
-- Month-over-month post growth
WITH monthly AS (
    SELECT
        DATE_TRUNC('month', published_at) AS month,
        COUNT(*)                          AS posts
    FROM posts
    WHERE status = 'published'
    GROUP BY 1
)
SELECT
    month,
    posts,
    LAG(posts) OVER (ORDER BY month)  AS prev_month_posts,
    posts - LAG(posts) OVER (ORDER BY month) AS growth,
    ROUND(
        100.0 * (posts - LAG(posts) OVER (ORDER BY month))
             / NULLIF(LAG(posts) OVER (ORDER BY month), 0),
        1
    )                                  AS growth_pct
FROM monthly
ORDER BY month;
```

### FIRST\_VALUE and LAST\_VALUE

```sql
-- Compare each post's views to the most-viewed in its category
SELECT
    c.name                                          AS category,
    p.title,
    p.views,
    FIRST_VALUE(p.views) OVER (
        PARTITION BY c.id
        ORDER BY p.views DESC
    )                                               AS category_max_views,
    ROUND(100.0 * p.views /
        FIRST_VALUE(p.views) OVER (
            PARTITION BY c.id
            ORDER BY p.views DESC
        ), 1)                                       AS pct_of_max
FROM posts p
JOIN post_categories pc ON p.id = pc.post_id
JOIN categories c       ON pc.category_id = c.id
WHERE p.status = 'published';
```

***

## UNION, INTERSECT, and EXCEPT

These set operators combine results from multiple SELECT statements.

```sql
-- All people who have interacted with the blog (authors OR commenters)
SELECT name, 'author'   AS role FROM authors
UNION
SELECT author_name, 'commenter' FROM comments;

-- UNION ALL keeps duplicates (faster, use when duplicates are OK or impossible)
SELECT name FROM authors
UNION ALL
SELECT author_name FROM comments;

-- Authors who have also left comments (by name match)
SELECT name FROM authors
INTERSECT
SELECT author_name FROM comments;

-- Authors who have never commented
SELECT name FROM authors
EXCEPT
SELECT author_name FROM comments;
```

**Rules for set operators:**

* Same number of columns in both SELECT statements
* Corresponding columns must have compatible data types
* Column names come from the first SELECT

***

## Advanced Filtering and Aggregation

### FILTER Clause

Apply aggregate functions conditionally without pivoting or CASE expressions:

```sql
-- Posts statistics broken down by status
SELECT
    author_id,
    COUNT(*)                                   AS total_posts,
    COUNT(*) FILTER (WHERE status = 'published')   AS published,
    COUNT(*) FILTER (WHERE status = 'draft')        AS drafts,
    SUM(views) FILTER (WHERE status = 'published')  AS total_views,
    ROUND(AVG(views) FILTER (WHERE status = 'published'), 0) AS avg_views
FROM posts
GROUP BY author_id
ORDER BY published DESC;
```

### GROUPING SETS, ROLLUP, CUBE

Generate multiple levels of aggregation in one query:

```sql
-- Post counts at multiple rollup levels
SELECT
    COALESCE(a.name, 'ALL AUTHORS')   AS author,
    COALESCE(c.name, 'ALL CATEGORIES') AS category,
    COUNT(p.id)                        AS post_count
FROM posts p
JOIN authors a          ON p.author_id = a.id
LEFT JOIN post_categories pc ON p.id = pc.post_id
LEFT JOIN categories c  ON pc.category_id = c.id
WHERE p.status = 'published'
GROUP BY ROLLUP(a.name, c.name)
ORDER BY a.name NULLS LAST, c.name NULLS LAST;
```

`ROLLUP(a, b)` generates: `(a, b)`, `(a)`, `()` — subtotals and grand total.\
`CUBE(a, b)` generates all combinations: `(a, b)`, `(a)`, `(b)`, `()`.

***

## Full-Text Search in PostgreSQL

PostgreSQL has built-in full-text search that's far more powerful than `LIKE`:

```sql
-- Add a full-text search vector column
ALTER TABLE posts ADD COLUMN search_vector tsvector;

-- Populate it
UPDATE posts
SET search_vector = to_tsvector('english', title || ' ' || COALESCE(content, ''));

-- Create a GIN index for fast search
CREATE INDEX idx_posts_search ON posts USING GIN(search_vector);

-- Search for posts containing 'postgresql' and 'performance'
SELECT
    id,
    title,
    ts_rank(search_vector, query)           AS rank
FROM posts,
     to_tsquery('english', 'postgresql & performance') AS query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 10;
```

```sql
-- Automatic maintenance with a trigger
CREATE OR REPLACE FUNCTION update_post_search_vector()
RETURNS TRIGGER AS $$
BEGIN
    NEW.search_vector := to_tsvector('english',
        NEW.title || ' ' || COALESCE(NEW.content, '')
    );
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_post_search
BEFORE INSERT OR UPDATE ON posts
FOR EACH ROW EXECUTE FUNCTION update_post_search_vector();
```

***

## Query Patterns for the Blog System

### Pattern 1: Dashboard Overview

```sql
-- Blog dashboard stats in a single query
WITH stats AS (
    SELECT
        (SELECT COUNT(*) FROM authors)                      AS total_authors,
        (SELECT COUNT(*) FROM posts WHERE status = 'published') AS published_posts,
        (SELECT COUNT(*) FROM posts WHERE status = 'draft')     AS draft_posts,
        (SELECT COUNT(*) FROM comments)                     AS total_comments,
        (SELECT SUM(views) FROM posts WHERE status = 'published') AS total_views
)
SELECT * FROM stats;
```

### Pattern 2: Author Leaderboard

```sql
-- Author performance leaderboard
SELECT
    a.name,
    COUNT(DISTINCT p.id)                              AS posts,
    COALESCE(SUM(p.views), 0)                         AS total_views,
    COALESCE(COUNT(DISTINCT c.id), 0)                 AS total_comments,
    COALESCE(SUM(p.views), 0) + COALESCE(COUNT(DISTINCT c.id), 0) * 5 AS score,
    RANK() OVER (ORDER BY
        COALESCE(SUM(p.views), 0) + COALESCE(COUNT(DISTINCT c.id), 0) * 5
        DESC
    )                                                 AS rank
FROM authors a
LEFT JOIN posts p    ON a.id = p.author_id AND p.status = 'published'
LEFT JOIN comments c ON p.id = c.post_id
GROUP BY a.id, a.name
ORDER BY score DESC;
```

### Pattern 3: Related Posts

```sql
-- Find related posts (sharing at least 2 categories)
WITH post_cats AS (
    SELECT post_id, array_agg(category_id) AS categories
    FROM post_categories
    GROUP BY post_id
)
SELECT
    p.id,
    p.title,
    COUNT(DISTINCT shared.category_id)  AS shared_categories
FROM posts p
JOIN post_categories pc      ON p.id = pc.post_id
JOIN post_categories shared  ON pc.category_id = shared.category_id
                              AND shared.post_id != p.id
WHERE shared.post_id = 42      -- The reference post
  AND p.status = 'published'
GROUP BY p.id, p.title
HAVING COUNT(DISTINCT shared.category_id) >= 2
ORDER BY shared_categories DESC
LIMIT 5;
```

***

## Common Mistakes and How to Avoid Them

### 1. N+1 Query Problem

❌ **Wrong**: Fetching posts then querying comments for each post individually (N+1 queries).

✅ **Right**: Use a JOIN or a single IN query:

```sql
-- One query for posts + their comment counts
SELECT p.id, p.title, COUNT(c.id) AS comment_count
FROM posts p
LEFT JOIN comments c ON p.id = c.post_id
WHERE p.status = 'published'
GROUP BY p.id, p.title;
```

### 2. Forgetting NULL Handling

```sql
-- This misses NULL author_ids
SELECT * FROM posts WHERE author_id != 5;

-- Correct: handle NULLs explicitly
SELECT * FROM posts WHERE author_id != 5 OR author_id IS NULL;
```

### 3. HAVING vs WHERE Confusion

```sql
-- WRONG: Can't use aggregate in WHERE
SELECT author_id, COUNT(*) FROM posts
WHERE COUNT(*) > 5
GROUP BY author_id;

-- CORRECT: Use HAVING for aggregate conditions
SELECT author_id, COUNT(*) AS post_count
FROM posts
GROUP BY author_id
HAVING COUNT(*) > 5;
```

### 4. Implicit vs Explicit JOINs

```sql
-- Avoid implicit join (old style, easy to create accidental cartesian product)
SELECT p.title, a.name
FROM posts p, authors a
WHERE p.author_id = a.id;

-- Use explicit JOIN (recommended)
SELECT p.title, a.name
FROM posts p
JOIN authors a ON p.author_id = a.id;
```

***

## Performance Implications

| Technique              | Performance Notes                                              |
| ---------------------- | -------------------------------------------------------------- |
| `EXISTS` vs `IN`       | `EXISTS` is usually faster for large subqueries                |
| CTE vs Subquery        | PostgreSQL 12+ optimises CTEs like subqueries by default       |
| Window Functions       | Efficient but require full sort; add appropriate indexes       |
| `UNION ALL` vs `UNION` | `UNION ALL` is faster (skips deduplication)                    |
| Recursive CTE          | Can be slow for deep hierarchies; add depth limits             |
| Full-text search       | Requires GIN index; `LIKE '%text%'` doesn't use B-tree indexes |

Always verify with `EXPLAIN ANALYZE`:

```sql
EXPLAIN ANALYZE
SELECT p.title, a.name, COUNT(c.id)
FROM posts p
JOIN authors a ON p.author_id = a.id
LEFT JOIN comments c ON p.id = c.post_id
WHERE p.status = 'published'
GROUP BY p.id, p.title, a.name
ORDER BY COUNT(c.id) DESC
LIMIT 10;
```

***

## What I Learned About Advanced Queries

Looking back at that original complex question that stumped me, I now know exactly how to approach it:

1. **Break it down into CTEs** — each CTE answers one sub-question
2. **Use window functions** for ranking and running calculations without collapsing rows
3. **Combine set operations** for union/difference logic
4. **Validate with EXPLAIN ANALYZE** before deploying to production

The biggest shift was realising that SQL is a **declarative** language—you describe *what* you want, not *how* to get it. Once I stopped thinking in loops and started thinking in sets and relations, complex queries became intuitive.

**Key principles:**

* ✅ Favour JOINs over correlated subqueries for performance
* ✅ Use CTEs to make queries readable and maintainable
* ✅ `EXISTS` beats `IN` for large datasets
* ✅ Window functions replace many application-level calculations
* ✅ Always test with `EXPLAIN ANALYZE`

***

## Next Steps

With advanced queries under your belt, the next frontier is **Indexes and Performance**—understanding how PostgreSQL finds your data and how to make it dramatically faster.

* [**→ Next: Indexes and Performance**](https://blog.htunnthuthu.com/getting-started/programming/database-101/database-101-indexes-and-performance)
* [**← Previous: Relationships and Data Integrity**](https://blog.htunnthuthu.com/getting-started/programming/database-101/database-101-relationships-and-data-integrity)

***

*Part of the* [*Database 101 Series*](https://blog.htunnthuthu.com/getting-started/programming/database-101)
