Advanced Queries and Joins

My journey from simple SELECTs to powerful multi-table mastery

Introduction: The Query That Made Me Realise I Knew Nothing

I thought I had SQL figured out. I could do SELECT, INSERT, UPDATE, DELETE. I knew WHERE and GROUP BY. My blog was running, queries were returning data. Life was good.

Then a user asked: "Can you show me the top 5 most commented posts from the last 30 days, but only from authors who have published at least 3 posts this year, ranked by a combination of comments and views?"

I sat down to write the query and froze. I could see the data in my head—it was all there in the database—but I had no idea how to express that question in SQL.

I cobbled together something with five separate queries, pulled the results into Python, and filtered them manually. It worked. It was also 200 lines of code and took 3 seconds to run.

A senior engineer on my team looked at it and rewrote it as a single, elegant 20-line SQL query using CTEs and window functions. It ran in 12 milliseconds.

That was the day I committed to learning advanced SQL. This article is everything I wish someone had shown me back then.

Recap: JOIN Types Refresher

Before diving in, let's quickly visualise the JOIN types we covered in the previous article:

We have the following blog tables to work with throughout this article:

-- Our blog schema reference
-- authors (id, name, email, bio, created_at)
-- posts (id, author_id, title, content, status, views, published_at, created_at)
-- comments (id, post_id, author_name, body, created_at)
-- categories (id, name, slug)
-- post_categories (post_id, category_id)
-- tags (id, name)
-- post_tags (post_id, tag_id)

Advanced JOIN Techniques

Self JOIN

A self join joins a table to itself. Useful when rows have parent-child relationships.

Imagine our authors table has a mentor_id column pointing to another author:

-- Add mentor relationship to authors
ALTER TABLE authors ADD COLUMN mentor_id INTEGER REFERENCES authors(id);

-- Find all authors and their mentors
SELECT
    a.name        AS author_name,
    m.name        AS mentor_name
FROM authors a
LEFT JOIN authors m ON a.mentor_id = m.id
ORDER BY a.name;

 author_name  | mentor_name
--------------+-------------
 Alice Wong   | Bob Smith
 Bob Smith    | NULL
 Carol Davis  | Bob Smith

Bob is a mentor but has no mentor himself (senior author).

Multiple JOINs in One Query

Real queries often join 3, 4, or more tables:

-- Posts with author name, all their categories, and comment count
SELECT
    p.id,
    p.title,
    a.name                      AS author,
    STRING_AGG(c.name, ', ')    AS categories,
    COUNT(DISTINCT cm.id)       AS comment_count
FROM posts p
JOIN authors a          ON p.author_id = a.id
LEFT JOIN post_categories pc ON p.id = pc.post_id
LEFT JOIN categories c  ON pc.category_id = c.id
LEFT JOIN comments cm   ON p.id  = cm.post_id
WHERE p.status = 'published'
GROUP BY p.id, p.title, a.name
ORDER BY comment_count DESC
LIMIT 10;

Key insight: Use LEFT JOIN for optional relationships (a post may have no categories or no comments). Use INNER JOIN when the relationship is mandatory.

LATERAL JOIN

LATERAL allows a subquery on the right-hand side to reference columns from tables on the left. Think of it as a correlated subquery in a JOIN position:

-- For each author, get their 3 most recent posts
SELECT
    a.name,
    recent.title,
    recent.published_at
FROM authors a,
LATERAL (
    SELECT title, published_at
    FROM posts
    WHERE author_id = a.id
      AND status = 'published'
    ORDER BY published_at DESC
    LIMIT 3
) AS recent;

Without LATERAL, this would require a complex window function or multiple queries. LATERAL makes it elegant.

Subqueries: Queries Within Queries

Scalar Subquery

Returns a single value, usable anywhere an expression is expected:

-- Posts that have more comments than average
SELECT title, 
    (SELECT COUNT(*) FROM comments WHERE post_id = p.id) AS comment_count
FROM posts p
WHERE (
    SELECT COUNT(*) FROM comments WHERE post_id = p.id
) > (
    SELECT AVG(comment_count)
    FROM (
        SELECT COUNT(*) AS comment_count
        FROM comments
        GROUP BY post_id
    ) sub
);

⚠️ Performance note: Correlated scalar subqueries execute once per row. For large tables, use JOINs or CTEs instead.

IN and EXISTS Subqueries

-- Posts that have at least one comment (using IN)
SELECT id, title
FROM posts
WHERE id IN (SELECT DISTINCT post_id FROM comments);

-- Same query using EXISTS (usually more efficient)
SELECT id, title
FROM posts p
WHERE EXISTS (
    SELECT 1 FROM comments c WHERE c.post_id = p.id
);

-- Posts with NO comments (NOT EXISTS)
SELECT id, title
FROM posts p
WHERE NOT EXISTS (
    SELECT 1 FROM comments c WHERE c.post_id = p.id
);

EXISTS is generally faster than IN because it short-circuits on the first match.

Derived Tables (Subquery in FROM)

-- Average comments per category
SELECT
    c.name            AS category,
    ROUND(AVG(post_comments.cnt), 2) AS avg_comments
FROM categories c
JOIN post_categories pc ON c.id = pc.category_id
JOIN (
    SELECT post_id, COUNT(*) AS cnt
    FROM comments
    GROUP BY post_id
) AS post_comments ON pc.post_id = post_comments.post_id
GROUP BY c.name
ORDER BY avg_comments DESC;

Derived tables are great but can only be referenced once. CTEs are better for reuse.

Common Table Expressions (CTEs)

CTEs (introduced with WITH) make complex queries readable and allow reuse of intermediate results.

Basic CTE

-- Find authors with above-average post counts
WITH author_post_counts AS (
    SELECT
        author_id,
        COUNT(*) AS post_count
    FROM posts
    WHERE status = 'published'
    GROUP BY author_id
),
avg_count AS (
    SELECT AVG(post_count) AS mean FROM author_post_counts
)
SELECT
    a.name,
    apc.post_count,
    ROUND(avg_count.mean, 2) AS average
FROM authors a
JOIN author_post_counts apc ON a.id = apc.author_id
CROSS JOIN avg_count
WHERE apc.post_count > avg_count.mean
ORDER BY apc.post_count DESC;

Chained CTEs

You can reference earlier CTEs in later ones:

-- Top authors by engagement score (views + comments weighted)
WITH post_stats AS (
    SELECT
        p.id,
        p.author_id,
        p.views,
        COUNT(c.id)                            AS comment_count,
        p.views + (COUNT(c.id) * 10)           AS engagement_score
    FROM posts p
    LEFT JOIN comments c ON p.id = c.post_id
    WHERE p.status = 'published'
    GROUP BY p.id, p.author_id, p.views
),
author_engagement AS (
    SELECT
        author_id,
        SUM(engagement_score)          AS total_engagement,
        COUNT(*)                       AS published_posts,
        ROUND(AVG(engagement_score), 0) AS avg_engagement
    FROM post_stats
    GROUP BY author_id
)
SELECT
    a.name,
    ae.published_posts,
    ae.total_engagement,
    ae.avg_engagement
FROM authors a
JOIN author_engagement ae ON a.id = ae.author_id
ORDER BY ae.total_engagement DESC
LIMIT 5;

Recursive CTEs

Powerful for hierarchical data (categories with subcategories, comment threads, org charts):

-- Build a threaded comment tree
-- Assumes comments have a parent_id column
ALTER TABLE comments ADD COLUMN parent_id INTEGER REFERENCES comments(id);

WITH RECURSIVE comment_tree AS (
    -- Base case: top-level comments
    SELECT
        id,
        post_id,
        author_name,
        body,
        parent_id,
        1          AS depth,
        ARRAY[id]  AS path
    FROM comments
    WHERE parent_id IS NULL

    UNION ALL

    -- Recursive case: replies
    SELECT
        c.id,
        c.post_id,
        c.author_name,
        c.body,
        c.parent_id,
        ct.depth + 1,
        ct.path || c.id
    FROM comments c
    JOIN comment_tree ct ON c.parent_id = ct.id
)
SELECT
    REPEAT('  ', depth - 1) || author_name AS author,
    body,
    depth
FROM comment_tree
WHERE post_id = 42
ORDER BY path;

Window Functions

Window functions perform calculations across a set of rows related to the current row—without collapsing them into groups like GROUP BY does.

RANK and ROW_NUMBER

-- Rank posts by views within each category
SELECT
    c.name           AS category,
    p.title,
    p.views,
    RANK() OVER (
        PARTITION BY c.id
        ORDER BY p.views DESC
    )                AS rank_in_category,
    ROW_NUMBER() OVER (
        PARTITION BY c.id
        ORDER BY p.views DESC
    )                AS row_num
FROM posts p
JOIN post_categories pc ON p.id = pc.post_id
JOIN categories c       ON pc.category_id = c.id
WHERE p.status = 'published';

Difference: RANK() gives ties the same rank (1, 1, 3...). ROW_NUMBER() always gives unique sequential numbers (1, 2, 3...). DENSE_RANK() gives ties the same rank without gaps (1, 1, 2...).

Running Totals and Moving Averages

-- Running total of posts published per month
SELECT
    DATE_TRUNC('month', published_at)  AS month,
    COUNT(*)                            AS posts_this_month,
    SUM(COUNT(*)) OVER (
        ORDER BY DATE_TRUNC('month', published_at)
    )                                   AS running_total
FROM posts
WHERE status = 'published'
GROUP BY DATE_TRUNC('month', published_at)
ORDER BY month;

-- 3-month moving average of post views
SELECT
    DATE_TRUNC('month', published_at)       AS month,
    SUM(views)                              AS monthly_views,
    ROUND(AVG(SUM(views)) OVER (
        ORDER BY DATE_TRUNC('month', published_at)
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 0)                                   AS moving_avg_3m
FROM posts
WHERE status = 'published'
GROUP BY DATE_TRUNC('month', published_at)
ORDER BY month;

LAG and LEAD

Access values from previous or next rows:

-- Month-over-month post growth
WITH monthly AS (
    SELECT
        DATE_TRUNC('month', published_at) AS month,
        COUNT(*)                          AS posts
    FROM posts
    WHERE status = 'published'
    GROUP BY 1
)
SELECT
    month,
    posts,
    LAG(posts) OVER (ORDER BY month)  AS prev_month_posts,
    posts - LAG(posts) OVER (ORDER BY month) AS growth,
    ROUND(
        100.0 * (posts - LAG(posts) OVER (ORDER BY month))
             / NULLIF(LAG(posts) OVER (ORDER BY month), 0),
        1
    )                                  AS growth_pct
FROM monthly
ORDER BY month;

FIRST_VALUE and LAST_VALUE

-- Compare each post's views to the most-viewed in its category
SELECT
    c.name                                          AS category,
    p.title,
    p.views,
    FIRST_VALUE(p.views) OVER (
        PARTITION BY c.id
        ORDER BY p.views DESC
    )                                               AS category_max_views,
    ROUND(100.0 * p.views /
        FIRST_VALUE(p.views) OVER (
            PARTITION BY c.id
            ORDER BY p.views DESC
        ), 1)                                       AS pct_of_max
FROM posts p
JOIN post_categories pc ON p.id = pc.post_id
JOIN categories c       ON pc.category_id = c.id
WHERE p.status = 'published';

UNION, INTERSECT, and EXCEPT

These set operators combine results from multiple SELECT statements.

-- All people who have interacted with the blog (authors OR commenters)
SELECT name, 'author'   AS role FROM authors
UNION
SELECT author_name, 'commenter' FROM comments;

-- UNION ALL keeps duplicates (faster, use when duplicates are OK or impossible)
SELECT name FROM authors
UNION ALL
SELECT author_name FROM comments;

-- Authors who have also left comments (by name match)
SELECT name FROM authors
INTERSECT
SELECT author_name FROM comments;

-- Authors who have never commented
SELECT name FROM authors
EXCEPT
SELECT author_name FROM comments;

Rules for set operators:

Same number of columns in both SELECT statements
Corresponding columns must have compatible data types
Column names come from the first SELECT

Advanced Filtering and Aggregation

FILTER Clause

Apply aggregate functions conditionally without pivoting or CASE expressions:

-- Posts statistics broken down by status
SELECT
    author_id,
    COUNT(*)                                   AS total_posts,
    COUNT(*) FILTER (WHERE status = 'published')   AS published,
    COUNT(*) FILTER (WHERE status = 'draft')        AS drafts,
    SUM(views) FILTER (WHERE status = 'published')  AS total_views,
    ROUND(AVG(views) FILTER (WHERE status = 'published'), 0) AS avg_views
FROM posts
GROUP BY author_id
ORDER BY published DESC;

GROUPING SETS, ROLLUP, CUBE

Generate multiple levels of aggregation in one query:

-- Post counts at multiple rollup levels
SELECT
    COALESCE(a.name, 'ALL AUTHORS')   AS author,
    COALESCE(c.name, 'ALL CATEGORIES') AS category,
    COUNT(p.id)                        AS post_count
FROM posts p
JOIN authors a          ON p.author_id = a.id
LEFT JOIN post_categories pc ON p.id = pc.post_id
LEFT JOIN categories c  ON pc.category_id = c.id
WHERE p.status = 'published'
GROUP BY ROLLUP(a.name, c.name)
ORDER BY a.name NULLS LAST, c.name NULLS LAST;

ROLLUP(a, b) generates: (a, b), (a), () — subtotals and grand total. CUBE(a, b) generates all combinations: (a, b), (a), (b), ().

Full-Text Search in PostgreSQL

PostgreSQL has built-in full-text search that's far more powerful than LIKE:

-- Add a full-text search vector column
ALTER TABLE posts ADD COLUMN search_vector tsvector;

-- Populate it
UPDATE posts
SET search_vector = to_tsvector('english', title || ' ' || COALESCE(content, ''));

-- Create a GIN index for fast search
CREATE INDEX idx_posts_search ON posts USING GIN(search_vector);

-- Search for posts containing 'postgresql' and 'performance'
SELECT
    id,
    title,
    ts_rank(search_vector, query)           AS rank
FROM posts,
     to_tsquery('english', 'postgresql & performance') AS query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 10;

-- Automatic maintenance with a trigger
CREATE OR REPLACE FUNCTION update_post_search_vector()
RETURNS TRIGGER AS $$
BEGIN
    NEW.search_vector := to_tsvector('english',
        NEW.title || ' ' || COALESCE(NEW.content, '')
    );
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_post_search
BEFORE INSERT OR UPDATE ON posts
FOR EACH ROW EXECUTE FUNCTION update_post_search_vector();

Query Patterns for the Blog System

Pattern 1: Dashboard Overview

-- Blog dashboard stats in a single query
WITH stats AS (
    SELECT
        (SELECT COUNT(*) FROM authors)                      AS total_authors,
        (SELECT COUNT(*) FROM posts WHERE status = 'published') AS published_posts,
        (SELECT COUNT(*) FROM posts WHERE status = 'draft')     AS draft_posts,
        (SELECT COUNT(*) FROM comments)                     AS total_comments,
        (SELECT SUM(views) FROM posts WHERE status = 'published') AS total_views
)
SELECT * FROM stats;

Pattern 2: Author Leaderboard

-- Author performance leaderboard
SELECT
    a.name,
    COUNT(DISTINCT p.id)                              AS posts,
    COALESCE(SUM(p.views), 0)                         AS total_views,
    COALESCE(COUNT(DISTINCT c.id), 0)                 AS total_comments,
    COALESCE(SUM(p.views), 0) + COALESCE(COUNT(DISTINCT c.id), 0) * 5 AS score,
    RANK() OVER (ORDER BY
        COALESCE(SUM(p.views), 0) + COALESCE(COUNT(DISTINCT c.id), 0) * 5
        DESC
    )                                                 AS rank
FROM authors a
LEFT JOIN posts p    ON a.id = p.author_id AND p.status = 'published'
LEFT JOIN comments c ON p.id = c.post_id
GROUP BY a.id, a.name
ORDER BY score DESC;

-- Find related posts (sharing at least 2 categories)
WITH post_cats AS (
    SELECT post_id, array_agg(category_id) AS categories
    FROM post_categories
    GROUP BY post_id
)
SELECT
    p.id,
    p.title,
    COUNT(DISTINCT shared.category_id)  AS shared_categories
FROM posts p
JOIN post_categories pc      ON p.id = pc.post_id
JOIN post_categories shared  ON pc.category_id = shared.category_id
                              AND shared.post_id != p.id
WHERE shared.post_id = 42      -- The reference post
  AND p.status = 'published'
GROUP BY p.id, p.title
HAVING COUNT(DISTINCT shared.category_id) >= 2
ORDER BY shared_categories DESC
LIMIT 5;

Common Mistakes and How to Avoid Them

1. N+1 Query Problem

❌ Wrong: Fetching posts then querying comments for each post individually (N+1 queries).

✅ Right: Use a JOIN or a single IN query:

-- One query for posts + their comment counts
SELECT p.id, p.title, COUNT(c.id) AS comment_count
FROM posts p
LEFT JOIN comments c ON p.id = c.post_id
WHERE p.status = 'published'
GROUP BY p.id, p.title;

2. Forgetting NULL Handling

-- This misses NULL author_ids
SELECT * FROM posts WHERE author_id != 5;

-- Correct: handle NULLs explicitly
SELECT * FROM posts WHERE author_id != 5 OR author_id IS NULL;

3. HAVING vs WHERE Confusion

-- WRONG: Can't use aggregate in WHERE
SELECT author_id, COUNT(*) FROM posts
WHERE COUNT(*) > 5
GROUP BY author_id;

-- CORRECT: Use HAVING for aggregate conditions
SELECT author_id, COUNT(*) AS post_count
FROM posts
GROUP BY author_id
HAVING COUNT(*) > 5;

4. Implicit vs Explicit JOINs

-- Avoid implicit join (old style, easy to create accidental cartesian product)
SELECT p.title, a.name
FROM posts p, authors a
WHERE p.author_id = a.id;

-- Use explicit JOIN (recommended)
SELECT p.title, a.name
FROM posts p
JOIN authors a ON p.author_id = a.id;

Performance Implications

Technique

Performance Notes

EXISTS vs IN

EXISTS is usually faster for large subqueries

CTE vs Subquery

PostgreSQL 12+ optimises CTEs like subqueries by default

Window Functions

Efficient but require full sort; add appropriate indexes

UNION ALL vs UNION

UNION ALL is faster (skips deduplication)

Recursive CTE

Can be slow for deep hierarchies; add depth limits

Full-text search

Requires GIN index; LIKE '%text%' doesn't use B-tree indexes

Always verify with EXPLAIN ANALYZE:

EXPLAIN ANALYZE
SELECT p.title, a.name, COUNT(c.id)
FROM posts p
JOIN authors a ON p.author_id = a.id
LEFT JOIN comments c ON p.id = c.post_id
WHERE p.status = 'published'
GROUP BY p.id, p.title, a.name
ORDER BY COUNT(c.id) DESC
LIMIT 10;

What I Learned About Advanced Queries

Looking back at that original complex question that stumped me, I now know exactly how to approach it:

Break it down into CTEs — each CTE answers one sub-question
Use window functions for ranking and running calculations without collapsing rows
Combine set operations for union/difference logic
Validate with EXPLAIN ANALYZE before deploying to production

The biggest shift was realising that SQL is a declarative language—you describe what you want, not how to get it. Once I stopped thinking in loops and started thinking in sets and relations, complex queries became intuitive.

Key principles:

✅ Favour JOINs over correlated subqueries for performance
✅ Use CTEs to make queries readable and maintainable
✅ EXISTS beats IN for large datasets
✅ Window functions replace many application-level calculations
✅ Always test with EXPLAIN ANALYZE

Next Steps

With advanced queries under your belt, the next frontier is Indexes and Performance—understanding how PostgreSQL finds your data and how to make it dramatically faster.

Part of the Database 101 Series

PreviousRelationships and Data Integrity NextIndexes and Performance

Last updated 1 month ago

hashtagTable of Contents

hashtagIntroduction: The Query That Made Me Realise I Knew Nothing

hashtagRecap: JOIN Types Refresher

hashtagAdvanced JOIN Techniques

hashtagSelf JOIN

hashtagMultiple JOINs in One Query

hashtagLATERAL JOIN

hashtagSubqueries: Queries Within Queries

hashtagScalar Subquery

hashtagIN and EXISTS Subqueries

hashtagDerived Tables (Subquery in FROM)

hashtagCommon Table Expressions (CTEs)

hashtagBasic CTE

hashtagChained CTEs

hashtagRecursive CTEs

hashtagWindow Functions

hashtagRANK and ROW_NUMBER

hashtagRunning Totals and Moving Averages

hashtagLAG and LEAD

hashtagFIRST_VALUE and LAST_VALUE

hashtagUNION, INTERSECT, and EXCEPT

hashtagAdvanced Filtering and Aggregation

hashtagFILTER Clause

hashtagGROUPING SETS, ROLLUP, CUBE

hashtagFull-Text Search in PostgreSQL

hashtagQuery Patterns for the Blog System

hashtagPattern 1: Dashboard Overview

hashtagPattern 2: Author Leaderboard

hashtagPattern 3: Related Posts

hashtagCommon Mistakes and How to Avoid Them

hashtag1. N+1 Query Problem

hashtag2. Forgetting NULL Handling

hashtag3. HAVING vs WHERE Confusion

hashtag4. Implicit vs Explicit JOINs

hashtagPerformance Implications

hashtagWhat I Learned About Advanced Queries

hashtagNext Steps

Table of Contents

Introduction: The Query That Made Me Realise I Knew Nothing

Recap: JOIN Types Refresher

Advanced JOIN Techniques

Self JOIN

Multiple JOINs in One Query

LATERAL JOIN

Subqueries: Queries Within Queries

Scalar Subquery

IN and EXISTS Subqueries

Derived Tables (Subquery in FROM)

Common Table Expressions (CTEs)

Basic CTE

Chained CTEs

Recursive CTEs

Window Functions

RANK and ROW_NUMBER

Running Totals and Moving Averages

LAG and LEAD

FIRST_VALUE and LAST_VALUE

UNION, INTERSECT, and EXCEPT

Advanced Filtering and Aggregation

FILTER Clause

GROUPING SETS, ROLLUP, CUBE

Full-Text Search in PostgreSQL

Query Patterns for the Blog System

Pattern 1: Dashboard Overview

Pattern 2: Author Leaderboard

Pattern 3: Related Posts

Common Mistakes and How to Avoid Them

1. N+1 Query Problem

2. Forgetting NULL Handling

3. HAVING vs WHERE Confusion

4. Implicit vs Explicit JOINs

Performance Implications

What I Learned About Advanced Queries

Next Steps