Advanced SQL Study Notes

Chapter 1: JOINS and Set Operations

1. Inner Join (Revisited with Advanced Understanding)

An inner join returns rows that have matching values in both tables. At the advanced level, understanding join cardinality and join ordering matters.

Key points:

An inner join requires a matching condition.
Rows without matches in either table are filtered out.
The final cardinality depends on the uniqueness of join keys.

Example:

SELECT c.customer_id, c.name, o.order_id, o.amount
FROM customers c
INNER JOIN orders o
    ON c.customer_id = o.customer_id;

Follow-up questions:

What happens if the join key in the orders table is not unique?
How can you detect unintentional many-to-many joins?
How does the database optimizer decide join order?

2. Left Join

A left join returns all rows from the left table and the matching rows from the right table. If there is no match, NULLs appear on the right side.

Important details:

Useful to detect missing relationships.
Preserves all rows from the left table while extending them with optional data.

Example:

SELECT e.employee_id, e.name, d.department_name
FROM employees e
LEFT JOIN departments d
    ON e.department_id = d.department_id;

Follow-up questions:

How do you filter only non-matching rows?
What issues appear when using aggregates after a left join?
How to distinguish between real NULLs and NULLs from unmatched rows?

3. Right Join

A right join is the opposite of a left join. It returns all rows from the right table and matching rows from the left table.

Example:

SELECT p.product_id, p.product_name, s.sales_count
FROM sales_summary s
RIGHT JOIN products p
  ON p.product_id = s.product_id;

Note: Right join is less preferred than left join in industry because left join keeps the “primary” table on the left side.

Follow-up questions:

When might a right join be necessary?
Why do most SQL developers rewrite right joins as left joins?

4. Full Outer Join

A full join returns all rows from both tables, with NULLs where no match exists.

Example:

SELECT a.customer_id, a.balance, b.last_order_date
FROM account_balances a
FULL OUTER JOIN customer_last_orders b
    ON a.customer_id = b.customer_id;

Use cases:

Identifying mismatches between two datasets.
Reconciling logs, audit tables, or imported datasets.

Follow-up questions:

How do you filter only rows that exist exclusively in one table?
Why do some databases not support full outer join natively?

5. Cross Join

A cross join returns the Cartesian product of two tables.

Example:

SELECT c.name, p.plan_name
FROM customers c
CROSS JOIN pricing_plans p;

Uses:

Generating combinations.
Creating time-series expansion.
Statistical analysis requiring all possible pairs.

Follow-up questions:

How do you prevent accidental cross joins?
When is a cross join more efficient than generating combinations in application code?

6. Self Join

A self join is when a table is joined to itself. Mainly used for hierarchical or relational comparisons within the same dataset.

Example:

SELECT e.employee_id, e.name, m.name AS manager_name
FROM employees e
LEFT JOIN employees m
    ON e.manager_id = m.employee_id;

Follow-up questions:

How do you detect circular references in hierarchical data?
When should self-joins be replaced with recursive CTEs?

7. Anti Join

An anti join returns rows in one table that have no match in another.

Implemented using:

LEFT JOIN + WHERE right_key IS NULL
NOT EXISTS
NOT IN (caution: NULL-related issues)

Best pattern:

SELECT c.customer_id
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
);

Follow-up questions:

Why is NOT EXISTS preferred over NOT IN?
How do null values cause logical errors in NOT IN?

8. Semi Join

A semi join returns rows where a match exists, but does not return columns from the right table.

Implemented using:

WHERE EXISTS

Example:

SELECT c.customer_id, c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
);

Follow-up questions:

When is EXISTS more efficient than INNER JOIN?
How does a semi join differ from an inner join?

9. Set Operations (UNION, INTERSECT, EXCEPT)

UNION / UNION ALL

UNION removes duplicates.
UNION ALL keeps duplicates.

Example:

SELECT name FROM table1
UNION
SELECT name FROM table2;

Follow-up:

What performance issues arise with UNION (without ALL)?
How does sorting affect UNION?

INTERSECT

Returns rows common in both datasets.

Example:

SELECT email FROM leads
INTERSECT
SELECT email FROM customers;

EXCEPT / MINUS

Returns rows from left that do not exist in right.

Example:

SELECT product_id FROM warehouse1
EXCEPT
SELECT product_id FROM warehouse2;

Follow-up:

How does EXCEPT differ from anti join?
When are set operations faster than joins?

Chapter 2: Window Functions

Window functions allow operations across a set of rows without collapsing them into a single row like aggregates do. These are the foundation of analytical SQL.

1. Concept of Window Functions

Window functions operate “over” a logical window of rows.

Syntax:

function() OVER (
    PARTITION BY ...
    ORDER BY ...
    ROWS or RANGE frame_clause
)

A window function does not reduce row count. It adds an analytical column.

Follow-up:

What is the difference between PARTITION BY and GROUP BY?
Why does a window function require an ORDER BY for ranking?

2. Ranking Functions

Row_Number()

Assigns unique sequential ranking.

Example:

SELECT emp_id, salary,
       ROW_NUMBER() OVER (ORDER BY salary DESC) AS rank
FROM employees;

Useful for:

Top-N queries
Deduplication

Rank()

Assigns the same rank for ties; skips next ranks.

Dense_Rank()

Assigns the same rank for ties; does not skip.

Follow-up:

How do ties affect RANK and DENSE_RANK?
When would ROW_NUMBER cause unintended row loss in deduplication?

3. Aggregate Window Functions

Window versions of SUM, AVG, MIN, MAX, COUNT.

Example:

SELECT date, sales,
       SUM(sales) OVER (ORDER BY date) AS running_total
FROM daily_sales;

Key concepts:

Running totals
Moving averages
Percentage contributions

Follow-up:

Why does missing ORDER BY cause incorrect analytics?
How does the frame clause modify results?

4. LAG and LEAD

Used for comparing current row with previous or next rows.

Example:

SELECT date, sales,
       LAG(sales) OVER (ORDER BY date) AS prev_sales
FROM daily_sales;

Applications:

Difference calculations
Trend analysis
Sessionizing logs

Follow-up:

What happens when there is no previous row?
How does LAG differ from self join?

5. Window Frames (ROWS vs RANGE)

Window frame defines the subset of rows used for calculations.

Common frames:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
RANGE BETWEEN INTERVAL
RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING

Example:

SUM(amount) OVER (
   ORDER BY transaction_time
   ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
)

ROWS frame operates on physical row positions. RANGE frame operates on value-based ranges.

Follow-up:

Why do RANGE frames cause larger windows on repeated values?
How does the default frame differ across databases?

Chapter 3: Advanced Aggregations

1. Grouping Sets

Allows multiple grouping combinations in a single query.

Example:

SELECT region, product, SUM(sales)
FROM sales
GROUP BY GROUPING SETS (
    (region),
    (product),
    (region, product)
);

Follow-up:

How do grouping sets reduce multiple queries?
How to identify subtotal and grand-total rows?

2. Rollup

Generates hierarchical subtotals.

Example:

SELECT region, product, SUM(sales)
FROM sales
GROUP BY ROLLUP (region, product);

Produces:

region + product
region subtotal
grand total

3. Cube

Generates all combinations of grouping sets.

Example:

SELECT region, product, quarter, SUM(sales)
FROM sales
GROUP BY CUBE (region, product, quarter);

Follow-up:

Why is cube expensive?
When should cube be avoided?

Chapter 4: Recursive CTEs

1. Concept

Recursive CTEs are used for hierarchical data, graph traversal, nested categories, and generating sequences.

Structure:

WITH RECURSIVE cte_name AS (
    anchor_query
    UNION ALL
    recursive_query
)
SELECT * FROM cte_name;

2. Hierarchical Example

WITH RECURSIVE hierarchy AS (
    SELECT id, manager_id, name
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    SELECT e.id, e.manager_id, e.name
    FROM employees e
    JOIN hierarchy h
    ON e.manager_id = h.id
)
SELECT * FROM hierarchy;

Follow-up:

How do you prevent infinite recursion?
What is the maximum depth?
How to detect loops in hierarchy?

Chapter 5: SQL on Semi-Structured Data (JSON, Arrays)

1. JSON Extraction

Example:

SELECT
    order_id,
    order_data->>'customer_name' AS customer,
    (order_data->'items')->0->>'product_name' AS first_item
FROM orders;

Key concepts:

JSON operators differ by database
Indexing JSON for performance
Converting JSON arrays to rows using json_each/json_table

Follow-up:

What is the difference between -> and -»?
How to index JSON columns?

2. Arrays

Example:

SELECT customer_id,
       UNNEST(preferences) AS preference
FROM customer_profiles;

Follow-up:

Chapter 6: Query Optimization Foundations

This part introduces optimization fundamentals but does not go into the tuning playbook yet.

1. Importance of EXPLAIN / EXPLAIN ANALYZE

Use EXPLAIN to understand:

Join order
Index usage
Table scans
Filter pushdown
Cardinality estimates

Example:

EXPLAIN ANALYZE
SELECT * FROM orders WHERE amount > 500;

Follow-up:

2. Indexing Strategies

Types:

B-tree index
Hash index
Bitmap index
Columnstore index

When to index:

High-cardinality columns
Columns frequently used in WHERE, JOIN, ORDER BY

When not to index:

Highly volatile columns
Low-cardinality boolean flags

Follow-up: