QAVeda Explore QAVeda →
Interview Prep · SQL Sorcery

SQL Sorcery
Interview Questions

Master SQL for software testers and QA engineers. Learn query writing, joins, aggregations, stored procedures and database testing techniques. Includes real SQL interview questions for QA roles.

150
Questions
3
Levels
Expert
Curated

Junior (0–2 years)

1
DDL vs DML

What is the difference between DELETE, TRUNCATE, and DROP?

All three remove data but operate at completely different levels — DELETE removes specific rows, TRUNCATE empties a table in one sweep, and DROP destroys the table itself.

Why it exists:
QA engineers reset test databases constantly — before a test run, between test suites, or when cleaning up after an exploratory session. Using the wrong command is a serious mistake: DELETE is reversible and surgical, TRUNCATE is fast but total, DROP is catastrophic and irreversible. Knowing which to reach for (and when) prevents accidentally wiping production-adjacent data or killing a table structure.

``sql
DELETE FROM Orders WHERE status = 'cancelled'; -- removes only cancelled rows
TRUNCATE TABLE Orders; -- empties the table, keeps structure
DROP TABLE Orders; -- table is gone entirely
``

After DELETE: surviving rows remain. After TRUNCATE: table exists but has zero rows. After DROP: even querying the table name is an error.

Real-world QA use case:
A QA engineer is setting up a fresh test run. She runs TRUNCATE on the test Orders and Users tables to clear previous test data — fast and safe because she wants the empty table structure to remain. She would never use DROP here (that would require recreating the schema) and DELETE would be too slow on 50,000 rows. Before her next exploratory session she uses DELETE with a WHERE clause to remove only the records her specific test created, leaving other testers' data intact.

Rule of thumb: DELETE for precise row removal (reversible), TRUNCATE to empty a table fast (keep the structure), DROP to remove the table permanently (no undo).
💡 Plain English: Think of a whiteboard. **DELETE** is rubbing out specific words one at a time — slow, but you can stop and undo. **TRUNCATE** is one big swipe that clears the whole board in a second. **DROP** is throwing the entire whiteboard in the bin — board, frame, markers and all.
2
Query Writing

Write a query to find the second-highest salary from an Employee table. What should it return if there is no second salary?

-- Employee
-- +----+--------+
-- | id | salary |
-- +----+--------+
-- |  1 |    100 |
-- |  2 |    200 |
-- |  3 |    300 |
-- +----+--------+
Find the largest salary that is strictly less than the overall maximum — and wrap it in MAX() so the query returns NULL (not zero rows) when no second salary exists.

Why it exists:
"Nth highest value" is one of the most common SQL interview problems because it tests whether you understand subqueries, window functions, and NULL handling simultaneously. In QA work, the same pattern appears when validating ranking logic: confirming a leaderboard shows the correct second-place score, or verifying a "runner-up" report returns the right value when only one record exists.

``sql
-- Approach 1: subquery with MAX (most portable, returns NULL if no second)
SELECT MAX(salary) AS SecondHighestSalary
FROM Employee
WHERE salary < (SELECT MAX(salary) FROM Employee);

-- Approach 2: OFFSET (MySQL/Postgres) — returns nothing, not NULL, if no second
SELECT DISTINCT salary
FROM Employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

-- Approach 3: DENSE_RANK for Nth highest (generalises to any N)
SELECT salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM Employee
) t
WHERE rnk = 2;
``

Real-world QA use case:
A QA engineer is verifying a "Top Earners" report that should show the second-highest paid employee when the top earner is filtered out. She writes the subquery approach against the test database and confirms the report UI matches — and then checks the edge case: what does the page display when only one employee exists? The MAX() approach confirms the query returns NULL, which the UI should show as "N/A" rather than crashing.

Rule of thumb: Use MAX() inside the WHERE subquery for portability and correct NULL behaviour; use DENSE_RANK() when you need the Nth highest for any N.
💡 Plain English: Picture a race podium. Gold goes to MAX(). To find silver, you ask: "who's fastest if we ignore the gold winner?" — that's exactly what `WHERE salary < (SELECT MAX...)` does. Hide the winner, grab the next-best.
3
Aggregation

Write a query to find all duplicate email addresses in a Person table.

-- Person
-- +----+---------+
-- | id | email   |
-- +----+---------+
-- |  1 | a@x.com |
-- |  2 | b@x.com |
-- |  3 | a@x.com |
-- +----+---------+
Group by email and use HAVING to keep only groups that appear more than once — the duplicate check must go in HAVING, not WHERE, because you can only count after grouping.

Why it exists:
Data integrity testing is one of the most common SQL tasks for QA engineers. Finding duplicates on fields that should be unique (email, username, order reference) is a fundamental database validation. Understanding GROUP BY + HAVING is the foundation for all data quality checks that involve counting occurrences.

``sql
-- Approach 1: GROUP BY + HAVING (simplest — returns the duplicate values)
SELECT email, COUNT(*) AS occurrences
FROM Person
GROUP BY email
HAVING COUNT(*) > 1;

-- Approach 2: window function (returns the full duplicate rows)
SELECT * FROM (
SELECT *, COUNT(*) OVER (PARTITION BY email) AS cnt
FROM Person
) t
WHERE cnt > 1;
``

Approach 1 lists the duplicated email addresses. Approach 2 returns the complete rows — useful when you need to review or delete them.

Real-world QA use case:
A QA engineer is validating that a user registration API enforces unique email addresses. After running a load test that fired 1,000 concurrent registration requests, she runs the GROUP BY + HAVING query against the test database. It finds 12 duplicate emails — a race condition in the application allows two threads to pass the uniqueness check simultaneously. The defect is raised with the exact duplicate rows as evidence.

Rule of thumb: GROUP BY + HAVING to find which values are duplicated; window function with PARTITION BY to get the actual duplicate rows for review or deletion.
💡 Plain English: Checking a party guest list for duplicates. Sort everyone into piles by name (GROUP BY), then point at any pile with more than one person (HAVING COUNT(*) > 1). Single-person piles are fine — the stacks are your duplicates.
4
Set Operators

What is the difference between UNION and UNION ALL? Which is faster and why?

Both combine results from two queries into one list — UNION ALL keeps every row including duplicates (faster), UNION deduplicates first (slower).

Why it exists:
QA engineers frequently combine results from multiple queries — comparing data across environments, merging audit logs from different date ranges, or combining test results from different sources. Choosing UNION when UNION ALL is correct (or vice versa) produces wrong results. Understanding the deduplication cost matters for query performance on large datasets.

``sql
SELECT name FROM team_a -- returns: Asha, Ben
UNION ALL
SELECT name FROM team_b; -- returns: Asha, Carol
-- Result: Asha, Ben, Asha, Carol ← Asha appears twice

SELECT name FROM team_a
UNION
SELECT name FROM team_b;
-- Result: Asha, Ben, Carol ← Asha deduped to once
``

UNION internally sorts the entire combined result to detect duplicates — on large tables this is expensive. UNION ALL skips that step entirely.

Real-world QA use case:
A QA engineer needs to compare user records from a staging database against production to find discrepancies. She uses UNION ALL to combine both sets (since she wants duplicates — they prove a record exists in both), then groups by user_id to find those appearing only once (missing from one environment). Using UNION would silently remove the very duplicates she is looking for.

Rule of thumb: Use UNION ALL by default — it's faster and correct when duplicates are intentional or impossible. Only use UNION when you genuinely need deduplication.
💡 Plain English: Merging two wedding guest lists. UNION ALL staples them together instantly — fast, but Aunt Sue who is on both lists now appears twice. UNION is the careful planner who reads every name and crosses out the repeats — tidier, but takes longer the bigger the lists.
5
Joins

Write a query to list every department along with its employee count — including departments that have zero employees.

-- Department(id, name)
-- Employee(id, name, department_id)  -- department_id may be NULL
Use a LEFT JOIN starting from Department so departments with no employees still appear — and count COUNT(e.id), not COUNT(*), or empty departments wrongly show 1.

Why it exists:
LEFT JOIN is the fundamental technique for "include everything from one side even when there's no match on the other." In QA, this is critical for data completeness tests: verifying every expected category appears in a report, confirming no reference records are missing from a summary, checking that zero-count rows are shown (not silently omitted) in dashboards.

``sql
SELECT d.name AS department,
COUNT(e.id) AS employee_count -- NOT COUNT(*) — see below
FROM Department d
LEFT JOIN Employee e ON e.department_id = d.id
GROUP BY d.id, d.name;
``

The trap: COUNT(*) counts rows — an empty department still produces one NULL row from the LEFT JOIN, so it reports 1 instead of 0. COUNT(e.id) ignores NULLs and correctly reports 0.

Real-world QA use case:
A QA engineer is validating a headcount report that should show all departments including recently created ones with no staff yet. She runs the LEFT JOIN query and confirms three new departments appear with a count of 0. She then checks the UI — the report was accidentally using an INNER JOIN and hiding those three departments entirely. That omission is the defect.

Rule of thumb: LEFT JOIN to include all rows from the primary table. Always use COUNT(column) not COUNT(*) when you want zero instead of one for unmatched rows.
💡 Plain English: Taking the register in every classroom, including the empty ones. If you count chairs (COUNT(*)), an empty room has a chair and you report "1 student" — wrong. Count actual students (COUNT(e.id)) and the empty room honestly reads zero.
6
Keys

What is the difference between a PRIMARY KEY, a UNIQUE key, and a FOREIGN KEY?

Primary key is the row's unique identifier (never NULL, only one per table), unique key enforces no-duplicate values (can hold NULLs, multiple per table), and foreign key enforces referential integrity by linking to another table's primary key.

Why it exists:
Keys are the schema constraints that enforce data integrity at the database level. For QA, understanding keys is critical for designing test data (you can't insert an order without a valid customer_id if a foreign key exists), diagnosing constraint violations in bug reports, and writing schema validation queries that confirm the database structure matches the design.

``sql
CREATE TABLE Customers (
id INT PRIMARY KEY, -- unique row identifier, never NULL
email VARCHAR(255) UNIQUE -- no duplicate emails, but not the main ID
);
CREATE TABLE Orders (
id INT PRIMARY KEY,
customer_id INT REFERENCES Customers(id) -- must be a real customer
);

-- This will be REJECTED by the foreign key constraint:
INSERT INTO Orders (id, customer_id) VALUES (1, 9999); -- customer 9999 doesn't exist
``

Real-world QA use case:
A QA engineer is writing automated test setup code and hits a foreign key violation: "insert or update on table 'Orders' violates foreign key constraint." She traces it to a test helper that creates an Order without first creating the associated Customer. Understanding the foreign key relationship tells her exactly what order to insert test data in — Customer first, then Order. She also queries the schema to build a full dependency map for the test data factory.

Rule of thumb: Primary key = row identity. Unique key = no duplicates allowed on this field. Foreign key = this field must reference a real row in another table.
💡 Plain English: Think of ID documents. Your primary key is your passport number — one official ID, never blank, never shared. A unique key is like your email address — also one-of-a-kind, but you might have a few. A foreign key is the address on a parcel — it has to point to a real house that actually exists, or the parcel is rejected.
7
Joins

What are the main types of JOINs in SQL?

JOIN types control which rows from two tables appear in the result — INNER keeps only matching rows, LEFT keeps all rows from the left table, FULL OUTER keeps everything from both sides, and CROSS produces every possible combination.

Why it exists:
Every relational database query that spans more than one table uses a JOIN. For QA, choosing the wrong JOIN type is a common source of data validation errors: an INNER JOIN silently drops unmatched rows (hiding missing data), while a LEFT JOIN preserves them (surfacing orphan records). Understanding join types is essential for writing correct database validation queries and for understanding why a report shows fewer records than expected.

``sql
-- Setup: Customers (1=Asha, 2=Ben), Orders (o1→customer 1, o2→customer 99)

INNER JOIN → Asha, o1 -- only the match; Ben and o2 dropped
LEFT JOIN → Asha, o1 | Ben, NULL -- all customers; o2 dropped
RIGHT JOIN → Asha, o1 | NULL, o2 -- all orders; Ben dropped
FULL OUTER → Asha, o1 | Ben, NULL | NULL, o2 -- everything from both
CROSS JOIN → 4 rows (Asha+o1, Asha+o2, Ben+o1, Ben+o2) — no ON clause
``

In day-to-day work, INNER JOIN and LEFT JOIN cover 95% of real queries.

Real-world QA use case:
A QA engineer is investigating why a "customers with no orders" report returns an unexpectedly low number. She checks the underlying query and finds it uses INNER JOIN — which silently excludes all customers with zero orders, the very records the report should highlight. Changing to LEFT JOIN WHERE orders.id IS NULL returns the correct unmatched set and reveals 340 customers the report was hiding.

Rule of thumb: INNER JOIN when you only want matching rows. LEFT JOIN when you need all rows from the primary table regardless of matches. FULL OUTER when you need to surface mismatches on both sides.
💡 Plain English: Two groups at a party — the guests and the cars they arrived in. INNER = only guests who have a car parked outside. LEFT = every guest, with their car listed if they drove. FULL OUTER = every guest and every car, with blanks where they don't connect. CROSS = list every guest next to every car — all possible combinations, for no obvious reason.
8
Filtering

What is the difference between WHERE and HAVING?

-- Departments with more than 5 active employees
SELECT department_id, COUNT(*) AS headcount
FROM Employee
WHERE active = 1          -- filter rows first
GROUP BY department_id
HAVING COUNT(*) > 5;      -- filter the groups after
WHERE filters individual rows before grouping; HAVING filters groups after GROUP BY. Any filter using COUNT(), SUM(), or AVG() must go in HAVING because those values don't exist until after grouping.

Why it exists:
WHERE and HAVING are applied at different stages of query execution. Putting an aggregate condition in WHERE causes an error ("invalid use of aggregate function") because the rows haven't been grouped yet. Understanding this separation is fundamental for writing any query that aggregates data — which is most real-world data validation and reporting queries.

``sql
-- Correct: WHERE filters rows, HAVING filters the groups
SELECT department_id, COUNT(*) AS headcount
FROM Employee
WHERE active = 1 -- removes inactive employees BEFORE grouping
GROUP BY department_id
HAVING COUNT(*) > 5; -- keeps only departments with more than 5 active employees

-- Wrong: this causes an error — you can't use COUNT() in WHERE
SELECT department_id FROM Employee
WHERE COUNT(*) > 5 -- ERROR: aggregate functions not allowed in WHERE
GROUP BY department_id;
``

Real-world QA use case:
A QA engineer writes a validation query to find product categories with more than 10 test records, filtering out archived items. She uses WHERE status != 'archived' to cut rows before grouping (faster — fewer rows to count), then HAVING COUNT(*) > 10 to filter the category groups. Writing this correctly is a daily skill for any QA engineer doing database-level test data validation.

Rule of thumb: If the filter involves an aggregate (COUNT, SUM, AVG, MIN, MAX) — it belongs in HAVING. Everything else belongs in WHERE, where it runs earlier and faster.
💡 Plain English: Sorting job applicants. WHERE is the first filter applied to each individual CV ("must have a driving licence"). HAVING is a filter on whole piles after you've sorted them ("only keep cities that sent more than 5 applicants"). You cannot count a pile until you've made the piles.
9
Schema Design

What is database normalization? Briefly explain 1NF, 2NF, and 3NF.

Normalization organises tables to eliminate data repetition — store each fact once, in one place, so that a change requires only one update.

Why it exists:
Unnormalised databases create update anomalies: changing a customer's city requires updating every one of their orders. They also create insert anomalies (can't add a product without an order) and delete anomalies (deleting the last order for a customer deletes the customer). For QA, understanding normalisation explains why test data needs to be set up in a specific order and why certain joins are required to reconstruct complete records.

``text
Starting table (violates all three normal forms):
order_id | products | customer | city
1 | Pen, Book | Asha | Pune
2 | Book | Asha | Pune

1NF fix — one value per cell (split "Pen, Book" into separate rows):
order_id | product | customer | city
1 | Pen | Asha | Pune
1 | Book | Asha | Pune

2NF fix — remove columns that depend on part of the key
(product price belongs in a Products table, not repeated per order line):
OrderItems(order_id, product_id, qty)
Products(id, name, price)

3NF fix — remove columns that depend on a non-key column
(city depends on customer, not on the order):
Customers(id, name, city)
Orders(id, customer_id)

End state: change Asha's city in one row of Customers → all her orders reflect it
``

Real-world QA use case:
A QA engineer notices a customer address discrepancy in a report — the same customer appears with two different postcodes in different order records. She diagnoses it as a 3NF violation: the address was stored directly on Orders rather than on Customers, and one order was updated without updating the others. The defect is an update anomaly — a direct consequence of unnormalised schema design.

Rule of thumb: 1NF = one value per cell. 2NF = every column depends on the whole key. 3NF = no column depends on another non-key column. Store each fact once, in the right place.
💡 Plain English: Tidying a kitchen. Instead of writing "salt: 1 tsp" on twenty different recipe cards, you keep one labelled jar and the recipes just reference it. Change the jar label once — all twenty recipes are automatically updated. Normalisation is the same principle applied to data.
10
Basics

What does the DISTINCT keyword do?

DISTINCT removes duplicate rows from the result set so each unique value (or combination of values) appears only once.

Why it exists:
Without DISTINCT, a query can return thousands of repeated values when you only need to know what unique values exist. In QA work this is used constantly: finding unique error codes in a log table, listing all distinct statuses an order has ever been in, checking what unique values exist in a column to spot unexpected ones.

``sql
-- Without DISTINCT: returns one row per customer (hundreds of 'India' rows)
SELECT country FROM Customers;

-- With DISTINCT: returns each country once
SELECT DISTINCT country FROM Customers;

-- Important: DISTINCT applies to all selected columns as a combination
SELECT DISTINCT city, country FROM Customers;
-- Returns unique (city, country) pairs — not unique cities alone
-- 'London, UK' and 'London, Canada' are two distinct combinations
``

Real-world QA use case:
A QA engineer is investigating an intermittent bug in an order status workflow. She runs SELECT DISTINCT status FROM Orders to see every status value that exists in the database. She discovers a status value "Pending_old" that shouldn't exist — a legacy value from a migration that was never cleaned up. Without DISTINCT she would have scrolled through 50,000 rows to spot it.

Rule of thumb: Use DISTINCT to find unique values or validate what values exist. Remember it operates on the entire selected column list — add more columns and you get unique combinations, not unique values per column.
💡 Plain English: Building a guest list from a stack of RSVPs where some people sent two replies. DISTINCT keeps one entry per person and quietly discards the extras — you get the unique set, nothing repeated.
11
Data Types

What is the difference between CHAR and VARCHAR?

CHAR(n) is fixed-length (always stores exactly n characters, padding with spaces), VARCHAR(n) is variable-length (stores only as many characters as needed, up to n).

Why it exists:
Choosing the right string type affects storage efficiency and comparison behaviour. CHAR is ideal for fixed-length codes where every value is the same size. VARCHAR is correct for anything that varies. For QA, the practical importance is comparison behaviour: CHAR pads shorter strings with trailing spaces, which can cause unexpected results in WHERE clause string comparisons if you're not aware of it.

``sql
CREATE TABLE addresses (
country_code CHAR(2), -- always 2 chars: 'US', 'IN', 'GB'
city VARCHAR(100) -- only as long as needed: 'Pune' uses 4 bytes, not 100
);

-- CHAR trailing-space gotcha:
INSERT INTO test (code CHAR(5)) VALUES ('AB');
SELECT * FROM test WHERE code = 'AB'; -- works (most DBs trim trailing spaces)
SELECT * FROM test WHERE code = 'AB '; -- also works (same stored value)
-- In some contexts, 'AB' != 'AB ' — be aware of this in test assertions
``

Real-world QA use case:
A QA engineer writes a test that checks whether a status code stored in a CHAR(10) column equals 'ACTIVE'. The assertion passes in PostgreSQL (which pads to 10 chars internally but trims on comparison) but fails in a legacy Oracle system that returns 'ACTIVE ' with trailing spaces. Understanding CHAR padding explains why the same assertion behaves differently across database platforms — and why VARCHAR is safer for most test data validation queries.

Rule of thumb: Use CHAR for fixed-size codes (country codes, status flags with known lengths). Use VARCHAR for everything else. Watch for trailing-space comparison surprises with CHAR in multi-database environments.
💡 Plain English: CHAR is a row of identical lockers — every item gets the same sized box whether it's a coat or a single key, empty space wasted. VARCHAR is a vacuum-seal bag that shrinks to fit whatever you put in — efficient, no wasted space.
12
NULL Handling

How do you test for and filter NULL values in SQL?

NULL means "unknown" or "missing" — never use = to test for it. Use IS NULL and IS NOT NULL instead, because any comparison with NULL evaluates to unknown (not true), so WHERE phone = NULL silently returns zero rows.

Why it exists:
NULL is one of SQL's most common sources of bugs and surprises. WHERE column = NULL is silently wrong — it returns nothing, with no error, and developers write it constantly. QA engineers validating data completeness, optional field handling, and form submission edge cases hit NULL issues constantly. Knowing how to correctly filter NULLs is fundamental to writing accurate database validation queries.

``sql
-- WRONG: returns zero rows, no error message
SELECT name FROM Customers WHERE phone = NULL;

-- CORRECT: returns customers with no phone
SELECT name FROM Customers WHERE phone IS NULL;

-- Customers who DO have a phone on file
SELECT name FROM Customers WHERE phone IS NOT NULL;

-- Replace NULL with a fallback value (COALESCE)
SELECT name, COALESCE(phone, 'No phone') AS contact
FROM Customers;
``

Real-world QA use case:
A QA engineer is validating that a "required field" enforcement works correctly on a form. She queries the database after submitting the form without filling in a required field. She finds rows where the field is NULL — meaning the backend accepted the submission without validation. The WHERE column IS NULL query surfaces the defect that WHERE column = NULL would have silently missed, leaving her thinking there were no issues.

Rule of thumb: Never use = NULL in a WHERE clause. Always use IS NULL or IS NOT NULL. Use COALESCE(column, fallback) to substitute a default value for NULL in query results.
💡 Plain English: NULL is a blank box on a form where someone wrote nothing. You cannot ask "does their answer equal blank?" — blank is not a value, it's the absence of one. You can only ask "did they leave it empty?" — and that is exactly what IS NULL asks.
13
Query Execution

In what order does SQL actually run the parts of a SELECT query?

You write SELECT first, but the database runs it in a different order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT.

Why it exists:
The execution order explains why certain SQL patterns fail. You can't reference a SELECT alias in a WHERE clause because WHERE runs before SELECT. You can't use aggregate functions in WHERE because aggregation hasn't happened yet. Every "why doesn't this work?" SQL question traces back to misunderstanding execution order.

``text
Logical execution order:
1. FROM / JOIN → combine the tables
2. WHERE → filter individual rows
3. GROUP BY → bundle rows into groups
4. HAVING → filter the groups
5. SELECT → compute the columns (aliases created HERE)
6. ORDER BY → sort (can use aliases from SELECT)
7. LIMIT / OFFSET → take the slice
`

`sql
-- Fails: WHERE runs before SELECT, so 'annual' doesn't exist yet
SELECT salary * 12 AS annual
FROM Employee
WHERE annual > 100000; -- ERROR: column 'annual' not found

-- Fix: repeat the expression, or wrap in a subquery
SELECT salary * 12 AS annual
FROM Employee
WHERE salary * 12 > 100000; -- works
``

Real-world QA use case:
A QA engineer writes a validation query that groups test results by status and filters to groups with more than 10 failures. She mistakenly puts COUNT(*) > 10 in the WHERE clause instead of HAVING and gets an error. Understanding execution order immediately tells her the aggregate isn't available until after GROUP BY — she moves the condition to HAVING and the query runs correctly.

Rule of thumb: Aliases created in SELECT are not available in WHERE or HAVING. Aggregate conditions go in HAVING, not WHERE. ORDER BY is the only clause that can reference SELECT aliases.
💡 Plain English: Cooking a meal, not reading a menu top to bottom. You gather ingredients (FROM), discard the bad ones (WHERE), portion them out (GROUP BY), then plate and garnish (SELECT, ORDER BY). You don't plate the dish first and then decide what goes in it.
14
Pattern Matching

How does the LIKE operator work, and what do % and _ mean?

LIKE performs pattern matching on text columns — % matches any sequence of characters (including none), _ matches exactly one character.

Why it exists:
Pattern matching is essential for searching text data when you don't know the exact value. In QA work, LIKE queries are used to find log entries containing specific error codes, search test data by partial name or reference, or identify records that match a naming convention — like all test accounts starting with "qa_".

``sql
SELECT * FROM Customers WHERE name LIKE 'A%'; -- starts with A: Asha, Amy, Alex
SELECT * FROM Customers WHERE name LIKE '%son'; -- ends with son: Jason, Wilson
SELECT * FROM Customers WHERE name LIKE '%ell%'; -- contains ell: Bell, Shelley
SELECT * FROM Customers WHERE code LIKE 'A_C'; -- A + any one char + C: ABC, A1C

-- Performance warning: leading wildcard forces full table scan (no index)
SELECT * FROM Logs WHERE message LIKE '%error%'; -- SLOW on large tables
-- Better: use full-text search for contains-style queries on large datasets
``

Real-world QA use case:
A QA engineer is hunting for all test accounts in a database that has mixed real and test data. She runs SELECT * FROM Users WHERE email LIKE 'qa_%@test.com' to find accounts following the QA naming convention. She also spots a variation — some old accounts used 'test_%@qa.com' — which she adds with an OR clause. LIKE lets her clean up without knowing every exact email address.

Rule of thumb: % = any characters, _ = exactly one character. Never start a LIKE pattern with % on a large table — it bypasses indexes and forces a full scan.
💡 Plain English: % is a "fill in anything" blank on a crossword — "A______" where the rest can be any letters. _ is a single "exactly one letter goes here" square. Together they let you describe a pattern when you don't know the exact answer.
15
Aggregation

What are aggregate functions? Name the common ones.

Aggregate functions collapse many rows into a single summary value — COUNT, SUM, AVG, MIN, and MAX are the five core ones.

Why it exists:
Aggregate functions are the foundation of every summary report, dashboard metric, and data validation check. QA engineers use them constantly: COUNT(*) to verify expected record counts after a batch process, SUM to validate financial totals, MIN/MAX to check boundary values in loaded data. Understanding the NULL behaviour is critical — COUNT(column) skips NULLs while COUNT(*) counts every row.

``sql
-- Orders table with amounts: 100, 250, 250, NULL
SELECT COUNT(*) AS total_rows, -- 4 (every row including NULL row)
COUNT(amount) AS with_amount, -- 3 (skips the NULL row)
SUM(amount) AS total, -- 600 (NULL ignored)
AVG(amount) AS average, -- 200 (NULL ignored — 600/3, not 600/4)
MIN(amount) AS smallest, -- 100
MAX(amount) AS largest -- 250
FROM Orders;

-- With GROUP BY: aggregate per group
SELECT department_id, AVG(salary) AS avg_salary
FROM Employee
GROUP BY department_id;
``

Real-world QA use case:
A QA engineer is validating a nightly batch job that should have processed 10,000 records. She runs COUNT(*) on the results table and gets 9,987 — 13 records short. She then uses COUNT(processed_at) vs COUNT(*) to check if the 13 are missing entirely or present with a NULL processed_at (failed silently). The comparison immediately tells her whether records are missing or failing without surfacing errors.

Rule of thumb: COUNT(*) counts rows (including NULLs). COUNT(column) counts non-NULL values only. All other aggregates (SUM, AVG, MIN, MAX) ignore NULLs automatically.
💡 Plain English: Aggregates are the summary line at the bottom of a receipt. The individual line items are the rows; SUM is the total, COUNT is "number of items," AVG is the average item price. One number that summarises many.
16
Date Filtering

Write a query to find all employees hired in the last 30 days.

-- Employee(id, name, hire_date)
Calculate the cutoff date on the right side of the comparison and keep the column bare — this lets the database use an index on hire_date for fast filtering.

Why it exists:
Date filtering is ubiquitous in QA work: finding records created in the last N days, validating that a batch process ran within the expected time window, checking that timestamps fall within a test's execution period. The key performance principle — keep the column bare, calculate on the constant side — is one of the most important SQL habits for writing fast date queries.

``sql
-- Correct: column is bare, index can be used
SELECT id, name, hire_date
FROM Employee
WHERE hire_date >= CURRENT_DATE - INTERVAL '30 days';

-- SQL Server equivalent:
WHERE hire_date >= DATEADD(DAY, -30, GETDATE())

-- Wrong: wrapping column in DATEDIFF defeats the index
WHERE DATEDIFF(day, hire_date, GETDATE()) <= 30; -- full table scan
``

The second form forces the database to compute DATEDIFF for every row before it can compare — it can't use a sorted index to skip to the right range.

Real-world QA use case:
A QA engineer is verifying that a "recently joined" feature shows only users created in the last 7 days. She queries the database directly using the correct date range filter to get the expected result set, then compares it against what the UI displays. She catches a bug where the feature is using the wrong timezone (UTC vs local) — the database shows 42 users in the 7-day window but the UI shows 38, a 4-hour offset at the day boundary.

Rule of thumb: Keep the column bare in date comparisons (so indexes work). Put the dynamic calculation on the right side of the comparison operator.
💡 Plain English: Finding everyone whose birthday falls in the next two weeks. You don't recalculate each person's age from scratch for every comparison — you set a fixed cutoff date and ask "is their birthday after this date?" Fast, with one calculation.
17
Sorting

How does ORDER BY work, including sorting by multiple columns?

ORDER BY sorts results by one or more columns — ASC (ascending, the default) or DESC (descending). Multiple columns are sorted left to right, with later columns breaking ties.

Why it exists:
Without ORDER BY, SQL results have no guaranteed order — the same query can return rows in a different sequence on different runs, different databases, or after a table update. For QA, consistent sort order is critical for result comparison (comparing query output against expected values), pagination validation (rows 1–10 must be stable), and test reproducibility.

``sql
-- Sort by department (A→Z), then within each department by salary (high→low)
SELECT name, department_id, salary
FROM Employee
ORDER BY department_id ASC, -- primary sort
salary DESC; -- secondary sort to break ties within same department

-- Practical QA validation pattern: sort both actual and expected the same way
SELECT id, status, amount FROM Orders
ORDER BY id ASC; -- deterministic order for result comparison
``

Real-world QA use case:
A QA engineer is comparing two result sets — one from the old system and one from the new system — to validate a data migration. Without ORDER BY on both queries, the rows come back in different orders and the diff shows hundreds of false mismatches. She adds ORDER BY id to both queries: the diff collapses from hundreds of apparent differences to exactly three genuine discrepancies that need investigating.

Rule of thumb: Never rely on implicit ordering in SQL — always add ORDER BY when result order matters. For test comparisons, always use the same ORDER BY on both sides.
💡 Plain English: A phone contacts list sorted by last name — when two people share a last name, it falls back to first name as the tiebreaker. Multiple ORDER BY columns work exactly the same: primary sort, then secondary sort to resolve ties.
18
Subqueries

What is a subquery? Give a simple example.

A subquery is a query nested inside another query — the inner query runs first and its result is used by the outer query.

Why it exists:
Subqueries let you answer questions that depend on an intermediate calculation — like "find everyone earning above average" where the average must be computed first. For QA, subqueries are used for comparative validation: find all records where a value exceeds a computed threshold, verify which IDs are missing from a second table, or check which results don't appear in an expected set.

``sql
-- WHERE subquery: find employees earning above the company average
SELECT name, salary
FROM Employee
WHERE salary > (SELECT AVG(salary) FROM Employee);

-- FROM subquery (inline view): treat a derived result as a table
SELECT dept, avg_sal
FROM (
SELECT department_id AS dept, AVG(salary) AS avg_sal
FROM Employee
GROUP BY department_id
) dept_averages
WHERE avg_sal > 60000;

-- SELECT subquery: add a computed value alongside each row
SELECT name,
salary,
(SELECT AVG(salary) FROM Employee) AS company_avg
FROM Employee;
``

Real-world QA use case:
A QA engineer is validating that a "high-value orders" report only shows orders above the account's average order value. She writes a subquery that calculates each customer's average, then an outer query that returns orders exceeding it. Comparing her query result against the UI output reveals that the report is using the global average (all customers combined) instead of per-customer averages — a data logic defect.

Rule of thumb: Subqueries run from inside out. Use them when the outer query depends on a value calculated from the data itself (an average, a count, a set membership check).
💡 Plain English: Asking a question that depends on another answer: "who is taller than the average person in the room?" You first work out the average (the subquery), then compare everyone to it (the outer query). The subquery is the preliminary question that makes the main question answerable.
19
Operators

What is the difference between the IN and BETWEEN operators?

IN checks if a value matches any item in a specific list. BETWEEN checks if a value falls within a continuous range (inclusive of both ends).

Why it exists:
IN and BETWEEN are the two most common ways to filter against multiple values without writing long chains of OR conditions. For QA engineers writing validation queries, knowing which to use — discrete set vs continuous range — produces cleaner, more readable code. BETWEEN's inclusiveness (both endpoints included) is also a common source of off-by-one errors in range boundary testing.

``sql
-- IN: value must match one of the listed items
SELECT * FROM Orders WHERE status IN ('new', 'paid', 'shipped');
-- Equivalent to: WHERE status = 'new' OR status = 'paid' OR status = 'shipped'

-- BETWEEN: value must fall in the range (both ends included)
SELECT * FROM Employee WHERE age BETWEEN 18 AND 30;
-- Equivalent to: WHERE age >= 18 AND age <= 30
-- Both 18 and 30 are INCLUDED

-- BETWEEN on dates (inclusive of both boundary dates)
SELECT * FROM Orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
-- Jan 1st and Jan 31st both included
``

Real-world QA use case:
A QA engineer is testing a discount tier system that applies 10% off for orders between £50 and £100. She writes boundary tests using BETWEEN to find test orders at exactly £50, £51, £99, and £100 — confirming all four get the discount. She also checks £49 and £101 are excluded. Understanding that BETWEEN is inclusive of both endpoints tells her exactly which boundary values need testing (the classic BVA boundaries).

Rule of thumb: Use IN for a specific known set of values (statuses, categories, IDs). Use BETWEEN for a continuous numeric or date range. Remember BETWEEN includes both boundary values.
💡 Plain English: IN is a guest list — your name is either on it or it isn't. BETWEEN is a height restriction on a ride — you're allowed if your height lands anywhere between the two marks, ends included. Different question, different tool.
20
Pagination

How do you return just one page of results, like rows 21 to 30?

Use LIMIT with OFFSET (MySQL/Postgres) — OFFSET skips a number of rows, LIMIT takes the next N. Always pair with ORDER BY or the same rows may appear on different pages.

Why it exists:
Pagination is how applications show large datasets in manageable chunks. QA engineers test pagination heavily: verifying page boundaries (last item on page 1 is not duplicated on page 2), confirming total page count is correct, checking edge cases (last page with fewer items than page size), and validating that new records inserted between page loads don't cause rows to be skipped or duplicated.

``sql
-- MySQL / Postgres: LIMIT + OFFSET
SELECT * FROM Products
ORDER BY id -- mandatory: without this, results are non-deterministic
LIMIT 10 OFFSET 20; -- skip 20, take 10 → rows 21-30

-- SQL Server syntax:
SELECT * FROM Products
ORDER BY id
OFFSET 20 ROWS FETCH NEXT 10 ROWS ONLY;

-- Page number formula: OFFSET = (page_number - 1) * page_size
-- Page 3, 10 items per page: OFFSET = (3-1) * 10 = 20
``

Real-world QA use case:
A QA engineer is testing a product listing page that shows 20 items per page. She checks the boundary between page 1 and page 2 by comparing the last item on page 1 against the first on page 2 — they must be adjacent and never overlap. She then inserts a new product while on page 1 and refreshes page 2 — she discovers the new product shifted all records by one, pushing the last item of old-page-1 onto the new page 2 and duplicating it. Cursor-based pagination would have prevented this, which she raises as an improvement suggestion.

Rule of thumb: Always include ORDER BY with pagination queries — without it, page results are non-deterministic. For large datasets, cursor-based pagination (WHERE id > last_seen_id) outperforms OFFSET at high page numbers.
💡 Plain English: Flipping to page 3 of search results. You skip the first two pages of hits (OFFSET) and show the next ten (LIMIT). Without a fixed sort order, "page 3" could show different items on different requests — like a different page falling open each time you drop the book.
21
Constraints

What is a foreign key, and what does ON DELETE CASCADE do?

A foreign key links a column to the primary key of another table, and blocks invalid links — you can't add an order for a customer who doesn't exist.

ON DELETE CASCADE decides what happens to the child rows when the parent is deleted:
- With CASCADE, deleting a customer automatically deletes all their orders too.
- Without it, the database stops you from deleting a customer who still has orders.

Use CASCADE carefully — it can delete more than you expect.

Example:
``sql
CREATE TABLE Orders (
id INT PRIMARY KEY,
customer_id INT REFERENCES Customers(id) ON DELETE CASCADE
);

DELETE FROM Customers WHERE id = 7;
-- Every order with customer_id = 7 is automatically deleted too.
-- Without ON DELETE CASCADE, this DELETE is blocked while those orders still exist.
``
💡 Plain English: A foreign key is the rule that every order must name a real customer. `ON DELETE CASCADE` is "when we close a customer's account, shred all their paperwork with them" — convenient, but make sure you really want everything gone.
22
Aggregation

Write a query to find total sales for each product category.

-- Sales(id, category, amount)
``sql
SELECT category, SUM(amount) AS total_sales
FROM Sales
GROUP BY category
ORDER BY total_sales DESC;
`

GROUP BY category bundles all rows of the same category together, SUM(amount) adds up the sales in each bundle, and ORDER BY puts the biggest category on top.

Rule of thumb: every column in
SELECT that isn't inside an aggregate must appear in GROUP BY`.
💡 Plain English: It's like sorting your shopping receipts into piles by type — groceries, fuel, clothes — and totalling each pile. `GROUP BY` makes the piles, `SUM` totals each one.
23
NULL Handling

How do you replace or neutralize a NULL value? (COALESCE, NULLIF)

- COALESCE(a, b, c) returns the first value that isn't NULL. Great for a fallback: COALESCE(nickname, first_name, 'Guest').
- NULLIF(a, b) returns NULL if the two values are equal, otherwise it returns the first. Handy to dodge divide-by-zero: total / NULLIF(count, 0) gives NULL instead of an error when count is 0.

Both are standard SQL. (SQL Server's ISNULL is similar to COALESCE, but takes only two arguments.)
💡 Plain English: `COALESCE` is a backup plan — "call my mobile; no answer? try home; still nothing? leave a voicemail." It uses the first one that works. `NULLIF` is "if these two match, treat it as nothing" — like ignoring a sensor that's stuck reading exactly zero.
24
Joins

In a JOIN, what is the difference between a condition in the ON clause and one in the WHERE clause?

For an INNER JOIN they behave the same. But for a LEFT JOIN they're very different — and this trips people up:

- A condition in ON is applied *while matching* the tables, so non-matching left rows are still kept (with NULLs).
- A condition in WHERE is applied *after* the join, to the combined result — so it can quietly drop those NULL rows and turn your LEFT JOIN back into an INNER JOIN.

Example: "list all customers, plus their 2024 orders if any":
``sql
-- RIGHT — keeps customers with no 2024 order (filter in ON):
SELECT c.name, o.id
FROM Customers c
LEFT JOIN Orders o
ON o.customer_id = c.id AND o.year = 2024;

-- WRONG — drops customers with no 2024 order (filter in WHERE):
SELECT c.name, o.id
FROM Customers c
LEFT JOIN Orders o ON o.customer_id = c.id
WHERE o.year = 2024; -- a NULL year fails this test → the row vanishes
``
💡 Plain English: Filtering in **ON** is telling the matchmaker "pair people up, and only count a 2024 order as a match" — everyone still shows up to the party. Filtering in **WHERE** is a bouncer at the *exit* who throws out anyone without a 2024 order, including the people who came alone. With a LEFT JOIN, that bouncer undoes the whole point.
25
Aggregation

If both can remove duplicates, what is the difference between DISTINCT and GROUP BY?

For *just* removing duplicate rows, SELECT DISTINCT col and SELECT col ... GROUP BY col give the same result.

The difference is intent:
- Use DISTINCT when you only want the unique values and nothing more.
- Use GROUP BY when you also want to *calculate something per group*, like a count or a sum.

``sql
-- DISTINCT: just the unique list
SELECT DISTINCT country FROM Customers;

-- GROUP BY: unique list PLUS a number per group
SELECT country, COUNT(*) FROM Customers GROUP BY country;
`

If you're not calculating anything,
DISTINCT` reads more clearly.
💡 Plain English: DISTINCT asks "which countries are our customers from?" — just the list. GROUP BY asks "how many customers from each country?" — the list *and* a tally for each. Same piles, but GROUP BY also counts them.
26
SQL Categories

What are the four categories of SQL commands — DDL, DML, DCL, and TCL?

SQL commands fall into four families by what they do:

- DDL (Data Definition) — define or change *structure*: CREATE, ALTER, DROP, TRUNCATE.
- DML (Data Manipulation) — work with the *data inside* tables: SELECT, INSERT, UPDATE, DELETE.
- DCL (Data Control) — manage *permissions*: GRANT, REVOKE.
- TCL (Transaction Control) — manage *transactions*: COMMIT, ROLLBACK, SAVEPOINT.

Example:
``sql
CREATE TABLE Orders (id INT); -- DDL: defines structure
INSERT INTO Orders VALUES (1); -- DML: changes data
GRANT SELECT ON Orders TO analyst; -- DCL: controls access
COMMIT; -- TCL: finalises the transaction
`
A useful tell: DDL auto-commits (you can't roll back a
DROP`), while DML changes can be wrapped in a transaction and rolled back.
💡 Plain English: Think of building a house. **DDL** is the architect drawing or changing the rooms. **DML** is moving furniture in and out. **DCL** is handing out door keys. **TCL** is the "save / undo" button on the day's work.
27
Constraints

What are the common column constraints, and what does each enforce? (NOT NULL, UNIQUE, CHECK, DEFAULT)

Constraints are rules the database enforces automatically, so bad data can't get in:

- NOT NULL — the column must always have a value.
- UNIQUE — no two rows can share the same value.
- CHECK — the value must satisfy a condition you write.
- DEFAULT — if no value is given, use this one.
- (PRIMARY KEY = NOT NULL + UNIQUE; FOREIGN KEY = must match a row in another table.)

Example:
``sql
CREATE TABLE Accounts (
id INT PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
balance NUMERIC DEFAULT 0,
status VARCHAR(10) CHECK (status IN ('active', 'closed'))
);
`
Now an insert with a NULL email, a duplicate email, or a status of
'banana'` is rejected before it can corrupt your data.
💡 Plain English: Constraints are a nightclub door policy. **NOT NULL** = "you must show ID", **UNIQUE** = "one entry per person", **CHECK** = "you must meet the dress code", **DEFAULT** = "no stamp? here's the standard one". The bouncer enforces it so troublemakers never get inside.
28
Indexing

What is an index, and when should you add one?

An index is a separate, sorted lookup structure the database keeps for a column, so it can find rows *without scanning the whole table*.

Add one when:
- A column is frequently used in WHERE, JOIN, or ORDER BY.
- The table is big enough that full scans are slow.

Be careful, because:
- Every INSERT/UPDATE/DELETE must also update the index, so too many indexes slow down writes.
- Indexes use extra storage.

Example:
``sql
-- Slow on a big table — scans every row:
SELECT * FROM Customers WHERE email = 'a@x.com';

-- Add an index so the lookup jumps straight to the row:
CREATE INDEX idx_customers_email ON Customers(email);
``
(*How* an index finds rows so fast, and the clustered-vs-non-clustered distinction, are deeper topics — see the mid/senior index questions.)
💡 Plain English: It's the index at the back of a book. Without it you read every page to find a topic; with it you jump straight to the right page. But that index has to be updated every time the book changes — so you don't bother indexing a one-page memo.
29
String Functions

What are the common string functions in SQL?

The everyday string toolkit:

- CONCAT(a, b) (or a || b) — join strings together.
- LENGTH(s) — number of characters.
- SUBSTRING(s, start, len) — pull out part of a string.
- UPPER(s) / LOWER(s) — change case.
- TRIM(s) — remove leading/trailing spaces.
- REPLACE(s, from, to) — swap text.

Example:
``sql
SELECT
CONCAT(first_name, ' ', last_name) AS full_name,
UPPER(country) AS country_code,
TRIM(email) AS clean_email,
SUBSTRING(phone, 1, 3) AS area_code
FROM Customers;
`
Watch out in
WHERE: WHERE UPPER(name) = 'ASHA'` wraps the column in a function and can stop an index from being used (see the non-sargable question).
💡 Plain English: String functions are a label-maker kit — cut a piece out (`SUBSTRING`), stick two together (`CONCAT`), tidy the edges (`TRIM`), or reprint it in capitals (`UPPER`).
30
Date Functions

What are the common date/time functions, and how do you work with dates?

The common ones (names vary a little by database):

- CURRENT_DATE / NOW() — today / the current timestamp.
- EXTRACT(YEAR FROM d) (or YEAR(d)) — pull out a part of a date.
- DATEDIFF / AGE — the gap between two dates.
- DATE_ADD / d + INTERVAL '7 days' — shift a date.

Example:
``sql
-- Each order with how many days ago it was placed:
SELECT id,
order_date,
CURRENT_DATE - order_date AS days_ago
FROM Orders
WHERE order_date >= DATE_TRUNC('year', CURRENT_DATE);
`
Tip: filter with a date *range* (
order_date >= '2024-01-01') rather than wrapping the column in a function (YEAR(order_date) = 2024`) — the bare-column version can use an index, the function version can't.
💡 Plain English: Date functions are a calendar plus a stopwatch. `EXTRACT` reads one number off the calendar (the month); `DATEDIFF` is the stopwatch between two dates; `DATE_ADD` flips the calendar forward.
31
Views

What is a view, and why would you use one?

A view is a *saved query* you can treat like a table. It stores no data of its own — each time you query it, the underlying SELECT runs and gives fresh results.

Why use one:
- Simplify — hide a complex join behind a friendly name.
- Reuse — define the logic once, use it everywhere.
- Security — expose only certain columns/rows, hiding the rest of the table.

Example:
``sql
CREATE VIEW active_customers AS
SELECT id, name, email
FROM Customers
WHERE status = 'active';

-- Now query it just like a table:
SELECT * FROM active_customers WHERE name LIKE 'A%';
``
(If you need the *results stored* for speed, that's a materialized view — a more advanced topic; see the senior question.)
💡 Plain English: A view is a saved filter on a spreadsheet, or a TV channel preset — not a new copy of everything, just a convenient, named window onto the data you care about.
32
Filtering

What is the difference between WHERE and HAVING? When do you use each?

Both filter rows, but at different stages of the query:

- WHERE filters *individual rows before* grouping. It cannot reference aggregate functions like COUNT or SUM.
- HAVING filters *groups after* GROUP BY runs. Use it to filter on aggregated values.

``sql
-- WHERE filters rows before they're grouped:
SELECT department_id, COUNT(*) AS headcount
FROM Employees
WHERE status = 'active' -- only count active employees
GROUP BY department_id
HAVING COUNT(*) > 10; -- only departments with more than 10
``

The rule: if you're filtering on a raw column value → WHERE. If you're filtering on the result of COUNT, SUM, AVG, MIN, MAX → HAVING.
💡 Plain English: A restaurant kitchen. **WHERE** is the chef checking ingredients *before* cooking — only fresh vegetables get in. **HAVING** is the manager checking finished dishes *after* plating — only plates with more than 3 items go out. Two different inspection points in the same pipeline.
33
Query Writing

Write a query to find each customer's total order count, showing only customers who have placed more than 5 orders.

-- Customers(id, name)
-- Orders(id, customer_id, order_date)
``sql
SELECT c.name, COUNT(o.id) AS order_count
FROM Customers c
JOIN Orders o ON o.customer_id = c.id
GROUP BY c.id, c.name
HAVING COUNT(o.id) > 5
ORDER BY order_count DESC;
``

Key points the interviewer is listening for:
- JOIN on customer_id to link the tables.
- GROUP BY the customer (not the order) to get one row per customer.
- HAVING (not WHERE) to filter on the aggregate COUNT — WHERE runs before grouping, so it can't see COUNT yet.
- ORDER BY makes the result useful.
💡 Plain English: Counting letters per person in a stack of mail. You group the stack by recipient first (GROUP BY), count each person's pile, then only hand over the piles with more than 5 letters (HAVING). Checking individual letters with WHERE can't tell you how big each pile is — you need to count first.
34
Query Writing

Write a query to find all employees who work in the 'Engineering' department.

-- Employees(id, name, department_id)
-- Departments(id, name)
``sql
SELECT e.id, e.name
FROM Employees e
JOIN Departments d ON d.id = e.department_id
WHERE d.name = 'Engineering';
`

Also valid — subquery approach:
`sql
SELECT id, name
FROM Employees
WHERE department_id = (
SELECT id FROM Departments WHERE name = 'Engineering'
);
``

The JOIN is preferred when you need other department columns in the result. The subquery is fine for a simple lookup. Both are correct — knowing both shows the interviewer you understand the trade-offs.
💡 Plain English: Finding all staff on a specific office floor. You look up which floor number belongs to "Engineering" (the Departments table), then find everyone registered on that floor number (the Employees table). Two tables, one shared key to link them.
35
Filtering

How do you find rows where a column is NULL? Why can't you use = NULL?

NULL means "unknown" — and in SQL, comparing anything to "unknown" gives "unknown", not true or false. So column = NULL never matches.

The correct operators are:
- IS NULL — find rows where the column has no value.
- IS NOT NULL — find rows where the column does have a value.

``sql
-- Find customers with no phone number:
SELECT * FROM Customers WHERE phone IS NULL;

-- Find customers who have a phone number:
SELECT * FROM Customers WHERE phone IS NOT NULL;

-- WRONG — always returns zero rows:
SELECT * FROM Customers WHERE phone = NULL;
`

Bonus: COALESCE replaces NULL with a fallback:
`sql
SELECT name, COALESCE(phone, 'no phone') AS contact
FROM Customers;
``
💡 Plain English: Asking someone "is your unknown phone number the same as this unknown number?" — you can't compare unknowns. The only valid question is "do you *have* a phone number at all?" That's IS NULL.
36
Query Writing

Write a query to get the top 5 most expensive products.

-- Products(id, name, price)
``sql
SELECT id, name, price
FROM Products
ORDER BY price DESC
LIMIT 5;
`

On SQL Server use TOP instead of LIMIT:
`sql
SELECT TOP 5 id, name, price
FROM Products
ORDER BY price DESC;
``

What an interviewer looks for:
- ORDER BY price DESC (highest first).
- LIMIT / TOP to cap the result.
- If there are ties at position 5, LIMIT cuts arbitrarily. If you need all products tied for 5th, use DENSE_RANK with a window function instead.
💡 Plain English: Sorting a product catalogue by price from highest to lowest and tearing off the first 5 pages. Simple — but if two products share the same price at the 5th slot, the tear is arbitrary. If ties matter, you need a smarter approach.
37
Query Writing

What does DISTINCT do, and write an example where it actually matters?

DISTINCT removes duplicate rows from the result — any two rows that are identical across all selected columns are collapsed into one.

``sql
-- Without DISTINCT — shows a row for every order (many per customer):
SELECT customer_id FROM Orders;

-- With DISTINCT — one row per customer, even if they have many orders:
SELECT DISTINCT customer_id FROM Orders;
`

A real use case: you want a list of countries your customers come from — not one row per customer, just the unique countries.
`sql
SELECT DISTINCT country FROM Customers ORDER BY country;
``

Watch out: DISTINCT applies to the *whole row*, not just the first column. Adding more columns means two rows must match on *all* of them to be considered duplicates.
💡 Plain English: A guest list after a wedding. The raw RSVP log has the same name appearing 3 times (they emailed, called, and texted). DISTINCT shows each guest once — deduplicated, clean.
38
Query Writing

How do you rename a column in a query result? Show an example.

Use AS to give a column a new label in the output (the underlying table is unchanged):

``sql
SELECT
first_name AS "First Name",
last_name AS "Last Name",
salary * 12 AS annual_salary
FROM Employees;
``

AS is optional in most databases — you can write the alias directly after the expression — but writing AS makes the intent clear.

Why it matters in practice:
- Calculated columns (like salary * 12) have no name without an alias.
- Aggregates like COUNT(*) output as "COUNT(*)" — giving them a clean alias makes downstream code and reports readable.
- Joining two tables with the same column name (both have "id") — alias one to avoid ambiguity.
💡 Plain English: A name badge at a conference. The employee badge might say "Priya Sharma, Senior Engineer, ID 4421" — but you stick a badge on top saying "Priya" for the networking event. The person is unchanged; the label is just friendlier for the context.
39
Query Writing

Write a query to find all customers who placed an order in the last 30 days.

-- Customers(id, name, email)
-- Orders(id, customer_id, order_date)
``sql
SELECT DISTINCT c.id, c.name, c.email
FROM Customers c
JOIN Orders o ON o.customer_id = c.id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days';
`

On MySQL:
`sql
WHERE o.order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
`

Key points:
- DISTINCT prevents the same customer appearing multiple times if they placed several orders in the period.
- Filter with a bare column (
order_date >= ...) not WHERE MONTH(order_date) = ...` — the bare column version can use an index, the function wrapper usually can't.
💡 Plain English: Checking the gym's sign-in book for everyone who visited in the past 30 days. You want each member's name once — not one line per visit. DISTINCT and a date range does exactly that.
40
Joins

What is the difference between INNER JOIN and LEFT JOIN? When would you use each?

- INNER JOIN — returns only rows where there is a match in *both* tables. Rows with no match on either side are excluded.
- LEFT JOIN — returns *all* rows from the left table, plus matched rows from the right. Where there's no match, the right-side columns come back as NULL.

``sql
-- INNER JOIN: only customers who have at least one order
SELECT c.name, o.id AS order_id
FROM Customers c
INNER JOIN Orders o ON o.customer_id = c.id;

-- LEFT JOIN: all customers, even those with no orders
SELECT c.name, o.id AS order_id
FROM Customers c
LEFT JOIN Orders o ON o.customer_id = c.id;
-- Customers with no orders appear once, with order_id = NULL
``

When to use which:
- INNER JOIN — you only care about records that exist on both sides.
- LEFT JOIN — you want everything from the primary table, with optional enrichment from the second (e.g. all customers whether or not they've ordered).
💡 Plain English: A student roll-call with grades. **INNER JOIN** only lists students who have a grade — everyone else is invisible. **LEFT JOIN** lists every student; those without a grade just get a blank in the grade column.
41
Query Writing

Write a query to list every customer and how many orders they have placed, including customers with zero orders.

-- Customers(id, name)
-- Orders(id, customer_id)
``sql
SELECT c.id, c.name, COUNT(o.id) AS order_count
FROM Customers c
LEFT JOIN Orders o ON o.customer_id = c.id
GROUP BY c.id, c.name
ORDER BY order_count DESC;
``

Why LEFT JOIN, not INNER JOIN?
An INNER JOIN would drop customers who have never ordered — their order_count would be 0 but they'd be invisible. LEFT JOIN keeps them and the COUNT of NULLs returns 0, which is correct.

Note: COUNT(o.id) counts non-NULL values — so it correctly returns 0 when there are no matching orders (all NULLs from the LEFT JOIN). COUNT(*) would return 1 instead of 0 for those customers, which is wrong.
💡 Plain English: A school attendance report including students who were absent. You list every student first (LEFT JOIN keeps all customers), then count how many days each one showed up (COUNT of non-NULL order IDs). Absent students show 0, not "missing from the report."
42
DML

Write a query to give all products in the 'Electronics' category a 10% price increase.

-- Products(id, name, price, category)
``sql
UPDATE Products
SET price = price * 1.10
WHERE category = 'Electronics';
`

Before running any UPDATE in an interview or in production, always test your WHERE clause with a SELECT first:
`sql
-- Step 1: confirm which rows will be affected
SELECT id, name, price, price * 1.10 AS new_price
FROM Products
WHERE category = 'Electronics';

-- Step 2: only then run the UPDATE
UPDATE Products
SET price = price * 1.10
WHERE category = 'Electronics';
``

Mentioning this habit shows the interviewer you are careful with destructive statements — it's one of the real differentiators between juniors and seniors.
💡 Plain English: Before repainting a room, you tape up everything that shouldn't be painted and do a test patch first. The SELECT is the test patch — confirm you've masked the right areas before the roller goes on.
43
Filtering

What is the difference between = and LIKE in a WHERE clause?

- = matches an *exact* value. Fast — can use an index.
- LIKE matches a *pattern* using wildcards: % (any sequence of characters) and _ (exactly one character). Slower on large tables, especially with a leading %.

``sql
-- Exact match — only finds 'Smith', nothing else:
SELECT * FROM Customers WHERE last_name = 'Smith';

-- Pattern match — finds 'Smith', 'Smithson', 'Blacksmith':
SELECT * FROM Customers WHERE last_name LIKE '%smith%';

-- Starts with 'Sm':
SELECT * FROM Customers WHERE last_name LIKE 'Sm%';

-- Exactly 5 characters:
SELECT * FROM Customers WHERE postcode LIKE '_____';
``

Performance note: LIKE with a leading % (e.g. LIKE '%smith') cannot use a standard B-tree index — it must scan the whole table. LIKE without a leading wildcard (e.g. LIKE 'Sm%') can still use an index.
💡 Plain English: Finding a book. **=** is looking up an exact ISBN number — one precise match, instant. **LIKE** is searching for any title *containing* the word "dragon" — you have to browse the whole shelf.
44
Query Writing

Write a query to find all products with a price between 100 and 500.

-- Products(id, name, price)
``sql
-- Using BETWEEN (inclusive on both ends):
SELECT id, name, price
FROM Products
WHERE price BETWEEN 100 AND 500
ORDER BY price;

-- Equivalent explicit version:
SELECT id, name, price
FROM Products
WHERE price >= 100 AND price <= 500
ORDER BY price;
`

Important: BETWEEN is *inclusive* — it includes both 100 and 500. If you want to exclude the boundaries, use the explicit >= / <= form with your chosen limits.

Follow-up interviewers often ask: BETWEEN also works on dates and strings:
`sql
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'
``
💡 Plain English: A price filter on a shopping website. Sliding the range slider to 100–500 includes products *at* those prices — same as BETWEEN. "Greater than 100 and less than 500" would exclude the boundary prices.
45
Query Writing

Write a query to display each customer's full name as a single column, combining first and last name.

-- Customers(id, first_name, last_name, email)
``sql
-- Standard SQL (works in Postgres, SQL Server, Oracle):
SELECT id,
first_name || ' ' || last_name AS full_name,
email
FROM Customers;

-- MySQL uses CONCAT:
SELECT id,
CONCAT(first_name, ' ', last_name) AS full_name,
email
FROM Customers;
`

Edge case the interviewer might probe: what if either name is NULL?

`sql
-- NULL + anything = NULL, so 'John' || NULL gives NULL, not 'John'
-- Fix with COALESCE:
SELECT COALESCE(first_name, '') || ' ' || COALESCE(last_name, '') AS full_name
FROM Customers;
``

Handling the NULL edge case is usually what separates a good answer from a great one.
💡 Plain English: Writing a name tag. If you only have a first name, just use that — don't leave the tag blank because the last name field is empty. COALESCE is the fallback: "use the real value, or use this default if it's missing."
46
Aggregation

Write a query to get the total, average, minimum, maximum, and count of salaries from an Employee table.

-- Employees(id, name, salary, department_id)
``sql
SELECT
COUNT(*) AS total_employees,
SUM(salary) AS total_salary,
AVG(salary) AS average_salary,
MIN(salary) AS lowest_salary,
MAX(salary) AS highest_salary
FROM Employees;
`

Per department:
`sql
SELECT
department_id,
COUNT(*) AS headcount,
ROUND(AVG(salary), 2) AS avg_salary,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary
FROM Employees
GROUP BY department_id
ORDER BY avg_salary DESC;
``

NULL behaviour: all aggregate functions except COUNT(*) ignore NULLs. COUNT(*) counts all rows; COUNT(salary) counts only rows where salary is not NULL.
💡 Plain English: A payroll summary on a spreadsheet. Instead of listing every employee line by line, you use the SUM, AVERAGE, MIN, and MAX formulas to get the key numbers at a glance — that's what SQL aggregate functions do at the database level.
47
Filtering

Write a query to find all customers who live in either London or New York.

``sql
-- Using IN (cleanest when checking multiple values):
SELECT id, name, city
FROM Customers
WHERE city IN ('London', 'New York')
ORDER BY city;

-- Equivalent with OR:
SELECT id, name, city
FROM Customers
WHERE city = 'London' OR city = 'New York';
`

IN is preferred when checking against 3 or more values — more readable and the database can optimise it better than a long OR chain.

What NOT to do:
`sql
-- Wrong — this will never match anything:
WHERE city = 'London' AND city = 'New York'
-- A single value can't equal two different things simultaneously.
``

The AND vs OR trap is one of the most common junior mistakes interviewers look for.
💡 Plain English: A door list at a club with two VIP entrances — London and New York. IN is the bouncer checking a short list: "are you on this list?" Any match lets you in. AND would mean "you must simultaneously be from both cities" — nobody qualifies.
48
Query Writing

Write a query to find all duplicate email addresses in a Customers table and show how many times each appears.

-- Customers(id, name, email)
``sql
SELECT email, COUNT(*) AS occurrences
FROM Customers
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY occurrences DESC;
`

If you also want to see the full rows (not just the emails):
`sql
SELECT *
FROM Customers
WHERE email IN (
SELECT email
FROM Customers
GROUP BY email
HAVING COUNT(*) > 1
)
ORDER BY email;
``

Why HAVING, not WHERE? COUNT(*) is an aggregate — it only exists after the GROUP BY runs. WHERE filters rows *before* grouping and can't see aggregate values.
💡 Plain English: Checking a school's enrolment records for duplicate names. You group students by name, count each group, and only raise your hand for groups with more than one — those are the duplicates worth investigating.
49
Practical

You ran a DELETE or UPDATE without a WHERE clause and affected all rows. What do you do?

This is a critical situation — act immediately:

If you are inside an open transaction (BEGIN/START TRANSACTION):
``sql
ROLLBACK;
``
This undoes all changes since the BEGIN. Check immediately — most databases auto-commit by default, so this window may be very small.

If auto-commit was on and the changes are committed:
1. Stop any dependent processes — prevent further writes that compound the damage.
2. Restore from the most recent backup — this is why backups and regular restore tests exist.
3. Use transaction logs / point-in-time recovery — most production databases (Postgres, SQL Server, MySQL with binlog) can replay or replay-minus changes from the log.
4. Escalate immediately — don't try to fix it quietly; every minute matters.

How to prevent it next time:
- Always run a SELECT with the same WHERE first to preview affected rows.
- Wrap destructive statements in an explicit BEGIN so you can ROLLBACK.
- Restrict production write permissions so full-table DELETEs require elevated access.
💡 Plain English: Accidentally erasing the whole whiteboard in a meeting room. If the cleaner is still in the room (transaction open), shout STOP and they can undo it. If they've already left and the room's been reset (auto-committed), you need the photo someone took earlier (the backup) — and you need to act fast before anything else overwrites it.
50
Query Writing

Write a query to insert a new product only if a product with the same name does not already exist in the table.

-- Products(id, name, price, category)
Approach 1 — INSERT with NOT EXISTS (portable):
``sql
INSERT INTO Products (name, price, category)
SELECT 'Wireless Mouse', 29.99, 'Electronics'
WHERE NOT EXISTS (
SELECT 1 FROM Products WHERE name = 'Wireless Mouse'
);
`

Approach 2 — Upsert / ON CONFLICT (Postgres):
`sql
INSERT INTO Products (name, price, category)
VALUES ('Wireless Mouse', 29.99, 'Electronics')
ON CONFLICT (name) DO NOTHING;
-- Requires a UNIQUE constraint on the name column
`

Approach 3 — MySQL INSERT IGNORE:
`sql
INSERT IGNORE INTO Products (name, price, category)
VALUES ('Wireless Mouse', 29.99, 'Electronics');
``

Best practice: the safest long-term solution is a UNIQUE constraint on the column — then the database itself prevents duplicates regardless of how the insert is written.
💡 Plain English: Adding a contact to a phone book. Before writing the name, you check if it already exists — if it does, you skip it; if it doesn't, you add it. A UNIQUE constraint is like the phone book rejecting duplicate entries automatically, no manual check needed.

Mid-Level (2–5 years)

1
Window Functions

Find the top 3 highest-paid employees in each department.

-- Employee(id, name, salary, department_id)
-- Department(id, name)
Rank employees within each department using DENSE_RANK() OVER (PARTITION BY department), then keep ranks 1–3. Use DENSE_RANK, not ROW_NUMBER, so tied salaries share a rank rather than arbitrarily dropping one.

Why it exists:
"Top N per group" is one of the most common business reporting patterns and a classic window function interview question. In QA work, the same pattern validates per-category reports: top 3 test failures per suite, highest-value orders per customer, most recent N records per user. Getting the tie-handling right (DENSE_RANK vs ROW_NUMBER) is critical for data accuracy.

``sql
-- Approach 1: window function (standard, recommended)
WITH ranked AS (
SELECT e.name, e.salary, d.name AS department,
DENSE_RANK() OVER (
PARTITION BY e.department_id
ORDER BY e.salary DESC
) AS rnk
FROM Employee e
JOIN Department d ON d.id = e.department_id
)
SELECT department, name, salary
FROM ranked
WHERE rnk <= 3;

-- Approach 2: correlated subquery (older databases without window functions)
SELECT e.name, e.salary, e.department_id
FROM Employee e
WHERE (
SELECT COUNT(DISTINCT e2.salary)
FROM Employee e2
WHERE e2.department_id = e.department_id
AND e2.salary > e.salary
) < 3;
``

Approach 1 is clearer and faster. Approach 2 is the portable fallback for legacy systems.

Real-world QA use case:
A QA engineer is validating a "Top Earners by Department" report. The report should show at most 3 employees per department, but tied salaries should both appear. She writes the DENSE_RANK query and discovers the report is using ROW_NUMBER — a tie at rank 3 drops one of the tied employees arbitrarily. She reproduces the defect with a specific test dataset where two employees share the department's third-highest salary, then raises it with the exact expected vs actual output.

Rule of thumb: Use DENSE_RANK for "top N" when ties should share a position. Use ROW_NUMBER only when you need exactly N rows and don't care about ties.
💡 Plain English: The Olympics giving out medals separately per sport. PARTITION BY department means "each sport gets its own podium." DENSE_RANK means if two swimmers tie for silver, both get silver and the next swimmer still gets bronze — no medal skipped. You crown the top 3 in every event independently.
2
Self Join

Given an Employee table where each row has a managerId pointing to another employee, find all employees who earn more than their manager.

-- Employee
-- +----+-------+--------+-----------+
-- | id | name  | salary | managerId |
-- +----+-------+--------+-----------+
-- |  1 | Joe   |  70000 |         3 |
-- |  2 | Henry |  80000 |         4 |
-- |  3 | Sam   |  60000 |      NULL |
-- |  4 | Max   |  90000 |      NULL |
-- +----+-------+--------+-----------+
Join the Employee table to itself — one alias as the employee, one as their manager — then filter where the employee's salary exceeds the manager's.

Why it exists:
A self-join is the technique for querying hierarchical relationships stored in a single table (employees/managers, parent/child records, category trees). In QA, the same pattern validates referential integrity within one table: checking that manager IDs point to valid senior-level employees, confirming that parent category records exist for all child records, or verifying that no circular references exist in a hierarchy.

``sql
-- Approach 1: self-join (faster, cleaner)
SELECT e.name AS employee, e.salary, m.name AS manager, m.salary AS manager_salary
FROM Employee e
JOIN Employee m ON e.managerId = m.id -- INNER JOIN: skips employees with no manager
WHERE e.salary > m.salary;
-- Result: Joe (70k) out-earns Sam (60k) → Joe appears

-- Approach 2: correlated subquery (reads like the English question)
SELECT e.name AS employee
FROM Employee e
WHERE e.salary > (
SELECT m.salary FROM Employee m WHERE m.id = e.managerId
);
``

Both return the same answer. Self-join is faster; the subquery reads more naturally.

Real-world QA use case:
A QA engineer discovers a data integrity issue: some employee records have managerId values that don't exist in the id column. She writes a self-join with a LEFT JOIN instead of INNER JOIN, then filters WHERE m.id IS NULL — this surfaces all employees with orphan manager references. The defect is a missing foreign key constraint on managerId that was never enforced.

Rule of thumb: When a table references itself (parent/child, manager/employee), self-join with two aliases. Use INNER JOIN to skip unmatched rows (no manager), LEFT JOIN to find them (orphan references).
💡 Plain English: Take one company photo and make two copies side by side. The left copy is "the employee," the right copy is "their manager." Connect each person to their boss with a string (e.managerId = m.id), then ask "is the pay on the left bigger than the pay on the right?" One table, two roles — that's a self-join.
3
Window Functions

Explain the difference between ROW_NUMBER(), RANK(), and DENSE_RANK() with an example.

All three number rows in order, but differ in how ties are handled — ROW_NUMBER gives every row a unique number, RANK skips numbers after ties, DENSE_RANK never skips.

Why it exists:
The three ranking functions exist because different business problems need different tie-handling. "Pick exactly one row per group" needs ROW_NUMBER. "Show all tied records at the same position" needs RANK or DENSE_RANK. "Top N distinct salary levels" needs DENSE_RANK. Choosing the wrong one produces incorrect results that can be hard to spot without careful test data that includes ties.

``sql
-- Salaries: 100, 90, 90, 80

SELECT salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
RANK() OVER (ORDER BY salary DESC) AS rnk,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rnk
FROM salaries;

-- Results:
-- salary | ROW_NUMBER | RANK | DENSE_RANK
-- 100 | 1 | 1 | 1
-- 90 | 2 | 2 | 2
-- 90 | 3 | 2 | 2 ← tied: same rank
-- 80 | 4 | 4 | 3 ← RANK skips 3; DENSE_RANK doesn't
``

- ROW_NUMBER: 1, 2, 3, 4 — always unique, ties broken arbitrarily
- RANK: 1, 2, 2, 4 — ties share a number, next number skips
- DENSE_RANK: 1, 2, 2, 3 — ties share a number, next number continues

Real-world QA use case:
A QA engineer is testing a leaderboard that should show all users tied for a position at the same rank. The developer used ROW_NUMBER — so two users with identical scores get ranks 3 and 4 instead of both getting rank 3. She creates a test dataset with three tied users at position 2, runs the leaderboard query, and confirms the defect: ROW_NUMBER arbitrarily picks which of the three gets rank 2, 3, and 4. DENSE_RANK would give all three rank 2.

Rule of thumb: ROW_NUMBER to grab exactly one row per group (deduplication, "latest per customer"). RANK when the business rule says "if two people tie for 2nd, the next is 4th." DENSE_RANK for "top N salary levels" where ties share a position.
💡 Plain English: Two runners tie for 2nd. RANK is the Olympics: both get silver, the next runner is 4th — bronze is skipped. DENSE_RANK is a kinder league: both get silver, the next runner gets bronze (3rd) — no position wasted. ROW_NUMBER is the photo finish: the camera picks one as 2nd and one as 3rd, even in a dead heat.
4
Window Functions

Write a query that returns each day's sales along with a running (cumulative) total of sales over time.

-- Sales
-- +------------+--------+
-- | sale_date  | amount |
-- +------------+--------+
-- | 2024-01-01 |    100 |
-- | 2024-01-02 |    150 |
-- | 2024-01-03 |     50 |
-- +------------+--------+
Use SUM() as a window function with ORDER BY to create a running total — each row's cumulative total grows as you move through the ordered dataset.

Why it exists:
Running totals appear in financial reports, progress dashboards, test pass-rate-over-time charts, and any metric that accumulates. In QA, running totals validate that batch-processed record counts accumulate correctly, that financial transaction totals reach the right final figure, or that a test run's pass count increases monotonically without dipping.

``sql
-- Approach 1: window function (fast, single pass)
SELECT sale_date,
amount,
SUM(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM Sales
ORDER BY sale_date;
-- Results: 100 | 250 | 300

-- For running total per customer (reset per group):
SUM(amount) OVER (PARTITION BY customer_id ORDER BY sale_date)

-- Approach 2: correlated subquery (legacy databases)
SELECT s.sale_date, s.amount,
(SELECT SUM(s2.amount) FROM Sales s2
WHERE s2.sale_date <= s.sale_date) AS running_total
FROM Sales s
ORDER BY s.sale_date;
``

The ROWS BETWEEN clause is important: without it, rows on the same date can produce the same running total rather than accumulating correctly.

Real-world QA use case:
A QA engineer is validating a daily revenue report that shows a running total. She runs the window function query against the test database and compares it against the UI. She discovers the report is using a plain SUM without partitioning — all customers' sales are being accumulated together instead of per-customer. Adding PARTITION BY customer_id to her query matches the expected per-customer running balances and confirms the UI bug.

Rule of thumb: SUM() OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) for running totals. Add PARTITION BY to reset the running total per group (per customer, per product, etc.).
💡 Plain English: Your bank balance, not your transaction list. A plain SUM gives you the grand total at the bottom — useful at month end, useless mid-month. The running total is the balance line on your statement: each day shows everything accumulated up to and including today.
5
Indexing

What is the difference between a clustered and a non-clustered index? How many of each can a table have?

A clustered index determines the physical order rows are stored — only one allowed per table. A non-clustered index is a separate sorted lookup structure with pointers to the actual rows — many allowed per table.

Why it exists:
Indexes are the primary tool for query performance. Understanding the difference between clustered and non-clustered matters for QA when diagnosing slow queries in performance testing, understanding why a query plan does a key lookup (expensive extra trip to the clustered index), and evaluating whether a "covering index" can eliminate that extra trip by including all needed columns.

``sql
-- Clustered: rows physically sorted by this key (one per table)
CREATE CLUSTERED INDEX ix_orders_id ON Orders(id);

-- Non-clustered: separate lookup structure with pointer to row (many allowed)
CREATE NONCLUSTERED INDEX ix_orders_email ON Orders(email);

-- Covering index: includes all columns the query needs → no extra row lookup
CREATE NONCLUSTERED INDEX ix_orders_covering
ON Orders(email)
INCLUDE (status, amount); -- query for email+status+amount hits only this index
``

Range queries (WHERE id BETWEEN 100 AND 200) are fast on the clustered index — matching rows are stored together. Non-clustered lookups by email find the pointer quickly, but then jump to the main table for other columns ("key lookup" in the query plan).

Real-world QA use case:
A QA engineer running performance tests notices that a customer lookup by email is slow despite an index existing. She checks the query execution plan and sees a "key lookup" operation — the non-clustered index on email doesn't include the status and name columns the query also needs, forcing an extra trip per row. She adds those as INCLUDE columns, reruns the performance test, and response time drops from 800ms to 40ms. The index design was the root cause.

Rule of thumb: One clustered index per table (the physical sort order). Many non-clustered indexes (separate lookup structures). When a query only needs columns that are in the index, use a covering index to eliminate the key lookup.
💡 Plain English: A clustered index is a phone book — the entries are physically printed in A–Z order, so the data itself is the sorted list. One per book. A non-clustered index is the index at the back of a textbook — a separate sorted list of topics pointing to page numbers. You can have many, but you still have to flip to the page to read the actual content.
6
Window Functions

How would you show the month-over-month change in sales?

-- MonthlySales(month, amount)
Use LAG() to pull the previous row's value onto the current row, then subtract — this gives the change without a self-join or subquery.

Why it exists:
LAG() and LEAD() are window functions that let you compare a row to its neighbours without a complex self-join. In QA work, this pattern validates trend data: confirming month-over-month metrics move in the expected direction, checking that daily active user counts don't unexpectedly drop, or verifying that test suite pass rates are improving over time rather than degrading.

``sql
SELECT month,
amount,
LAG(amount) OVER (ORDER BY month) AS prev_month,
amount - LAG(amount) OVER (ORDER BY month) AS change_vs_prev,
LEAD(amount) OVER (ORDER BY month) AS next_month
FROM MonthlySales
ORDER BY month;

-- Results for amounts 100, 150, 50:
-- month | amount | prev_month | change_vs_prev | next_month
-- 2024-01 | 100 | NULL | NULL | 150
-- 2024-02 | 150 | 100 | 50 | 50
-- 2024-03 | 50 | 150 | -100 | NULL
``

The first row's LAG is NULL (nothing before it). LEAD() is the mirror — it looks forward to the next row.

Real-world QA use case:
A QA engineer is validating a month-over-month growth report that should show positive change when sales increase. She runs the LAG query against the database and confirms the expected values. The UI shows a different figure for February — it's showing the absolute change as positive even when sales dropped. Comparing her LAG output to the UI reveals the report is calculating change_vs_prev using ABS(), masking negative trends. That's the defect.

Rule of thumb: LAG(column) OVER (ORDER BY date) gives the previous row's value on the current row. LEAD(column) gives the next row's value. Both return NULL at the boundary rows.
💡 Plain English: Checking your weight against last week's. LAG hands you last week's number to sit right next to today's so the difference is obvious at a glance — no flipping back and forth through old records. LEAD does the same but looks ahead.
7
Query Writing

How do you find the median salary?

-- Employee(id, salary)
Use PERCENTILE_CONT(0.5) for databases that support it — for others, rank from both ends and take the middle value where ascending and descending row numbers meet.

Why it exists:
The median is more robust than the mean for skewed salary data — one very high earner doesn't distort it. In QA, the same pattern validates percentile-based thresholds: checking that the median response time in a performance test stays below an SLA, verifying that test pass rates are above the 50th percentile, or validating statistical metrics in analytics features. The "two row_number" trick is a classic interview approach that tests deep window function understanding.

``sql
-- Approach 1: PERCENTILE_CONT (Postgres, SQL Server, Oracle)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median
FROM Employee;

-- Approach 2: two ROW_NUMBER window functions (works anywhere)
SELECT AVG(salary) AS median FROM (
SELECT salary,
ROW_NUMBER() OVER (ORDER BY salary) AS asc_pos,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS desc_pos
FROM Employee
) t
WHERE asc_pos IN (desc_pos, desc_pos + 1, desc_pos - 1);

-- Why it works: the centre row is where asc_pos and desc_pos are equal
-- For odd count: one row where they match exactly
-- For even count: two rows where they differ by 1 → AVG gives the median
``

Real-world QA use case:
A QA engineer is validating an analytics dashboard that shows "median order value." She runs PERCENTILE_CONT against the test data and gets £47.50. The UI shows £52.00 — it's using AVG instead of median. She inserts a test dataset with a known outlier (one £10,000 order among 99 normal orders): the AVG shoots up to £147, the median stays at £48. This proves the UI is using the wrong calculation — a data logic defect.

Rule of thumb: PERCENTILE_CONT(0.5) is the clean, direct solution for any database that supports it. The two-ROW_NUMBER technique is the portable fallback that works on any SQL database.
💡 Plain English: Line everyone up by height. The median is the person standing dead centre. If an even number share the middle (two people), you average the two middle heights. The median isn't pulled around by one giant outlier the way the average is.
8
Pivoting

How do you turn rows into columns — for example, total sales per quarter shown as four columns?

-- Sales(quarter, amount)  -- quarter is 'Q1'..'Q4'
Use CASE inside aggregate functions — one CASE expression per output column — or the PIVOT keyword on databases that support it (SQL Server/Oracle).

Why it exists:
Pivot queries transform row-oriented data into column-oriented reports, which is how most business dashboards are laid out. QA engineers validate pivot queries when testing reporting features, comparing expected column values against actual, and confirming that all categories are represented. Understanding the CASE+aggregate approach is essential because it works on every database platform (PIVOT is SQL Server/Oracle only).

``sql
-- Approach 1: CASE inside aggregates (portable — works on all databases)
SELECT
SUM(CASE WHEN quarter = 'Q1' THEN amount ELSE 0 END) AS Q1,
SUM(CASE WHEN quarter = 'Q2' THEN amount ELSE 0 END) AS Q2,
SUM(CASE WHEN quarter = 'Q3' THEN amount ELSE 0 END) AS Q3,
SUM(CASE WHEN quarter = 'Q4' THEN amount ELSE 0 END) AS Q4
FROM Sales;
-- Each CASE keeps only its quarter's rows, zeroes the rest → SUM totals just that quarter

-- Approach 2: PIVOT keyword (SQL Server / Oracle only)
SELECT Q1, Q2, Q3, Q4
FROM Sales
PIVOT (SUM(amount) FOR quarter IN (Q1, Q2, Q3, Q4)) AS p;
``

Real-world QA use case:
A QA engineer is verifying a quarterly sales summary report. She writes the CASE-based pivot query and gets Q1=45,200, Q2=38,100, Q3=51,800, Q4=0. The zero in Q4 immediately flags an issue — the report UI shows Q4=NULL instead of 0, which causes a JavaScript error when the charting library tries to render it. She raises it as a data formatting defect: the query should return 0 (or the display layer should handle NULL), not crash.

Rule of thumb: Use CASE inside SUM for portable pivot queries that work everywhere. Use PIVOT only when you're on SQL Server or Oracle and prefer the cleaner syntax.
💡 Plain English: Sorting post into labelled pigeonholes — one slot per quarter — and counting each slot. The CASE decides which slot each row drops into; the SUM tallies each pigeonhole. At the end you have one count per slot, turned into one column per quarter.
9
Subqueries

Find customers who bought product A but never bought product B.

-- Orders(customer_id, product)
Find A-buyers, then exclude anyone who also appears in the B-buyers set — use NOT EXISTS for NULL safety, LEFT JOIN...IS NULL for join-based thinking, or EXCEPT for the most readable set-theory approach.

Why it exists:
"Set A minus Set B" is a fundamental data validation pattern in QA work: find users who completed step 1 but never reached step 2, customers who were sent an email but never opened it, test cases that ran in suite A but not suite B. Each of the three approaches (NOT EXISTS, LEFT JOIN, EXCEPT) reveals different SQL thinking and has different NULL handling — knowing all three is expected at mid-level.

``sql
-- Approach 1: NOT EXISTS (safe default — handles NULLs correctly)
SELECT DISTINCT o.customer_id
FROM Orders o
WHERE o.product = 'A'
AND NOT EXISTS (
SELECT 1 FROM Orders o2
WHERE o2.customer_id = o.customer_id AND o2.product = 'B'
);

-- Approach 2: LEFT JOIN ... IS NULL
SELECT a.customer_id
FROM (SELECT DISTINCT customer_id FROM Orders WHERE product = 'A') a
LEFT JOIN (SELECT DISTINCT customer_id FROM Orders WHERE product = 'B') b
ON a.customer_id = b.customer_id
WHERE b.customer_id IS NULL;

-- Approach 3: EXCEPT (clearest to read — A-buyers minus B-buyers)
SELECT customer_id FROM Orders WHERE product = 'A'
EXCEPT
SELECT customer_id FROM Orders WHERE product = 'B';
``

All three return the same result. NOT EXISTS is the safest (NULL-proof). EXCEPT reads closest to plain English.

Real-world QA use case:
A QA engineer is validating a marketing segmentation feature that should identify customers who viewed a product page but never purchased. She writes the NOT EXISTS version to get the expected dataset, then compares it to the segment produced by the application. The application returns 340 customers; her query returns 312. Investigation reveals the app is using NOT IN and one customer has a NULL purchase record — the NOT IN silently returns zero matches (the classic NULL trap), inflating the segment count.

Rule of thumb: Use NOT EXISTS over NOT IN when the subquery could contain NULLs — NOT IN with a NULL in the list silently returns zero rows.
💡 Plain English: Finding people who own a cat but not a dog. Start with all cat owners, then cross off anyone who also appears on the dog-owner list. What remains is cat-only. Three roads to the same destination — NOT EXISTS, LEFT JOIN, and EXCEPT.
10
CTEs

What is a CTE, and when would you use one instead of a subquery?

A CTE (Common Table Expression) is a named, temporary result defined with WITH that can be referenced like a table in the main query — making complex multi-step queries readable and avoiding repeated subquery duplication.

Why it exists:
Nested subqueries become unreadable fast. A CTE breaks a complex query into named steps — each step refers to the previous by name rather than nesting deeper. For QA, CTEs are essential for writing readable database validation queries that other team members can understand and audit. They also enable recursive queries (org chart hierarchies, tree structures) that are impossible with plain subqueries.

``sql
-- As a CTE: named, readable, reusable within the same query
WITH big_spenders AS (
SELECT customer_id, SUM(amount) AS total
FROM Orders
GROUP BY customer_id
HAVING SUM(amount) > 1000
),
vip_customers AS (
SELECT id, name FROM Customers WHERE tier = 'VIP'
)
SELECT v.name, b.total
FROM big_spenders b
JOIN vip_customers v ON v.id = b.customer_id;

-- As a subquery (fine for simple, one-off use):
SELECT c.name, b.total
FROM (
SELECT customer_id, SUM(amount) AS total
FROM Orders GROUP BY customer_id HAVING SUM(amount) > 1000
) b
JOIN Customers c ON c.id = b.customer_id;
``

Use a CTE when: you'd repeat the same subquery more than once, the query has multiple steps that each depend on the previous, or you need recursion (see sql-mid-11).

Real-world QA use case:
A QA engineer writes a data validation query that needs to: (1) find all test users created this week, (2) find which of those placed orders, (3) find which of those orders triggered loyalty points. Writing this as three nested subqueries is unreadable. She writes it as three CTEs chained together — each references the previous by name. The query is reviewed and approved by the developer in minutes rather than requiring an explanation.

Rule of thumb: Use a CTE when a query has two or more logical steps, when a subquery is referenced more than once, or when readability matters more than brevity.
💡 Plain English: A subquery is a side note crammed into the middle of a sentence in brackets. A CTE is a labelled box you define at the start: "here's the big_spenders box." Then you just reference it by name — far easier to read when the query has multiple steps.
11
Recursive CTEs

How would you list an employee and everyone below them in the org chart?

-- Employee(id, name, manager_id)
Use a recursive CTE — the anchor (first part) selects the starting employee, the recursive (second part) keeps joining to find the next level down, until no new rows are found.

Why it exists:
Hierarchical data stored in a self-referencing table (parent/child, org chart, category tree, filesystem) cannot be traversed to arbitrary depth with a fixed-level JOIN. A recursive CTE walks the hierarchy level by level automatically. In QA, this is used to validate permission inheritance (does a user's role grant access at all hierarchy levels?), test category filtering in nested category trees, or confirm that an org chart report correctly shows all subordinates.

``sql
WITH RECURSIVE org AS (
-- Anchor: start at the chosen employee
SELECT id, name, manager_id, 0 AS level
FROM Employee
WHERE id = 1

UNION ALL

-- Recursive: find direct reports of everyone already in the result
SELECT e.id, e.name, e.manager_id, org.level + 1
FROM Employee e
JOIN org ON e.manager_id = org.id
)
SELECT id, name, level
FROM org
ORDER BY level, name;
-- Returns the starting employee + all subordinates at every level
``

The recursion stops automatically when the JOIN finds no new rows (no more subordinates). The level column shows depth — useful for indented display.

Real-world QA use case:
A QA engineer is testing a "view all team members" feature for a manager. She writes a recursive CTE to get the expected list of all employees under manager ID 5 (including indirect reports five levels deep). The feature only returns direct reports — two levels instead of all. The recursive CTE gives her the exact expected output to compare against, immediately quantifying the depth of the bug.

Rule of thumb: Recursive CTEs: anchor selects the starting row, recursive part joins back to the CTE to find the next level. The database stops when the recursive join returns zero rows.
💡 Plain English: Tracing a family tree downward. Start with one grandparent (anchor), find their children (first recursion), then their children's children (second recursion), and keep going until you reach people with no kids. The recursion just repeats "find the next level down" until nothing is left.
12
CASE

How do you group salaries into bands like Low / Medium / High?

-- Employee(id, salary)
Use CASE to assign a label to each row, then GROUP BY that label. CASE evaluates conditions top to bottom and stops at the first match — so condition order matters.

Why it exists:
CASE is SQL's conditional branching — the equivalent of if/else applied row by row. It's used constantly in QA for categorising data: grouping test results by severity band, classifying response times into fast/acceptable/slow buckets, converting numeric scores into pass/marginal/fail labels. Getting the order of conditions wrong (checking a wide range before a narrow one) produces incorrect bucketing that can be hard to spot.

``sql
SELECT
CASE
WHEN salary < 40000 THEN 'Low'
WHEN salary < 80000 THEN 'Medium'
ELSE 'High'
END AS band,
COUNT(*) AS people
FROM Employee
GROUP BY
CASE
WHEN salary < 40000 THEN 'Low'
WHEN salary < 80000 THEN 'Medium'
ELSE 'High'
END
ORDER BY MIN(salary); -- order bands by their lowest salary

-- Why order matters: salary = 35000
-- Check WHEN salary < 40000 first → 'Low' ✓ (stops here)
-- If we had WHEN salary < 80000 first → it would catch 35000 as 'Medium' ✗
``

Real-world QA use case:
A QA engineer is validating a performance reporting feature that bins API response times into Fast (<200ms), Acceptable (200ms–1000ms), and Slow (>1000ms). She writes the CASE query against the log table and gets Fast=840, Acceptable=120, Slow=12. The UI shows Fast=960, Acceptable=0, Slow=12 — the developer put the WHEN < 200 check before WHEN < 1000, but accidentally reversed the operators. The CASE condition order bug causes 200–999ms responses to be counted as Fast.

Rule of thumb: CASE evaluates conditions top to bottom and stops at the first match. Always put the most restrictive condition first (smallest range, lowest threshold) to avoid earlier conditions accidentally swallowing later ones.
💡 Plain English: Sorting people into T-shirt sizes. You don't record everyone's exact measurement — you check "under this? Small. Under that? Medium. Otherwise Large." CASE is that size chart applied row by row. Put the checks in the right order or a Small person gets called Medium.
13
Subqueries

What is the difference between IN, EXISTS, and a JOIN for checking related rows?

All three can answer "does a related row exist?" — IN checks against a list, EXISTS short-circuits on the first match, JOIN combines rows (use when you need the other table's columns). For large datasets, EXISTS is often faster; for NULL safety, prefer EXISTS over IN.

Why it exists:
These three approaches are conceptually equivalent but have different performance characteristics and NULL behaviours. IN silently returns zero rows if the subquery returns any NULL. EXISTS is NULL-safe and stops as soon as it finds one match. JOIN must be combined with DISTINCT to avoid row duplication when multiple matches exist. Knowing which to reach for — and why — shows depth of SQL understanding that interviewers test specifically.

``sql
-- Three ways to find customers who have placed an order:

-- IN: readable, fine for small lists, dangerous with NULLs
SELECT name FROM Customers
WHERE id IN (SELECT customer_id FROM Orders);

-- EXISTS: NULL-safe, stops at first match (fast for "does any row exist?")
SELECT name FROM Customers c
WHERE EXISTS (SELECT 1 FROM Orders o WHERE o.customer_id = c.id);

-- JOIN: use when you also need columns from Orders
SELECT DISTINCT c.name, o.amount
FROM Customers c
JOIN Orders o ON o.customer_id = c.id;
-- Note: DISTINCT needed if a customer has multiple orders

-- The NULL trap with IN:
SELECT name FROM Customers
WHERE id NOT IN (SELECT customer_id FROM Orders WHERE customer_id IS NULL);
-- Returns ZERO rows if any customer_id is NULL in Orders
-- NOT EXISTS handles this correctly
``

Real-world QA use case:
A QA engineer writes a validation query to find test users who have no associated orders — she uses NOT IN. The query returns 0 rows, which seems wrong. She discovers some Orders rows have NULL customer_id (from a bug in a previous test). NOT IN with a NULL in the subquery always returns zero rows. Switching to NOT EXISTS returns the correct 47 orphan users. She both fixes her query and raises a defect for the NULL customer_id issue.

Rule of thumb: Use EXISTS for "does any match exist" checks — it's faster on large tables and NULL-safe. Use IN for small, known lists. Use JOIN when you need columns from the joined table.
💡 Plain English: Checking if a guest is on the VIP list. IN reads the whole list and checks your name against every entry. EXISTS stops the moment it spots your name — no need to read the rest. JOIN is when you also want to grab their table assignment while checking the list.
14
NULL Gotchas

Why can NOT IN with a subquery suddenly return no rows?

If the subquery returns even one NULL, NOT IN returns zero rows for the entire outer query — silently, with no error. The fix is NOT EXISTS, which handles NULLs correctly.

Why it exists:
This is one of SQL's most dangerous silent bugs. NOT IN works by checking "is this value not equal to any value in the list?" — but comparing anything to NULL gives UNKNOWN (not FALSE), so the comparison fails for every row. The result: zero rows returned with no hint of what went wrong. Developers and QA engineers hit this constantly when validating data with optional foreign keys.

``sql
-- Employees table: manager_id has some NULLs (top-level managers have no manager)

-- BUG: returns ZERO rows if ANY manager_id is NULL
SELECT name FROM Employee
WHERE id NOT IN (SELECT manager_id FROM Employee);
-- Fails because: id NOT IN (1, 3, NULL) → id <> 1 AND id <> 3 AND id <> NULL
-- id <> NULL is always UNKNOWN → the whole condition is UNKNOWN → row excluded

-- FIX 1: NOT EXISTS (NULL-safe — doesn't compare to NULL)
SELECT name FROM Employee e
WHERE NOT EXISTS (
SELECT 1 FROM Employee m WHERE m.manager_id = e.id
);

-- FIX 2: Filter NULLs from the subquery (if NULLs should be excluded)
SELECT name FROM Employee
WHERE id NOT IN (
SELECT manager_id FROM Employee WHERE manager_id IS NOT NULL
);
``

Real-world QA use case:
A QA engineer runs a verification query to find all non-manager employees after a migration. NOT IN returns 0 rows. She checks the source data: 3 top-level executives have NULL manager_id. The NULL in the subquery silently kills the entire NOT IN check. She switches to NOT EXISTS and gets the correct 847 non-manager employees. She also adds a comment explaining the NULL trap to the shared query library so teammates don't hit it again.

Rule of thumb: Never use NOT IN when the subquery could contain NULLs. Always prefer NOT EXISTS — it's NULL-safe and often faster. If you must use NOT IN, add WHERE column IS NOT NULL to the subquery.
💡 Plain English: Asking "is this person NOT on any of these lists?" — but one list has a smudged, unreadable entry (NULL). Since you cannot be sure they are not on the smudged one, the answer is "unknown" — so the whole check gives up. NOT EXISTS doesn't look at the smudged list the same way and gets it right.
15
Aggregation

Find the department with the highest average salary.

-- Employee(id, department_id, salary)
Group by department, compute AVG(salary), sort descending, and take the top row. Use RANK() instead of LIMIT 1 if you want all tied winners.

Why it exists:
"Maximum aggregate per group" is a classic problem that tests whether you can correctly combine GROUP BY, ORDER BY, and either LIMIT or window functions. In QA work, the same pattern validates: which test suite has the highest failure rate, which product category has the best conversion rate, which server has the highest average response time.

``sql
-- Approach 1: simple (LIMIT 1 — drops ties)
SELECT department_id, AVG(salary) AS avg_salary
FROM Employee
GROUP BY department_id
ORDER BY avg_salary DESC
LIMIT 1;

-- Approach 2: with RANK() — preserves ties (both departments returned if equal)
SELECT department_id, avg_salary
FROM (
SELECT department_id,
AVG(salary) AS avg_salary,
RANK() OVER (ORDER BY AVG(salary) DESC) AS rnk
FROM Employee
GROUP BY department_id
) t
WHERE rnk = 1;
``

LIMIT 1 is fine when ties genuinely don't matter. RANK() is required when you must return all tied winners.

Real-world QA use case:
A QA engineer is validating a "Top Department" banner on the company intranet that highlights the department with the best average salary for a quarterly competition. She writes both approaches against the test data. With Approach 1, when two departments tie, one is dropped arbitrarily — the banner shows the wrong winner half the time. She raises this as a defect and recommends Approach 2, which correctly shows both tied departments.

Rule of thumb: LIMIT 1 for "just give me one winner." RANK() WHERE rnk = 1 when tied results must all be returned.
💡 Plain English: Finding which school class has the highest average test score. You compute each class's average, line them up, and pick the top. The RANK version is for when two classes tie for first — you want both on the winner's board, not just whichever happened to come first in the sort.
16
Window Functions

How do you show each category as a percentage of total sales?

-- Sales(category, amount)
Divide each category's total by the grand total — use a window SUM() OVER() to put the grand total on every row in one pass, or a scalar subquery for the same result with simpler syntax.

Why it exists:
Percentage-of-total is one of the most common metrics in business reports. In QA, this validates: share-of-voice reports, test coverage distribution across components, error rate per category as a percentage of total errors. The double SUM pattern — SUM(SUM()) — is a classic intermediate SQL technique that tests understanding of how aggregates interact with window functions.

``sql
-- Approach 1: window function (single pass — most efficient)
SELECT category,
SUM(amount) AS category_total,
100.0 * SUM(amount) / SUM(SUM(amount)) OVER () AS pct_of_total
FROM Sales
GROUP BY category;
-- SUM(SUM(amount)) OVER () = grand total across all categories
-- Use 100.0 (not 100) to avoid integer division truncating to 0

-- Approach 2: scalar subquery (simpler to read, works on older databases)
SELECT category,
SUM(amount) AS category_total,
100.0 * SUM(amount) / (SELECT SUM(amount) FROM Sales) AS pct_of_total
FROM Sales
GROUP BY category;
``

Real-world QA use case:
A QA engineer is validating a product breakdown report that shows each category's revenue as a percentage of total. She runs Approach 2 and gets Electronics=42.3%, Clothing=31.8%, Books=25.9% — they sum to 100%. The UI shows Electronics=45.1%, Clothing=34.2%, Books=27.5% — they sum to 106.8%. The application is calculating each percentage against a different (filtered) total rather than the true grand total. The scalar subquery reveals the discrepancy precisely.

Rule of thumb: Use 100.0 (not 100) in percentage calculations to avoid integer division. Verify that percentages across all categories sum to 100% as a sanity check in your test assertions.
💡 Plain English: Working out what slice of the pie each category gets. First measure each slice (the group total), then divide by the full pie (the grand total). The window function sticks the full pie size on every row so you can divide in one pass.
17
Window Functions

For each customer, find their most recent order.

-- Orders(id, customer_id, order_date, amount)
Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) to rank each customer's orders newest-first, then keep only rank 1.

Why it exists:
"Latest record per group" is one of the most common query patterns in analytics and reporting: most recent login per user, last order per customer, latest test result per test case. Window functions are the clean modern solution. The correlated subquery is the legacy approach — but it returns multiple rows per customer if two orders share the same date.

``sql
-- Approach 1: ROW_NUMBER (guarantees exactly one row per customer)
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC, id DESC -- id as tiebreaker for same-date orders
) AS rn
FROM Orders
)
SELECT * FROM ranked WHERE rn = 1;

-- Approach 2: correlated subquery (simpler but returns multiple rows on date ties)
SELECT * FROM Orders o
WHERE order_date = (
SELECT MAX(order_date) FROM Orders o2
WHERE o2.customer_id = o.customer_id
);
``

ROW_NUMBER guarantees exactly one row per customer. Adding id DESC as a tiebreaker makes the result deterministic when two orders share the same date.

Real-world QA use case:
A QA engineer is validating a "last activity" dashboard that shows each customer's most recent order. She finds the application returns two rows for customer 142 — they placed two orders on the same day. Using ROW_NUMBER with id DESC as a tiebreaker gives exactly one row per customer and makes the result deterministic. She raises the ambiguity as a defect: the query needs a tiebreaker, or the UI needs to handle multiple same-date orders explicitly.

Rule of thumb: Use ROW_NUMBER (not RANK) for "one latest row per group" — RANK returns multiple rows on ties. Add a tiebreaker column (id DESC) to ORDER BY for deterministic results.
💡 Plain English: Grabbing the top letter from each person's pile of post, where each pile is sorted newest on top. PARTITION BY creates one pile per person. Picking row number 1 takes exactly the latest from each pile — with ROW_NUMBER guaranteeing exactly one envelope, even if two arrived on the same day.
18
Subqueries

What is a correlated subquery, and why can it be slow?

A correlated subquery references columns from the outer query, forcing it to re-execute once for every outer row — which is slow on large tables. The fix is to pre-calculate the repeated result with GROUP BY and JOIN.

Why it exists:
Correlated subqueries are intuitive to read ("for this employee, get their department's average") but can be catastrophically slow: a table with 100,000 rows re-runs the inner query 100,000 times. For QA, recognising correlated subqueries in slow query reports is an important performance testing skill — they're often the culprit in "this was fast last month but now times out" issues as table sizes grow.

``sql
-- Correlated subquery: inner query runs once PER employee row (slow on big tables)
SELECT e.name
FROM Employee e
WHERE e.salary > (
SELECT AVG(salary) FROM Employee
WHERE department_id = e.department_id -- depends on outer row → re-runs each time
);

-- Fix: compute averages once with GROUP BY, then JOIN (runs inner query once)
SELECT e.name
FROM Employee e
JOIN (
SELECT department_id, AVG(salary) AS dept_avg
FROM Employee
GROUP BY department_id
) dept ON dept.department_id = e.department_id
WHERE e.salary > dept.dept_avg;
-- Inner subquery runs ONCE → much faster at scale
``

Real-world QA use case:
A QA engineer is running performance tests and notices a "salary comparison" report times out when the Employee table reaches 50,000 rows. She checks the query plan and sees the inner AVG subquery re-executing 50,000 times — a correlated subquery. She rewrites it as a JOIN to the pre-calculated averages. The query time drops from 45 seconds to 0.3 seconds. She adds this pattern to the team's SQL performance checklist.

Rule of thumb: If an inner subquery references the outer table, it's correlated and re-runs per row. Pre-calculate repeated aggregates with GROUP BY + JOIN to run them once instead.
💡 Plain English: Phoning the bank to check your balance before every single purchase, rather than checking once and keeping a note. Per-purchase works, but it's painfully slow when you shop a lot — and catastrophic if you're shopping 100,000 times.
19
Window Functions

How do you find missing numbers (gaps) in a column of sequential IDs?

-- Invoices(id)  -- ids should be 1,2,3,... but some are missing
Compare each row to the next with LEAD, and flag where the jump is more than 1:

``sql
SELECT id + 1 AS gap_start,
next_id - 1 AS gap_end
FROM (
SELECT id, LEAD(id) OVER (ORDER BY id) AS next_id
FROM Invoices
) t
WHERE next_id - id > 1;
`

LEAD(id)` puts the next id on the same row. If the gap between this id and the next is bigger than 1, there are missing numbers in between.

Why it exists:
Sequential ID gaps are a data integrity signal — they indicate rows that were deleted, failed to insert, or were never created. For QA, detecting gaps in invoice numbers, order IDs, or transaction sequences is a common data validation task. A gap in an invoice sequence might mean a transaction was rolled back, a record was hard-deleted against policy, or a batch insert failed silently without logging an error.

Real-world QA use case:
A QA engineer validating a payment processing system notices that invoice IDs should be sequential but some numbers appear to be missing. She runs the LEAD-based gap query and discovers IDs 10,045 through 10,049 are absent. Investigation reveals a failed batch import from a partner system that silently swallowed five invoices. Without the gap query, those five missing transactions would have gone unnoticed during data validation.

Rule of thumb: when validating sequential ID or invoice-number columns, always check for gaps — they often signal silent failures, hard deletes, or batch import errors that no other automated test would catch.
💡 Plain English: It's like checking a roll of numbered tickets for missing ones. You read each ticket and peek at the next — if ticket 5 is followed by ticket 9, you know 6, 7, and 8 are missing.
20
Window Functions

Find all employees who earn more than the average salary of their own department.

-- Employee(id, name, department_id, salary)
Approach 1 — correlated subquery (easiest to read). For each employee, compare their salary to their own department's average:

``sql
SELECT name, salary, department_id
FROM Employee e
WHERE salary > (
SELECT AVG(salary) FROM Employee
WHERE department_id = e.department_id
);
`

Approach 2 — window function (faster on big tables). Work out each department's average once, alongside every row, then filter:

`sql
SELECT name, salary, department_id
FROM (
SELECT name, salary, department_id,
AVG(salary) OVER (PARTITION BY department_id) AS dept_avg
FROM Employee
) t
WHERE salary > dept_avg;
``

Approach 1 reads most clearly; Approach 2 works out each average just once instead of re-running it for every row, so it scales better on large tables.

Why it exists:
Comparing individual rows against a per-group aggregate is one of the most common patterns in data validation reporting — checking whether an individual value is above or below its peer group's benchmark. QA engineers use this to verify that bonus calculations, discount applications, and threshold rules correctly use per-group context (the department or category) rather than a misleading global average.

Real-world QA use case:
A QA engineer is validating a salary adjustment script that flags employees earning above their department's average for a pay review. She runs both approaches — correlated subquery and window function — and compares the results on the test dataset. They agree on the dev database, but on the 1M-row production clone the correlated subquery takes 90 seconds while the window function finishes in under 2 seconds. She files a performance finding and recommends the window function approach for production.

Rule of thumb: use a window function with PARTITION BY when comparing individual rows to a per-group aggregate — it computes each group's average once instead of re-running it for every row, making it safe at scale.
💡 Plain English: It's like asking "who scored above their *own class's* average?" — not the whole school's. `PARTITION BY department` works out each class's average separately, then you keep only the students who beat their own class.
21
Indexing

Why might a query ignore an index on a column, and how do you fix it?

SELECT * FROM Orders WHERE YEAR(order_date) = 2024;
Wrapping the column in a function — YEAR(order_date) — means the database can't use the index on order_date, because the index stores the raw dates, not the result of YEAR(). This is a "non-sargable" condition, and it forces a full scan.

Fix it by leaving the column bare and using a range instead:

``sql
SELECT * FROM Orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';
`

Now the index works. The same problem shows up with
UPPER(name), column + 1 = 10, and a LIKE '%foo' that starts with a wildcard.

Why it exists:
Non-sargable conditions are one of the most common performance mistakes QA engineers encounter when validating SQL that scales to production sizes. A query using YEAR(order_date) may run in 100ms on a 10,000-row test database but take 30 seconds on a 50M-row production table because the index is bypassed. Understanding this pattern prevents "passes in test, slow in production" performance bugs from reaching users.

Real-world QA use case:
A QA engineer running performance tests against a staging database flags a report query using
YEAR(created_at) = 2024 in its WHERE clause. The EXPLAIN plan shows a full sequential scan on a 20M-row Orders table despite an index on created_at`. She rewrites the predicate to a date range and re-runs the test — the query drops from 45 seconds to 120 milliseconds. The fix goes into the story's acceptance criteria before the developer ships to production.

Rule of thumb: never wrap an indexed column in a function in a WHERE clause — rewrite the condition as a range or equality on the bare column so the index is actually used.
💡 Plain English: An index is a sorted phone book. Searching by surname is instant. But asking "whose name has exactly 7 letters?" forces you to read every single entry — the book isn't sorted by *length*. Wrapping a column in a function does the same thing: it asks a question the sorted index can't answer.
22
Data Quality

How do you find rows that are duplicated across several columns, not just one?

-- Orders(customer_id, product, order_date)
Approach 1 — GROUP BY + HAVING (easiest). Group by all the columns that together define a duplicate, and keep groups with more than one row:

``sql
SELECT customer_id, product, order_date, COUNT(*) AS copies
FROM Orders
GROUP BY customer_id, product, order_date
HAVING COUNT(*) > 1;
`

The combination of
customer_id + product + order_date defines what "the same" means here.

Approach 2 — window function (when you want the full duplicate rows back). Count copies within each group, alongside every row, then keep the ones with more than one:

`sql
SELECT * FROM (
SELECT *, COUNT(*) OVER (
PARTITION BY customer_id, product, order_date
) AS copies
FROM Orders
) t
WHERE copies > 1;
``

Approach 1 lists the duplicated *combinations*; Approach 2 hands back the actual rows (handy when you need to inspect or delete them).

Why it exists:
Single-column duplicate checks catch the simplest cases, but real data quality issues often involve composite keys — the same customer placing the same order on the same date twice, or the same product being loaded twice in a bulk import. QA engineers need multi-column duplicate checks for post-migration validation, ETL testing, and data integrity verification after bulk operations.

Real-world QA use case:
A QA engineer is validating an order data migration. She runs a single-column COUNT on customer_id and finds no duplicates. But when she applies a three-column GROUP BY on (customer_id, product, order_date), she discovers 340 duplicate order rows — all created during a retry of a failed batch import that didn't check for existing records. Without the composite duplicate check, the migration would have shipped with 340 duplicate orders affecting real customer accounts.

Rule of thumb: always define what "duplicate" means for the specific table before running a duplicate check — a single column is rarely enough; the check should mirror the table's natural unique key.
💡 Plain English: It's spotting people who registered twice. One person might share a *first name* with someone else — not a duplicate. But the same first name *and* last name *and* birthday? Now that's almost certainly one person signed up twice.
23
Aggregation

How do you get subtotals and a grand total in the same result?

-- Sales(region, product, amount)
ROLLUP adds subtotal and grand-total rows automatically:

``sql
SELECT region, product, SUM(amount) AS total
FROM Sales
GROUP BY ROLLUP (region, product);
``

You get the normal per-region-per-product totals, *plus* a subtotal per region, *plus* one grand total at the bottom. The summary rows show NULL in the columns they're rolling up over (the grand total has NULL for both region and product).

Why it exists:
Reporting queries that show both detail rows and summaries are extremely common in financial and operational dashboards. Without ROLLUP you need three separate queries (detail, subtotals, grand total) unified with UNION ALL — verbose and error-prone. For QA, ROLLUP also enables totals verification: you can confirm that the sum of all category subtotals equals the grand total in a single query, which is a standard data validation pattern.

Real-world QA use case:
A QA engineer is validating a sales reporting feature whose acceptance criteria states "the report must show sales by region and product, with regional subtotals and a company-wide grand total." She uses ROLLUP in her verification query to compute the expected values, then compares them against the report's output. The comparison immediately reveals that the report uses INNER JOIN instead of LEFT JOIN, silently dropping two regions with zero sales from the regional subtotals.

Rule of thumb: use ROLLUP when you need subtotals and a grand total in one result — it's cleaner than three separate queries unified with UNION ALL, and the NULL markers in summary rows make them easy to identify and filter.
💡 Plain English: It's a receipt that shows each item, then a subtotal per department, then the total at the very bottom — all in one printout. `ROLLUP` adds those summary lines for you, instead of you running three separate queries.
24
Self Join

Find all pairs of employees who work in the same department.

-- Employee(id, name, department_id)
Join the table to itself on department, and use < to avoid duplicates and self-pairs:

``sql
SELECT a.name AS emp1, b.name AS emp2, a.department_id
FROM Employee a
JOIN Employee b
ON a.department_id = b.department_id
AND a.id < b.id;
`

a.id < b.id is the key trick: without it you'd get each pair twice (A–B and B–A) plus everyone paired with themselves.

Why it exists:
Self-joins are needed whenever you want to compare rows within the same table — finding conflicting records, detecting mutual relationships, or identifying pairs that violate a business rule. In QA, self-joins appear when validating constraints like "no two users with the same role can be assigned to the same project" or verifying that a conflict-detection algorithm is actually firing.

Real-world QA use case:
A QA engineer is testing a conflict-of-interest validation rule: no two employees on the same project should report to the same manager. She writes a self-join query to find all pairs of project members who share a manager_id within the same project_id. The query returns six pairs that should have triggered a compliance warning but didn't — revealing that the validation rule had never been implemented despite being in the specification.

Rule of thumb: use a self-join with
a.id < b.id` to find all unique pairs within the same table — the inequality eliminates both reversed duplicates (A,B and B,A) and self-pairings (A,A).
💡 Plain English: It's like making a handshake list for people in the same room. You don't shake your own hand, and A–B is the same handshake as B–A — so the `a.id < b.id` rule counts each handshake just once.
25
Window Functions

How do you split rows into 4 equal groups, like salary quartiles?

-- Employee(id, name, salary)
NTILE(4) divides the ordered rows into 4 buckets as evenly as possible:

``sql
SELECT name, salary,
NTILE(4) OVER (ORDER BY salary) AS quartile
FROM Employee;
`

Quartile 1 is the lowest-paid quarter, quartile 4 the highest.
NTILE(10)` would give deciles, and so on. It's the standard way to split data into equal-sized ranked groups.

Why it exists:
Dividing data into equal-ranked groups is a core data analysis and validation technique. QA engineers use NTILE when verifying that business rules correctly treat different tiers differently — that top-quartile customers receive the advertised discount, that bottom-decile performers are included in the correct review process, or that percentile cutoffs in reports match the defined thresholds.

Real-world QA use case:
A QA engineer is validating a loyalty rewards program that promises different benefits to customers in each spending quartile. She uses NTILE(4) to rank all test customers by total spend and then verifies each customer has received the correct reward tier. She discovers that customers at the exact boundary between quartile 2 and quartile 3 are being assigned to quartile 2 by the application but NTILE places them in quartile 3 — a fencepost error in the application's ranking logic that would silently under-reward borderline customers.

Rule of thumb: use NTILE(n) to divide ordered rows into n equal-sized buckets — it's the standard way to compute quartiles, deciles, or any equal-sized ranked groups and is ideal for validating tier-based business rules.
💡 Plain English: It's like sorting runners by finish time and splitting them into four equal squads — the fastest quarter, the next quarter, and so on. `NTILE(4)` does the dividing for you, keeping the squads as even in size as possible.
26
Window Functions

What is the difference between GROUP BY and PARTITION BY?

Both group rows by a column — the difference is what comes *out*:

- GROUP BY *collapses* each group into a single summary row. The individual rows are gone.
- PARTITION BY (used with a window function) keeps *every* row, and adds the group's calculation alongside each one.

Example: same data, two very different outputs —
``sql
-- GROUP BY: one row per department (detail rows gone)
SELECT department_id, AVG(salary) AS dept_avg
FROM Employee
GROUP BY department_id;

-- PARTITION BY: every employee, with their department average attached
SELECT name, salary, department_id,
AVG(salary) OVER (PARTITION BY department_id) AS dept_avg
FROM Employee;
``
Use GROUP BY when you only want the summary; use PARTITION BY when you want to compare each row *to* its group (e.g. "is this person above their department average?").

Why it exists:
Confusing GROUP BY and PARTITION BY is one of the most common SQL mistakes — both group rows, but one collapses them and one doesn't. QA engineers need to recognise which is appropriate when writing verification queries: use GROUP BY to generate expected summary totals, and PARTITION BY when you need to validate per-row values against a group benchmark simultaneously.

Real-world QA use case:
A QA engineer is verifying a report that shows each employee's salary alongside their department's average. Using GROUP BY, she gets one row per department with no individual details — the wrong shape. Switching to PARTITION BY, she gets every employee with their department average attached in a single pass, which she can then compare column-for-column against the report's output to verify correctness.

Rule of thumb: GROUP BY collapses groups into one summary row; PARTITION BY keeps every row and adds the group's calculation alongside each one — choose based on whether you need detail rows or summary rows.
💡 Plain English: GROUP BY is the final scoreboard — one line per team, the players gone. PARTITION BY is each player's stat card that *also* prints their team's average on it — every player still on the page.
27
Transactions

How do transactions work — BEGIN, COMMIT, ROLLBACK, and SAVEPOINT?

A transaction groups several statements so they succeed or fail *together*.

- BEGIN (or START TRANSACTION) — start grouping.
- COMMIT — make all the changes permanent.
- ROLLBACK — undo everything since BEGIN.
- SAVEPOINT — a checkpoint you can roll back to *partway*, without throwing away the whole transaction.

Example:
``sql
BEGIN;
UPDATE Accounts SET balance = balance - 100 WHERE id = 1;
SAVEPOINT after_debit;
UPDATE Accounts SET balance = balance + 100 WHERE id = 2;
ROLLBACK TO after_debit; -- undoes just the credit, keeps the debit
COMMIT;
``
Until you COMMIT, no other session sees your changes, and a crash rolls them back automatically.

Why it exists:
Transactions are the mechanism that keeps multi-step operations atomic — either everything happens or nothing does. For QA, understanding transactions is critical when writing test setup and teardown scripts, when validating that a payment or transfer can't be left half-complete, and when testing failure scenarios that should trigger a rollback.

Real-world QA use case:
A QA engineer is writing test teardown scripts that restore a test database to a known state after each test run. She wraps the entire cleanup in a BEGIN/ROLLBACK block, which lets her execute all the cleanup steps and verify the final state without actually committing the changes. If the cleanup script itself has a bug, ROLLBACK discards everything and leaves the database unchanged — making the teardown idempotent and safe to run repeatedly.

Rule of thumb: always wrap multi-step data modifications in an explicit transaction — if any step fails, ROLLBACK leaves the database in a clean state rather than a half-modified one.
💡 Plain English: It's editing a document with "track changes" on. BEGIN starts a draft, SAVEPOINT is a named checkpoint, ROLLBACK is undo (all the way, or back to a checkpoint), and COMMIT is "accept all changes" — only then does everyone else see the final version.
28
Programmability

What is the difference between a stored procedure, a function, and a trigger?

All three are saved bits of logic that live inside the database:

- Stored procedure — a saved routine you *call explicitly* to do work (often several statements). Can change data. Run it with CALL / EXEC.
- Function — returns a *single value* (or a table) and is meant to be used *inside a query* (in SELECT, WHERE, etc.). Usually can't change data.
- Trigger — logic that *fires automatically* in response to an event (INSERT/UPDATE/DELETE on a table). You never call it directly.

Example:
``sql
-- Function: used inside a query
SELECT id, fn_tax(amount) AS tax FROM Orders;

-- Procedure: called on purpose to do work
CALL archive_old_orders(2020);

-- Trigger: runs by itself on every insert
CREATE TRIGGER set_created_at
BEFORE INSERT ON Orders
FOR EACH ROW SET NEW.created_at = NOW();
`

Why it exists:
Stored procedures, functions, and triggers each automate different kinds of database logic — and confusing them leads to unmaintainable code. For QA, understanding the difference matters when testing: procedures need to be called explicitly with test parameters, functions can be validated inline in SELECT queries, and triggers fire automatically so their effects must be verified indirectly by checking what changed in the table after an INSERT or UPDATE.

Real-world QA use case:
A QA engineer is writing SQL test scripts for a financial application. She realises that a
set_created_at` trigger automatically stamps every new row — her test setup must account for this so it doesn't accidentally overwrite a timestamp she's trying to set explicitly for date-range testing. Knowing that triggers fire automatically and silently means she adds a step to verify the trigger's output rather than assuming the timestamp is what her INSERT specified.

Rule of thumb: test stored procedures by calling them explicitly; test functions by wrapping them in SELECT; test triggers by inspecting what changed in the target table after the triggering event fires.
💡 Plain English: A **function** is a calculator you use mid-sentence ("what's the tax on this?"). A **stored procedure** is a recipe you deliberately run ("do the month-end cleanup"). A **trigger** is a motion-sensor light — you don't switch it on, it reacts automatically when something happens.
29
Data Modification

How do you "insert or update" a row in a single statement (an upsert)?

-- Inventory(product_id PRIMARY KEY, qty)
An upsert inserts a new row, or updates it if it already exists — in one atomic step, avoiding a race between a separate SELECT and INSERT. The syntax differs by database:

Postgres / SQLite — INSERT ... ON CONFLICT:
``sql
INSERT INTO Inventory (product_id, qty)
VALUES (42, 10)
ON CONFLICT (product_id)
DO UPDATE SET qty = Inventory.qty + EXCLUDED.qty;
`

MySQL —
INSERT ... ON DUPLICATE KEY UPDATE:
`sql
INSERT INTO Inventory (product_id, qty)
VALUES (42, 10)
ON DUPLICATE KEY UPDATE qty = qty + VALUES(qty);
`

SQL Server / Oracle —
MERGE:
`sql
MERGE INTO Inventory AS t
USING (SELECT 42 AS product_id, 10 AS qty) AS s
ON t.product_id = s.product_id
WHEN MATCHED THEN UPDATE SET t.qty = t.qty + s.qty
WHEN NOT MATCHED THEN INSERT (product_id, qty) VALUES (s.product_id, s.qty);
``
All do the same job; they rely on a unique/primary key to decide "does this already exist?"

Why it exists:
Without an atomic upsert, you need a SELECT then INSERT-or-UPDATE — two round trips with a race condition between them. Two concurrent requests can both see "no row exists," both insert, and you get a duplicate key error. Upsert solves this atomically. For QA, upsert logic is important to test with concurrent requests and boundary conditions (first insert vs. subsequent updates).

Real-world QA use case:
A QA engineer is testing an inventory sync feature that runs every 5 minutes. She designs a concurrency test that fires two simultaneous sync requests for the same product and verifies that the database ends up with exactly one row, not two duplicates and not an error. Without the ON CONFLICT clause the application was throwing duplicate key errors under concurrent syncs — the upsert rewrite is what makes the sync idempotent.

Rule of thumb: use upsert (INSERT ... ON CONFLICT, ON DUPLICATE KEY UPDATE, or MERGE) when a write must succeed whether or not the row already exists — always test it with both the first-insert and the update path, and with concurrent requests.
💡 Plain English: It's "save" in a video game. If a save slot for this level already exists, overwrite it; if not, create one. One action — not "check whether a save exists, then maybe create it."
30
Joins

Find all customers who have never placed an order.

-- Customers(id, name)
-- Orders(id, customer_id)
Approach 1 — LEFT JOIN ... IS NULL (the classic anti-join). Keep customers whose matching order is missing:
``sql
SELECT c.id, c.name
FROM Customers c
LEFT JOIN Orders o ON o.customer_id = c.id
WHERE o.id IS NULL;
`

Approach 2 — NOT EXISTS (often the clearest).
`sql
SELECT id, name
FROM Customers c
WHERE NOT EXISTS (
SELECT 1 FROM Orders o WHERE o.customer_id = c.id
);
`

Approach 3 — NOT IN (watch the NULL trap).
`sql
SELECT id, name FROM Customers
WHERE id NOT IN (SELECT customer_id FROM Orders WHERE customer_id IS NOT NULL);
`
NOT EXISTS and the LEFT JOIN are the safe go-tos; NOT IN needs the IS NOT NULL` guard, or one NULL makes it return nothing.

Why it exists:
Finding "records with no matching child" is one of the most common data integrity checks in QA — customers with no orders, users with no activity, products with no pricing. Each approach (LEFT JOIN, NOT EXISTS, NOT IN) has different performance characteristics and a silent NULL trap in NOT IN that has bitten many QA engineers who got back zero rows and assumed everything was fine.

Real-world QA use case:
A QA engineer is validating a churn report that should list customers with no orders in the last 90 days. Her initial query uses NOT IN against the Orders table — but the query returns zero results because one order has a NULL customer_id from a guest checkout. The NOT IN NULL trap makes the entire query silently return nothing. Switching to NOT EXISTS fixes it immediately, and she adds a note in the test plan: always use NOT EXISTS or LEFT JOIN IS NULL for anti-join queries on tables that may contain NULLs.

Rule of thumb: use NOT EXISTS or LEFT JOIN ... IS NULL for anti-join queries — NOT IN is unsafe if the subquery's column can contain NULLs, which will silently return zero rows instead of the expected results.
💡 Plain English: It's the guest list minus the people who showed up. Start with everyone invited (Customers), cross off anyone who appears in the attendance log (Orders) — whoever's left never came.
31
Set Operators

What do EXCEPT and INTERSECT do?

They combine two result sets like UNION, but by *set logic* (both queries must return matching columns):

- INTERSECT — rows that appear in both queries.
- EXCEPT (called MINUS in Oracle) — rows in the first query that are not in the second.

Example:
``sql
-- Customers who bought in BOTH 2023 and 2024:
SELECT customer_id FROM Orders WHERE year = 2023
INTERSECT
SELECT customer_id FROM Orders WHERE year = 2024;

-- Customers who bought in 2023 but NOT in 2024 (churned):
SELECT customer_id FROM Orders WHERE year = 2023
EXCEPT
SELECT customer_id FROM Orders WHERE year = 2024;
``
Both remove duplicates automatically (like UNION, unlike UNION ALL).

Why it exists:
INTERSECT and EXCEPT enable concise set-comparison queries that would otherwise require complex joins or NOT EXISTS subqueries. For QA, EXCEPT is especially valuable for cross-environment data validation: comparing what's in production against what's in staging and finding rows that exist in one but not the other — a classic data migration verification pattern.

Real-world QA use case:
A QA engineer is validating a data migration from an old system to a new one. She uses EXCEPT to find customer records that exist in the old system but are missing from the new one, and then EXCEPT in the reverse direction to find records that appeared in the new system but shouldn't exist at all. The two EXCEPT queries together give her a complete picture of migration gaps and phantom records in under a minute.

Rule of thumb: use EXCEPT to find rows in set A that are missing from set B — it's the cleanest way to detect gaps in data migrations, synchronisation jobs, or any process that should produce matching records in two tables.
💡 Plain English: Picture two overlapping circles (a Venn diagram). **INTERSECT** is the overlap in the middle; **EXCEPT** is the part of the left circle that the right one doesn't cover.
32
Performance

A query that ran in 2 seconds now takes 3 minutes after the table grew from 1M to 50M rows. How do you investigate and fix it?

Step 1 — Get the execution plan.
``sql
EXPLAIN ANALYZE SELECT ...; -- Postgres
EXPLAIN SELECT ...; -- MySQL
SET STATISTICS IO ON; SELECT ...; -- SQL Server
`
Look for: table scans (Seq Scan / Table Scan) where you expected index seeks, and nested loop joins on large result sets.

Step 2 — Common culprits and fixes:

- Missing index — the plan shows a full scan on a 50M-row table. Add an index on the columns in WHERE, JOIN, and ORDER BY.
- Non-sargable WHERE clause — a function wrapped around the column stops the index being used:
`sql
-- Bad (can't use index):
WHERE YEAR(order_date) = 2024

-- Good (uses index):
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'
``
- Stats are stale — the query planner makes bad decisions when it thinks the table still has 1M rows. Run ANALYZE (Postgres) or UPDATE STATISTICS (SQL Server).
- Cartesian product / bad JOIN — a missing or wrong JOIN condition multiplies rows.
- SELECT * — pulling 50 columns when you need 3 increases I/O dramatically.

Step 3 — Measure after each change. Don't pile on fixes blindly; change one thing at a time and re-EXPLAIN.

Why it exists:
Performance regressions caused by data growth are one of the most common production incidents QA can prevent. A query that ran fine on 1M rows in staging silently becomes a 3-minute timeout on 50M rows in production. Knowing how to diagnose a slow query with EXPLAIN ANALYZE — and which culprits to look for — is a core QA skill for performance testing and production incident support.

Real-world QA use case:
A QA engineer running load tests against a staging environment notices a "customer order history" endpoint timing out after the test dataset is scaled up to production size. She runs EXPLAIN ANALYZE and finds a table scan on the 50M-row Orders table despite an index on customer_id — the culprit is a YEAR(order_date) function in the WHERE clause that prevents index use. She logs the finding with the EXPLAIN output as evidence and proposes the date-range rewrite in the defect ticket.

Rule of thumb: when a query becomes slow after data growth, always read the EXPLAIN plan before guessing at a fix — table scans, stale statistics, and non-sargable predicates each have different solutions.
💡 Plain English: A delivery driver whose 5-minute route now takes 45 minutes because the city grew. First, look at the map (execution plan) to see where the bottleneck is — a missing shortcut (index), a road closure (bad join), or outdated map data (stale stats). Fix the biggest blockage first, then re-time the route.
33
Window Functions

Write a query showing each salesperson's total revenue alongside the overall team average, in the same row.

-- Sales(salesperson_id, salesperson_name, amount)
``sql
SELECT
salesperson_name,
SUM(amount) AS total_revenue,
ROUND(AVG(SUM(amount)) OVER (), 2) AS team_average,
SUM(amount) - AVG(SUM(amount)) OVER () AS diff_from_average
FROM Sales
GROUP BY salesperson_id, salesperson_name
ORDER BY total_revenue DESC;
`

Why a window function? A plain AVG(amount) in a GROUP BY would aggregate everything into one row. The window function OVER () (empty window = whole result set) computes the average *across all salespeople* while still keeping one row per salesperson.

Alternative — CTE approach (often clearer in a real codebase):
`sql
WITH totals AS (
SELECT salesperson_name, SUM(amount) AS revenue
FROM Sales
GROUP BY salesperson_id, salesperson_name
)
SELECT
salesperson_name,
revenue,
ROUND(AVG(revenue) OVER (), 2) AS team_average
FROM totals
ORDER BY revenue DESC;
``
💡 Plain English: A school report card showing each student's grade *and* the class average on every line. You want both numbers visible at once — not a separate summary row. The window function is what puts the class average on every individual line without collapsing everything into one.
34
Query Writing

Write a query to find the first purchase date and the most recent purchase date for each customer.

-- Orders(id, customer_id, order_date, amount)
``sql
SELECT
customer_id,
MIN(order_date) AS first_purchase,
MAX(order_date) AS latest_purchase,
COUNT(*) AS total_orders,
DATEDIFF(MAX(order_date), MIN(order_date)) AS days_as_customer -- MySQL
FROM Orders
GROUP BY customer_id
ORDER BY latest_purchase DESC;
`

A common follow-up: also show the amount of their first and last order:
`sql
SELECT DISTINCT
customer_id,
FIRST_VALUE(amount) OVER w AS first_order_amount,
LAST_VALUE(amount) OVER w AS last_order_amount,
MIN(order_date) OVER w AS first_date,
MAX(order_date) OVER w AS latest_date
FROM Orders
WINDOW w AS (PARTITION BY customer_id ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
``
The GROUP BY version handles most real-world cases; the window function approach is needed when you want values from specific rows (like the amount).

Why it exists:
First and last activity dates are core metrics for customer lifecycle analysis, churn detection, and data completeness checks. QA engineers use MIN/MAX aggregations to validate onboarding flows (was a first-purchase date recorded correctly?), verify retention reports (does the "days as customer" calculation match the date difference?), and confirm that activity tracking starts and ends at the right events.

Real-world QA use case:
A QA engineer is validating a customer lifetime value report. She queries MIN(order_date) and MAX(order_date) per customer from the database and compares them to the values shown in the report. The report shows some customers with a "last purchase date" of today even though their last actual order was months ago — caused by a bug that was refreshing the date field on every report generation rather than only on actual order events.

Rule of thumb: use MIN and MAX aggregations grouped by customer to derive first and last activity dates — they're your ground-truth baseline when validating any report that claims to track when a customer first or last did something.
💡 Plain English: Looking at a customer's file and noting when they first walked through the door and when they last visited. MIN gives the first date, MAX the latest — the simplest summary of their relationship history with the business.
35
Practical

You ran UPDATE without a WHERE clause and updated every row in a large table in production. What do you do?

This is a production incident — respond immediately:

If inside an open transaction:
``sql
ROLLBACK;
``
Confirm auto-commit is off first — if it was on, the changes are already committed.

If already committed:
1. Assess blast radius first — how many rows, which table, what was changed? Inform your team lead immediately.
2. Stop dependent processes — prevent downstream systems from reading or acting on the corrupted data.
3. Point-in-time recovery — most production databases support this. Restore to a snapshot taken just before the incident.
4. Replay from transaction logs if a full restore is too slow — reconstruct the original values from the WAL (Postgres), binlog (MySQL), or transaction log (SQL Server).
5. Communicate with stakeholders — don't hide it; be transparent about impact and timeline.

How to never repeat this:
- Always preview with SELECT using the same WHERE before running UPDATE or DELETE.
- Wrap destructive statements in explicit BEGIN / ROLLBACK / COMMIT.
- Add a database role restriction: application accounts should not have permission to run unfiltered mass updates.

Why it exists:
A runaway UPDATE without WHERE is one of the most catastrophic mistakes possible in a production database — and it happens more often than anyone admits. QA engineers should know the incident response steps because they are often the first person called when data is corrupted: they own the testing environments that serve as the recovery source, and they know which point-in-time snapshot was taken before the bad change.

Real-world QA use case:
A QA engineer is doing exploratory testing on a staging database and accidentally runs an UPDATE on the wrong table without a WHERE clause. She immediately runs ROLLBACK (she was in a transaction), checks the row count to confirm no permanent damage, and documents the near-miss in the team's incident log. The team subsequently adds a pre-commit hook to their SQL workflow that warns when a DML statement has no WHERE clause, preventing the same mistake from reaching production.

Rule of thumb: always wrap destructive SQL in an explicit transaction during testing so you can ROLLBACK instantly — and always preview with SELECT using the same WHERE before running UPDATE or DELETE on any real data.
💡 Plain English: Accidentally sending a mass email to every customer with the wrong name merged in. You can't un-send it — but you can send a correction quickly, figure out exactly who received the wrong email (blast radius), and change the process so a confirmation step is required before any bulk send.
36
Data Modification

How do you find duplicate rows in a table and delete them, keeping only one row per group?

-- Customers(id, name, email)  -- email should be unique but has duplicates
Step 1 — Find the duplicates:
``sql
SELECT email, COUNT(*) AS cnt
FROM Customers
GROUP BY email
HAVING COUNT(*) > 1;
`

Step 2 — Delete duplicates, keeping the row with the lowest id (Postgres/MySQL):
`sql
DELETE FROM Customers
WHERE id NOT IN (
SELECT MIN(id)
FROM Customers
GROUP BY email
);
`

Postgres with CTE (cleaner and safer — preview before you delete):
`sql
-- First, see what will be deleted:
WITH ranked AS (
SELECT id,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM Customers
)
SELECT * FROM ranked WHERE rn > 1;

-- Then delete:
WITH ranked AS (
SELECT id,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM Customers
)
DELETE FROM Customers
WHERE id IN (SELECT id FROM ranked WHERE rn > 1);
``

Always run the SELECT version first — confirm exactly which rows will go before you delete.

Why it exists:
Duplicate rows in production databases are a surprisingly common data quality problem — caused by retried imports, missing unique constraints, or race conditions in concurrent writes. QA engineers need the ability to find and safely clean duplicates both as a data validation step (proving the migration introduced no duplicates) and as a remediation step (removing duplicates after a confirmed data quality incident).

Real-world QA use case:
A QA engineer discovers that a nightly bulk import has created 1,200 duplicate rows in the Customers table over three weeks because the import job lacked an ON CONFLICT clause. She writes the ROW_NUMBER CTE approach to identify the duplicates, runs the SELECT preview version to confirm exactly which rows will be deleted, then executes the delete in a transaction with a final COUNT check before committing. She also adds the fix to the import job to prevent recurrence.

Rule of thumb: before deleting duplicates, always run the SELECT preview first to see exactly which rows will be removed — then execute the delete inside a transaction so you can ROLLBACK if the count or spot-check reveals something unexpected.
💡 Plain English: Deduplicating a contacts list. You group by name and phone number, keep the oldest entry (the original), and remove all the later copies. But before pressing delete, you check the list of what's about to be removed to make sure it's right.
37
Window Functions

Write a query to calculate the running total of daily sales.

-- DailySales(sale_date, amount)
``sql
SELECT
sale_date,
amount,
SUM(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM DailySales
ORDER BY sale_date;
`

Breaking it down:
- SUM(amount) OVER (...) — a window function, so it gives a value per row without collapsing the result.
- ORDER BY sale_date — tells the window to accumulate in date order.
- ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — "from the very first row up to and including this one." This is often the default when ORDER BY is specified, but writing it explicitly makes the intent clear.

Variation — running total per category:
`sql
SUM(amount) OVER (PARTITION BY category ORDER BY sale_date)
``
PARTITION BY restarts the running total for each category.

Why it exists:
Running totals appear in financial dashboards, inventory tracking, and any report that shows "balance so far" or "cumulative progress." For QA, the running total query is a standard data validation tool: by computing the expected cumulative figures independently, you can compare them against the application's displayed values and catch off-by-one errors in the application's own accumulation logic.

Real-world QA use case:
A QA engineer is validating a financial dashboard that shows a running monthly revenue total. She computes the expected running total with SUM OVER ORDER BY and compares it against the dashboard's displayed values month by month. The dashboard's running total diverges from her query at month 7 — caused by the application code incorrectly treating a reversal transaction as a positive value rather than subtracting it from the running sum.

Rule of thumb: use SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) to compute a running total — it's the ground-truth benchmark for validating any UI or report that displays cumulative figures.
💡 Plain English: A bank statement balance column. Each line shows the transaction amount *and* the total balance so far. The balance is the running sum — each row adds its amount to everything above it.
38
Query Writing

Write a query to find gaps in a sequential ID column — IDs that are missing from the sequence.

-- Orders(id)  -- id should be 1, 2, 3... but some are missing
Approach 1 — self-join to find missing numbers:
``sql
SELECT t1.id + 1 AS gap_start
FROM Orders t1
LEFT JOIN Orders t2 ON t2.id = t1.id + 1
WHERE t2.id IS NULL
AND t1.id < (SELECT MAX(id) FROM Orders)
ORDER BY gap_start;
`
This finds every ID where the *next* ID doesn't exist.

Approach 2 — window function (cleaner, finds gap ranges):
`sql
WITH numbered AS (
SELECT id,
LEAD(id) OVER (ORDER BY id) AS next_id
FROM Orders
)
SELECT id + 1 AS gap_start,
next_id - 1 AS gap_end
FROM numbered
WHERE next_id > id + 1;
``
LEAD looks at the next row's ID. If next_id > current id + 1, there's a gap between them.

Why it exists:
Gap detection in sequential ID columns is a fundamental data integrity check — sequential IDs should form an unbroken sequence, and any gap indicates a missing record that may represent a failed transaction, an unauthorised deletion, or a broken import. The LEAD-based approach is superior to a self-join when you also need to report the full gap range (gap_start through gap_end), not just the individual missing IDs.

Real-world QA use case:
A QA engineer is validating a financial system where order IDs must be sequential for audit compliance. After a system migration she runs the LEAD-based gap query and discovers three separate gap ranges — IDs 5001–5010, 8200–8201, and 12000. She reports them as a P1 data integrity finding because the missing IDs represent financial transactions that are unaccounted for in the migrated system.

Rule of thumb: use the LEAD-based gap query when you need to identify gap ranges (start to end), and the self-join approach when you only need to flag individual missing IDs — both solve gap detection, but LEAD is more readable for range reporting.
💡 Plain English: Checking a numbered ticket roll for missing stubs. You look at each stub and check whether the next number up exists. If ticket 47 is there but 48 is not and 49 is, there's a gap at 48. LEAD is what lets you peek at the next ticket in one query.
39
Performance

Your query has 4 JOINs and is very slow. What do you look at first?

Start with EXPLAIN ANALYZE — don't guess. The plan tells you where the time is actually going.

What to look for in the plan:

1. Table scans on large tables — are any of the 4 joined tables being fully scanned? Add indexes on the JOIN keys and any filtered columns.

2. Join order — the database should join small filtered sets first. If the planner is joining two huge tables before filtering, a hint or a CTE that filters first can help.

3. Nested loops on large sets — a nested loop JOIN is O(n×m). For large tables, a hash join or merge join is better. Check if missing indexes are forcing nested loops.

4. Cartesian product — a missing or incorrect JOIN condition creates a row explosion. Check that every JOIN has a proper ON condition.

5. Pulling too many columns — SELECT * across 4 JOINed tables drags in enormous amounts of data. Select only the columns you need.

6. Intermediate result sizes — add a CTE or subquery to filter a large table *before* joining it, rather than joining 50M rows and filtering afterward.

Practical rule: index every foreign key column and every column that appears in WHERE or JOIN conditions.

Why it exists:
Multi-join queries become the first to degrade as data grows, because each join multiplies the work. For QA engineers doing performance testing or investigating staging-vs-production slowdowns, knowing which patterns to look for in an execution plan — table scans, nested loops on large tables, cartesian products — is what separates a useful performance finding from a vague "it's slow" bug report.

Real-world QA use case:
A QA engineer is performance-testing an API endpoint that powers a reporting page. The endpoint uses four JOINs across Orders, Customers, Products, and Promotions tables. On the 50,000-row test dataset it takes 400ms; on the production clone with 5M rows it times out. She runs EXPLAIN ANALYZE and finds a nested loop join on the Orders–Products join with no index on the join key. Adding the index drops the response to 180ms and surfaces a second issue: a SELECT * pulling 60 columns when the report only uses 8.

Rule of thumb: when a multi-join query is slow, read EXPLAIN ANALYZE before guessing — look for table scans on large tables, nested loops, and missing indexes on JOIN and WHERE columns.
💡 Plain English: A traffic map with 4 intersections all backed up. You don't just add lanes everywhere — you look at the map to find which intersection is the actual bottleneck, fix that one, and re-check before touching the others.
40
Query Writing

Write a query to find the percentage contribution of each product category to total revenue.

-- OrderItems(order_id, category, amount)
``sql
SELECT
category,
SUM(amount) AS category_revenue,
ROUND(SUM(amount) * 100.0 / SUM(SUM(amount)) OVER (), 2) AS pct_of_total
FROM OrderItems
GROUP BY category
ORDER BY category_revenue DESC;
`

Breaking it down:
- SUM(amount) — total revenue per category (GROUP BY).
- SUM(SUM(amount)) OVER () — the window function with an empty OVER() spans the entire result set, summing all the per-category totals into the grand total.
- * 100.0 — forces decimal division (integer / integer truncates in some databases).

Alternative with CTE (avoids nested aggregates):
`sql
WITH totals AS (
SELECT category, SUM(amount) AS revenue
FROM OrderItems
GROUP BY category
),
grand AS (SELECT SUM(revenue) AS total FROM totals)
SELECT t.category, t.revenue,
ROUND(t.revenue * 100.0 / g.total, 2) AS pct
FROM totals t, grand g
ORDER BY t.revenue DESC;
``

Why it exists:
Percentage-of-total queries are ubiquitous in sales, finance, and product reporting — every pie chart and share analysis uses this pattern. For QA engineers, they're a standard data validation tool: compute the expected percentage breakdown independently and compare it against the application's displayed values to detect rounding errors, missing categories, or incorrect total calculations.

Real-world QA use case:
A QA engineer is validating a sales dashboard that shows each product category's share of total revenue. She runs the percentage-of-total query against the test database and finds the dashboard's "Electronics" category shows 34.2% while her query returns 31.8%. Investigation reveals the dashboard is computing percentages using item count rather than revenue amount — a business logic bug that makes high-volume, low-value items appear disproportionately large.

Rule of thumb: use SUM(amount) * 100.0 / SUM(SUM(amount)) OVER () to compute percentage of total in one query — the empty OVER() spans the entire result and gives you the grand total without a separate subquery or join.
💡 Plain English: A budget pie chart. Each slice is a category's share. To find each slice's percentage you need each slice's size *and* the whole pie's size. The window function is what gives you the whole pie size without a separate query.
41
Window Functions

How do you compare a value in the current row with the value in the previous row — for example, month-over-month revenue change?

-- MonthlySales(month, revenue)
Use the LAG window function to look back one row:

``sql
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS change,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month))
* 100.0 / LAG(revenue) OVER (ORDER BY month), 2
) AS pct_change
FROM MonthlySales
ORDER BY month;
`

- LAG(revenue) — returns the revenue from the *previous* row in the ORDER BY sequence.
- LEAD(revenue) — looks *forward* one row instead.
- The first row has no previous row, so LAG returns NULL — your reporting layer handles that as "no comparison available."

For year-over-year (same month, prior year):
`sql
LAG(revenue, 12) OVER (ORDER BY month)
``
The second argument is how many rows back to look.

Why it exists:
Period-over-period comparison is one of the most fundamental business metrics — whether revenue went up or down this month versus last month is in every product and finance dashboard. For QA engineers, the LAG-based query is the reference calculation: compute the expected month-over-month change independently and use it to verify that the application's displayed trends, arrows, and percentage labels are correct.

Real-world QA use case:
A QA engineer is validating a sales trend dashboard that shows month-over-month revenue change with up/down arrows. She computes the expected changes using LAG and finds that the dashboard's arrow for March is pointing up (+12%) but her query shows March was actually down (-3%). The application was mistakenly comparing March against the same month last year instead of the prior month — a business logic defect in the date arithmetic.

Rule of thumb: use LAG(column) OVER (ORDER BY date) to access the previous row's value — it's the clearest way to compute period-over-period differences without a self-join, and NULL for the first row is correct expected behaviour (no prior period exists).
💡 Plain English: Reading a bank statement where each line also shows last month's balance for comparison. LAG is what reaches back one row to get that "previous" number — without a self-join or a subquery.
42
Joins

Your JOIN query is returning more rows than expected — duplicates in the result. What are the likely causes and how do you fix them?

Duplicate rows from JOINs almost always come from one of these:

1. Many-to-many relationship without a bridging table
If Orders and Products are joined directly, and an order has 3 products and a product is in 5 orders, you get 15 rows — the Cartesian product. Fix: join through the correct junction table (OrderItems).

2. Duplicate rows in one of the source tables
Check each table independently before joining:
``sql
SELECT customer_id, COUNT(*) FROM Orders GROUP BY customer_id HAVING COUNT(*) > 1;
`
If the source is dirty, clean it or deduplicate before joining.

3. Missing or wrong JOIN condition
A join with no ON clause or an always-true condition creates a full Cartesian product.
`sql
-- Accidental Cartesian product:
SELECT * FROM Customers, Orders; -- missing WHERE/ON
``

4. Multiple matching rows on the right side
A customer with 3 addresses LEFT JOINed to their orders will duplicate every order 3 times. Fix: filter to one address per customer before joining, or aggregate first.

Diagnostic approach:
Run each table/join step in isolation, count the rows at each stage, and find where the count first exceeds what you expect.

Why it exists:
Unexpected duplicate rows from JOINs are one of the most common and confusing SQL bugs — the query "works" but returns 3× as many rows as expected, inflating aggregates or flooding a UI with repeats. QA engineers encounter this when validating JOIN-heavy reports, and the step-by-step diagnostic approach (count after each join, find where the explosion occurs) is the same skill used in debugging test failures caused by incorrect data setup.

Real-world QA use case:
A QA engineer is validating a customer order history page that shows 8 orders for a specific test user. She expects 2 orders but the report shows 8. She queries each join separately: Customers returns 1 row, Orders for that customer returns 2 rows. When she JOINs to Addresses, the row count jumps to 8 — that customer has 4 addresses, multiplying each order 4 times. The fix is to filter the Addresses join to only the primary shipping address.

Rule of thumb: when a JOIN returns more rows than expected, count the rows after each join step in isolation — the step where the count first exceeds the expected value points to the table with the duplicating rows.
💡 Plain English: Photocopying a document where one page accidentally gets fed through three times. Each extra copy multiplies everything that follows. Identify which table is the "three-copy page" and deduplicate it before it enters the stack.
43
Query Writing

Write a query to list every department and its employee headcount, including departments with zero employees.

-- Departments(id, name)
-- Employees(id, name, department_id)
``sql
SELECT d.name AS department,
COUNT(e.id) AS headcount
FROM Departments d
LEFT JOIN Employees e ON e.department_id = d.id
GROUP BY d.id, d.name
ORDER BY headcount DESC;
`

Key points:
- LEFT JOIN keeps all departments even if no employee rows match.
- COUNT(e.id) counts non-NULL employee IDs — returns 0 for departments with no employees. COUNT(*) would return 1 instead of 0, which is wrong.
- If you used INNER JOIN, departments with no employees would silently disappear from the result.

A common follow-up: also show the highest salary per department:
`sql
SELECT d.name,
COUNT(e.id) AS headcount,
MAX(e.salary) AS top_salary
FROM Departments d
LEFT JOIN Employees e ON e.department_id = d.id
GROUP BY d.id, d.name;
``

Why it exists:
LEFT JOIN with COUNT is the standard pattern for "show every parent row even when it has no children." For QA, this pattern is essential when validating completeness: a report using INNER JOIN will silently drop departments with no staff, projects with no assignments, or products with no orders — making the report appear complete while hiding missing data.

Real-world QA use case:
A QA engineer is validating an HR headcount report that should show all departments including those with no current employees. The developer used INNER JOIN, so two recently-emptied departments are completely absent from the report. Without the LEFT JOIN validation query as a baseline, the omission is not obvious — the report looks complete because there are no blank rows. The QA engineer catches it by comparing the department count in the report (18) against the Departments table row count (20).

Rule of thumb: use LEFT JOIN with COUNT(right_table.id) — not COUNT(*) — when you need to include parent records with zero children; COUNT(*) returns 1 for childless parents, while COUNT(right_table.id) correctly returns 0.
💡 Plain English: A company org chart including empty seats. You want every department to appear — even the one that just lost its whole team. LEFT JOIN is what keeps the empty departments visible; COUNT of the employee ID is what correctly reports them as 0 rather than pretending one ghost employee is there.
44
Aggregation

How does NULL behave inside aggregate functions like SUM, COUNT, and AVG? Show an example where it matters.

The rule: all aggregate functions *except* COUNT(*) silently ignore NULL values.

``sql
-- Table: Sales(id, amount)
-- Rows: 100, NULL, 200, NULL, 300
SELECT
COUNT(*) AS total_rows, -- 5 (counts everything including NULLs)
COUNT(amount) AS non_null_rows, -- 3 (skips NULLs)
SUM(amount) AS total, -- 600 (NULLs ignored)
AVG(amount) AS average -- 200 (600 / 3, not 600 / 5!)
FROM Sales;
`

The trap: AVG divides by the count of non-NULL values (3), not total rows (5). If NULLs mean "zero" in your domain, you must replace them first:
`sql
AVG(COALESCE(amount, 0)) -- treats NULL as 0: 600 / 5 = 120
``

NULL in GROUP BY: NULL values form their own group — all rows where the grouped column is NULL are placed together, not discarded.

This is one of the most common sources of wrong aggregate results in real reports — always check what NULLs in your data actually mean before writing the query.

Why it exists:
NULL handling in aggregates is a silent trap that produces wrong results without any error message. A QA engineer who doesn't account for NULLs can write a verification query that returns a plausible-looking but incorrect expected value, then incorrectly validate a bug as passing. Understanding exactly how each aggregate handles NULLs is foundational for writing reliable data validation queries.

Real-world QA use case:
A QA engineer is validating an average order value metric. She writes AVG(amount) and gets $145.00. The application displays $118.00. She initially logs this as a defect, but investigating further finds that the application treats NULL amounts (orders with no payment data) as $0 in its average calculation, while AVG ignores them. She resolves the discrepancy by using AVG(COALESCE(amount, 0)) in her verification query — matching the application's business rule.

Rule of thumb: before writing any aggregate validation query, check whether NULLs in your data should be treated as zero (use COALESCE), excluded (the default), or flagged as an error — the wrong assumption produces a plausible but incorrect expected value.
💡 Plain English: Calculating a class average. If 2 students were absent and got NULL (no grade), AVG ignores them and divides by the students who showed up. If the absent students should count as zero, you have to fill in the zero yourself before calculating.
45
Query Writing

Write a query to find customers who placed at least one order in every month of the current year.

-- Orders(id, customer_id, order_date)
``sql
SELECT customer_id
FROM Orders
WHERE order_date >= DATE_TRUNC('year', CURRENT_DATE)
AND order_date < DATE_TRUNC('year', CURRENT_DATE) + INTERVAL '1 year'
GROUP BY customer_id
HAVING COUNT(DISTINCT EXTRACT(MONTH FROM order_date)) = 12;
`

Breaking it down:
- Filter to the current year with a date range (avoids wrapping the column in a function).
- GROUP BY customer_id — one group per customer.
- COUNT(DISTINCT EXTRACT(MONTH FROM order_date)) — how many distinct months this customer ordered in.
- HAVING = 12 — only keep customers present in all 12 months.

If running mid-year and you want "every month so far":
`sql
HAVING COUNT(DISTINCT EXTRACT(MONTH FROM order_date)) = EXTRACT(MONTH FROM CURRENT_DATE)
``

Why it exists:
"Active in every period" queries are common in retention analysis, compliance reporting, and SLA verification. For QA, this pattern is useful when validating streak-based features — monthly active users, subscription renewal checks, consecutive login rewards — and when confirming that activity history is complete and uninterrupted across a defined period.

Real-world QA use case:
A QA engineer is validating a loyalty program that awards a "platinum" badge to customers who place at least one order every month for a full year. She writes the COUNT(DISTINCT MONTH) query to identify all qualifying customers, then compares the list against those who have received the badge. She finds 12 customers who qualify but haven't been awarded the badge — a bug in the awarding job that was checking the previous month's data rather than the full 12-month window.

Rule of thumb: use COUNT(DISTINCT EXTRACT(MONTH FROM date)) = 12 in a HAVING clause to find records present in every month of the year — adjust the target to the current month number when validating partial-year requirements.
💡 Plain English: Finding gym members who checked in every single month of the year — not just those who visit frequently, but those with no month where they completely vanished. COUNT DISTINCT months and keep only those with 12.
46
Query Writing

How do you pivot rows into columns in SQL? Show a practical example.

-- Sales(salesperson, month, revenue)
-- Rows: Alice/Jan/1000, Alice/Feb/1200, Bob/Jan/900, Bob/Feb/800
Approach 1 — Conditional aggregation (most portable):
``sql
SELECT
salesperson,
SUM(CASE WHEN month = 'Jan' THEN revenue ELSE 0 END) AS jan,
SUM(CASE WHEN month = 'Feb' THEN revenue ELSE 0 END) AS feb,
SUM(CASE WHEN month = 'Mar' THEN revenue ELSE 0 END) AS mar
FROM Sales
GROUP BY salesperson;
`

Approach 2 — PIVOT syntax (SQL Server / Oracle):
`sql
SELECT salesperson, [Jan], [Feb], [Mar]
FROM Sales
PIVOT (SUM(revenue) FOR month IN ([Jan], [Feb], [Mar])) AS pvt;
`

Result:
| salesperson | jan | feb |
|---|---:|---:|
| Alice | 1000 | 1200 |
| Bob | 900 | 800 |

The CASE/SUM approach works in every major database and is easier to read for most teams. PIVOT is cleaner but vendor-specific.

Why it exists:
Pivot queries transform row-based data into a column-per-category layout that matches how business stakeholders expect to read reports. For QA, pivot logic is important to test because each CASE column is an independent calculation with its own boundary conditions — a missing category, a NULL value, or an off-by-one in the CASE condition creates a silent wrong value in just one column while all other columns appear correct.

Real-world QA use case:
A QA engineer is validating a monthly sales pivot report with columns Jan through Dec. She writes the independent CASE/SUM query and compares each column against the report output. The June column shows $0 in the report but $45,000 in her query — the developer used
month = 'Jun' in the CASE condition but the data stores full month names ('June'`), so the CASE never matches. The other 11 months pass because they all have the same format inconsistency in the condition, making June the only visible failure.

Rule of thumb: when writing pivot queries, test every column independently — a mismatch in just one CASE condition produces a silent zero that looks like legitimate data rather than a bug.
💡 Plain English: Rotating a spreadsheet. The months were row labels and you want them as column headers. You're not changing the data — just reshaping how it's laid out so comparison across months becomes a left-right scan instead of a top-down one.
47
Performance

How do you identify which queries are consuming the most resources on the database right now?

Each database has its own tooling, but the approach is the same: look at the query stats views.

Postgres:
``sql
SELECT query,
calls,
total_exec_time,
mean_exec_time,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
`
Requires the pg_stat_statements extension enabled.

MySQL:
`sql
SELECT * FROM performance_schema.events_statements_summary_by_digest
ORDER BY sum_timer_wait DESC
LIMIT 10;
`

SQL Server:
`sql
SELECT TOP 10
total_elapsed_time / execution_count AS avg_elapsed_ms,
execution_count,
SUBSTRING(st.text, 1, 200) AS query_snippet
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY avg_elapsed_ms DESC;
`

For live active queries (what's running right now):
`sql
SELECT * FROM pg_stat_activity WHERE state = 'active'; -- Postgres
SHOW PROCESSLIST; -- MySQL
``

The stat views tell you cumulative cost since last reset — great for finding which queries to tune. The activity views show you what's blocking or slow right now.

Why it exists:
Performance testing generates useful results only if you know which queries are actually running and costing the most. Without querying the stat views, QA engineers guess at which queries are slow based on feature names rather than evidence. The stat views give you empirical rankings — the same data DBAs use during production incidents — so performance test findings are precise and actionable.

Real-world QA use case:
A QA engineer is investigating why a load test shows elevated response times across a whole test suite, not just one endpoint. She queries pg_stat_statements sorted by total_exec_time and finds that 78% of total database time is consumed by a single reporting query that runs on every page load because of a misconfigured cache. She didn't know which query was the culprit until she looked at the stat view — no amount of Playwright profiling would have surfaced a database-level issue.

Rule of thumb: during performance test investigations, always check the database stat views (pg_stat_statements, performance_schema, sys.dm_exec_query_stats) sorted by total execution time — the actual bottleneck is almost never the query you assumed it was.
💡 Plain English: A call centre's performance dashboard. You don't guess which agent is overwhelmed — you look at the metrics: who has the longest call duration, the highest call volume, the most hold time. The database stat views are that dashboard for queries.
48
Window Functions

Write a query to rank customers by their total spending, with the highest spender ranked 1. Handle ties correctly.

-- Orders(id, customer_id, amount)
-- Customers(id, name)
``sql
SELECT
c.name,
SUM(o.amount) AS total_spent,
RANK() OVER (ORDER BY SUM(o.amount) DESC) AS rank_with_gaps,
DENSE_RANK() OVER (ORDER BY SUM(o.amount) DESC) AS rank_no_gaps,
ROW_NUMBER() OVER (ORDER BY SUM(o.amount) DESC) AS row_num
FROM Customers c
JOIN Orders o ON o.customer_id = c.id
GROUP BY c.id, c.name
ORDER BY total_spent DESC;
``

Which to use:
- DENSE_RANK — if two customers tie for 1st, both get rank 1 and the next gets rank 2. Best for "Top 10 spenders" where ties should count once.
- RANK — ties both get rank 1 but the next rank jumps to 3 (the gap). Reflects competition style: two gold medals, no silver.
- ROW_NUMBER — unique position for every row, ties broken arbitrarily. Best when you need exactly N rows.

The interviewer is listening for you to know the difference — it shows you've dealt with real ranking requirements.

Why it exists:
Ranking functions appear in leaderboards, loyalty tiers, report pagination, and any feature that orders users or items and assigns positions. QA engineers need to know the three variants because they produce different results on tied values — the wrong choice can under-reward tied customers, create unexpected gaps in leaderboard positions, or over-count top-N results when ties exist at the boundary.

Real-world QA use case:
A QA engineer is testing a leaderboard that shows the top 10 spenders with their rank. Two customers are tied for 9th place — the expected behaviour is that both appear in the top 10 (11 rows total, no 10th place). The developer used ROW_NUMBER, so only one of the tied customers appears at rank 9 and the other is cut off at rank 11 (outside the LIMIT 10). The QA engineer raises a defect: DENSE_RANK should be used so both tied customers receive rank 9 and both appear in the top-10 list.

Rule of thumb: use DENSE_RANK for leaderboards and top-N reports where ties should share a rank and both appear in the result; use ROW_NUMBER only when you need exactly N rows with no ties allowed at the boundary.
💡 Plain English: A marathon finishing list. DENSE_RANK: two runners cross simultaneously — both are 1st, next runner is 2nd. RANK: both are 1st, next runner is 3rd (no 2nd place given out). ROW_NUMBER: a photo finish picks one as 1st regardless of the tie.
49
Practical

How do you test a complex stored procedure or SQL function before deploying it to production?

Testing a stored procedure is systematic, not ad-hoc:

1. Test in a non-production environment first
Never run untested logic against production data. Use a staging or dev database with a realistic data set.

2. Boundary and edge cases
- Valid inputs that cover all code paths.
- Empty inputs / NULL parameters — does it handle them gracefully or throw an unhandled error?
- Boundary values (zero rows, one row, maximum rows).
- Invalid inputs — what error does the caller receive?

3. Verify the output
Cross-check the stored procedure's output against a manually written query that produces the expected result. If they disagree, find out why.

4. Test transaction behaviour
- Does it commit correctly on success?
- Does it roll back cleanly on failure without leaving partial data?
- Does it handle concurrent calls safely?

5. Performance check
Run EXPLAIN ANALYZE on the key statements inside the procedure. A procedure that works correctly on 1,000 rows may be unacceptably slow on 10 million.

6. Regression test after any change
Any modification to the procedure re-triggers the full test cycle — not just the changed path.

Why it exists:
Stored procedures encapsulate business logic inside the database, often without unit tests or version control visibility. Because they're called by the application rather than directly visible in code review, bugs in procedures can survive undetected until they're exercised by an edge-case input. A systematic test approach for procedures is critical because a bug inside a procedure may affect every feature that calls it.

Real-world QA use case:
A QA engineer is testing a stored procedure that calculates invoice totals including tax, discounts, and rounding. She systematically tests: a standard invoice (expected pass), an invoice with zero items (should return $0, not an error), an invoice with a NULL discount (should treat as 0% not crash), and an invoice with the maximum line-item count (performance check). She finds that passing NULL for the discount parameter throws an unhandled null pointer exception — a bug that would surface in production whenever a customer had no active discount code.

Rule of thumb: when testing a stored procedure, always include NULL parameter tests and empty-input tests in addition to the happy path — these are the edge cases most developers don't manually test and most commonly reach production broken.
💡 Plain English: Testing a new recipe before serving it at a dinner party. You cook it in your own kitchen first (dev environment), taste it with the edge-case ingredient variations (NULL inputs, empty lists), confirm the dish matches what the menu promised (output verification), and make sure it scales before cooking for 200 guests (performance).
50
Window Functions

Write a query to show month-over-month revenue change — both the absolute difference and the percentage change.

-- MonthlySales(month DATE, revenue NUMERIC)
``sql
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS abs_change,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month))
* 100.0
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0),
2) AS pct_change
FROM MonthlySales
ORDER BY month;
``

Why NULLIF(..., 0)? If last month's revenue was zero, dividing by it causes a divide-by-zero error. NULLIF returns NULL instead of 0, so the division produces NULL gracefully — which is correct ("undefined % change from zero").

Interviewer follow-up — what does the first row show? LAG returns NULL for the first row (no prior month). The abs_change and pct_change columns correctly show NULL, meaning "no comparison available."

Why it exists:
Month-over-month revenue change is the most common business metric in financial dashboards — every executive report includes trend arrows and percentage changes. For QA, the LAG-based query is the reference calculation for validating those displayed values. The NULLIF divide-by-zero guard is specifically important in testing because brand-new products or test datasets often have a month with zero revenue, which causes an application crash if the developer didn't add the guard.

Real-world QA use case:
A QA engineer is testing a new product launch dashboard. She includes a test case where January (the product's first month) has zero revenue — an edge case that simulates a pre-launch period. The dashboard crashes with a divide-by-zero error on the percentage change column for January. The fix is NULLIF(LAG(revenue), 0) — displaying NULL for the percentage instead of erroring. She adds this as a required boundary test for all new product metrics going forward.

Rule of thumb: use LAG for period-over-period comparisons, always guard division by LAG with NULLIF(..., 0) to handle zero prior-period values gracefully, and verify that your test data includes at least one period with zero revenue to catch divide-by-zero bugs.
💡 Plain English: A finance dashboard showing each month's revenue and the arrow next to it — up 12%, down 3%. LAG is what puts last month's number next to this month's so you can compute the arrow. NULLIF protects against the embarrassing divide-by-zero on a brand-new product's first month.

Senior (5+ years)

1
Transactions

Explain the four SQL transaction isolation levels and which read phenomena each one prevents.

Isolation levels are a trade-off between safety and speed. There are three problems they protect against:

- Dirty read — you read another transaction's changes *before* it has saved them (and it might still undo them).
- Non-repeatable read — you read the same row twice and get different values, because someone changed it in between.
- Phantom read — you run the same search twice and get a different set of rows, because someone added or removed matching rows in between.

| Isolation level | Dirty read | Non-repeatable read | Phantom read |
|---|:---:|:---:|:---:|
| READ UNCOMMITTED | possible | possible | possible |
| READ COMMITTED | prevented | possible | possible |
| REPEATABLE READ | prevented | prevented | possible* |
| SERIALIZABLE | prevented | prevented | prevented |

\* In MySQL, REPEATABLE READ also blocks phantom reads; the official standard doesn't require that.

How to talk about it:
- READ COMMITTED is the everyday default for most databases (Postgres, Oracle, SQL Server).
- More safety means more waiting and more locking, so it's slower. SERIALIZABLE is the safest, but transactions can fail and need a retry.
- Bonus point: many databases pull this off using "snapshots" (MVCC) instead of locks, which is why readers don't block writers.

Why it exists:
Isolation levels govern how concurrent transactions interact — a critical setting in any multi-user application. For QA engineers doing concurrency and load testing, isolation levels determine which anomalies are possible and therefore which test scenarios to design: can a dirty read occur? Can the same row return different values within a transaction? Understanding the levels helps QA write targeted concurrency tests rather than guessing what could go wrong.

Real-world QA use case:
A QA engineer is testing a banking application under load. She designs a dirty-read test: Transaction A updates an account balance but doesn't commit; Transaction B reads the same balance. With READ UNCOMMITTED, Transaction B sees the uncommitted update — a serious data integrity issue for a bank. She confirms the application uses READ COMMITTED as a minimum, then designs a non-repeatable read test to verify whether that isolation level is sufficient for the balance-check use case.

Rule of thumb: for most applications, READ COMMITTED is the correct default; use REPEATABLE READ when a transaction must see consistent data across multiple reads of the same row; use SERIALIZABLE only when absolute correctness is required and you can tolerate retry logic on serialisation failures.
💡 Plain English: Think of editing a shared Google Doc with a coworker. **READ UNCOMMITTED**: you see their half-typed, misspelled sentence before they finish (dirty read). **READ COMMITTED**: you only ever see finished sentences. **REPEATABLE READ**: the paragraph you're reading is frozen — it won't change while your eyes are on it. **SERIALIZABLE**: you each take strict turns, nobody types at the same time. The safer you go, the more everyone has to wait.
2
Gaps & Islands

Write a query to find every user who has logged in on 3 or more consecutive days, returning the start date and length of each streak.

-- Activity(user_id, login_date)  -- one row per user per day they logged in
This is the classic gaps-and-islands problem. The trick: on a run of back-to-back dates, date − row_number stays the *same*, so we can use it to group each streak.

``sql
WITH numbered AS (
SELECT user_id,
login_date,
ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY login_date
) AS rn
FROM (SELECT DISTINCT user_id, login_date FROM Activity) a
),
streaks AS (
SELECT user_id,
login_date,
DATEADD(DAY, -rn, login_date) AS grp -- stays the same within one streak
FROM numbered
)
SELECT user_id,
MIN(login_date) AS streak_start,
COUNT(*) AS streak_length
FROM streaks
GROUP BY user_id, grp
HAVING COUNT(*) >= 3;
`

Why it works (in plain terms): number each person's login days 1, 2, 3, 4… On a non-stop streak, the date goes up by 1 each day and the number also goes up by 1 — so "date minus number" never changes. The moment they skip a day, that value jumps, which tells us a new streak began. Then we just group by that value and count the days in each streak, keeping the ones with 3 or more.

*(Use
INTERVAL in Postgres or DATE_SUB` in MySQL — same idea.)*

Why it exists:
Consecutive-day or streak-based logic appears in retention metrics, loyalty features, compliance tracking, and gamification systems. The gaps-and-islands pattern is the go-to SQL technique for detecting these streaks — but it's non-obvious enough that many developers implement it incorrectly with application-side loops. For QA, understanding the pattern lets you write the reference query to validate streak calculations in the application.

Real-world QA use case:
A QA engineer is testing a "7-day login streak" badge in a gamification feature. She uses the gaps-and-islands query to independently calculate each user's streaks from the Activity table and compares the results against which badges have been awarded. She finds that the application's streak counter is resetting incorrectly when a user logs in twice in one day (because the streak code counts rows rather than distinct dates) — a bug the reference query catches immediately because DISTINCT login_date is part of the correct approach.

Rule of thumb: the gaps-and-islands pattern works by computing (date − row_number) for each row within a partition — this value stays constant within a consecutive run and jumps when a gap occurs, grouping each streak automatically.
💡 Plain English: Think of marking gym days on a calendar. Number each visit: 1, 2, 3… On a non-stop streak, the date and the visit number both go up by 1 together, so "date minus number" stays the same. Skip a day and that value suddenly jumps — that's how you spot a new streak starting. Same value = same streak; count the days in each.
3
Concurrency

Your production application is hitting deadlocks under load. How do you diagnose the cause and resolve them?

What a deadlock is: two transactions each hold something the other one needs, so both wait forever. The database notices, picks one as the "loser", and cancels it with an error.

How to find the cause:
1. Grab the deadlock report your database logs (SQL Server's system_health, the Postgres log, or MySQL's SHOW ENGINE INNODB STATUS). It tells you the two queries and what each one was waiting on.
2. Look at the order each one grabbed its locks. Deadlocks almost always come from two pieces of code locking the same things in the *opposite* order.

How to fix it (best first):
- Lock things in the same order everywhere — e.g., always update accounts before audit_log. If everyone follows the same order, a deadlock can't form. This is the real fix.
- Keep transactions short — grab locks late, save quickly, and never wait on a user or an outside service while holding a lock.
- Add the right indexes — without them, the database locks far more rows than it needs to, which makes deadlocks more likely.
- Use a lower safety level if the work allows it, so locks are held for less time.
- Add retry logic — deadlocks can't be 100% prevented, so catch the error and try again. This is a safety net, not the main fix.

Example: two sessions lock the same rows in *opposite* order, and freeze:
``sql
-- Session 1 -- Session 2 (at the same time)
BEGIN; BEGIN;
UPDATE Accounts SET ... WHERE id = 1; UPDATE Accounts SET ... WHERE id = 2;
UPDATE Accounts SET ... WHERE id = 2; UPDATE Accounts SET ... WHERE id = 1;
-- Session 1 waits for id=2 (held by Session 2)
-- Session 2 waits for id=1 (held by Session 1) → deadlock
``
The fix: make *both* sessions update id=1 before id=2 — then the cycle can't form.

Why it exists:
Deadlocks are a class of concurrency bug that surfaces only under concurrent load — they're invisible in single-user testing and typically only appear during load tests or production traffic spikes. For QA, designing concurrency tests that intentionally create deadlock conditions is the only reliable way to verify that the application handles them gracefully (with retry logic) rather than returning a 500 error to the user.

Real-world QA use case:
A QA engineer is load-testing a funds transfer feature. She designs a concurrency scenario: two threads simultaneously transfer funds between accounts A and B in opposite directions. Under sustained load, deadlock errors begin appearing in the logs. The application is not retrying — it's returning "Transfer failed, please try again" to the user. She logs this as a defect: the application must handle deadlock retries automatically rather than surfacing them to the user, and the locking order should be standardised to reduce deadlock frequency.

Rule of thumb: design at least one concurrency test for every feature that writes to multiple rows in sequence — the most reliable way to prevent deadlocks is consistent lock ordering, and retry logic is a required safety net for the cases that still occur.
💡 Plain English: Two people meet in a narrow hallway, and each politely waits for the *other* to step aside first. Nobody moves — frozen forever. That's a deadlock. The real fix isn't "wait longer", it's a shared rule: **everyone always passes on the right**. In SQL that rule is "always lock tables in the same order" — if everyone follows it, the standoff simply can't happen.
4
Data Cleanup

How would you delete duplicate rows from a large table while keeping exactly one copy of each, efficiently?

-- Person(id, email, ...)  -- 'id' is unique; 'email' is duplicated
Approach 1 — ROW_NUMBER (the standard, flexible). Number the copies of each email, then delete all but the first:

``sql
WITH ranked AS (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY email -- group the duplicates together
ORDER BY id -- keep the lowest id
) AS rn
FROM Person
)
DELETE FROM Person
WHERE id IN (SELECT id FROM ranked WHERE rn > 1);
`

rn = 1 is the copy you keep. Change the ORDER BY to choose *which* copy survives (newest, highest, etc.).

Approach 2 — keep MIN(id) per group (simpler, no window functions). Delete any row that isn't the lowest id for its email:

`sql
DELETE FROM Person
WHERE id NOT IN (
SELECT MIN(id) FROM Person GROUP BY email
);
`

Approach 2 is more portable; Approach 1 is more flexible about which copy to keep.

On a big table, being correct isn't enough — be safe too:
- Delete in small batches (a few thousand at a time). One huge delete locks the table and blocks other users.
- Index the duplicate column (
email) so the scan isn't slow.
- For very large cleanups, copy-and-swap: copy the keepers into a fresh table, then swap it in — safer than a giant delete, and you keep the original until verified.
- Stop it recurring — add a
UNIQUE rule on email`.

Why it exists:
Duplicate rows in large production tables are one of the most common data quality incidents QA engineers face — typically introduced by a missing unique constraint during an initial migration, a retry in a batch job without idempotency, or a race condition under concurrent inserts. The ability to safely identify, count, and remove duplicates at scale is a core data remediation skill.

Real-world QA use case:
A QA engineer is tasked with cleaning up 800,000 duplicate Person records discovered after a CRM data migration. She uses the ROW_NUMBER approach to identify which rows to delete, runs the SELECT preview to sample 50 rows and confirm the right records are marked for deletion, then executes the delete in batches of 10,000 rows with a brief pause between each batch to avoid locking the live application. After cleanup she adds a UNIQUE constraint on email to prevent recurrence — a schema change that was missing from the original migration plan.

Rule of thumb: when deleting duplicates from a large table, always: (1) preview the target rows with SELECT before deleting, (2) delete in batches to avoid locking, and (3) add a UNIQUE constraint afterward so the duplicates can't come back.
💡 Plain English: It's like clearing the duplicate photos on your phone. You group every shot of the same scene, label them "copy 1, copy 2, copy 3…", keep copy 1, and delete the rest. To stop your phone freezing, you don't wipe 50,000 photos in one tap — you clear them in small batches. Then you switch on "no duplicate uploads" so the mess can't come back (the `UNIQUE` rule).
5
Performance at Scale

A query that used to be fast has become slow as its table grew to 500M rows. Walk me through your diagnosis and the levers you would pull.

1. Look at the actual query plan. Run EXPLAIN ANALYZE. Compare the *guessed* row counts to the *real* ones — if they're way off, the database's stats are out of date and it's choosing a bad plan.

2. Find the slowest step. Look for the database scanning the whole 500M-row table, or sorting so much data that it spills to disk.

3. Try the cheap fixes first:
- Add indexes on the columns you filter and join on. A "covering index" (one that holds every column the query needs) is even better.
- Refresh the stats (ANALYZE). Cheap, and often fixes it instantly.
- Fix index-killers — wrapping a column in a function (like YEAR(date)) or starting a LIKE with % stops the index from working. Rewrite those.

4. Bigger changes as the table keeps growing:
- Partitioning — split the table by date so a query only looks at one slice instead of all 500M rows. It also makes deleting old data almost instant.
- Archive old data — most giant tables are mostly old rows nobody queries. Move them out.
- A pre-built summary table for heavy reports, refreshed on a schedule.

5. Last-resort, big-architecture changes:
- Read replicas — extra copies of the database to take the reporting load off the main one.
- Sharding — split the data across several servers. Powerful, but a big jump in complexity, so only when nothing else is left.

The mindset interviewers want to hear: measure first, fix the part you've proven is slow, and only add big complexity when the simple fixes run out — don't jump straight to "shard it."

Why it exists:
Performance at scale is the category of problem most likely to surface only in production because dev and staging databases are too small to exhibit the same behaviour. For QA engineers running performance tests on production-sized datasets, understanding the full diagnosis and remediation ladder — from index additions through to partitioning — enables them to write performance findings with specific, actionable recommendations rather than vague "it's slow" reports.

Real-world QA use case:
A QA engineer is performance-testing a user activity history feature against a 500M-row activity table. The query times out. She works through the diagnosis ladder: EXPLAIN shows a full table scan — the activity table has no index on user_id. Adding the index brings the query to 800ms. A second EXPLAIN then reveals stale statistics causing a bad join order. ANALYZE brings it to 120ms. She documents both findings with before/after EXPLAIN output, saving the team from a premature and unnecessary sharding discussion.

Rule of thumb: when diagnosing a slow query at scale, work through fixes in order of cost — indexes and statistics refreshes first, then query rewrites, then partitioning, then infrastructure changes — each is cheaper and less disruptive than the next.
💡 Plain English: It's a library that quietly grew from 1,000 books to 5 million. The old "just walk the shelves" trick now takes all day. First you check the catalogue to see *where* the time goes (the query plan). Then you put up better signs (indexes). Still drowning? Split the building into wings by subject (partitioning) and send the dusty old books to storage (archiving). Only when one building genuinely can't cope do you open a second branch across town (sharding) — because running two buildings is a whole new headache.
6
Transactions

What does ACID stand for in databases?

ACID is four guarantees that keep transactions reliable:
- Atomicity — all steps in a transaction happen, or none do. A half-done money transfer can't be left hanging.
- Consistency — a transaction moves the database from one valid state to another, never breaking its rules (constraints).
- Isolation — running transactions don't trip over each other; the result is as if they ran one at a time.
- Durability — once committed, the change survives even a crash or power cut.

Together, they're why you can trust a bank transfer or an order to be all-or-nothing and permanent.

Example: a money transfer wrapped in a transaction:
``sql
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1; -- from
UPDATE accounts SET balance = balance + 100 WHERE id = 2; -- to
COMMIT;
`
- Atomicity — if the second UPDATE fails, the first is rolled back too; money never half-moves.
- Consistency — a rule like
CHECK (balance >= 0)` blocks the COMMIT if account 1 would go negative.
- Isolation — until COMMIT, no other session sees the half-finished transfer.
- Durability — once COMMIT returns, it survives even if the server crashes a second later.

Why it exists:
ACID guarantees are the foundation of data reliability in relational databases. For QA engineers, each property corresponds to a distinct test category: Atomicity → test partial failure rollback; Consistency → test constraint violations and boundary conditions; Isolation → test concurrent transaction behaviour; Durability → test recovery after crash or power cut. Without understanding ACID, QA engineers miss entire categories of required test cases.

Real-world QA use case:
A QA engineer is testing a fund transfer feature. She designs four test scenarios mapped to each ACID property: (A) force a network failure after the debit but before the credit — verify the debit is rolled back; (C) attempt a transfer that would take the account negative — verify the CHECK constraint blocks it; (I) run 50 concurrent transfers on the same account — verify no double-spend; (D) kill the database process immediately after COMMIT returns — verify the transfer is recoverable after restart. All four pass. She documents ACID coverage in the test plan.

Rule of thumb: when testing any financial or data-critical feature, map your test cases to each ACID property explicitly — Atomicity (failure rollback), Consistency (constraint enforcement), Isolation (concurrent access), Durability (crash recovery) — to ensure none of the four are left untested.
💡 Plain English: Think of an ATM withdrawal. **Atomicity**: you get cash *and* your balance drops, or neither — never cash without the deduction. **Consistency**: you can't end up overdrawn if the rules forbid it. **Isolation**: two people on a joint account don't both grab the last $50. **Durability**: once it says "done", a power cut won't undo it.
7
Scaling

What is the difference between sharding, partitioning, and replication?

All three deal with big or busy databases, but solve different problems:
- Partitioning — split one big table into chunks *on the same server* (e.g., by month). Queries touch only the chunk they need.
- Sharding — split the data across *multiple servers*, each holding a slice (e.g., users A–M on one, N–Z on another). Used when one machine can't handle the write load.
- Replication — keep *copies* of the same data on several servers. Great for read-heavy apps (reads spread across copies) and for failover if one dies.

Rough rule: partition for query speed, replicate for read scaling and safety, shard only when writes outgrow a single machine.

Example: all three, side by side —

Partitioning — one server, one table split into chunks:
``sql
CREATE TABLE Orders (id INT, order_date DATE, amount NUMERIC)
PARTITION BY RANGE (order_date);

CREATE TABLE orders_2023 PARTITION OF Orders
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
`
A query for 2023 orders now reads only the
orders_2023` chunk.

Sharding — many servers, each holding a *slice*: users A–M live on Server 1, N–Z on Server 2. A lookup for "Asha" is routed only to Server 1.

Replication — many servers, each holding a *full copy*: the primary takes all writes, and read-only copies serve reports and stand by as backups if the primary fails.

Why it exists:
As systems scale, a single database server eventually becomes a bottleneck. Understanding sharding, partitioning, and replication is important for senior QA because each approach has different implications for test environment setup, data consistency, and which test scenarios are possible: replicas introduce read-lag to test; partitioned tables have different query behaviours at the boundaries; sharded databases require testing that cross-shard queries return complete results.

Real-world QA use case:
A QA engineer is setting up a performance test environment for a system that uses read replicas. She discovers that a test scenario reads data from a replica immediately after writing it to the primary — and the replica has up to 500ms of replication lag. Her test is flaky because it sometimes reads stale data from the replica. She adds a deliberate 1-second wait after writes in the test setup, documents the replication lag as a known behaviour, and raises a recommendation that the application use the primary for consistency-critical reads.

Rule of thumb: replication lag is a required test consideration whenever your test environment uses read replicas — always account for the lag window when a test writes data and immediately reads it back.
💡 Plain English: Picture a busy restaurant. **Partitioning** is organising one kitchen into stations (grill, salad, dessert) so cooks aren't in each other's way. **Replication** is opening identical kitchens so more orders cook at once — and if one catches fire, the others carry on. **Sharding** is opening separate branches in different cities, each serving its own customers.
8
Indexing

How does a database index actually make lookups fast?

Most indexes are a B-tree — a sorted, branching structure. Instead of reading every row, the database starts at the top and follows a few branches down to the value, like a decision tree.

Because each step rules out a huge chunk of the data, finding one row in a million takes only a handful of hops instead of a million checks. The data is kept sorted, which is also why range searches (BETWEEN, >) and ORDER BY on the indexed column are fast.

The cost: every insert, update, and delete has to keep the tree sorted — so indexes speed up reads but slightly slow down writes.

Why it exists:
Understanding how a B-tree index actually works — rather than just knowing "indexes make queries fast" — is what enables QA engineers to write informed performance test findings. It explains why range queries benefit from indexes, why function-wrapped columns break indexes, and why a table scan on a small table is actually correct behaviour from the planner.

Real-world QA use case:
A QA engineer is reviewing an EXPLAIN plan that shows an index seek on a 50M-row Orders table — fast, as expected. She adds an index hint test to confirm the query regresses to a full scan if the index is dropped, to demonstrate the index's value in the performance test evidence. Understanding that B-tree indexes are sorted is what tells her to also test range queries (WHERE order_date BETWEEN) and ORDER BY on the indexed column — both benefit from the sorted structure without additional index seeks.

Rule of thumb: B-tree indexes are sorted structures — they speed up equality lookups, range queries, and ORDER BY on the indexed column; they do NOT help with function-wrapped columns, leading wildcard LIKE patterns, or columns with very low cardinality.
💡 Plain English: It's how you find a word in a dictionary. You don't read from page one — you open the middle, see you've gone too far, jump back, narrow it down, and land on the word in seconds. A B-tree lets the database "open to the middle" over and over until it pinpoints the row.
9
Concurrency

What is the difference between optimistic and pessimistic locking?

Both stop two people from overwriting each other's changes, but in opposite ways:
- Pessimistic locking — lock the row *up front* while you work on it. Others must wait. Safe, but it can cause waiting and deadlocks. Good when conflicts are common.
- Optimistic locking — don't lock. Let everyone read freely, but when saving, check whether the row changed since you read it (often via a version number or timestamp). If it did, reject the save and ask them to retry. Good when conflicts are rare.

Most web apps use optimistic locking, because most edits don't actually collide.

Example:
``sql
-- Pessimistic: lock the row now; others must wait
SELECT * FROM Accounts WHERE id = 1 FOR UPDATE;

-- Optimistic: no lock; check the version when saving
UPDATE Accounts SET balance = 50, version = version + 1
WHERE id = 1 AND version = 7; -- 0 rows updated = someone beat you → retry
``

Why it exists:
Locking strategy determines how a system behaves under concurrent access — a critical concern for QA engineers designing concurrency test scenarios. Pessimistic locking prevents conflicts but can create deadlocks and bottlenecks under load. Optimistic locking allows concurrent reads but requires the application to handle "someone beat you to it" gracefully rather than silently overwriting another user's changes.

Real-world QA use case:
A QA engineer is testing a collaborative document editing feature that uses optimistic locking with a version field. She designs a test where two users simultaneously edit the same document: User A reads version 5, User B reads version 5. User A saves first — version becomes 6. User B's save returns 0 rows affected. She verifies the application detects this correctly and shows User B a "document was updated, please refresh" message rather than silently discarding User A's changes.

Rule of thumb: always test optimistic locking with a concurrent-edit scenario — two users reading the same version, one saving first — and verify the second save is correctly rejected with a meaningful conflict message, not a silent data loss.
💡 Plain English: Pessimistic is taking the *only* library copy of a book home so nobody else can touch it until you're done. Optimistic is everyone photocopying the page freely — but before you hand in your edits, you check that nobody changed that page first. If they did, you redo your bit.
10
Architecture

What is the difference between OLTP and OLAP systems?

They're databases tuned for opposite jobs:
- OLTP (transaction processing) — lots of small, fast reads and writes. Think placing an order or updating a profile. Highly normalised, optimised for quick single-row changes.
- OLAP (analytical processing) — big, complex read queries over huge amounts of history. Think "total sales by region over the last 3 years." Often denormalised or column-stored for fast scanning.

You usually keep them separate, so heavy reports don't slow down the live app — analytics run on a copy or a data warehouse.

Example:
``sql
-- OLTP: tiny, fast, one row (a checkout, thousands per second)
UPDATE Accounts SET balance = balance - 50 WHERE id = 1;

-- OLAP: huge scan + aggregation (an analyst's report, run occasionally)
SELECT region, SUM(amount)
FROM Orders
WHERE order_date >= '2021-01-01'
GROUP BY region;
``

Why it exists:
OLTP and OLAP have fundamentally different performance characteristics — a heavy analytical query run against an OLTP database can hold locks and cause timeouts for live users. For QA engineers, understanding the distinction matters when deciding which database to run validation queries against: large analytical queries should target a warehouse or read replica, never the production OLTP database.

Real-world QA use case:
A QA engineer runs a large data validation query directly against the production OLTP database to verify a migration. The query does a full table scan across 50M rows, causing timeout errors for real users for 40 seconds. The post-incident review establishes a policy: all QA validation queries that could run for more than 5 seconds must be run against the read replica or data warehouse — never the primary OLTP database.

Rule of thumb: run small, targeted lookups against OLTP databases; run analytical aggregations, full-table comparisons, and long-running validation queries against read replicas or data warehouses to protect live user traffic.
💡 Plain English: OLTP is the shop's cash register — quick, one customer at a time, all day long. OLAP is the analyst in the back office pulling a year of receipts to spot trends. You wouldn't tie up the till to run that big report, so you give the analyst their own copy of the data.
11
Schema Design

When would you deliberately denormalize a database?

Normalisation avoids duplicate data, but it can make reads slow because you join many tables. You denormalise — store some data redundantly — when read speed matters more than tidy storage:
- A report joins 8 tables and is too slow → store a pre-joined summary.
- You constantly count something (likes, comments) → keep a running count column instead of counting every time.

The trade-off: redundant data must be kept in sync, which complicates writes. So denormalise only after measuring a real problem, never by default.

Example: showing a post's like-count:
``sql
-- Normalized (slow if run on every page view):
SELECT COUNT(*) FROM Likes WHERE post_id = 42;

-- Denormalized: keep a running count on the post itself
UPDATE Posts SET like_count = like_count + 1 WHERE id = 42;
``
Reads become instant — but now every like has to update two places.

Why it exists:
Deliberate denormalisation is an intentional performance optimisation — but it introduces a data consistency risk: the cached value (like_count) can drift from the true source-of-truth count if write updates are missed. For QA, denormalised columns require specific data integrity tests: verify that the cached counter matches the actual COUNT after every operation type (add, remove, undo), and design concurrency tests to check the counter under simultaneous updates.

Real-world QA use case:
A QA engineer is testing a like-count counter that was denormalised for performance. She tests four scenarios: like a post (count +1 ✓), un-like it (count -1 ✓), like the same post from two devices simultaneously (count +2 — race condition causes count +1 ✗). The concurrent update reveals a missing database-level atomic increment — the application is reading then writing, not using UPDATE ... SET like_count = like_count + 1, so simultaneous updates collide and lose one increment.

Rule of thumb: whenever a denormalised counter is introduced, add a data integrity test that compares the counter value against COUNT(*) from the source table — and always test concurrent increment/decrement operations to verify atomicity.
💡 Plain English: It's like writing your phone number on every page of your notebook instead of once on the cover. Looking it up is instant from any page — but if the number changes, you've got a lot of pages to update. Worth it only if you look it up constantly.
12
Views

What is the difference between a view and a materialized view?

- A view is a saved query. It stores no data — every time you use it, the underlying query runs fresh. Always up to date, but no speed gain.
- A materialized view actually *stores* the results, like a cached table. Reading it is fast because the work is already done, but the data is only as fresh as the last refresh — you have to refresh it on a schedule or trigger.

Use a plain view to simplify and reuse a query; use a materialized view when a heavy query runs often and slightly stale data is acceptable.

Example:
``sql
-- View: re-runs the query every time, always fresh
CREATE VIEW active_users AS
SELECT * FROM Users WHERE status = 'active';

-- Materialized view: stores the result, fast to read, must be refreshed
CREATE MATERIALIZED VIEW sales_by_region AS
SELECT region, SUM(amount) FROM Orders GROUP BY region;

REFRESH MATERIALIZED VIEW sales_by_region; -- run on a schedule
``

Why it exists:
Views and materialized views are both abstractions over complex queries, but with opposite freshness/performance trade-offs. For QA engineers, the difference matters when testing reporting features: a view always reflects current data (data changes are immediately visible), while a materialized view requires a refresh cycle (staleness within the refresh window is expected behaviour, not a bug).

Real-world QA use case:
A QA engineer is testing a sales dashboard backed by a materialized view refreshed hourly. She inserts a large test order and immediately checks the dashboard — the order doesn't appear. She documents this as expected behaviour, then verifies the next scheduled refresh includes the order. She also tests the edge case where a refresh fails silently — confirming the dashboard shows a visible "last updated" timestamp so users know the data may be stale.

Rule of thumb: when testing features backed by materialized views, explicitly test the staleness window — verify that changes made after the last refresh are not visible until the next refresh, and that the stated refresh frequency is actually being honoured.
💡 Plain English: A view is a recipe — every time you want the dish, you cook it from scratch (always fresh, but it takes time). A materialized view is a batch you cooked earlier and put in the fridge — grab-and-go fast, but you need to remember to cook a fresh batch now and then.
13
Indexing

Does the column order in a composite (multi-column) index matter?

Yes — a lot. A composite index on (a, b) is sorted by a first, then by b within each a. This "leftmost prefix" rule means:
- It helps queries filtering on a, or on a and b together.
- It does *not* help a query filtering on b alone.

``sql
-- Good for: WHERE a = ? and WHERE a = ? AND b = ?
-- Useless for: WHERE b = ?
CREATE INDEX idx ON Orders (a, b);
`

So put the column you filter on most (especially by equality) first.

Why it exists:
Composite index column order is one of the most commonly misunderstood indexing concepts — developers add a composite index but get no performance benefit because the query doesn't use the leftmost prefix. For QA engineers doing performance testing, knowing this rule lets them verify that index column order matches the actual query patterns — and catch indexes that were added with the wrong order and are therefore wasted write overhead.

Real-world QA use case:
A QA engineer is performance-testing a query that filters on
status and created_at. The developer added a composite index (created_at, status). The EXPLAIN plan shows the index is used for a query filtered on created_at alone, but a query filtered on status alone still does a full scan. She flags the index order as a defect — the most frequent query pattern filters by status` first, so the index should be (status, created_at) to enable the status-only filter to use the leftmost prefix.

Rule of thumb: in a composite index, put the most selective column first and the one used in range conditions last — the leftmost prefix rule means the index only helps queries that filter on the first column (or the first two columns together).
💡 Plain English: It's a phone book sorted by last name, then first name. Brilliant for finding "Smith", or "Smith, John". But useless for finding everyone called "John" — the book isn't organised by first name, so you'd have to read all of it.
14
Indexing

What is a covering index?

A covering index contains *every column a query needs* — both the columns it filters on and the columns it returns. Because everything is right there in the index, the database never has to go back to the table for the full row, saving a whole step.

``sql
-- Query: SELECT email FROM Users WHERE status = 'active';
-- Covering index includes status (to filter) and email (to return):
CREATE INDEX idx ON Users (status) INCLUDE (email);
`

It's one of the biggest easy wins for a hot, slow query.

Why it exists:
A covering index eliminates the most expensive part of an index-based lookup — the second round trip to the table to fetch the actual row data. For QA engineers running performance tests, "Key Lookup" or "Heap Fetches" in an EXPLAIN plan are a reliable signal that a covering index could provide a significant speed improvement for high-frequency queries on large tables.

Real-world QA use case:
A QA engineer is performance-testing a high-frequency API endpoint that fetches active users' email addresses. The EXPLAIN plan shows an index seek on
status followed by a Heap Fetch (table lookup) to retrieve email — two steps per row. She identifies this as a covering index opportunity and raises a recommendation: adding INCLUDE (email)` to the existing status index eliminates the table lookup entirely, dropping the endpoint's p99 latency from 340ms to 85ms under load.

Rule of thumb: when an EXPLAIN plan shows "Key Lookup" or "Heap Fetches" on a high-frequency query, check whether a covering index (adding the returned columns to the index's INCLUDE list) would eliminate the table lookup entirely.
💡 Plain English: It's like a contacts list that already shows the phone number next to each name. If all you wanted was the number, you're done — no need to open the full contact card. The index "covers" everything the question asked for.
15
Distributed Systems

What is the CAP theorem?

For a distributed database (data spread across servers), the CAP theorem says that when a network problem splits the servers apart, you can only keep *two* of these three:
- Consistency — everyone sees the same latest data.
- Availability — every request still gets an answer.
- Partition tolerance — the system keeps working despite the network split.

Since network splits *will* happen, you must tolerate partitions — so the real choice during a split is: favour Consistency (refuse stale answers) or Availability (answer anyway, possibly with old data)? Banks lean consistent; social feeds lean available.

Why it exists:
The CAP theorem frames a fundamental trade-off that shapes how distributed systems behave under failure. For QA engineers, understanding CAP is important when designing failure and chaos tests: what does the system do when it loses connectivity between nodes? Does it return an error (consistent) or stale data (available)? The answer determines which test scenarios to write and what the expected behaviour should be.

Real-world QA use case:
A QA engineer is testing a distributed loyalty points system. She designs a network partition test: she cuts connectivity between two database nodes and attempts a points redemption. The system is configured for consistency — it correctly refuses the transaction and returns an error rather than risk double-spending points. She documents this as a pass. She then tests the error message quality: instead of a generic 500 error, the system should return "Service temporarily unavailable, please try again" — the user-facing message is a separate failing test.

Rule of thumb: when testing distributed systems, always include a network partition test and verify the system's CP vs AP behaviour explicitly — document the expected behaviour in the test plan so everyone agrees on what "correct" means before testing begins.
💡 Plain English: Two shop branches lose the phone line between them. A customer wants to spend loyalty points. The **consistent** choice: "sorry, can't confirm your balance right now" (correct, but unavailable). The **available** choice: "sure, go ahead" (works, but the two branches might now disagree on the balance). You can't have both while the line is down.
16
Migrations

How would you add a new NOT NULL column to a huge, live table without downtime?

Doing it in one step can lock the table and take it offline. Instead, break it into safe stages:
1. Add the column as nullable (instant on most modern databases).
2. Backfill existing rows in small batches, so you never lock the whole table at once.
3. Update the app to start writing the column for new rows.
4. Once every row has a value, add the NOT NULL constraint.

This way the table stays online throughout, and no single step locks it for long.

Example: adding a NOT NULL country column in safe stages:
``sql
ALTER TABLE Users ADD COLUMN country VARCHAR(2); -- 1. nullable, instant
UPDATE Users SET country = 'US' -- 2. backfill in batches
WHERE country IS NULL AND id BETWEEN 1 AND 10000;
-- ...repeat step 2 for the next batches...
ALTER TABLE Users ALTER COLUMN country SET NOT NULL; -- 4. lock in the rule
`

Why it exists:
A single-step ALTER TABLE ADD NOT NULL on a large production table locks the entire table while backfilling every row — potentially locking out users for minutes or hours. The staged approach breaks this into non-blocking steps. For QA, verifying that migrations use this pattern is part of testing release safety — a migration that locks a 100M-row table is a P0 incident even if the schema change itself is correct.

Real-world QA use case:
A QA engineer reviews a migration script before a production release and sees a single-line
ALTER TABLE Users ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'active'`. On the 80M-row production table she estimates this will lock the table for 15–25 minutes. She raises a blocking defect: the migration must use the four-stage approach (add nullable, backfill in batches, set default, then add NOT NULL constraint) to avoid a production outage during the release window.

Rule of thumb: never add a NOT NULL column to a large table in a single ALTER TABLE — always use the four-stage approach to keep the table available throughout, and time the migration on a production-sized clone before scheduling it.
💡 Plain English: It's like repainting a busy shop without closing it. You don't shut the doors and paint everything at once — you cordon off one small section at a time, paint it, and move on. Customers keep shopping throughout, and eventually the whole place is done.
17
Performance

What is the N+1 query problem, and how do you fix it?

It's when code runs 1 query to get a list, then 1 *more* query for each item in that list — so fetching 100 orders fires 1 + 100 = 101 queries. It's slow because of all the round-trips to the database, and it usually comes from ORMs loading related data lazily.

Fix it by fetching everything in one go:
- Use a JOIN (or the ORM's "eager loading") to pull the related rows with the main query.
- Or fetch all the related rows with a single WHERE id IN (...).

One query instead of 101.

Example: showing 100 posts with their author:
``sql
-- N+1 (101 queries): 1 for the posts, then 1 per post for its author
SELECT * FROM Posts; -- 100 rows
SELECT * FROM Authors WHERE id = ?; -- fired 100 times, once per post

-- Fixed (1 query): join the author in
SELECT p.*, a.name
FROM Posts p JOIN Authors a ON a.id = p.author_id;
``

Why it exists:
The N+1 problem is one of the most common ORM anti-patterns — the code returns the right data but the database round-trip count scales linearly with the row count. It often passes all functional tests but causes linear performance degradation that only appears under realistic data volumes. For QA engineers, N+1 is specifically worth testing because ORMs hide it behind clean API calls that look harmless in code review but fire dozens or hundreds of silent queries.

Real-world QA use case:
A QA engineer is performance-testing a product listing page. With 20 products the response time is 180ms. With 200 products it's 1,800ms — linear scaling, a classic N+1 signal. She enables query logging and confirms: 1 query for the product list, then 200 additional queries loading each product's category. She files the N+1 defect with the query log evidence and the linear scaling data as proof.

Rule of thumb: when performance test response time scales linearly with the number of rows returned, suspect N+1 queries — enable query logging and count the database round-trips; the count should be a small constant, not proportional to the result set size.
💡 Plain English: It's like grocery shopping by walking to the shop, coming home, then going *back* for each item one at a time — bread, then again for milk, then again for eggs. The fix is obvious: take one list and buy everything in a single trip.
18
Replication

What are read replicas, and what is replication lag?

A read replica is a copy of your main database that handles read queries, taking load off the primary. Writes go to the primary, then copy across to the replicas.

Replication lag is the short delay before a write shows up on a replica. If a user saves something and immediately reads from a replica, they might briefly see the *old* data. Handle it by reading "read-your-own-write" critical data from the primary, while sending heavy reports and non-urgent reads to replicas.

Why it exists:
Read replicas are a common scaling pattern for read-heavy applications, but they introduce a class of data freshness bug that only appears when reads are routed to replicas immediately after a write. For QA engineers, replication lag is a required test consideration in any environment that uses replicas — tests that write data and immediately read it back can flake because of lag, and production users can see stale data.

Real-world QA use case:
A QA engineer notices that an automated test reliably passes when run alone but occasionally fails in the full test suite. The test creates a new order and immediately verifies it appears in a list endpoint. The list endpoint reads from a read replica with up to 300ms replication lag. She adds a targeted assertion delay in the test setup and documents the lag in the test plan. She also raises a separate recommendation: the "order created" confirmation page should always read from the primary to guarantee the user sees their own write.

Rule of thumb: when testing systems with read replicas, always check whether your test reads from the primary or a replica immediately after a write — a brief wait or primary-targeting is required to avoid flaky tests caused by replication lag.
💡 Plain English: The primary is head office; replicas are branch offices that get a copy of every memo. Branches answer most questions, so head office isn't swamped. But there's a small delay before a new memo reaches the branches — so for something you *just* changed, you ask head office directly to be sure.
19
Window Functions

How do you find the top 10% of earners?

-- Employee(id, name, salary)
Approach 1 — ORDER BY + LIMIT (simplest). Sort by salary and take the top tenth of the row count:

``sql
SELECT name, salary
FROM Employee
ORDER BY salary DESC
LIMIT (SELECT CEIL(COUNT(*) * 0.1) FROM Employee);
`

Approach 2 — NTILE (clean "which 10% bucket"). Split everyone into 100 equal ranked buckets, keep the top 10:

`sql
SELECT name, salary FROM (
SELECT name, salary,
NTILE(100) OVER (ORDER BY salary DESC) AS pct
FROM Employee
) t
WHERE pct <= 10;
`

Approach 3 — PERCENT_RANK (exact percentile). Keep rows in the top 10% by their precise rank position:

`sql
SELECT name, salary FROM (
SELECT name, salary,
PERCENT_RANK() OVER (ORDER BY salary DESC) AS pr
FROM Employee
) t
WHERE pr <= 0.10;
`

Approach 1 is the most intuitive;
NTILE / PERCENT_RANK` handle ties and exact boundaries more precisely on large datasets.

Why it exists:
Top-N percent queries are common in compensation analysis, loyalty tier assignment, and performance ranking reports. For QA engineers, the key validation concern is boundary behaviour — who sits exactly at the 10% cutoff, and do ties at the boundary cause one or more extra rows to be included or excluded? Each of the three approaches handles ties differently, and the wrong choice can over- or under-include borderline users in a tier.

Real-world QA use case:
A QA engineer is validating a bonus eligibility feature that awards bonuses to the top 10% of earners. She tests with 100 employees — the 10th-ranked employee ties with the 11th. Using LIMIT 10, only one of the tied employees qualifies. Using NTILE, both tied employees fall in the top 10 bucket (11 rows). The business rule says "ties both qualify" — NTILE is the correct implementation, and the LIMIT-based approach is a bug at the boundary.

Rule of thumb: when implementing top-N percent features, clarify the tie-breaking rule before choosing the SQL approach — NTILE includes tied boundary rows proportionally, PERCENT_RANK uses exact fractional position, and LIMIT truncates at the count regardless of ties.
💡 Plain English: It's like grading on a curve. You line everyone up by score, slice the line into 100 equal bands, and the top ten bands are your top 10%. `NTILE(100)` does the slicing evenly for you.
20
Query Optimization

How does the database decide how to run a query?

Modern databases use a cost-based optimiser. For a given query there are many possible plans — which index to use, which join order, which join method. The optimiser estimates the "cost" of each, based on table sizes and column statistics, and picks the cheapest.

This is why statistics matter: if they're stale, the optimiser misjudges how many rows it'll get and may pick a bad plan. It's also why two similar queries can run very differently. You inspect its choice with EXPLAIN.

Example: ask the database to reveal its plan without running the query:
``sql
EXPLAIN
SELECT * FROM Orders WHERE customer_id = 42;
-- "Index Scan using orders_customer_idx" → good, it's using the index
-- "Seq Scan on orders" → reading the whole table, often bad
``
A surprise Seq Scan usually means stale statistics or a missing index.

Why it exists:
The query optimiser is the database's decision-making engine — it determines whether a query takes 1ms or 1 minute based on table statistics and available indexes. For QA engineers, EXPLAIN is the primary tool for diagnosing performance findings: it shows whether the optimiser is making good decisions and whether the indexes the developers added are actually being used.

Real-world QA use case:
A QA engineer runs performance tests and finds a customer lookup endpoint timing out. She runs EXPLAIN and sees "Seq Scan on Customers (rows=1000000)" — the optimiser estimates 1M matching rows and scans the whole table, even though an index on email exists. She refreshes the statistics (ANALYZE) and re-runs EXPLAIN: the plan switches to an index scan and the query drops from 4 seconds to 8ms. Stale statistics were the entire cause — no index change required.

Rule of thumb: always run EXPLAIN before diagnosing a slow query as "missing an index" — stale statistics can cause the optimiser to choose a table scan even when a perfectly good index already exists; ANALYZE fixes the statistics cost-free.
💡 Plain English: It's like a sat-nav planning a route. It knows roughly how busy each road is (the statistics) and picks what it thinks is fastest. If its traffic info is out of date, it'll confidently send you down a jammed road — which is exactly what stale statistics do to a query.
21
Durability

How does a database avoid losing committed data in a crash?

Most use a write-ahead log (WAL): before changing the actual data, the database first writes the change to a sequential log on disk and confirms it. So when you COMMIT, the change is safely in the log even if the main data files haven't been updated yet.

If the server crashes, on restart it replays the log — redoing committed changes and undoing half-finished ones — bringing the database back to a correct state. This is what makes the "D" (durability) in ACID real.

Why it exists:
WAL/write-ahead logging is the mechanism behind database durability — it's why a COMMIT is permanent even if the server crashes a millisecond later. For QA engineers, understanding WAL is important when designing disaster recovery tests: you can verify that no committed data is lost after a simulated crash, and that in-progress (uncommitted) transactions are correctly rolled back on recovery.

Real-world QA use case:
A QA engineer is testing disaster recovery for a financial application. She runs a transaction that transfers funds and commits it, then immediately kills the database process. On restart, she verifies the transfer is still present — WAL replay ensures the committed change survived the crash. She also runs a test where she kills the process mid-transaction (before COMMIT) and verifies the partially-written transfer is rolled back on recovery, leaving no partial state.

Rule of thumb: durability testing requires two scenarios — commit then crash (data should survive) and crash mid-transaction (partial state should be rolled back) — both are required to verify the WAL correctly handles committed and uncommitted changes.
💡 Plain English: It's like jotting every transaction in a notebook *before* updating the big ledger. If the power cuts out mid-update, you reopen the notebook and redo the entries you'd noted but not yet copied over. Nothing confirmed is ever lost.
22
Performance

How do you update millions of rows without locking the table for everyone?

Don't do it in one giant statement — that holds locks for ages and bloats the log. Instead, update in small batches in a loop:

``sql
-- repeat until no rows are left to update
UPDATE Orders
SET status = 'archived'
WHERE id IN (
SELECT id FROM Orders WHERE status = 'old' LIMIT 5000
);
``

Each batch commits quickly, releasing locks so other users can get in between batches. Add a short pause between batches if the server is busy.

Why it exists:
A single UPDATE or DELETE touching millions of rows holds locks for the entire duration — potentially blocking all other users for minutes. Batch updates solve this by breaking the work into small, fast commits that each release their locks immediately. For QA engineers, batch processing is important to test because it introduces new edge cases: what happens if the batch job is interrupted halfway through? Does it restart cleanly?

Real-world QA use case:
A QA engineer is testing a "bulk archive old orders" job that updates 2M rows. She tests the single-statement approach on the staging database and measures a 4-minute table lock that times out all other active tests in the suite. She recommends the batch approach (5,000 rows per batch). She also designs an interruption test: she kills the batch job after 300,000 rows are updated and verifies that the remaining rows are correctly processed on the next run, with no rows archived twice or missed.

Rule of thumb: batch large UPDATE and DELETE operations in chunks of 1,000–10,000 rows — each batch commits quickly and releases locks, keeping the application responsive while the bulk operation runs in the background.
💡 Plain English: It's like moving house. You don't carry every box in one impossible trip that blocks the whole hallway — you move a few boxes at a time, letting other people pass between trips. Slower in total, but nobody's left stuck waiting.
23
Debugging

A query is fast in your dev database but slow in production. What is going on?

Almost always it's about scale and stats, not the SQL itself:
- Data size — dev has 1,000 rows, prod has 50 million. A missing index doesn't hurt on small data but kills you on big.
- Stale statistics in prod lead the optimiser to a bad plan.
- Concurrency — in prod, other queries compete for locks, CPU, and memory.
- Different config or hardware between the two environments.

How to investigate: run EXPLAIN on production (carefully), compare its plan to dev, and check indexes and statistics first.

Example: the same query, two very different worlds:
``sql
EXPLAIN ANALYZE SELECT * FROM Orders WHERE customer_id = 42;
-- Dev (1k rows): Seq Scan ... 2ms → scanning everything is cheap
-- Prod (50M rows): Seq Scan ... 8000ms → identical plan, now a disaster
`
The usual fix: an index on
customer_id` plus refreshed statistics in prod.

Why it exists:
"Works in dev, fails in prod" is one of the most common and frustrating categories of production incident. For QA engineers, the root cause is almost always data volume — dev databases are too small to exhibit the same query plan behaviour as production. Understanding this gap is what drives good test environment sizing: performance tests must use production-sized data or the results are misleading.

Real-world QA use case:
A QA engineer establishes a rule for her team's performance tests: all performance tests must run against a database with at least 10% of production row counts, and the test environment must have statistics refreshed after the test data is loaded. She enforces this after two incidents where "fast in staging" queries caused production timeouts — in both cases the staging database had 10,000 rows while production had 10M, making the test results completely non-representative.

Rule of thumb: run performance tests against a production-sized dataset — a query that uses a table scan on 10,000 rows takes 5ms; the same scan on 10M rows takes 5 minutes; only production-scale data reveals the real plan.
💡 Plain English: It's like a recipe that works fine for two but falls apart at a wedding for 500. The steps are identical — what changed is the scale, the busy kitchen, and whether your tools were built for that volume.
24
Data Warehousing

In a data warehouse, how do you keep history when a value changes — like a customer moving city?

That's a slowly changing dimension, and the common approach is Type 2: instead of overwriting the old value, you add a *new row* for the customer and mark which one is current.

You keep columns like valid_from, valid_to, and is_current. The old row gets an end date; the new row is flagged current. Now a report from last year still shows the customer's *old* city, and today's report shows the new one — history is preserved.

(Type 1, by contrast, just overwrites and forgets the old value.)

Example: Asha moves from Pune to Delhi — keep both rows:

| id | name | city | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| 1 | Asha | Pune | 2020-01-01 | 2024-03-01 | false |
| 1 | Asha | Delhi | 2024-03-01 | NULL | true |

A report dated 2023 joins to the Pune row; today's report joins to the Delhi row.

Why it exists:
Slowly changing dimensions preserve the historical accuracy of reports — without Type 2, a customer's old order data would show their new city, making historical analysis wrong. For QA, SCD Type 2 tables require specific data integrity tests: verify that historical lookups return the value at the correct point in time, that only one row per entity is flagged is_current = true, and that the valid_to/valid_from boundary dates don't have gaps or overlaps.

Real-world QA use case:
A QA engineer is testing a historical sales report that shows where customers lived when they placed each order. She designs three test scenarios: (1) a customer who never moved — all orders show one address row; (2) a customer who moved once — orders before the move date show the old address, orders after show the new one; (3) a customer who moved twice — three rows, each order mapped to the correct row. She finds the report incorrectly maps all historical orders to the most recent address because it joins only on customer_id without filtering on order_date falling within valid_from/valid_to.

Rule of thumb: when testing SCD Type 2 tables, always include a test with a historical lookup date that falls before the latest row — if the join doesn't filter on the date range, old orders will show the customer's current data rather than the data at order time.
💡 Plain English: It's like keeping every version of your address on file instead of typing over the old one. An old parcel still shows where it was *actually* sent at the time, while new parcels use your current address. You never lose the history of where you used to live.
25
Reliability

How do you stop a retried request from charging a customer twice?

You make the operation idempotent — safe to run more than once with the same result. The usual trick is an idempotency key: the client sends a unique key with the request, and the database has a unique constraint on it.
- First time: the charge is recorded with that key.
- A retry with the same key hits the unique constraint and is safely ignored (or returns the original result) instead of charging again.

This protects against network retries, double-clicks, and timeouts where the client isn't sure the first attempt worked.

Example: a unique constraint on the key does the work:
``sql
INSERT INTO Charges (idempotency_key, amount)
VALUES ('abc-123', 100);

-- A retry with the SAME key fails the unique constraint instead of charging twice:
-- ERROR: duplicate key value violates unique constraint "charges_idempotency_key"
``

Why it exists:
Double-charging customers is one of the most damaging payment bugs — and it's caused by perfectly normal network behaviour (timeouts, retries) rather than application errors. Idempotency keys are the standard database-level solution. For QA engineers, testing the idempotency guarantee is a required test case for every payment or financial endpoint: send the same request twice and verify exactly one charge is recorded.

Real-world QA use case:
A QA engineer is testing a payment endpoint. She designs three idempotency scenarios: (1) same key, same amount, sent twice — only one charge recorded; (2) same key, different amount — the second request should be rejected or return the original response; (3) new key — a second legitimate charge is correctly created. She finds scenario 2 is not handled — the API accepts the same idempotency key with a different amount, creating ambiguity about which amount was charged. She logs this as a high-severity defect.

Rule of thumb: test idempotency with three cases: same key + same payload (one charge), same key + different payload (should reject), different key (new charge) — all three are required to fully validate the idempotency guarantee.
💡 Plain English: It's like a cloakroom ticket. You hand over your coat and get ticket #42. If you accidentally ask for your coat twice with the same ticket, you still get *one* coat — the ticket number stops them handing you a second one that isn't yours.
26
Schema Design

How would you design the database tables for a simple e-commerce site?

Start by finding the entities (the nouns) and the relationships between them.

Core tables:
- Customers(id, name, email)
- Products(id, name, price)
- Orders(id, customer_id → Customers, order_date, status)
- OrderItems(id, order_id → Orders, product_id → Products, quantity, unit_price)

The decisions an interviewer listens for:
- OrderItems is a junction table — an order has many products and a product appears in many orders (many-to-many), so you need a table *between* them.
- Store unit_price on OrderItems, not just on Products — the price at *purchase time* must be frozen, even if the product's price changes later.
- Foreign keys make sure every order item points to a real order and a real product.

Example:
``sql
CREATE TABLE OrderItems (
id INT PRIMARY KEY,
order_id INT REFERENCES Orders(id),
product_id INT REFERENCES Products(id),
quantity INT NOT NULL,
unit_price NUMERIC NOT NULL -- frozen at purchase time
);
``

Why it exists:
Schema design questions test whether a candidate understands relational data modelling at a practical level. For QA engineers, understanding e-commerce data model patterns is directly applicable to test data setup, SQL verification queries, and diagnosing data integrity issues. Knowing that unit_price must be frozen at order time — not read from Products at report time — prevents an entire class of historical pricing errors.

Real-world QA use case:
A QA engineer is validating an order history report. She notices that when a product's price is updated, old orders in the report show the new price rather than the purchase-time price. Investigation reveals the report is reading unit_price from the Products table rather than from OrderItems — the frozen price column exists in the schema but the report query was written before it was populated. She raises the defect and adds a test case: change a product price, verify historical orders still display the original purchase price.

Rule of thumb: always store unit_price on OrderItems at the time of purchase — never join back to Products for historical prices, because product prices change and historical orders must reflect what the customer actually paid.
💡 Plain English: It's setting up a filing system before the paperwork floods in — one drawer per kind of thing (customers, products, orders), plus a cross-reference card (OrderItems) noting which products went into which order, with the price written down on the day so it never shifts under you.
27
Data Warehousing

What is the difference between a star schema and a snowflake schema?

Both are data-warehouse layouts with a central fact table (the measurements — sales, clicks) surrounded by dimension tables (the context — date, product, store).

- Star schema — dimensions are *flat* (denormalised). DimProduct holds the product *and* its category and brand, all in one table. Fewer joins → fast, simple queries.
- Snowflake schema — dimensions are *normalised* into sub-tables: DimProductDimCategoryDimDepartment. Less duplicated data, but more joins.

Example: a sales fact joined to a flat (star) product dimension:
``sql
SELECT p.category, SUM(f.amount)
FROM FactSales f
JOIN DimProduct p ON p.id = f.product_id -- category lives right here (star)
GROUP BY p.category;
``
Star is the common default for reporting (speed + simplicity); snowflake saves space and keeps large or shared dimensions tidy, at the cost of extra joins.

Why it exists:
Star and snowflake are the two dominant patterns for structuring data warehouse schemas. For QA engineers working on BI or reporting features, understanding these patterns is important for writing accurate SQL validation queries — star schemas require fewer joins, while snowflake schemas require joining through multiple dimension levels to get the same data, and forgetting a join gives you the wrong result.

Real-world QA use case:
A QA engineer is validating a sales report that aggregates by product category. The warehouse uses a snowflake schema: FactSales → DimProduct → DimCategory → DimDepartment. Her initial validation query only joins FactSales to DimProduct, which gives product-level results rather than category-level results. She adds the DimCategory join and gets the correct category-level aggregation — matching the report. The extra join step in the snowflake schema is what she nearly missed.

Rule of thumb: in a star schema, one join from the fact table gives you dimension attributes; in a snowflake, you need a chain of joins — always trace the full join path to the dimension level you need before writing a validation query.
💡 Plain English: A **star** schema is a plate with the main dish in the middle and all the sides in one ring around it — grab anything in a single reach. A **snowflake** adds smaller dishes branching off the sides — neater and less repetition, but you reach further (more joins) to get everything.
28
Design

What is the difference between a soft delete and a hard delete, and when would you use each?

- Hard delete — actually remove the row with DELETE. It's gone.
- Soft delete — keep the row but mark it inactive, usually with a deleted_at timestamp or an is_deleted flag. Queries then filter out the "deleted" rows.

Example (soft delete):
``sql
-- "Delete" the customer:
UPDATE Customers SET deleted_at = NOW() WHERE id = 7;

-- Every normal query must now exclude them:
SELECT * FROM Customers WHERE deleted_at IS NULL;
`
Use soft delete when you need history, undo, an audit trail, or to keep referential integrity (old orders still point to the customer). Use hard delete for genuinely disposable data, or to honour "delete my data" privacy requests (GDPR).

The catch with soft delete: every query must remember the
WHERE deleted_at IS NULL filter, and the table keeps growing.

Why it exists:
Soft delete is the standard pattern for features that need an audit trail, undo capability, or referential integrity preservation — but it introduces a subtle and dangerous bug class: queries that forget the
WHERE deleted_at IS NULL filter will silently include deleted records in their results. For QA engineers, soft delete tables require a specific test category: verify that every query, report, and list endpoint correctly excludes soft-deleted rows.

Real-world QA use case:
A QA engineer is testing a product catalogue that uses soft delete. She soft-deletes a product and then checks each area where products appear: the product list page (correctly hidden), the search results (correctly hidden), the order history report (correctly hidden), but the "popular products" widget still shows the deleted product. The widget query was added after the soft-delete pattern was established and its developer forgot the
WHERE deleted_at IS NULL filter. The QA engineer adds a regression test: soft delete a product and verify it's absent from all 6 surfaces.

Rule of thumb: when testing features with soft-deleted tables, always verify every query surface that reads from the table — the
WHERE deleted_at IS NULL` filter is easy to forget when adding new queries, and the bug is invisible unless specifically tested.
💡 Plain English: Soft delete is moving a file to the Recycle Bin — recoverable, still taking up space, you just don't see it normally. Hard delete is running it through the shredder. You bin things you might want back; you shred things that must truly be gone.
29
Performance

Why does OFFSET pagination get slow on large tables, and what is the better approach?

-- Orders(id PRIMARY KEY, ...)
The problem: LIMIT 10 OFFSET 100000 still makes the database *walk through and throw away* the first 100,000 rows before returning 10. The deeper the page, the slower it gets.

Better — keyset (a.k.a. cursor / seek) pagination. Remember the last row you saw, and ask for rows *after* it:
``sql
-- Page 1:
SELECT * FROM Orders ORDER BY id LIMIT 10;

-- Next page: pass the last id you saw (e.g. 4310)
SELECT * FROM Orders
WHERE id > 4310
ORDER BY id
LIMIT 10;
`
Because
id is indexed, the database jumps straight to the spot — page 10,000 is as fast as page 1.

Trade-off: keyset can't jump to an arbitrary page number ("go to page 500"); it's built for next/previous and infinite scroll. Plain
OFFSET` is fine for small, shallow result sets.

Why it exists:
Pagination performance is a silent killer — it works perfectly at page 1 and degrades invisibly at page 500. For QA engineers, OFFSET-based pagination is important to test with deep page numbers against production-sized datasets. The first 10 pages of results may load in 50ms while page 1,000 takes 8 seconds — a regression that only appears in performance testing, not functional testing.

Real-world QA use case:
A QA engineer is performance-testing a transaction history feature that uses LIMIT/OFFSET pagination. She tests pages 1, 10, 50, 100, and 500 and records the response times: p1=45ms, p10=48ms, p50=120ms, p100=380ms, p500=4200ms — clear linear degradation. She raises a performance defect with the evidence and recommends keyset pagination for the infinite-scroll use case. The fix reduces all page response times to under 50ms regardless of depth.

Rule of thumb: always performance-test pagination at deep page numbers (100+) against production-sized data — OFFSET pagination looks fast in functional testing but degrades linearly in production; keyset pagination maintains constant performance regardless of page depth.
💡 Plain English: OFFSET pagination is counting pages from the front of the book every single time to reach page 500 — slower the deeper you go. Keyset pagination is sticking a bookmark where you stopped and opening straight to it next time.
30
Distributed Systems

What is the difference between ACID and BASE?

They're two philosophies for how a system handles data correctness:

- ACID (traditional relational databases) — Atomic, Consistent, Isolated, Durable. Every transaction is all-or-nothing and the data is always in a valid, up-to-date state. Favours *correctness*.
- BASE (many NoSQL / distributed systems) — Basically Available, Soft state, Eventually consistent. The system stays available and fast, and accepts that different nodes may briefly disagree before *catching up*. Favours *availability and scale*.

This ties back to the CAP theorem: during a network split, ACID systems lean consistent (refuse stale answers), BASE systems lean available (answer now, reconcile later).

The trade-off in practice: a bank balance wants ACID — never show wrong money, even if you have to wait. A social-media "like" count is fine with BASE — if it briefly reads 1,001 instead of 1,002, nobody cares, and the system stays fast for millions of users.

Why it exists:
ACID vs BASE is a foundational architecture decision that determines what consistency guarantees users can expect. For QA engineers, this distinction shapes which test scenarios are valid: an ACID system should be tested for strict consistency under concurrent access; a BASE system should be tested for eventual consistency — data converges to the correct state eventually, but not necessarily immediately after a write.

Real-world QA use case:
A QA engineer is testing a distributed shopping cart backed by a BASE/eventual-consistency store. She writes and immediately reads back the cart — occasionally the read returns the pre-write state. She documents this as expected eventual consistency behaviour (not a bug) and agrees with the product team on the maximum convergence window. She adds a test that verifies the cart shows the correct state within 2 seconds of a write — the agreed SLA for eventual consistency in this feature.

Rule of thumb: before testing a distributed system, clarify whether it promises ACID consistency or BASE eventual consistency — the expected test outcomes are fundamentally different, and marking eventual-consistency behaviour as a bug is incorrect.
💡 Plain English: ACID is a careful accountant — the books must balance *exactly*, right now, even if that means making you wait. BASE is a busy tally on a whiteboard where everyone adds marks fast; the count might lag a few seconds across the office, but it never stops working and it sorts itself out.
31
Indexing

When should you NOT add an index?

Indexes speed up reads but cost you on writes and storage, so they aren't free. Avoid (or rethink) one when:

- The table is small — a full scan is already instant; the index just adds overhead.
- The column is low-cardinality — very few distinct values (a boolean is_active, a gender). The database often scans anyway because the index barely narrows things down.
- The table is write-heavy — every INSERT/UPDATE/DELETE must update every index, so lots of indexes slow writes.
- The column is rarely filtered on — an index no query uses is pure cost.

Example:
``sql
-- 'is_active' has only 2 values — an index on it is usually ignored:
CREATE INDEX idx_active ON Users(is_active); -- often a waste

-- A composite index aimed at a real query pattern is far more useful:
CREATE INDEX idx_active_signup ON Users(is_active, signup_date);
`
Rule of thumb: index for the queries you actually run, and measure — don't index "just in case."

Why it exists:
Over-indexing is a real problem in write-heavy systems — every unnecessary index slows down INSERT, UPDATE, and DELETE operations. For QA engineers, knowing when NOT to add an index prevents them from suggesting indexes in performance findings that would harm write throughput more than they help read throughput.

Real-world QA use case:
A QA engineer finds a slow query during performance testing and notices there's no index on the
is_active boolean column used in the WHERE clause. She recommends adding an index — but the developer checks the data: 92% of rows are active, so the index would be ignored anyway (the planner prefers a full scan when filtering removes fewer than ~10% of rows). Instead, they add a partial index: CREATE INDEX idx_inactive ON Users(id) WHERE is_active = false` — which is tiny, effective, and only serves the minority case that actually benefits from an index.

Rule of thumb: before recommending an index, check the column's cardinality — a boolean or low-cardinality column rarely benefits from a standard B-tree index; a partial index covering only the minority value is often the better solution.
💡 Plain English: An index is like adding a detailed table-of-contents to a document. For a 2-page memo it's pointless overhead; for a section that's just "yes/no" it barely helps you find anything; and if the document changes constantly, you're forever reprinting the contents page. Add one only where it genuinely speeds up finding things.
32
Performance

Production database CPU is at 100% and queries are timing out. Walk me through your investigation.

Don't restart first — investigate first. A restart hides the evidence.

Step 1 — Find what's running:
``sql
-- Postgres: active queries right now
SELECT pid, now() - query_start AS duration, query, state
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;

-- MySQL
SHOW PROCESSLIST;
`

Step 2 — Identify the heaviest queries:
`sql
-- Postgres: top queries by total CPU time
SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;
`

Step 3 — Check for locks:
`sql
-- Postgres: who is blocked?
SELECT blocked.pid, blocking.pid AS blocking_pid, blocked.query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
`

Common causes in order of likelihood:
1. A new slow query (missing index after schema change or data growth).
2. A lock chain — one long-running transaction blocking many others.
3. A runaway loop in a stored procedure.
4. Sudden traffic spike — check application logs alongside DB logs.
5. Autovacuum storm on a write-heavy table (Postgres-specific).

Mitigate: kill the worst offending query (
pg_cancel_backend(pid)`) only if you understand what it is — not blindly.

Why it exists:
A database CPU spike is a production incident that requires systematic diagnosis, not guessing. For QA engineers, this scenario comes up in performance test post-mortems (why did the database saturate at X concurrent users?) and in production incident support. Knowing how to query pg_stat_activity, pg_stat_statements, and the lock views turns a "database is slow" complaint into an actionable, evidence-backed finding.

Real-world QA use case:
A QA engineer is running load tests and the database CPU hits 100% at 50 concurrent users. She connects to the database mid-test and queries pg_stat_activity — she finds 45 active connections all running the same report query simultaneously (the test is triggering a scheduled report on every user login). The report is doing a full table scan. She stops the test, logs the finding with the pg_stat_activity output as evidence, and recommends the report be moved to a background job rather than triggered per login.

Rule of thumb: when database CPU is high during load testing, query pg_stat_activity and pg_stat_statements immediately — the active query list and cumulative execution times tell you which specific query is the culprit within seconds.
💡 Plain English: A hospital A&E with all beds occupied and new patients still arriving. First thing: who are the patients taking the longest and why? Find the blocked beds (locks), identify who's been there longest (long queries), and triage — not turn everyone away at the door.
33
Schema Design

You need to migrate a 500-million-row table with zero downtime. How do you approach it?

A full table migration with a lock is not an option at this scale. The approach is a dual-write / shadow table migration:

Phase 1 — Create the new table
Create the target table (new schema, new name) alongside the existing one — no traffic impact yet.

Phase 2 — Backfill in batches
Copy rows in small chunks (e.g., 10,000 rows at a time), with a short sleep between batches to avoid overwhelming the database:
``sql
INSERT INTO orders_new SELECT * FROM orders WHERE id BETWEEN 1 AND 10000;
-- commit, sleep, next batch
``
This runs for hours or days without locking.

Phase 3 — Dual-write
Update the application to write to *both* old and new tables simultaneously. New rows land in both; backfill catches up to the cutover point.

Phase 4 — Verify
Run checksums or row counts to confirm the new table is complete and consistent.

Phase 5 — Cutover
Briefly pause writes (maintenance window of seconds, not hours), do a final top-up of any rows written since the last batch, then switch reads and writes entirely to the new table.

Phase 6 — Cleanup
Drop the old table once confirmed stable.

Tools like gh-ost (MySQL) or pg_repack (Postgres) automate this pattern.

Why it exists:
Zero-downtime migration of large tables is one of the most complex database operations a team faces. For QA engineers, understanding this pattern is important when reviewing migration plans before production releases — a migration that takes 2 hours with a table lock is not "zero downtime" regardless of how the ticket is labelled. QA engineers should verify that the migration approach is truly non-blocking before approving a release.

Real-world QA use case:
A QA engineer is reviewing a release plan that includes migrating a 300M-row Orders table to add a new shipping_region column. The plan says "zero downtime migration." She reviews the migration script: it uses a single ALTER TABLE statement. She flags this as a blocking issue — a single ALTER on 300M rows will lock the table for 30–60 minutes. The team rewrites the migration using the dual-write/batch-backfill approach, and the QA engineer verifies the new approach on a production-sized clone before approving the release.

Rule of thumb: verify "zero downtime" migration claims by reading the actual SQL script — a single ALTER TABLE on a large table is never zero downtime; the dual-write/batch-backfill pattern (or an online schema change tool) is required for tables over ~10M rows.
💡 Plain English: Replacing a motorway while keeping traffic flowing. You build the new lane alongside the old one, gradually move traffic over (dual-write), run both in parallel until you confirm no cars are on the old road, then close and demolish it. You never close the whole motorway at once.
34
Indexing

What is a covering index and how does it eliminate a table lookup?

A covering index is an index that contains every column a query needs — so the database can answer the query entirely from the index, without ever touching the actual table rows.

Normally a query uses an index to find *which rows* match, then does a separate table lookup (heap fetch / key lookup) to get the actual column values. This second step is expensive on large result sets.

A covering index eliminates that second step:

``sql
-- Query:
SELECT name, email FROM Customers WHERE status = 'active';

-- Standard index on status alone:
-- 1. Seek index to find matching rows → get row pointers
-- 2. Fetch each row from the table to get name and email ← extra I/O

-- Covering index includes name and email too:
CREATE INDEX idx_covering ON Customers(status) INCLUDE (name, email);
-- Now the index itself has status + name + email
-- Step 2 is eliminated entirely
`

INCLUDE (SQL Server, Postgres) adds columns to the index leaf pages without making them part of the sort key — so the index stays narrow for seeking, but wide enough for the query to be self-contained.

When to use: high-frequency queries on large tables where the table fetch is visibly expensive in the execution plan (look for "Key Lookup" or "Heap Fetches" in EXPLAIN).

Why it exists:
Covering indexes eliminate the most expensive part of index-based lookups at scale — the second round trip to the heap to fetch the actual row data. For QA engineers doing performance testing, identifying "Key Lookup" in an execution plan is a direct recommendation opportunity: adding INCLUDE columns to an existing index can dramatically improve high-frequency query performance with minimal schema change.

Real-world QA use case:
A QA engineer is analysing EXPLAIN output for a high-frequency "active user lookup" query. The plan shows: Index Scan on
idx_users_status → Key Lookup on the main table (to fetch name and email). At 1,000 requests/second, those 1,000 extra table lookups per second are measurable latency. She recommends adding INCLUDE (name, email)` to the existing status index — the query drops from 340ms to 85ms p99 under load because the Key Lookup step is eliminated entirely.

Rule of thumb: when EXPLAIN shows a "Key Lookup" or "Heap Fetch" step after an index scan on a high-frequency query, check whether a covering index (add SELECT columns to INCLUDE) would eliminate the second table access entirely.
💡 Plain English: A library catalogue card that also shows the full book summary, not just the shelf location. Normally you use the card to find the shelf, then walk there to read. A covering index is the card that has everything — you never leave the catalogue room.
35
Performance

EXPLAIN shows a table scan despite an index existing on the filtered column. What are the possible reasons?

This is a common senior interview question — there are several reasons the planner might ignore an index:

1. Low cardinality / poor selectivity
The column has very few distinct values (e.g. status = 'active' covers 90% of rows). The planner calculates that a full scan is faster than reading the index and then fetching 90% of the table anyway.

2. Stale statistics
The planner thinks the table has 1,000 rows (old stats) but it now has 50 million. Run ANALYZE (Postgres) or UPDATE STATISTICS (SQL Server) to refresh.

3. Non-sargable predicate — function on the column
``sql
WHERE UPPER(email) = 'USER@X.COM' -- index on email is bypassed
WHERE YEAR(created_at) = 2024 -- index on created_at is bypassed
`
Fix: rewrite the predicate so the column is bare, or create a functional index.

4. Small table
Below a certain row count, a full scan is faster than index overhead. The planner is correct to ignore the index here.

5. Wrong data type / implicit conversion
`sql
WHERE user_id = '123' -- user_id is INT; string comparison forces a cast
`
The implicit conversion prevents the index from being used.

6. Index is bloated or fragmented
Rebuild or reindex to recover efficiency.

Diagnosis: add an index hint temporarily to force the index and compare execution times — if forcing it is slower, the planner was right.

Why it exists:
"Index exists but isn't being used" is one of the most common and confusing EXPLAIN findings for QA engineers. Understanding the six reasons the planner ignores an index prevents incorrect diagnoses ("the index is broken") and incorrect recommendations ("add another index") when the real cause is stale statistics or a non-sargable predicate that is easily fixed.

Real-world QA use case:
A QA engineer finds a query doing a full scan on a 30M-row Users table despite an index on email. She works through the checklist: cardinality is high (emails are unique) ✓, statistics are current ✓, no function wrapping ✓ — she checks the data type and finds
email` is VARCHAR but the query comparison is against an INT (user_id accidentally passed to an email parameter). The implicit type cast prevents the index from being used. Correcting the parameter type restores the index seek and drops the query from 12 seconds to 2ms.

Rule of thumb: when an index is ignored, work through the six causes in order — cardinality, stale statistics, function on column, small table, wrong data type, fragmentation — the data-type mismatch case (implicit cast) is especially easy to miss.
💡 Plain English: A shortcut on your commute that you ignore when it's rush hour because the main road is actually faster despite the distance. The planner looked at the traffic (data distribution) and made the same rational call. Understanding *why* it chose the main road is more useful than forcing it back onto the shortcut.
36
Schema Design

How do you design a table to store hierarchical data — an org chart or category tree?

There are three main approaches, each with real trade-offs:

1. Adjacency List (simplest — one parent_id column)
``sql
CREATE TABLE Categories (
id INT PRIMARY KEY,
name VARCHAR(100),
parent_id INT REFERENCES Categories(id) -- NULL = root
);
`
Pros: simple to implement and update. Cons: traversing the whole tree requires recursive CTEs or application-side loops.

Querying the full tree (Postgres/SQL Server with recursive CTE):
`sql
WITH RECURSIVE tree AS (
SELECT id, name, parent_id, 0 AS depth
FROM Categories WHERE parent_id IS NULL -- start at root
UNION ALL
SELECT c.id, c.name, c.parent_id, t.depth + 1
FROM Categories c
JOIN tree t ON t.id = c.parent_id
)
SELECT * FROM tree ORDER BY depth;
`

2. Nested Set (fast reads, complex writes)
Each node stores left and right bounds. Any node whose bounds fall inside another's is a descendant. Reads are a simple range query; inserts/deletes require renumbering every node — painful for frequently-changing trees.

3. Closure Table (best for complex traversal)
A separate table stores every ancestor-descendant pair. Reads and traversal are simple joins; writes insert multiple rows.

Practical advice: use Adjacency List + recursive CTEs for most applications — it's maintainable and correct. Move to a Closure Table only if complex tree queries are frequent and performance is a proven problem.

Why it exists:
Hierarchical data structures appear in org charts, category trees, threaded comments, bill-of-materials, and permission structures. Each approach has different performance implications for reads vs. writes. For QA engineers, testing hierarchical features requires specific test cases that probe the tree's boundaries: the root node, leaf nodes, deep nesting, moving a node, and the maximum supported depth.

Real-world QA use case:
A QA engineer is testing a product category hierarchy using Adjacency List. She designs boundary tests: a root category with no parent (parent_id IS NULL) ✓, a leaf category with no children ✓, a category moved from one parent to another (the recursive CTE picks up the new path) ✓. She then tests for an infinite loop: a category accidentally set as its own parent (parent_id = id) — the recursive CTE query runs forever. She flags the missing cycle-detection guard as a defect and recommends adding a
WHERE depth < 20` safety limit to the recursive query.

Rule of thumb: when testing recursive CTE tree traversal, always include a cycle-detection test — a row pointing to itself or a circular parent chain will cause the recursive query to run indefinitely without a depth limit or cycle guard.
💡 Plain English: Storing a family tree. Adjacency List is writing each person's parent's name — simple, but finding all descendants of a great-grandmother takes multiple lookups. Nested Set is drawing boxes within boxes on a map — one scan finds all descendants, but moving someone means redrawing all the boxes.
37
Schema Design

How do you design a full audit trail system that records every INSERT, UPDATE, and DELETE on a critical table?

An audit trail must capture who changed what, when, and what the old value was. Two main approaches:

Approach 1 — Trigger-based audit table
``sql
CREATE TABLE Orders_Audit (
audit_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
operation CHAR(1), -- 'I', 'U', 'D'
changed_at TIMESTAMPTZ DEFAULT NOW(),
changed_by TEXT,
old_data JSONB, -- full row before change
new_data JSONB -- full row after change
);

CREATE OR REPLACE FUNCTION audit_orders() RETURNS TRIGGER AS $$
BEGIN
INSERT INTO Orders_Audit(operation, changed_by, old_data, new_data)
VALUES (
LEFT(TG_OP, 1),
current_user,
row_to_json(OLD),
row_to_json(NEW)
);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_audit_orders
AFTER INSERT OR UPDATE OR DELETE ON Orders
FOR EACH ROW EXECUTE FUNCTION audit_orders();
``

Approach 2 — Application-level audit log
The application writes to an audit table explicitly before/after every change. More control, no trigger overhead, but easy to miss in code paths.

What a good audit record must contain:
- The operation (INSERT / UPDATE / DELETE).
- The timestamp (with time zone).
- Who did it (user ID, not just DB user).
- The old value and new value — not just a flag that it changed.
- The source (which application, which endpoint, which IP).

Compliance note: audit tables should be append-only — no application user should have UPDATE or DELETE rights on them.

Why it exists:
Audit trails are a compliance and forensics requirement for financial systems, healthcare applications, and any feature that modifies sensitive data. For QA engineers, audit logging requires its own dedicated test suite: verify that every INSERT, UPDATE, and DELETE is captured, that the old and new values are recorded correctly, that the audit trail cannot be modified by normal users, and that the who/when information is accurate.

Real-world QA use case:
A QA engineer is testing an audit trail on a Patient table in a healthcare application. She designs four test scenarios: (1) INSERT a patient — verify one audit row with operation='I' and the new_data field matching the inserted values; (2) UPDATE the patient's email — verify one audit row with operation='U', old_data showing the old email, new_data showing the new one; (3) DELETE the patient — verify one audit row with operation='D' and old_data containing the full pre-delete record; (4) attempt to DELETE from the audit table as a normal user — verify permission denied.

Rule of thumb: audit trail testing requires four test cases: INSERT audit capture, UPDATE audit capture (old AND new values), DELETE audit capture, and permission test confirming the audit table is append-only — all four must pass for the audit trail to be considered reliable.
💡 Plain English: A CCTV system for a bank vault. Every door open and close is recorded with a timestamp, who opened it, and what was inside before and after. The recording system itself must be tamper-proof — no one who uses the vault should be able to edit the footage.
38
Performance

How do you partition a large table and what types of partitioning are available?

Partitioning splits one large table into smaller physical pieces (partitions) while keeping the logical appearance of one table. The database can then skip entire partitions when querying — called partition pruning.

Types of partitioning:

1. Range partitioning (most common — usually by date)
``sql
CREATE TABLE Orders (order_date DATE, ...) PARTITION BY RANGE (order_date);
CREATE TABLE orders_2023 PARTITION OF Orders FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
CREATE TABLE orders_2024 PARTITION OF Orders FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
`
Query for 2024 data → only the 2024 partition is scanned.

2. List partitioning — partition by discrete values (region, country, status):
`sql
PARTITION BY LIST (region);
CREATE TABLE orders_eu PARTITION OF Orders FOR VALUES IN ('EU');
``

3. Hash partitioning — distribute rows evenly across N partitions by hashing a column. Good for balancing write load when there is no natural range or list key.

When partitioning helps:
- Tables over ~100GB where most queries filter on the partition key.
- Archiving: drop or detach old partitions instantly (no DELETE needed).
- Parallel query: each partition can be scanned by a separate worker.

When it doesn't help:
- Queries that don't filter on the partition key — all partitions are still scanned.
- Small tables — overhead outweighs benefit.

Why it exists:
Partitioning is a scaling technique that makes queries on large tables faster by physically separating data into smaller chunks and allowing the planner to skip irrelevant ones. For QA engineers, partitioned tables require additional boundary tests: verify that queries at the exact partition boundary (e.g., 2023-12-31 vs 2024-01-01) return correct results and are assigned to the correct partition, and that rows outside any defined partition boundary are rejected or handled correctly.

Real-world QA use case:
A QA engineer is testing a range-partitioned Orders table. She designs partition boundary tests: an order on 2023-12-31 (goes into orders_2023 partition ✓), an order on 2024-01-01 (goes into orders_2024 partition ✓), a query with WHERE order_date = '2023-12-31' (EXPLAIN confirms only the 2023 partition is scanned ✓), and an order with a date of NULL (no partition accepts NULL — insert fails with a partition exclusion violation). She flags the NULL date handling as a defect needing a default partition or NOT NULL constraint.

Rule of thumb: when testing partitioned tables, always test the partition boundary dates (the last day of one partition and the first day of the next), NULL values in the partition key, and verify EXPLAIN confirms partition pruning is actually working.
💡 Plain English: A filing cabinet with drawers labelled by year. To find a 2024 contract, you open only the 2024 drawer — not all 10 drawers. Partition pruning is the database opening only the relevant drawer instead of searching the whole cabinet.
39
Concurrency

A long-running query is blocking all other queries. How do you resolve it without taking down the system?

Identify the blocker first:
``sql
-- Postgres: find blocking sessions
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

-- See the lock chain:
SELECT blocked.pid, blocked.query, blocking.pid AS blocking_pid, blocking.query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
`

Options — in order of invasiveness:

1. Wait and monitor — if the query is nearly done, waiting is safer than killing it (a rollback of a large transaction can take as long as the transaction itself).

2. Cancel the query (soft kill) — sends a cancellation signal; the session stays alive:
`sql
SELECT pg_cancel_backend(pid); -- Postgres
KILL QUERY thread_id; -- MySQL
`

3. Terminate the session (hard kill) — use if cancel doesn't work:
`sql
SELECT pg_terminate_backend(pid); -- Postgres
KILL thread_id; -- MySQL (terminates session)
`

After resolution — root cause analysis:
- Was it a runaway query (missing WHERE, bad plan)?
- Was it expected but too long (large batch, report)?
- Add a statement_timeout to prevent this in future:
`sql
SET statement_timeout = '5min';
``
- For known long-running reports, run them off a read replica so they can't block the primary.

Why it exists:
A single long-running or blocked query can cascade — it holds locks, which block other queries, which pile up, eventually overwhelming the connection pool. For QA engineers doing load testing, designing a test that reproduces this blocking cascade (one long transaction, then many short queries that wait for it) is how you verify the system's tolerance for lock contention under realistic conditions.

Real-world QA use case:
A QA engineer is investigating why her load test shows random timeouts at 30% load — not consistent, but intermittent. She connects during the test and queries the lock chain view: a background analytics job is holding a table-level lock for 45 seconds, blocking dozens of read requests that queue up. When the analytics job finishes, the queued requests all try to execute simultaneously, overwhelming the connection pool. She recommends the analytics job run on a read replica with a statement_timeout of 30 seconds — and adds a statement_timeout test to the performance test suite.

Rule of thumb: when investigating intermittent timeouts during load tests, always check the lock chain in pg_stat_activity or the equivalent — a background long-running query that periodically blocks everything is the most common cause of intermittent (not consistent) load test failures.
💡 Plain English: One lorry broken down and blocking the whole motorway. You assess first — is it about to move? If not, call for a tow (cancel). If it's on fire, close the lane and remove it (terminate). Then investigate why the lorry was allowed on this road in the first place.
40
Security

How do you implement row-level security — different users see different rows of the same table?

Postgres — Row-Level Security (RLS):
``sql
-- Enable RLS on the table
ALTER TABLE Orders ENABLE ROW LEVEL SECURITY;

-- Policy: users can only see their own orders
CREATE POLICY user_own_orders ON Orders
USING (customer_id = current_setting('app.current_user_id')::INT);

-- Application sets the context before querying:
SET app.current_user_id = '42';
SELECT * FROM Orders; -- only returns orders for customer 42
`

SQL Server — Row-Level Security:
`sql
CREATE FUNCTION dbo.fn_security_predicate(@customer_id INT)
RETURNS TABLE AS RETURN
SELECT 1 AS result WHERE @customer_id = CAST(SESSION_CONTEXT(N'user_id') AS INT);

CREATE SECURITY POLICY OrderFilter
ADD FILTER PREDICATE dbo.fn_security_predicate(customer_id) ON dbo.Orders;
``

How it works: the policy is a filter that the database applies transparently on every SELECT, UPDATE, and DELETE — users physically cannot retrieve rows the policy blocks, even if they write the query correctly.

Key design points:
- The application must set the user context before any query — don't rely on application-layer filtering alone.
- Superusers / table owners bypass RLS by default — use FORCE ROW LEVEL SECURITY for full enforcement.
- Test thoroughly: try querying as a low-privilege user to confirm the policy works.

Why it exists:
Row-level security enforces data isolation at the database level — it prevents one tenant or user from ever seeing another's data, even through SQL injection, a misconfigured query, or a developer making a direct database connection. For QA engineers, RLS policies require security-specific test scenarios that verify the isolation guarantee holds under multiple access patterns.

Real-world QA use case:
A QA engineer is testing a multi-tenant SaaS application where each customer should only see their own orders. She designs four RLS test scenarios: (1) Customer A logs in — their orders visible, Customer B's orders absent; (2) Customer A tries a UNION attack to access Customer B's data via a crafted query — returns zero rows (RLS applies to all queries); (3) a developer connects directly to the database with the application role — RLS still applies; (4) a database superuser connects — RLS bypassed (expected — document this as a privileged access risk). She flags that FORCE ROW LEVEL SECURITY is not set, meaning the table owner can read all rows.

Rule of thumb: when testing RLS policies, always verify isolation from three connection types: application user (should be restricted), direct database user with application role (should still be restricted), and superuser (bypasses RLS — flag as a privileged access risk if FORCE ROW LEVEL SECURITY is not enabled).
💡 Plain English: A hospital records system where each doctor can only see their own patients' records. The filter isn't enforced by the application — the database itself applies it on every read. Even a rogue query can't pull records that belong to another doctor.
41
Internals

What is MVCC (Multi-Version Concurrency Control) and why does it mean readers don't block writers in Postgres?

MVCC keeps multiple versions of every row simultaneously — each transaction sees a *snapshot* of the data as it existed at the moment the transaction started, regardless of what other transactions are doing.

How it works:
- When a row is updated, Postgres doesn't overwrite it. It writes a *new version* of the row and marks the old version as "deleted after transaction X."
- A concurrent reader sees the old version (the snapshot from when their transaction began) — no lock needed.
- The writer writes the new version — also no lock needed.

The result: readers and writers work on different row versions simultaneously. Neither has to wait for the other.

``
Transaction A (UPDATE): writes new row version → committed at t=5
Transaction B (SELECT, started at t=3): reads old version → no blocking
``

The cost: old row versions accumulate — they must be cleaned up. This is what VACUUM does in Postgres. A table with very high write rates needs regular vacuuming or it bloats (dead row versions pile up and slow queries).

Why this matters to a DBA: long-running transactions hold snapshots that prevent VACUUM from cleaning up old versions. A 3-hour analytics query on a busy table can cause significant bloat.

Why it exists:
MVCC is the reason Postgres (and many other databases) can handle high concurrency without constant lock contention — readers and writers don't block each other because they work on different row versions. For QA engineers, MVCC explains two important test considerations: (1) long-running read transactions can cause table bloat if VACUUM is prevented from reclaiming dead rows; (2) under MVCC, a transaction sees a snapshot of the data at its start time, so "read-your-own-write" tests within the same transaction always work, but reading from a separate transaction immediately after a commit may see a slightly older snapshot.

Real-world QA use case:
A QA engineer is investigating growing table sizes on a write-heavy test database that runs 3-hour analytics tests. She queries pg_stat_activity and finds a long-running analytics query that has been open for 180 minutes — its snapshot is holding back VACUUM from cleaning dead rows, causing 40GB of table bloat. She adds a statement_timeout of 30 minutes to the analytics test connection and schedules a manual VACUUM after each analytics test run, eliminating the bloat.

Rule of thumb: in Postgres, long-running transactions (especially read-only analytics) prevent VACUUM from reclaiming dead rows — always set a statement_timeout on long-running test connections and run VACUUM after heavy write tests to prevent table bloat from accumulating.
💡 Plain English: A Google Doc with version history. When you edit, Google doesn't delete the old version — it creates a new one. Your colleague reading the document at the same moment sees the version from when they opened it, not your half-finished edits. No one has to wait. VACUUM is periodically archiving the old versions to keep the document history manageable.
42
Performance

You are asked to tune a legacy database with no documentation, no query history, and no one to ask. Where do you start?

Start by observing, not changing anything.

Step 1 — Understand what the system is doing:
``sql
-- Top queries by cumulative time (Postgres):
SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;
`
This gives you the actual workload — not what you think the system does.

Step 2 — Find the expensive queries:
Run EXPLAIN ANALYZE on the top 5 offenders. Look for table scans, missing indexes, bad join orders, and stale statistics.

Step 3 — Understand the schema:
`sql
-- List tables by size:
SELECT relname, pg_size_pretty(pg_total_relation_size(oid))
FROM pg_class WHERE relkind = 'r' ORDER BY pg_total_relation_size(oid) DESC;

-- Check existing indexes:
SELECT * FROM pg_indexes WHERE tablename = 'your_table';
`

Step 4 — Check index usage:
`sql
SELECT indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes ORDER BY idx_scan;
``
Indexes with zero or near-zero scans are unused — they cost write overhead for no benefit.

Step 5 — Make one change at a time. Add one index, measure, confirm improvement before the next change.

What not to do: don't add indexes everywhere speculatively, don't rebuild everything, and don't change isolation levels without understanding concurrency patterns first.

Why it exists:
Tuning an undocumented legacy system is a real scenario QA engineers face when brought into existing applications to establish a performance baseline. The key principle — observe before changing — applies equally to performance investigation and quality assessment: you need to understand what the system actually does before you can assess whether it does it correctly or efficiently.

Real-world QA use case:
A QA engineer joins a legacy banking application with no test documentation. She applies the same observe-first approach: she starts by querying pg_stat_statements for the top 20 queries by total execution time to understand the actual workload. The top query (a balance inquiry) runs 50,000 times per day and takes 800ms each — 11 CPU-hours per day on one query. She identifies the missing index, adds it, and documents a 95% performance improvement. The methodology — measure first, fix the proven bottleneck, verify improvement — is documented as the performance testing baseline for the project.

Rule of thumb: when joining a new project for performance assessment, start by reading pg_stat_statements sorted by total_exec_time — this shows you the actual workload, not what the codebase looks like it does, and it immediately identifies the highest-impact optimisation targets.
💡 Plain English: Taking over a messy workshop with no manual. You don't start reorganising immediately — you watch how work actually flows through the space for a week, identify the three most painful bottlenecks, fix the worst one first, and confirm it helped before touching anything else.
43
Schema Design

What is the difference between a regular view and a materialized view, and when do you choose each?

Regular view: a saved query with no stored data. Every time you query it, the underlying SELECT runs fresh. Up to date always; no extra storage; can be as slow as the underlying query.

Materialized view: the result set is computed once and stored physically on disk, like a snapshot. Queries against it are fast (reading a table). But it goes stale — you must explicitly refresh it.

``sql
-- Postgres materialized view:
CREATE MATERIALIZED VIEW monthly_revenue AS
SELECT DATE_TRUNC('month', order_date) AS month, SUM(amount) AS revenue
FROM Orders GROUP BY 1;

-- Refresh when data changes:
REFRESH MATERIALIZED VIEW monthly_revenue;

-- Refresh without blocking reads (Postgres):
REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_revenue;
``

Choose a regular view when:
- Data must be real-time.
- The underlying query is fast.
- Freshness is more important than query speed.

Choose a materialized view when:
- The query is expensive (aggregations over millions of rows).
- Slight staleness is acceptable (e.g. a dashboard refreshed every hour).
- The same expensive result is needed by many queries.

Why it exists:
Choosing between a regular and a materialized view is an architectural decision with real testing implications — the right choice depends on whether users need real-time data or can tolerate a refresh window. For QA engineers, the key testing consideration for materialized views is verifying the refresh mechanism works correctly and that the staleness window is communicated to users.

Real-world QA use case:
A QA engineer is testing a monthly revenue dashboard backed by a materialized view refreshed hourly. She designs four test scenarios: (1) data added since the last refresh is NOT visible (expected staleness behaviour) ✓; (2) after a manual refresh, new data IS visible ✓; (3) the refresh job fails silently — the view shows data from 3 hours ago with no warning to the user ✗ (defect: a "last updated" timestamp should be displayed); (4) REFRESH MATERIALIZED VIEW CONCURRENTLY allows reads during refresh ✓. She raises the missing staleness indicator as a separate defect.

Rule of thumb: when testing materialized views, always test three scenarios: staleness (changes not visible until refresh), successful refresh (changes visible after), and failed/delayed refresh (users should see a "last updated" timestamp, not stale data without warning).
💡 Plain English: A regular view is a live camera feed — always current but requires the camera to be running. A materialized view is a photograph — fast to look at, but it shows the scene as it was when the photo was taken, not right now. Refresh it to take a new photo.
44
Incident Management

Your database log file (WAL / transaction log) has filled the disk and the database is now refusing writes. What do you do?

This is a production outage — act fast and methodically:

Immediate containment:
1. Do not delete log files blindly — transaction logs are needed for recovery. Deleting the wrong file can corrupt the database.
2. Confirm the cause — is it a log backup that hasn't run, a replication slot holding back WAL, or a genuine runaway log growth?

Postgres — check for stale replication slots:
``sql
SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn))
FROM pg_replication_slots;
`
A stale slot prevents WAL from being cleaned up. Drop the stale slot if no replica is consuming it:
`sql
SELECT pg_drop_replication_slot('stale_slot_name');
`

SQL Server — force a log backup:
`sql
BACKUP LOG MyDatabase TO DISK = 'NUL'; -- emergency truncation (dev/staging only)
-- In production: backup to a real location, then shrink
``

General — free space quickly:
- Add disk space / extend the volume if infrastructure allows.
- Clean up other large files on the same volume to buy time.
- Run a log backup to a remote location to truncate the log.

After recovery — root cause:
- Why wasn't the log backup running? (Check the backup job.)
- Was a replication consumer silently failing?
- Set up disk usage alerts at 70% and 85% so this is never a surprise.

Why it exists:
WAL/transaction log disk-full is a class of production incident that causes a complete write outage — not gradual degradation. For QA engineers, this scenario is relevant when designing disaster recovery tests and validating monitoring coverage. A full log disk is entirely preventable with proper monitoring, making it a test of alerting and operational procedures as much as a test of recovery steps.

Real-world QA use case:
A QA engineer is designing a disaster recovery test plan for a production database. She includes a WAL overflow test: she creates a stale replication slot that holds back WAL, then monitors disk usage until it reaches 80%. She verifies that the alert fires at 70% (expected) and then simulates the resolution — dropping the stale slot and confirming WAL cleanup runs. She also verifies that dropping the slot doesn't cause data loss by confirming all downstream consumers are reconnected before the slot is dropped.

Rule of thumb: a full transaction log means writes stop immediately — test your disk usage alerts before the incident, not during it; the alert at 70% gives you time to respond before the 100% outage, which is the entire point of monitoring.
💡 Plain English: A filing room where someone forgot to clear the filing trays — they fill up and new documents can't come in. You don't burn the room down. You identify the clog (stale replication slot, missed backup), clear the backlog carefully, and set up an alarm so the trays never silently fill again.
45
Architecture

What is connection pooling, why does it matter at scale, and how do you configure it correctly?

Opening a database connection is expensive — it involves authentication, memory allocation, and process/thread creation. At scale, with hundreds of web workers each opening their own connections, this becomes the bottleneck.

Connection pooling maintains a pool of pre-opened, reusable connections. Application threads borrow a connection from the pool, use it, and return it — no open/close overhead per request.

Without pooling at scale:
- Each web request opens a new connection → database thread limit hit (Postgres default: 100).
- Connection overhead dominates response time.
- Under sudden traffic spikes, the DB crashes from connection storms.

Recommended tools:
- PgBouncer (Postgres) — transaction-mode pooling: one connection reused across many clients.
- HikariCP (Java), node-postgres pool (Node.js) — application-level pools.

Key configuration to get right:
- Pool size — not "as large as possible." Optimal is usually ≈ (num_cpu_cores * 2) + effective_spindle_count per Postgres documentation. Too large causes context-switching overhead.
- Max idle time — return idle connections to the pool before the database kills them.
- Connection timeout — fail fast if the pool is exhausted rather than queuing forever.
- Health checks — pools must test connections before lending them (detect stale connections after a DB restart).

Why it exists:
Connection pooling is a prerequisite for running production-scale load tests — without it, your load test will hit the database's connection limit rather than testing the actual application behaviour. For QA engineers, connection pool configuration is an important load test variable: a pool that's too small creates a bottleneck under load, a pool that's too large overwhelms the database with context switching.

Real-world QA use case:
A QA engineer is load-testing a Node.js API. At 100 concurrent users, response times spike to 8 seconds and the database logs show "too many connections" errors. She checks the connection pool configuration: it's set to a maximum of 5 connections for the entire Node.js process. She increases it to 20 (matching the recommended formula for the CPU count) and re-runs the test — response times drop to under 200ms. She also tests the health-check behaviour: she restarts the database mid-test and verifies the pool automatically re-establishes connections rather than serving stale connection errors to users.

Rule of thumb: size your connection pool to approximately (CPU cores × 2) + effective_spindle_count for Postgres — larger pools cause context-switching overhead that makes performance worse, not better; test pool behaviour explicitly by restarting the database mid-load-test.
💡 Plain English: A taxi rank instead of hailing a new cab from scratch every trip. Cars wait at the rank (the pool). You jump in an available one, use it, and return it to the rank. No waiting for a cab to be manufactured each time — just borrowing one that's already ready.
46
Testing

How do you verify a database migration script is safe before running it on production?

A bad migration on a production database with 500M rows can mean hours of downtime or permanent data loss. Verify it rigorously:

1. Run on a production-sized clone first
Test on a copy of production (same row counts, same data distribution) — not on a 1,000-row dev database. Performance and lock behaviour are completely different at scale.

2. Time the migration
Run EXPLAIN ANALYZE or time the actual execution. A migration that takes 2 seconds on dev may take 45 minutes on prod and lock the table the entire time.

3. Verify correctness
Before and after row counts, checksums on key columns:
``sql
-- Before:
SELECT COUNT(*), SUM(amount), MAX(updated_at) FROM Orders;

-- Run migration

-- After: same query should show expected changes only
`

4. Test the rollback script
Every migration must have a verified rollback. Run the rollback on the clone and confirm it returns the schema and data to the original state.

5. Check for table locks
Some DDL operations (adding a NOT NULL column, changing a column type) lock the whole table. Use online schema change tools (pt-online-schema-change, gh-ost, pg_repack) for tables you can't afford to lock.

6. Run in a transaction where possible
`sql
BEGIN;
-- migration steps
-- verify: SELECT COUNT(*), spot-check data
ROLLBACK; -- dry run — confirm, then COMMIT on the real run
``

Why it exists:
Database migrations are permanent, shared-state changes — a bad migration can corrupt data, lock tables for hours, or require a backup restore to undo. Verifying migrations before production deployment is QA's formal role in the release process: every migration should have documented evidence of testing before a QA engineer signs off.

Real-world QA use case:
A QA engineer is assigned to validate a migration before a production release. She runs it on a production-sized clone and times it: 8 minutes — acceptable. She checks row counts before and after against expected values. She runs the rollback script and confirms the schema returns to the original state. She submits a signed-off migration test evidence document as a release gate artefact. Three months later, the team tries to skip this process to save time — she refers back to the documented process, noting that the last time it was skipped a migration locked the Orders table for 22 minutes during peak traffic.

Rule of thumb: every database migration requires three verified checks before production: (1) timing on a production-sized clone, (2) data correctness before/after, and (3) rollback tested — a migration with no tested rollback has no safe recovery path if something goes wrong.
💡 Plain English: A surgeon rehearsing a complex procedure on a simulation model before the real patient. The simulation has the same anatomy, the same constraints, and the same risks — what takes 5 minutes on a mannequin and 5 hours on a real patient is worth knowing before you pick up the scalpel.
47
Best Practices

How do you write SQL that stays readable and maintainable as the codebase grows?

SQL that works on day one becomes a maintenance nightmare two years later without intentional style. The principles that survive in real teams:

1. CTEs over nested subqueries
``sql
-- Unreadable:
SELECT * FROM (SELECT * FROM (SELECT ...) a WHERE ...) b WHERE ...;

-- Readable:
WITH active_users AS (...),
recent_orders AS (...)
SELECT ... FROM active_users JOIN recent_orders ...;
`

2. One column per line, aligned
`sql
SELECT
customer_id,
SUM(amount) AS total_revenue,
COUNT(*) AS order_count
FROM Orders
WHERE status = 'completed'
GROUP BY customer_id;
`

3. Name everything meaningfully
Avoid single-letter aliases for non-obvious tables (
o, c is fine; a, b is not). Name CTEs after what they represent.

4. Comment the why, not the what
`sql
-- Exclude test accounts from all revenue reports
WHERE email NOT LIKE '%@internal.company.com'
``

5. Avoid SELECT *
List explicit columns — implicit columns break when the schema changes.

6. Keep transactions short and explicit
Wrap every multi-statement write in an explicit BEGIN / COMMIT so intent is clear.

7. Store complex queries in version-controlled files
Not in application strings, not in stored procedures nobody can find. Version-controlled SQL files can be reviewed, diffed, and tested like any other code.

Why it exists:
SQL maintainability directly affects QA's ability to write and maintain data validation queries. When production queries are buried in application string literals or undocumented stored procedures, QA engineers can't easily reproduce or validate the business logic. Well-structured SQL — with named CTEs, explicit column lists, and version control — is as important to test quality as well-structured application code.

Real-world QA use case:
A QA engineer is writing data validation queries for a quarterly audit. She finds that the production reports use nested subqueries five levels deep with single-letter aliases — impossible to read and validate. She refactors the report query using named CTEs (which the developer accepts as a maintenance improvement) and then uses the refactored version as her validation reference. The refactoring also surfaces a silent bug: one level of the original nesting was filtering on the wrong column, which had been invisible in the dense formatting.

Rule of thumb: when writing SQL for data validation, use named CTEs and explicit column aliases — the extra clarity makes the validation query itself reviewable and reduces the chance of a silent logic error in your own verification code.
💡 Plain English: Well-written SQL is like well-written legal language — precise, structured, and named so that anyone reading it a year later understands the intent without asking the author. Ambiguity and abbreviations that save 30 seconds today cost hours of confusion next year.
48
Security

A developer asks for read access to the database. How do you grant access to specific tables only, not the whole database?

Use principle of least privilege — grant only what is needed, nothing more.

Postgres:
``sql
-- Create a role for this developer
CREATE ROLE dev_readonly;

-- Grant CONNECT to the specific database
GRANT CONNECT ON DATABASE myapp TO dev_readonly;

-- Grant USAGE on the schema (required before table access)
GRANT USAGE ON SCHEMA public TO dev_readonly;

-- Grant SELECT on specific tables only
GRANT SELECT ON orders, customers TO dev_readonly;

-- NOT this — too broad:
-- GRANT ALL ON ALL TABLES IN SCHEMA public TO dev_readonly;

-- Create the user and assign the role
CREATE USER alice WITH PASSWORD 'secure_password';
GRANT dev_readonly TO alice;
`

Revoke when no longer needed:
`sql
REVOKE dev_readonly FROM alice;
DROP USER alice;
``

Best practices:
- Use roles, not per-user grants — roles scale to teams.
- Never grant on production directly. Prefer a read replica for developer queries — it can't lock the primary and can be revoked without touching production permissions.
- Audit who has access regularly and revoke stale accounts promptly.
- Never share the superuser / sa account for developer access.

Why it exists:
Database access control is a security requirement that QA engineers should verify as part of security testing. Overly broad permissions — granting ALL to all tables, sharing superuser credentials, or forgetting to revoke access when team members leave — are common findings in security audits. QA engineers who understand grant/revoke can write automated permission-verification tests that run as part of the security test suite.

Real-world QA use case:
A QA engineer is performing a security review for a production database. She discovers that the application uses the superuser account for all database connections — not a read/write role with minimal permissions. She also finds that a former contractor's database user still exists with full table access, 6 months after they left the project. She raises both findings as high-severity security defects and implements an automated test that queries pg_roles and verifies no application connection uses a superuser account, and no accounts have been idle for over 90 days.

Rule of thumb: verify database access permissions as part of security testing — check that the application role has only the permissions it needs, that no application uses superuser credentials, and that all accounts are actively reviewed and revoked when no longer needed.
💡 Plain English: Giving a contractor a key card to the building. You don't give them the master key — you give them a key that opens the two rooms they need and nothing else, with an expiry date. If they leave, you deactivate that key, not the whole building's security system.
49
Performance

How do you identify and resolve index fragmentation on a production database?

Index fragmentation builds up over time as INSERT, UPDATE, and DELETE operations scatter index pages out of logical order. A fragmented index takes more I/O to traverse.

Postgres — fragmentation via page bloat:
Postgres doesn't have traditional fragmentation but suffers from bloat (dead rows and empty pages). Check with:
``sql
SELECT indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
`
Fix with: REINDEX CONCURRENTLY (Postgres 12+) — rebuilds the index without locking:
`sql
REINDEX INDEX CONCURRENTLY idx_orders_customer;
`

SQL Server — check fragmentation:
`sql
SELECT object_name(object_id) AS table_name,
name AS index_name,
avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED')
JOIN sys.indexes ON object_id AND index_id
WHERE avg_fragmentation_in_percent > 10
ORDER BY avg_fragmentation_in_percent DESC;
``
- < 10% fragmentation — leave it.
- 10–30% — REORGANIZE (online, low impact).
- > 30% — REBUILD (offline by default; use ONLINE = ON for minimal locking).

Rule: run fragmentation checks and maintenance as a scheduled job, not reactively.

Why it exists:
Index fragmentation builds up silently over time in high-write tables — the index still works, but it takes progressively more I/O to traverse. Performance degrades gradually rather than suddenly, making fragmentation invisible without proactive monitoring. For QA engineers, this is important context for interpreting performance test trends: if p95 latency is creeping up week-over-week on a write-heavy table, fragmentation (or table bloat in Postgres) may be the cause rather than new code changes.

Real-world QA use case:
A QA engineer notices that her weekly performance regression test on an Orders search endpoint is creeping up by 5–8ms each week — too slow to be new code and too consistent to be noise. She investigates index health on the Orders table and finds 45% fragmentation on the primary search index after 3 months of heavy writes in the test environment. A REINDEX CONCURRENTLY drops the endpoint back to its original baseline. She adds index health monitoring to the performance test dashboard so future fragmentation is caught at 30% rather than 45%.

Rule of thumb: when performance test baselines gradually degrade over weeks without code changes, check index fragmentation — a rebuild or reorganise often restores baseline performance, and scheduled maintenance prevents the creep from recurring.
💡 Plain English: A filing cabinet where new files get inserted wherever there's space, not in alphabetical order. After months of this, finding a file means skipping around the drawer. Rebuilding the index is reorganising the cabinet back into alphabetical order — fast lookups restored.
50
Architecture

How do you approach database capacity planning for a system that is growing rapidly?

Capacity planning is about knowing *when* you'll hit a limit — not finding out when the system falls over.

What to measure continuously:
- Storage growth rate — GB per week. Project forward 6 and 12 months. When do you need more storage?
- Query response time trends — are p95 and p99 latencies creeping up? Which queries?
- Connection usage — what percentage of max connections are typically in use?
- CPU and I/O utilisation — sustained above 70% is a warning sign.
- Table and index sizes — which tables are growing fastest?

Tooling:
``sql
-- Postgres: track table growth over time
SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) AS size
FROM pg_class WHERE relkind = 'r'
ORDER BY pg_total_relation_size(oid) DESC LIMIT 20;
``
Store this result daily in a metrics table and chart the trend.

Planning decisions driven by the data:
- Archiving strategy — at what table size do queries degrade? Archive old data before you hit that threshold, not after.
- Read replicas — add before read latency becomes a problem, not after users complain.
- Partitioning — plan before a table becomes so large that partitioning it causes downtime.
- Hardware / cloud tier upgrades — plan 2–3 months ahead, not the day you need it.

The rule: measure everything, graph it, set alerts at 70% of each limit, and act at 80%.

Why it exists:
Capacity planning is about knowing when you'll hit a limit before you hit it — not discovering the limit when the system falls over. For QA engineers running performance tests, capacity planning provides the context to interpret results: a query that uses 60% of available IOPS on the test database tells you it will saturate production in 3 months if current growth continues, even if it passes today's performance criteria.

Real-world QA use case:
A QA engineer notices during performance testing that the Orders table is growing at 2GB per week. She projects forward: at current growth, the table will hit 1TB in 10 months. She queries the current p95 query time at 50GB (current size) and estimates it against known performance at 1TB based on test results with a scaled dataset. She presents the projection to the engineering team: without partitioning or archiving by month 8, query performance will degrade below the SLA. The team adds partitioning to the roadmap — 2 months ahead of when it would have become urgent.

Rule of thumb: during performance testing, measure not just current performance but growth rate — table size, query latency trends, and connection usage trends tell you when you'll hit limits, which is more valuable than knowing current headroom.
💡 Plain English: Managing fuel for a long road trip. You don't wait until the light comes on — you check the gauge regularly, know how far each tank gets you, and plan the refuel stop before you need it. Capacity planning is the same: measure, project, act early.
Want to master SQL Sorcery?
QAVeda has 200+ structured lessons, practice tests, skill assessments and certificates — all gamified with XP, badges and ranks.
Start Learning on QAVeda →