Optimizing MySQL Queries for Finding Top Earners by Department (leetcode : 184. Department Highest Salary)

Optimizing MySQL Queries for Finding Top Earners by Department

Optimizing MySQL Queries for Finding Top Earners by Department

In this blog post, we’ll explore how to write and optimize a MySQL query to find employees with the highest salary in each department. This is a common problem in database querying, often encountered in interviews and real-world applications. We’ll compare two solutions: one using a correlated subquery and an optimized version using a Common Table Expression (CTE). We’ll explain the approach, logic, time complexity, and provide the complete SQL code for both.

Problem Statement

Given two tables:

  • Employee: Contains employee details with columns id, name, salary, and departmentId.
  • Department: Contains department details with columns id and name.

We need to find each employee who earns the highest salary in their department, returning the department name, employee name, and salary. If multiple employees share the highest salary in a department, all should be included.

Solution 1: Using a Correlated Subquery

Approach

The first solution uses a correlated subquery to identify the maximum salary for each employee’s department and filters for employees whose salaries match this maximum.

  1. Join Tables: Join Employee and Department on departmentId = id to associate employees with department names.
  2. Correlated Subquery: For each employee, compute the maximum salary in their department using a subquery referencing the employee’s departmentId.
  3. Filter: Select employees whose salary equals their department’s maximum salary.

SQL Query

SELECT d.name AS Department, e.name AS Employee, e.salary AS Salary
FROM Employee e 
JOIN Department d 
ON e.departmentId = d.id 
WHERE e.salary = (
    SELECT MAX(salary)
    FROM Employee
    WHERE departmentId = e.departmentId
);
    

Explanation

  • Join: The JOIN links each employee to their department using e.departmentId = d.id.
  • Subquery: For each row in Employee, the subquery SELECT MAX(salary) FROM Employee WHERE departmentId = e.departmentId finds the maximum salary for that department. It’s correlated because it uses e.departmentId from the outer query.
  • Filter: The WHERE e.salary = (...) ensures only employees with the maximum salary are selected.
  • Output: Returns d.name (department name), e.name (employee name), and e.salary (salary).

Time Complexity

Let’s define:

  • n: Number of rows in Employee.
  • m: Number of rows in Department.
  • k: Average number of employees per department (n ≈ k * m).
  1. Join: With indexes on Employee.departmentId and Department.id (primary key), a hash join costs O(n + m). A nested loop join with an index on d.id costs O(n log m).
  2. Correlated Subquery: Runs for each of n rows in Employee.
    • Filter departmentId = e.departmentId: With an index, costs O(log n + k) (index lookup plus scanning k rows).
    • Compute MAX(salary) over k rows: O(k).
    • Total per subquery: O(log n + k).
    • Across n rows: O(n * (log n + k)).
  3. Filtering/Output: Comparing salaries and selecting columns is O(n).

Total: O(n * (log n + k)).

  • If k is small, approximates O(n log n).
  • If k is large (k ≈ n/m), becomes O(n²/m).
  • Without an index on departmentId, subquery costs O(n) each, leading to O(n²).

Drawbacks

  • Performance: The subquery executes n times, inefficient for large Employee tables.
  • Scalability: Performance degrades as n or k grows.
  • Readability: Nested subquery is harder to follow for complex scenarios.

Solution 2: Using a Common Table Expression (CTE)

Approach

To improve performance, we precompute the maximum salary per department using a CTE and join it with Employee and Department.

  1. Precompute Maximum Salaries: Use a CTE to calculate the maximum salary for each department with GROUP BY.
  2. Join with Employee: Join the CTE with Employee to find employees matching the maximum salary.
  3. Join with Department: Join with Department for department names.
  4. Output: Select the required columns.

SQL Query

WITH MaxSalaries AS (
    SELECT departmentId, MAX(salary) AS max_salary
    FROM Employee
    GROUP BY departmentId
)
SELECT d.name AS Department, e.name AS Employee, e.salary AS Salary
FROM Employee e
JOIN MaxSalaries ms ON e.departmentId = ms.departmentId AND e.salary = ms.max_salary
JOIN Department d ON e.departmentId = d.id;
    

Explanation

  • CTE (MaxSalaries): The WITH MaxSalaries AS (...) creates a temporary result set with columns departmentId and max_salary. The subquery computes the maximum salary per department, producing one row per department.
  • Joins:
    • JOIN MaxSalaries ms ON e.departmentId = ms.departmentId AND e.salary = ms.max_salary: Matches employees to their department’s maximum salary.
    • JOIN Department d ON e.departmentId = d.id: Links employees to department names.
  • Output: Returns d.name, e.name, and e.salary for employees with the maximum salary.
  • Handling Ties: Includes all employees whose salaries match the maximum.

Time Complexity

  1. CTE (MaxSalaries):
    • Scan Employee: O(n).
    • GROUP BY departmentId: With an index, streaming costs O(n); sorting costs O(n log n).
    • Compute MAX(salary) for k rows per department, across m departments: O(m * k) = O(n).
    • Total: O(n) (streaming) or O(n log n) (sorting).
    • Output: m rows.
  2. Joins:
    • Employee JOIN MaxSalaries: With indexes on Employee(departmentId, salary), hash join costs O(n + m). Index-based join costs O(m log n) for r matches per department (r ≤ k).
    • Join with Department: For n’ result rows (n’ ≤ n), lookup in Department costs O(n’ log m).
    • Total: O(n + m) (hash join) or O(m log n + n’ log m).
  3. Output: O(n’), where n’ is small.

Total: O(n log n) (sorting) or O(n) (streaming with hash joins).

Advantages

  • Performance: Computes maximum salaries once, avoiding n subquery executions.
  • Readability: CTE separates logic, improving clarity.
  • Scalability: Scales better with large datasets.
  • Flexibility: CTE can be extended for additional aggregates.

Requirements

CTEs require MySQL 8.0+. For older versions, use a derived table:

SELECT d.name AS Department, e.name AS Employee, e.salary AS Salary
FROM Employee e
JOIN (
    SELECT departmentId, MAX(salary) AS max_salary
    FROM Employee
    GROUP BY departmentId
) ms ON e.departmentId = ms.departmentId AND e.salary = ms.max_salary
JOIN Department d ON e.departmentId = d.id;
    

Comparison

Aspect Solution 1 (Correlated Subquery) Solution 2 (CTE)
Time Complexity O(n * (log n + k)) O(n log n) or O(n)
Readability Moderate (nested subquery) High (clear, modular CTE)
Performance Poor for large n or k Efficient, single scan
Scalability Limited Scales well
MySQL Version Any 8.0+ (or derived table)

Optimization Tips

  1. Indexes: Create a composite index on Employee(departmentId, salary):
    CREATE INDEX idx_emp_dept_salary ON Employee(departmentId, salary);
  2. Execution Plan: Use EXPLAIN to verify index usage:
    EXPLAIN SELECT ...;
  3. Test with Data: Compare performance with large datasets.
  4. Edge Cases:
    • Empty tables: Returns no rows.
    • No employees in a department: Excludes such departments.
    • Ties: Includes all maximum-salary employees.

Conclusion

The correlated subquery is intuitive but inefficient for large datasets due to repeated executions. The CTE-based solution precomputes maximum salaries, reducing scans and improving performance. It’s also more readable and maintainable. With proper indexes and MySQL 8.0+, the CTE approach is ideal. Run these queries, use EXPLAIN, and optimize for your data. Happy querying!

Additional Resources

For another perspective on solving this problem, check out this LeetCode solution by Bikash. It provides an easy-to-understand approach that complements the solutions discussed here.

Comments

Popular posts from this blog

3Sum Closest: O(n²) Two-Pointer Magic šŸš€ leetcode: 16

Kadane's Algorithm: Maximum Subarray Problem - LeetCode(53) Solution

14. Longest Common Prefix