The Employee
table holds all employees. The employee table has three columns: Employee Id, Company Name, and Salary.
+-----+------------+--------+ |Id | Company | Salary | +-----+------------+--------+ |1 | A | 2341 | |2 | A | 341 | |3 | A | 15 | |4 | A | 15314 | |5 | A | 451 | |6 | A | 513 | |7 | B | 15 | |8 | B | 13 | |9 | B | 1154 | |10 | B | 1345 | |11 | B | 1221 | |12 | B | 234 | |13 | C | 2345 | |14 | C | 2645 | |15 | C | 2645 | |16 | C | 2652 | |17 | C | 65 | +-----+------------+--------+
Write a SQL query to find the median salary of each company. Bonus points if you can solve it without using any built-in SQL functions.
+-----+------------+--------+ |Id | Company | Salary | +-----+------------+--------+ |5 | A | 451 | |6 | A | 513 | |12 | B | 234 | |9 | B | 1154 | |14 | C | 2645 | +-----+------------+--------+
Intuition
By the definition of median, the count of the bigger numbers than itself should be equal to the count of the smaller ones in an odd array.
Algorithm
Take array [1,3,2] for example, is the first number 1 the median? No, because this array only have 3 elements but there are 2 of them (3, 2) are greater than 1. To continue, we know 3 is not the median as well since there are 2 elements smaller. But for the last element 2, there are equal amount of bigger and smaller numbers. So it is the median in this array!
What if an array has even amount of distinct values, the median is the average of the middle two elements next to each other after sorting this array. It is not hard to understand that for either of these two elements, the difference (absolute value) of its bigger and smaller number than itself in this array is 1, which is the exactly frequency of a element in the distinct array.
So in general, the median's frequency should be equal or grater than the absolute difference of its bigger elements and small ones in an array no matter whether it has odd or even amount of numbers and whether they are distinct. This rule is the key, and it is represented as the following code.
SUM(CASE WHEN Employee.Salary = alias.Salary THEN 1 ELSE 0 END) >= ABS(SUM(SIGN(Employee.Salary - alias.Salary)))
Thus, this approach is as following in MySQL code.
MySQL
SELECT Employee.Id, Employee.Company, Employee.Salary FROM Employee, Employee alias WHERE Employee.Company = alias.Company GROUP BY Employee.Company , Employee.Salary HAVING SUM(CASE WHEN Employee.Salary = alias.Salary THEN 1 ELSE 0 END) >= ABS(SUM(SIGN(Employee.Salary - alias.Salary))) ORDER BY Employee.Id ;
Note: In MySQL 5.6, this code runs fine, but if you are using MySQL 5.7+, please use
ANY_VALUE(Employee.Id)
instead ofEmployee.Id
in the SELECT statement. Otherwise, you may encouter the following error message: Error Code: 1055. Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'Employee.Id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by. For more details on how to userANY_VALUE(arg)
, please refer to this link.
Intuition
In general, we can just pick the middle one(s) to get the median if the records are ranked by salary. But how can we get them sorted particularly MySQL does not have the build-in rank function, and these are several companies in this case.
Algorithm
By adding a virtual column to simulate the ranking, we can sort these records by salary and pick up the middle one(s). Here we need to use the session variable to achieve this goal.
This approach is more efficient than the first one since it does not calculate all the salary one by one in the table.
SELECT Id, Company, Salary FROM (SELECT e.Id, e.Salary, e.Company, IF(@prev = e.Company, @Rank:=@Rank + 1, @Rank:=1) AS rank, @prev:=e.Company FROM Employee e, (SELECT @Rank:=0, @prev:=0) AS temp ORDER BY e.Company , e.Salary , e.Id) Ranking INNER JOIN (SELECT COUNT(*) AS totalcount, Company AS name FROM Employee e2 GROUP BY e2.Company) companycount ON companycount.name = Ranking.Company WHERE Rank = FLOOR((totalcount + 1) / 2) OR Rank = FLOOR((totalcount + 2) / 2) ;