274. H-Index


Given an array of citations (each citation is a non-negative integer) of a researcher, write a function to compute the researcher's h-index.

According to the definition of h-index on Wikipedia: "A scientist has index h if h of his/her N papers have at least h citations each, and the other N − h papers have no more than h citations each."

For example, given citations = [3, 0, 6, 1, 5], which means the researcher has 5 papers in total and each of them had received 3, 0, 6, 1, 5 citations respectively. Since the researcher has 3 papers with at least 3 citations each and the remaining two with no more than 3 citations each, his h-index is 3.

Note: If there are several possible values for h, the maximum one is taken as the h-index.

Credits:
Special thanks to @jianchao.li.fighter for adding this problem and creating all test cases.


Summary

This article is for intermediate readers. It introduces the following ideas: Comparison Sort and Counting Sort.

Solution

Approach #1 (Sorting) [Accepted]

Intuition

Think geometrically. Imagine plotting a histogram where the -axis represents the number of citations for each paper. After sorting in descending order, -index is the length of the largest square in the histogram.

h-index

Figure 1. -index from a plot of decreasing citations for papers

Algorithm

To find such a square length, we first sort the citations array in descending order. After sorting, if , then papers to all have at least citations.

Thus, to find -index, we search for the largest (let's call it ) such that

and therefore the -index is .

For example:

0 1 2 3 4 5 6
sorted citations 10 9 5 3 3 2 1
? true true true false false false false

In this example, we know that the largest with is . Thus

Because , papers (from paper 0 to paper ) have citations at least and papers (from paper to paper ) have citations no more than . By the definition of -index, .

It is also possible to find through binary search after sorting. However, since comparison sorting has a time complexity of which dominates the performance of entire algorithm (linear search is ). Using a binary search () instead of linear search won't change the asymptotic time complexity.

Also note that, we deduced the algorithm in descending for simplicity. Usually the sort function provided by default is in ascending order. The same principles applies to both ascending order and descending order. In the case of ascending order, we just scan it from backward.

Complexity Analysis

  • Time complexity : . Comparison sorting dominates the time complexity.

  • Space complexity : . Most libraries using heap sort which costs extra space in the worst case.


Approach #2 (Counting) [Accepted]

Intuition

Comparison sorting algorithm has a lower bound of . To achieve better performance, we need non-comparison based sorting algorithms.

Algorithm

From Approach #1, we sort the citations to find the h-index. However, it is well known that comparison sorting algorithms such as heapsort, mergesort and quicksort have a lower bound of . The most commonly used non-comparison sorting is counting sort.

Counting sort operates by counting the number of objects that have each distinct key value, and using arithmetic on those tallies to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum and minimum keys, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items.

---by Wikipedia

However, in our problem, the keys are the citations of each paper which can be much larger than the number of papers . It seems that we cannot use counting sort. The trick here is the following observation:

Any citation larger than can be replaced by and the -index will not change after the replacement

The reason is that -index is upper bounded by total number of papers , i.e.

In the diagram, replacing citations greater than with is equivalent to cutting off the area where .

h-index cut off

Figure 2. cutting off the area with citations more than

Apparently, cutting that area off will not change the largest square and the -index.

After we have the counts, we can get a sorted citations by traversing the counts array. And the rest is the same as Approach #1.

But we can do even better. The idea is that we don't even need to get sorted citations. We can find the -index by using the paper counts directly.

To explain this, let's look at the following example:

The counting results are:

0 1 2 3 4 5
count 0 1 1 2 0 1
5 5 4 3 1 1

The value is defined as "the sum of all counts with citation " or "the number of papers having, at least, citations". By definition of the h-index, the largest with is our answer.

After replacing with , we have . Now, we count the number of papers for each citation number to . The counts are . The first from right to left ( down to ) that have is the -index .

Since we can calculate on the fly when traverse the count array, we only need one pass through the count array which only costs time.

Complexity Analysis

  • Time complexity : . There are two steps. The counting part is since we traverse the citations array once and only once. The second part of finding the -index is also since we traverse the papers array at most once. Thus, the entire algorithm is

  • Space complexity : . We use auxiliary space to store the counts.

Further Thoughts

Is it possible to have multiple -values?

The answer is NO. One can find this intuitively from Figure 1. The dashed line crosses the histogram once and only once, because the sorted bars are monotonic. It can also be proven from the definition of the -index.