LC 0274 [M] H Index - ALawliet/algorithms GitHub Wiki

bucket/counting sort

Wikipedia: "Formally, if f is the function that corresponds to the number of citations for each publication, we compute the h-index as follows: First we order the values of f from the largest to the lowest value. Then, we look for the last position in which f is greater than or equal to the position (we call h this position)."

First we order the values of f from the largest to the lowest value. Then, we look for the last position in which f is greater than or equal to the position (we call h this position). For example, if we have a researcher with 5 publications A, B, C, D, and E with 10, 8, 5, 4, and 3 citations, respectively, the h-index is equal to 4 because the 4th publication has 4 citations and the 5th has only 3. In contrast, if the same publications have 25, 8, 5, 3, and 3 citations, then the index is 3 (i.e. the 3rd position) because the fourth paper has only 3 citations.
This type of problems always throw me off, but it just takes some getting used to. The idea behind it is some bucket sort mechanisms. First, you may ask why bucket sort. Well, the h-index is defined as the number of papers with reference greater than the number. So assume n is the total number of papers, if we have n+1 buckets, number from 0 to n, then for any paper with reference corresponding to the index of the bucket, we increment the count for that bucket. The only exception is that for any paper with larger number of reference than n, we put in the n-th bucket.

Then we iterate from the back to the front of the buckets, whenever the total count exceeds the index of the bucket, meaning that we have the index number of papers that have reference greater than or equal to the index. Which will be our h-index result. The reason to scan from the end of the array is that we are looking for the greatest h-index. For example, given array [3,0,6,5,1], we have 6 buckets to contain how many papers have the corresponding index. Hope to image and explanation help.

The idea is to see that the result can only range from 0 to the length of the array (because we can't have h-index greater than the total papers published). So we create an array "arr" which acts like a HashMap (using pigeon hole principle) and loop backwards from the highest element, then we find "tot" which is the total number of papers that has more than i citations, and we stop when tot>=i (total number of papers with more than i citations >= i). We don't need to keep going because we are trying the biggest i possible, we we stop and return the result.

The citations[ ] array contains the number of citations for each paper. The number of papers publish is citations.length.

This is actually a "counting sort", which simply counts the number of papers with x citations. Some discussion entries call this a "bucket sort", but technically a counting sort is different in that a bucket sort saves all of the original data. A counting sort throws out the original data, and only saves aggregate information. Counting sorts are O(N). I don't really consider counting sorts to be real sorts, because the original data is lost. I consider it more of a data analysis.

The h-index value cannot be higher than the total number of papers, so we only need the count sort to hold values from 0 to the number of papers (citations.length). Papers with more citations than the number of papers, will be counted as though they had number-of-papers citations.
class Solution:
    def hIndex(self, citations: List[int]) -> int:
        citations = sorted(citations, reverse=True)
        h = 0
        for pos, f in enumerate(citations):
            if f > pos:
                h += 1
        return h
class Solution:
    def hIndex(self, citations):
        citations.sort()
        n = len(citations)
        for i in range(n):
            if citations[i] >= (n-i):
                return n-i
        return 0
class Solution(object):
    def hIndex(self, citations):
        n = len(citations)
        buckets = [0 for _ in range(n + 1)]
        
        for num in citations:
            if num >= n: # can't be bigger than n
                buckets[n] += 1
            elif num < n:
                buckets[num] += 1
                
        count = 0
        
        for pos in reversed(range(len(buckets))):
            count += buckets[pos]
            
            if count >= pos: # reverse count total = h exceeds index pos
                return pos
                
        return 0