Hive getting top n records in group by query

Revised answer, fixing the bug as mentioned by @Himanshu Gahlot

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (
        SELECT page-id, user-id, clicks FROM mytable
        DISTRIBUTE BY page-id
        SORT BY page-id, clicks desc
) a ) b
WHERE rank < 5
ORDER BY page-id, rank

Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)


As of Hive 0.11, you can do this using Hive's built in rank() function and using simpler semantics using Hive's built-in Analytics and Windowing functions. Sadly, I couldn't find as many examples with these as I would have liked, but they are really, really useful. Using those, both rank() and WhereWithRankCond are built in, so you can just do:

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank() 
           over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
    FROM my table
) ranked_mytable
WHERE ranked_mytable.rank < 5
ORDER BY page-id, rank

No UDF required, and only one subquery! Also, all of the rank logic is localized.

You can find some more (though not enough for my liking) examples of these functions in this Jira and on this guy's blog.