How to Group Nested Collections Based on Given Criteria?

I came up with the following:

user=> (def a [["A" 2011 "Dan"] 
               ["A" 2011 "Jon"] 
               ["A" 2010 "Tim"] 
               ["B" 2009 "Tom"] ])

user=> (into {} (for [[k v] (group-by first a)] 
                  [k (group-by second v)]))

{"A" {2011 [["A" 2011 "Dan"] 
            ["A" 2011 "Jon"]], 
      2010 [["A" 2010 "Tim"]]}, 
 "B" {2009 [["B" 2009 "Tom"]]}}

Generalization of group-by

I needed a generalization of group-by that’d produce more than 2-nested map-of-maps. I wanted to be able to give such a function a list of arbitrary functions to run recursively through group-by. Here’s what I came up with:

(defn map-function-on-map-vals
  "Take a map and apply a function on its values. From [1].
   [1] http://stackoverflow.com/a/1677069/500207"
  [m f]
  (zipmap (keys m) (map f (vals m))))

(defn nested-group-by
  "Like group-by but instead of a single function, this is given a list or vec
   of functions to apply recursively via group-by. An optional `final` argument
   (defaults to identity) may be given to run on the vector result of the final
   group-by."
  [fs coll & [final-fn]]
  (if (empty? fs)
    ((or final-fn identity) coll)
    (map-function-on-map-vals (group-by (first fs) coll)
                              #(nested-group-by (rest fs) % final-fn))))

Your example

Applied to your dataset:

cljs.user=> (def foo [ ["A" 2011 "Dan"]
       #_=>            ["A" 2011 "Jon"]
       #_=>            ["A" 2010 "Tim"]
       #_=>            ["B" 2009 "Tom"] ])
cljs.user=> (require '[cljs.pprint :refer [pprint]])
nil
cljs.user=> (pprint (nested-group-by [first second] foo))
{"A"
 {2011 [["A" 2011 "Dan"] ["A" 2011 "Jon"]], 2010 [["A" 2010 "Tim"]]},
 "B" {2009 [["B" 2009 "Tom"]]}}

Produces exactly the desired output. nested-group-by could take three or four or more functions and produces that many nestings of hash-maps. Perhaps this will be helpful to others.

Handy feature

nested-group-by also has a handy extra feature: final-fn, which defaults to identity so if you don’t provide one, the deepest nesting returns a vector of values, but if you do provide a final-fn, that is run on the innermost vectors. To illustrate: if you just wanted to know how many rows of the original dataset appeared in each category and year:

cljs.user=> (nested-group-by [first second] foo count)
                                               #^^^^^ this is final-fn
{"A" {2011 2, 2010 1}, "B" {2009 1}}

Caveat

This function doesn’t use recur so deeply-recursive calls could blow the stack. However, for the anticipated use-case, with only a small handful of functions, this shouldn’t be a problem.

Tags:

Clojure