solving a problem with map reduce

Using irb (ruby-1.9.2-p180):

list = [ {a:2, b:1, d:3}, {a:3, b:2, c:3}, {a:4, b:1, c:3} ]
 => [{:a=>2, :b=>1, :d=>3}, {:a=>3, :b=>2, :c=>3}, {:a=>4, :b=>1, :c=>3}] 

Hash[list.map(&:keys).inject(&:&).map{|key| [key,list.map{|arr| arr[key]}.inject(&:+)]}]
 => {:a=>9, :b=>4} 

this solution works with multiple arrays (2+) it finds common keys and sums them returning a hash of results

to find common keys (gather keys and find common part):

list.map(&:keys).inject(&:&)

to find sum for key (select values by keys and sum them):

list.map{|arr| arr[key]}.inject(&:+)

to build Hash from array of pairs [[:a,9], [:b,4]]:

results = [[:a,9], [:b,4]]
Hash[ results ]

I love ruby for this one liners!


You could try by considering the elements given in MapReduce wikipedia article:

  • an input reader - in your case this would probably be a method call on [key, value] pair from your input hashes.
  • a Map function - you already have keys you should be processing your data by, so your map worker would just return the [key, value] pair it got as an input
  • a partition function - a method which would assign a reduce worker based on the key. In your case it could be simply key.hash % REDUCER_COUNT.
  • a compare function - I don't think this is applicable in your case as you don't need values to be processed in any particular order.
  • a Reduce function - would be given [key, list] pair, list being list of values associated with the key. It would return the sum of list if list is more than one element long (as you want only elements appearing in both input hashes processed).
  • an output writer - could be plain Hash in your example.

And here's my (over)simplified implementation of the above.


Assuming that we have all the other map-reduce-related functions implemented (input reader, output writer, global sort, ...), these would be the map and reduce ones:

def map(input)
  input.each do |count, letter|
    yield [letter, count]
  end
end

def reduce(letter, partial_counts)
  result = if partial_counts.size == 2
    partial_counts[0] + partial_counts[1]
  end

  yield result
end

The map function will yield a pair (letter, count), which will be grouped later. Then for each letter received from maps reducewill receive an array containing every count yielded by a map for that letter. As you wish to only yield if the letter occurs on both hashes, we need counts to appear on partial_counts twice to use it to compute the sum at the end. The reduce function could be implemented in several ways. I've tried to make it as simples as possible to understand, although it's implementation is very adjusted to this problem.

Using these map and reduce implementation will return the last hash with keys and value inverted, which makes more sense, as there may be several letter with the same count. The input would be better if it inverted keys and values too. This way, map would be as simple as yielding each pair of (letter, count):

def map(input)
  input.each do |letter, count|
    yield [letter, count]
  end
end

or

def map(input)
  input.each do |i|
    yield i
  end
end

Tags:

Ruby

Mapreduce