DISTRIBUTE BY clause in HIVE

In addition to @Dudu's answer, the Distribute By only distributes the rows among the reducers which is determined from the input size.

The number of reducers to be used for a Hive job will be determined by this property hive.exec.reducers.bytes.per.reducer which is dependent on the input.

As of Hive 0.14, if the input is < 256MB, only one reducer (one reducer per 256MB of input) will be used unless the number of reducers is overridden by hive.exec.reducers.max or mapred.reduce.tasks properties.


The only thing DISTRIBUTE BY (city) says is that records with the same city will go to the same reducer. Nothing else.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy


A question by the OP:

Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?


For 2 reasons:

  1. In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions https://oren.lederman.name/?p=32)

  2. You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously.