How to perform a DISTINCT in Pig Latin on a subset of columns?

The accepted answer is one great solution but, in case you want to reorder the fields in the output (something I had to do recently) this might not work. Here's an alternative:

A = LOAD '$input' AS (f1, f2, f3, f4, f5);
GP = GROUP A BY (f1, f2, f3);
OUTPUT = FOREACH GP GENERATE 
    group.f1, group.f2, f4, f5, group.f3 ;

When you group on certain fields, the selection would have unique values for the group in a each tuple.


Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:

A_unique =
    FOREACH (GROUP A BY a4) {
        b = A.(a1,a2,a3);
        s = DISTINCT b;
        GENERATE FLATTEN(s), group AS a4;
    };

Tags:

Apache Pig