Dealing with datasets with repeated multivalued features

That is very general question but as far as I can tell, if you want to aim to use some ML methods its sensible to transform the data into a tidy data format first.

As far I cant tell from the documentation that @RootTwo nicely references in his comment, you are actually dealing with two datasets: one example flat table and one product flat table. (You can later join the two to get one table if so desired.)

Let us first create some parsers that decode the different lines into somewhat informative data structure:

For lines with examples we may use:

def process_example(example_line):
    # example ${exID}: ${hashID} ${wasAdClicked} ${propensity} ${nbSlots} ${nbCandidates} ${displayFeat1}:${v_1}
    #    0        1         2           3               4          5            6               7 ...
    feature_names = ['ex_id', 'hash', 'clicked', 'propensity', 'slots', 'candidates'] + \
                    ['display_feature_' + str(i) for i in range(1, 11)]
    are_numbers = [1, 3, 4, 5, 6]
    parts = example_line.split(' ')
    parts[1] = parts[1].replace(':', '')
    for i in are_numbers:
        parts[i] = float(parts[i])
        if parts[i].is_integer():
            parts[i] = int(parts[i])
    featues = [int(ft.split(':')[1]) for ft in parts[7:]]
    return dict(zip(feature_names, parts[1:7] + featues))

This method is hacky but gets the job done: parse features and cast to numbers where possible. The output does look like:

{'ex_id': 20184824,
 'hash': '57548fae76b0aa2f2e0d96c40ac6ae3057548faee00912d106fc65fc1fa92d68',
 'clicked': 0,
 'propensity': 1.416489e-07,
 'slots': 6,
 'candidates': 30,
 'display_feature_1': 728,
 'display_feature_2': 90,
 'display_feature_3': 1,
 'display_feature_4': 10,
 'display_feature_5': 16,
 'display_feature_6': 1,
 'display_feature_7': 26,
 'display_feature_8': 11,
 'display_feature_9': 597,
 'display_feature_10': 7}

Next are the product examples. As you mentioned, the proble is the multiple occurance of values. I think it sensible to aggregate unique feature-value pair by their frequency. Information does not get lost, but it helps us to encode of tidy sample. That should address your second question.

import toolz  # pip install toolz

def process_product(product_line):
    # ${wasProduct1Clicked} exid:${exID} ${productFeat1_1}:${v1_1} ...
    parts = product_line.split(' ')
    meta = {'label': int(parts[0]),
            'ex_id': int(parts[1].split(':')[1])}
    # extract feautes that are ${productFeat1_1}:${v1_1} separated by ':' into a dictionary
    features = [('product_feature_' + str(i), int(v))
                for i, v in map(lambda x: x.split(':'), parts[2:])]
    # count each unique value and transform them into
    # feature_name X feature_value X feature_frequency
    products = [dict(zip(['feature', 'value', 'frequency'], (*k, v)))
                for k, v in toolz.countby(toolz.identity, features).items()]
    # now merge the meta information into each product
    return [dict(p, **meta) for p in products]

that basically extracts the label and features for each example (example for line 40):

[{'feature': 'product_feature_11',
  'value': 0,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_12',
  'value': 1,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_13',
  'value': 0,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_14',
  'value': 2,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_15',
  'value': 0,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_17',
  'value': 2,
  'frequency': 2,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_21',
  'value': 55,
  'frequency': 2,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_22',
  'value': 14,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_22',
  'value': 54,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_24',
  'value': 3039,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_25',
  'value': 721,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_33',
  'value': 386,
  'frequency': 2,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_35',
  'value': 963,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103}]

So when you process your stream line by line, you can decide whether to map an example or a product:

def process_stream(stream):
    for content in stream:
        if 'example' in content:
            yield process_example(content)
        else:
            yield process_product(content)

I've decided to do a generator here because it will benefit processing data the functional way if you decide to not use pandas. Otherwise a list compresion will be your fried.

Now the for the fun part: we read the lines from a given (example) url one by one and assign them into their corresponding datasets (example or product). I will use reduce here, because it is fun :-) . I'll not go into detail what map/reduce actually does (thats up to you). You can always use a simple for loop instead.

import urllib.request
import toolz  # pip install toolz

lines_stream = (line.decode("utf-8").strip() 
                for line in urllib.request.urlopen('http://www.cs.cornell.edu/~adith/Criteo/sample.txt'))

# if you care about concise but hacky approach you could do:
# blubb = list(toolz.partitionby(lambda x: 'hash' in x, process_file(lines_stream)))
# examples_only = blubb[slice(0, len(blubb), 2)]
# products_only = blubb[slice(1, len(blubb), 2)]

# but to introduce some functional approach lets implement a reducer
def dataset_reducer(datasets, content):
    which_one = 0 if 'hash' in content else 1
    datasets[which_one].append(content)
    return datasets

# and process the stream using the reducer. Which results in two datasets:
examples_dataset, product_dataset = toolz.reduce(dataset_reducer, process_stream(lines), [[], []])

From here you can cast your datasets into a tidy dataframe that you can use to apply machine learning. Beware of NaN/missing values, distributions, etc. You can join the two datasets with merge to get one big flat table of samples X features. Then you will be more or less able use different methods from e.g. scikit-learn.

import pandas

examples_dataset = pandas.DataFrame(examples_dataset)
product_dataset = pandas.concat(pandas.DataFrame(p) for p in product_dataset)

Examples dataset

   candidates  clicked  ...    propensity  slots
0          30        0  ...  1.416489e-07      6
1          23        0  ...  5.344958e-01      3
2          23        1  ...  1.774762e-04      3
3          28        0  ...  1.158855e-04      6

Product dataset (product_dataset.sample(10))

       ex_id             feature  frequency  label  value
6   10244535  product_feature_21          1      0     10
9   37375474  product_feature_25          1      0      4
6   44432959  product_feature_25          1      0    263
15  62131356  product_feature_35          1      0     14
8   50383824  product_feature_24          1      0    228
8   63624159  product_feature_20          1      0     30
3   99375433  product_feature_14          1      0      0
9    3389658  product_feature_25          1      0     43
20  59461725  product_feature_31          8      0      4
11  17247719  product_feature_21          3      0      5

Be mindful about the product_dataset. You can 'pivot' you features in rows as columns (see reshaping docs).


The sample file has some interest features per example. Flattened out in a dict, each example looks something like:

{'ex_id': int,
 'hash': str,
 'clicked': bool,
 'propensity': float,
 'slots': int,
 'candidates': int,
 'display_feature_1': [int],
 'display_feature_2': [int],
 'display_feature_3': [int],
 'display_feature_4': [int],
 'display_feature_5': [int],
 'display_feature_6': [int],
 'display_feature_7': [int],
 'display_feature_8': [int],
 'display_feature_9': [int],
 'display_feature_10': [int],
 'display_feature_11': [int],
 'display_feature_12': [int],
 'display_feature_13': [int],
 'display_feature_14': [int],
 'display_feature_15': [int],
 'display_feature_16': [int],
 'display_feature_17': [int],
 'display_feature_18': [int],
 'display_feature_19': [int],
 'display_feature_20': [int],
 'display_feature_21': [int],
 'display_feature_22': [int],
 'display_feature_23': [int],
 'display_feature_24': [int],
 'display_feature_25': [int],
 'display_feature_26': [int],
 'display_feature_27': [int],
 'display_feature_28': [int],
 'display_feature_29': [int],
 'display_feature_30': [int],
 'display_feature_31': [int],
 'display_feature_32': [int],
 'display_feature_33': [int],
 'display_feature_34': [int],
 'display_feature_35': [int]
}

whereby features 1-35 may or may not be present, and may or may not each be repeated. A reasonable thing to do for a dataset of this size is to store it in as a list of tuples, whence each tuple corresponds to one example ID, like this:

(
  int, # exid
  str, # hash
  bool, # clicked
  float, # propensity
  int, # slots
  int, # candidates
  dict # the display features
)

A suitable dict structure for the 35 display features is

{k+1 : [] for k in range(35)}

Overall, this example data structure can be summarized as a list of tuples, whereby the last element in each tuple is a dictionary.

Assuming you have sample.txt locally, you can populate this structure like this:

examples = []
with open('sample.txt', 'r') as fp:
    for line in fp:

        line = line.strip('\n')

        if line[:7] == 'example':
            items = line.split(' ')
            items = [item.strip(':') for item in items]
            examples.append((
                int(items[1]),                  # exid
                items[2],                       # hash
                bool(items[3]),                 # clicked
                float(items[4]),                # propensity
                int(items[5]),                  # slots
                int(items[6]),                  # candidates 
                {k+1 : [] for k in range(35)}   # the display features
            ))
            for k in range(10):
                examples[-1][6][k+1].append(int(items[k+7].split(':')[1]))

        else:
            items = line.split(' ')
            while len(items) > 2:
                keyval = items.pop()
                key = int(keyval.split(':')[0])
                val = int(keyval.split(':')[1])
                examples[-1][6][key].append(val)

This data structure of records can be converted to JSON, and read to an numpy array. You can easily sort it based on any of the elements in each of the tuples, and iterate over it quickly too.

The approach to deal with multiple-value record items was to store them in a dictionary of lists. This makes it easy to gather their statistics.