pandas merging based on a timestamp which do not match exactly

I would try using this method in pandas:

pandas.merge_asof()

The parameters of interest for you would be direction,tolerance,left_on, and right_on

Building off @Igor answer:

import pandas as pd
from pandas import read_csv
from io import StringIO

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)



df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])

# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['date_start_time']
df2.index = df2['date_start_time']
# the magic happens below, check the direction and tolerance arguments
tol = pd.Timedelta('5 minute')
pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)

output

date_start_time date_start_time_x   employee_id_x   session_id_x    date_start_time_y   employee_id_y   session_id_y

2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182   2016-01-01 02:03:00 7261824.0   871631182.0
2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183   2016-01-01 06:05:00 7261824.0   871631183.0
2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184   2016-01-01 11:04:00 7261824.0   871631184.0
2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185   NaT NaN NaN

Consider the following mini-version of your problem:

from io import StringIO
from pandas import read_csv, to_datetime

# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)

which gives

>>> df1
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:03:00      7261824   871631183
2 2016-01-01 11:01:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
>>> df2
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:10:00      7261824   871631185

You would like to treat df2[0:3] as duplicates of df1[0:3] when merging (since they are respectively less than 5 minutes apart), but treat df1[3] and df2[3] as separate sessions.

Solution 1: Interval Matching

This is essentially what you are suggesting in your edit. You want to map timestamps in both tables to a 10-minute interval centered on the timestamp rounded to the nearest 5 minutes.

Each interval can be represented uniquely by its midpoint, so you can merge the data frames on the timestamp rounded to the nearest 5 minutes. For example:

import numpy as np

# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9

# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)

# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

which prints

             interval  employee_id  session_id
0 2016-01-01 02:05:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:00:00      7261824   871631184
3 2016-01-01 14:00:00      7261824   871631185
4 2016-01-01 11:05:00      7261824   871631184
5 2016-01-01 14:10:00      7261824   871631185

Note that this is not totally correct. The sessions df1[2] and df2[2] are not treated as duplicates although they are only 3 minutes apart. This is because they were on different sides of the interval boundary.

Solution 2: One-to-one matching

Here is another approach which depends on the condition that sessions in df1 have either zero or one duplicates in df2.

We replace timestamps in df1 with the closest timestamp in df2 which matches on employee_id and session_id and is less than 5 minutes away.

from datetime import timedelta

# get closest match from "df2" to row from "df1" (as long as it's below the threshold)
def closest(row):
    matches = df2.loc[(df2.employee_id == row.employee_id) &
                      (df2.session_id == row.session_id)]

    deltas = matches.date_start_time - row.date_start_time
    deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]

    try:
        return matches.loc[deltas.idxmin()]
    except ValueError:  # no items
        return row

# replace timestamps in "df1" with closest timestamps in "df2"
df1 = df1.apply(closest, axis=1)

# join
cols = ['date_start_time', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

which prints

      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
4 2016-01-01 14:10:00      7261824   871631185

This approach is significantly slower, since you have to search through the entirety of df2 for each row in df1. What I have written can probably be optimized further, but this will still take a long time on large datasets.


I would suggest to use the built-in pandas Series dt round function, to round both dataframe to a common time, for example round up to every 5min. So the time will always be in format: 01:00:00 and then 01:05:00 for example. In that way, both dataframe will have similar time index to perform the merge.

Please see documentation and examples here pandas.Series.dt.round

Tags:

Python

Pandas