Relational joining of tables

This one is fairly fast. I used GatherBy to collect like data rows and kept the ones that matched another. (I assumed that the id entries of a and b are unique in each table.) The appropriate entries are then extracted.

On 10000/5000 entries:

a = Table[{x, RandomReal[]}, {x, 1, 10000}];
b = Table[{x, RandomReal[]}, {x, 1, 10000, 2}];
Flatten[Cases[GatherBy[a~Join~b, First], {_, _}], {{1}, {2, 3}}][[All, {1, 2, 4}]] //
  Timing // First
(* 0.014554 *)

On 100,000/50,000 entries (roughly linear growth):

a = Table[{x, RandomReal[]}, {x, 1, 100000}];
b = Table[{x, RandomReal[]}, {x, 1, 100000, 2}];
Flatten[Cases[GatherBy[a~Join~b, First], {_, _}], {{1}, {2, 3}}][[All, {1, 2, 4}]] //
  Timing // First
(* 0.175955 *)

The output on a small data set looks like this:

a = Table[{x, RandomReal[]}, {x, 1, 10}];
b = Table[{x, RandomReal[]}, {x, 1, 10, 2}];
Flatten[Cases[GatherBy[a~Join~b, First], {_, _}], {{1}, {2, 3}}][[All, {1, 2, 4}]]
(* {{1, 0.74066, 0.69329}, {3, 0.80351, 0.420397}, {5, 0.239924, 0.806693},
    {7, 0.665209, 0.0483077}, {9, 0.705487, 0.737412}} *)

By comparison Verbeia's version is roughly of quadratic growth and rather slower:

a = Table[{x, RandomReal[]}, {x, 1, 1000}];
b = Table[{x, RandomReal[]}, {x, 1, 1000, 2}];
(Join[{##}, Cases[b, {#1, _}][[All, 2]]] & @@@ a) /. {_, _} -> Sequence[] //
  Timing // First
(* 0.065370 *)

a = Table[{x, RandomReal[]}, {x, 1, 10000}];
b = Table[{x, RandomReal[]}, {x, 1, 10000, 2}];
(Join[{##}, Cases[b, {#1, _}][[All, 2]]] & @@@ a) /. {_, _} -> Sequence[] //
  Timing // First
(* 5.736539 *)

Update

Below I present two faster functions, based on a similar idea: if the data Join[a, b] were sorted on the IDs, rows to be joined would end up next to each other. I used Ordering and Differences to determine the rows with the same ID. When sorted, a difference in IDs of zero indicates two rows to be joined. SparseArray and ArrayReshape seem fairly fast here, but they're not compilable. For the compiled version, Position and Partition were used instead. On packed arrays, both have comparable speeds.

I avoid sorting the whole database, since moving all that memory around will take time.

I'm assuming that the IDs are numeric. Indeed, to get the speed below, the data have to be numeric. The speed is dependent on using packed arrays. The compiled function assumes the IDs can be represented by Real numbers.

mJoin2[a_, b_] := 
  With[{data = a ~Join~ b,
        ids = a[[All, 1]] ~Join~ b[[All, 1]],
        ncols = Last @ Dimensions @ a},
   With[{ordering = Ordering[ids]}, 
    With[{adj = SparseArray[1 - Unitize @ Differences @ ids[[ordering]]]["AdjacencyLists"]},
     ArrayReshape[
        data[[ ordering[[Flatten @ Transpose @ {#, # + 1} &@ adj]] ]],
        {Length @ adj, 4}
      ][[All, Join[Range@ncols, Range[2 + ncols, 2 ncols]]]]
     ]]];

mJoin3 = Compile[{{a, _Real, 2}, {b, _Real, 2}},
   Module[{data, ids, ordering, joinID, ncols},
    data = a ~Join~ b;
    ids = a[[All, 1]] ~Join~ b[[All, 1]];
    ncols = Last @ Dimensions @ a;
    ordering = Ordering[ids];
    joinID = Flatten @ Position[Unitize @ Differences @ ids[[ordering]], 0];
    Partition[
      Flatten[
       data[[
         Flatten @ Transpose @ {ordering[[joinID]], ordering[[1 + joinID]]}
         ]]],
      2 ncols
      ][[All, Join[Range @ ncols, Range[2 + ncols, 2 ncols]]]]
    ]
   ];

Comparison

$HistoryLength = 0;

a = Table[{N@x, RandomReal[]}, {x, 1, 2000000, 3}];
b = Table[{N@x, RandomReal[]}, {x, 1, 2000000, 2}];

(m3 = mJoin3[a, b]) // Length // Timing     (* Michael E2 compiled *)
(m2 = mJoin2[a, b]) // Length // Timing     (* Michael E2 SparseArray method *)
(m1 = Flatten[                              (* Michael E2 original *)
       Cases[GatherBy[a~Join~b, First], {_, _}], {{1}, {2, 3}}][[All, {1, 2, 4}]]) //
         Length // Timing
(i2 = RJoin[b, a]) // Length // Timing      (* Ian S. -- first update *)

{0.177294, 333334}
{0.210489, 333334}
{2.890227, 333334}
{5.517205, 333334}

m1 == m2 == m3 == i2
(* True *)

On non-packed arrays mJoin2 slows down. The compiled function mJoin3 converts the Integer indices to Real and packs the arrays when it is called, which causes a slight slowdown.

a = Table[{x, RandomReal[]}, {x, 1, 2000000, 3}];
b = Table[{x, RandomReal[]}, {x, 1, 2000000, 2}];

(m3 = mJoin3[a, b]) // Length // Timing
(m2 = mJoin2[a, b]) // Length // Timing

{0.220412, 333334}
{0.580442, 333334}

Note: My original solution can be a bit of a memory hog. I believe it was slower in Ian Schumacher's test because it filled the RAM and led to some swapping.

I might have misunderstood what it is you are trying to do, but I would have thought that Cases would be a better option, and that looping approach was not ideal.

Here is a one-liner that seems to do what I think you want. Firstly, if you are looking for an index that is in both lists, then you only need to iterate over one list and find the indices that are also present in the other list.

Here is a better test dataset, since not all of the indices in a are in b.

a = Table[{x, RandomReal[]}, {x, 1, 1000}]; 
b = Table[{x, RandomReal[]}, {x, 1, 1000, 2}];

And here is my function:

(Join[{##}, Cases[b, {#1, _}][[All, 2]]] & @@@ a) /. 
  {_, _} ->   Sequence[]

This goes through each element of a and selects all the elements of b that have the same index (first part), and keeps only the second part, which is the value. That's what the Part specification [[All,2]] is doing. I Join that with the corresponding actual a element, and then delete, using a replacement rule, the ones that are only two elements long, since that means that no element of b actually matched.

I tested it on the long dataset above and got the following output:

{{1, 0.177197, 0.569452}, {3, 0.980658, 0.544697}, {5, 0.507622, 
  0.634173}, {7, 0.645986, 0.820293}, {9, 0.215669, 0.803831}, {11, 
  0.460078, 0.293823}, {13, 0.520429, 0.813139}, {15, 0.679199, 
  0.48138}...

It took about 0.1 seconds on my three-year-old PC.

As commented by @IanSchumacher this solution is more appropriate for a left. Using vLookup3 from here we have:

a = Table[{x, RandomReal[]}, {x, 1, 10000}];
b = Table[{x, RandomReal[]}, {x, 1, 10000, 2}];
vLookup3[a, b] // AbsoluteTiming // First

0.041213

for a inner join you can do:

DeleteCases[vLookup3[a, b],{__,Null}]

Relational joining of tables

Update

Tags:

List Manipulation

Related

Recent Posts