Python dictionary with multiple keys pointing to same list in memory efficient way

You can try something like this:

list(filter(lambda x: any(["C 9772980" in x]),data))

No need to make a mapping structure.


You are really in a trade-off space between the time/memory it takes to generate the dictionary versus the time it takes to scan the entire data for an on-the-fly method.

If you want a low memory method, you can use a function that searches each sublist for the value. Using a generator will get initial results faster to the user, but for large data sets, this will be slow between returns.

data = [[
        "A 5408599",
        "B 8126880",
        "A 2003529",
    ],
    [
        "C 9925336",
        "C 3705674",
        "A 823678571",
        "C 3205170186",
    ],
    [
        "C 9772980",
        "B 8960327",
        "C 4185139021",
        "D 1226285245",
        "C 2523866271",
        "D 2940954504",
        "D 5083193",
    ]]


def find_list_by_value(v, data):
    for sublist in data:
        if v in sublist:
            yield sublist

for s in find_list_by_value("C 9772980", data):
    print(s)

As mentioned in the comments, building a hash table based just on the first letter or first 2 or 3 character might be a good place to start. This will allow you to build a candidate list of sublists, then scan those to see if the value is in the sublist.

from collections import defaultdict

def get_key(v, size=3):
    return v[:size]

def get_keys(sublist, size=3):
    return set(get_key(v, size) for v in sublist)

def find_list_by_hash(v, data, hash_table, size=3):
    key = get_key(v, size)
    candidate_indices = hash_table.get(key, set())
    for ix in candidates:
        if v in data[ix]:
            yield data[ix]

# generate the small hash table
quick_hash = defaultdict(set)
for i, sublist in enumerate(data):
    for k in get_keys(sublist, 3):
        quick_hash[k].add(i)

# lookup a value by the small hash
for s in find_list_by_hash("C 9772980", data, quick_hash, 3):
    print(s)

In this code quick_hash will take some time to build, because you are scanning your entire data structure. However, the memory foot print will be much smaller. You main parameter for tuning performance is size. Smaller size will have a smaller memory footprint, but will take longer when running find_list_by_hash because your candidate pool will be larger. You can do some testing to see what the right size should be for your data. Just be mindful that all of your values are at least as long as size.


try this, using pandas

import pandas as pd
df=pd.DataFrame(data)
rows = df.shape[0]
for row in range(rows):
    print[[row]]    #Do something with your data

this looks simple solution, even if your data grows big, this will handle that efficiently

Tags:

Python