Read random lines from huge CSV file in Python

import random

filesize = 1500                 #size of the really big file
offset = random.randrange(filesize)

f = open('really_big_file')
f.seek(offset)                  #go to random position
f.readline()                    # discard - bound to be partial line
random_line = f.readline()      # bingo!

# extra to handle last/first line edge cases
if len(random_line) == 0:       # we have hit the end
    f.seek(0)
    random_line = f.readline()  # so we'll grab the first line instead

As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:

Let's assume (in this case) we have min=3 and max=15

1) Find the length (Lp) of the previous line.

Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.

We accomplish this by randomly keeping the line X% of the time where:

X = min / Lp

If we don't keep the line, we do another random pick until our dice roll comes good. :-)

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it

Assuming you don't need exactly 1 million lines and know then number of lines in your CSV file beforehand, you can use reservoir sampling to retrieve your random subset. Simply iterate through your data and for each line determine the chances of the line being selected. That way you only need a single pass of your data.

This works well if you need to extract the random samples often but the actual dataset changes infrequently (since you'll only need to keep track of the number of entries each time the dataset changes).

chances_selected = desired_num_results / total_entries
for line in csv.reader(file):
   if random() < chances_selected:
        result.append(line)

You can use a variation of the probabilistic method for choosing a random line in a file.

Instead of just keeping a single number that gets chosen, you can keep a buffer of size C. For each line number, n, in the file with N lines, you want to choose that line with probability C/n (rather than the original 1/n. If the number is selected, you then choose a random location from the C-length buffer to evict.

Here's how it works:

import random

C = 2
fpath = 'somelines.txt'
buffer = []

f = open(fpath, 'r')
for line_num, line in enumerate(f):
    n = line_num + 1.0
    r = random.random()
    if n <= C:
        buffer.append(line.strip())
    elif r < C/n:
        loc = random.randint(0, C-1)
        buffer[loc] = line.strip()

This requires a single pass through the file (so it's linear time) and returns exactly C lines from the file. Each line will have probability C/N of being selected.

To verify that the above works, I created a file with 5 lines containing a,b,c,d,e. I ran the code 10,000 times with C=2. This should produce about an even distribution of the 5 choose 2 (so 10) possible choices. The results:

a,b: 1046
b,c: 1018
b,e: 1014
a,c: 1003
c,d: 1002
d,e: 1000
c,e: 993
a,e: 992
a,d: 985
b,d: 947

Read random lines from huge CSV file in Python

Tags:

Python

Csv

File

Random

Related

Recent Posts