Ruby: What's an elegant way to pick a random line from a text file?

There is already a random entry selector built into the Ruby Array class: sample().

def pick_random_line
  File.readlines("data.txt").sample
end

You can do it without storing anything except the most recently-read line and the current candidate for the returned random line.

def pick_random_line
  chosen_line = nil
  File.foreach("data.txt").each_with_index do |line, number|
    chosen_line = line if rand < 1.0/(number+1)
  end
  return chosen_line
end

So the first line is chosen with probability 1/1 = 1; the second line is chosen with probability 1/2, so half the time it keeps the first one and half the time it switches to the second.

Then the third line is chosen with probability 1/3 - so 1/3 of the time it picks it, and the other 2/3 of the time it keeps whichever one of the first two it picked. Since each of them had a 50% chance of being chosen as of line 2, they each wind up with a 1/3 chance of being chosen as of line 3.

And so on. At line N, every line from 1-N has an even 1/N chance of being chosen, and that holds all the way through the file (as long as the file isn't so huge that 1/(number of lines in file) is less than epsilon :)). And you only make one pass through the file and never store more than two lines at once.

EDIT You're not going to get a real concise solution with this algorithm, but you can turn it into a one-liner if you want to:

def pick_random_line
  File.foreach("data.txt").each_with_index.reduce(nil) { |picked,pair| 
    rand < 1.0/(1+pair[1]) ? pair[0] : picked }
end

This function does exactly what you need.

It's not a one-liner. But it works with textfiles of any size (except zero size, maybe :).

def random_line(filename)
  blocksize, line = 1024, ""
  File.open(filename) do |file|
    initial_position = rand(File.size(filename)-1)+1 # random pointer position. Not a line number!
    pos = Array.new(2).fill( initial_position ) # array [prev_position, current_position]
    # Find beginning of current line
    begin
      pos.push([pos[1]-blocksize, 0].max).shift # calc new position
      file.pos = pos[1] # move pointer backward within file
      offset = (n = file.read(pos[0] - pos[1]).rindex(/\n/) ) ? n+1 : nil
    end until pos[1] == 0 || offset
    file.pos = pos[1] + offset.to_i
    # Collect line text till the end
    begin
      data = file.read(blocksize)
      line.concat((p = data.index(/\n/)) ? data[0,p.to_i] : data)
    end until file.eof? or p
  end
  line
end

Try it:

filename = "huge_text_file.txt"
100.times { puts random_line(filename).force_encoding("UTF-8") }

Negligible (imho) drawbacks:

  1. the longer the line, the higher the chance it'll be picked.

  2. doesn't take into account the "\r" line separator ( windows-specific ). Use files with Unix-style line endings!

Tags:

Ruby

Io

File