Read n random lines from a potentially huge file

Ruby, 104 94 92 90 bytes

File name and number of lines are passed into the command line. For example, if the program is shuffle.rb and the file name is a.txt, run ruby shuffle.rb a.txt 3 for three random lines.

-4 bytes from discovering the open syntax in Ruby instead of File.new

f=open$*[0]
puts [*0..f.size/n=f.gets.size+1].sample($*[1].to_i).map{|e|f.seek n*e;f.gets}

Also, here's a 85-byte anonymous function solution that takes a string and a number as its arguments.

->f,l{f=open f;puts [*0..f.size/n=f.gets.size+1].sample(l).map{|e|f.seek n*e;f.gets}}

Dyalog APL, 63 bytes

⎕NREAD¨t 82l∘,¨l×¯1+⎕?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍞⎕NTIE 0

Prompts for file name, then for how many random lines are desired.

Explanation

⍞ Prompt for text input (file name)
⎕NTIE 0 Tie the file using next available tie number (-1 on a clean system)
t← Store the chosen tie number as t
83 80,⍨ Append [83,80] yielding [-1,83,80]
⎕NREAD Read the first 80 bytes of file -1 as 8-bit integers (conversion code 83)
10⍳⍨ Find the index of the first number 10 (LF)
l← Store the line length as l
(⎕NSIZE t)÷ Divide the size of file -1 with the line length
⎕ Prompt for numeric input (desired number of lines)
? X random selections (without replacement) out the first Y natural numbers
¯1+ Add -1 to get 0-origin line numbers*
l× Multiply by the line length to get the start bytes
t 82l∘,¨ Prepend [-1,82,LineLength] to each start byte (creates list of arguments for ⎕NREAD)
⎕NREAD¨ Read each line as 8-bit character (conversion code 82)

Practical example

File /tmp/records.txt contains:

Hello
Think
12345
Klaus
Nilad

Make the program RandLines contain the above code verbatim by entering the following into the APL session:

∇RandLines
⎕NREAD¨t 82l∘,¨l×¯1+⎕?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍞⎕NTIE 0
∇

In the APL session type RandLines and press Enter.

The system moves the cursor to the next line, which is a 0-length prompt for character data; enter /tmp/records.txt.

The system now outputs ⎕: and awaits numeric input; enter 4.

The system outputs four random lines.

Real life

In reality, you may want to give filename and count as arguments and receive the result as a table. This can be done by entering:

RandLs←{↑⎕NREAD¨t 82l∘,¨l×¯1+⍺?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍵⎕NTIE 0}

Now you make MyLines contain three random lines with:

MyLines←3 RandLs'/tmp/records.txt'

How about returning just a single random line if count is not specified:

RandL←{⍺←1 ⋄ ↑⎕NREAD¨t 82l∘,¨l×¯1+⍺?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍵⎕NTIE 0}

Now you can do both:

MyLines←2 RandL'/tmp/records.txt'

and (notice absence of left argument):

MyLine←RandL'/tmp/records.txt'

Making code readable

Golfed APL one-liners are a bad idea. Here is how I would write in a production system:

RandL←{ ⍝ Read X random lines from file Y without reading entire file
    ⍺←1 ⍝ default count
    tie←⍵⎕NTIE 0 ⍝ tie file
    length←10⍳⍨⎕NREAD 83 80,⍨tie ⍝ find first NL
    size←⎕NSIZE tie ⍝ total file length
    starts←length×¯1+⍺?size÷length ⍝ beginning of each line
    ↑⎕NREAD¨tie 82length∘,¨starts ⍝ read each line as character and convert list to table
}

^{*I could save a byte by running in 0-origin mode, which is standard on some APL systems: remove ¯1+ and insert 1+ before 10.}

Haskell, 240 224 236 bytes

import Test.QuickCheck
import System.IO
g=hGetLine
main=do;f<-getLine;n<-readLn;h<-openFile f ReadMode;l<-(\x->1+sum[1|_<-x])<$>g h;s<-hFileSize h;generate(shuffle[0..div s l-1])>>=mapM(\p->hSeek h(toEnum 0)(l*p)>>g h>>=putStrLn).take n

Reads filename and n from stdin.

How it works:

main=do
  f<-getLine                   -- read file name from stdin
  n<-readLn                    -- read n from stdin
  h<-openFile f ReadMode       -- open the file
  l<-(\x->1+sum[1|_<-x])<$>g h -- read first line and bind l to it's length +1
                               -- sum[1|_<-x] is a custom length function
                               -- because of type restrictions, otherwise I'd have
                               -- to use "toInteger.length"
  s<-hFileSize h               -- get file size
  generate(shuffle[0..div s l-1])>>=
                               -- shuffle all possible line numbers 
  mapM (\->p  ...  ).take n    -- for each of the first n shuffled line numbers 
     hSeek h(toEnum 0).(l*p)>> -- jump to that line ("toEnum 0" is short for "AbsoluteSeek")
     g h>>=                    -- read a line from current position
     putStrLn                  -- and print

It takes a lot of time and memory to run this program for files with many lines, because of a horrible inefficient shuffle function.

Edit: I missed the "random without replacement" part (thanks @feersum for noticing!).

Read n random lines from a potentially huge file

Ruby, 104 94 92 90 bytes

Dyalog APL, 63 bytes

Explanation

Practical example

Real life

Making code readable

Haskell, 240 224 236 bytes

Tags:

Random

Code Golf

File System

Related

Recent Posts