Filter file by line number

I'd use awk, but not store the whole content of L.txt in memory and do unnecessary hash look ups ;-).

list=L.txt file=F.txt
LIST="$list" awk '
  function nextline() {
    if ((getline n < list) <=0) exit
  }
  BEGIN{
    list = ENVIRON["LIST"]
    nextline()
  }
  NR == n {
    print
    nextline()
  }' < "$file"

`grep -n | sort | sed | cut`

(   export LC_ALL=C
    grep -n ''   | sort -t:  -nmk1,1 ./L - |
    sed /:/d\;n  | cut  -sd: -f2-
)   <./F

That should work pretty quickly (some timed tests are included below) with input of any size. Some notes on how:

export LC_ALL=C
- Because the point of the following operation is to get the entire file of ./F stacked up inline with its ./L lineno's file, the only characters we'll really need to worry about are ASCII [0-9]digits and the :colon.
- For that reason it is more simple to worry about finding those 11 characters in a set of 128 possibles than it is if UTF-8 is otherwise involved.
grep -n ''
- This inserts the string LINENO: into the head of every line in stdin - or <./F.
sort -t: -nmk1,1 ./L -
- sort neglects to sort its input files at all, and instead (correctly) presumes they are presorted and -merges them in -numerically sorted order, ignoring basically anything beyond any possible -k1,1st occurring -t:colon character anyway.
- While this may require some temp space to do (depending on how far apart some sequences may occur), it will not require much as compared to a proper sort, and it will be very fast because it involves zero backtracking.
- sort will output a single stream where any lineno's in ./L will immediately precede the corresponding lines in ./F. ./L's lines always come first because they are shorter.
sed /:/d\;n
- If the current line matches a /:/colon delete it from output. Else, auto-print the current and next line.
- And so sed prunes sort's output to only sequential line pairs which do not match a colon and the following line - or, to only a line from ./L and then the next.
cut -sd: -f2-
- cut -suppresses from output those of its input lines which do not contain at least one of its -d:elimiter strings - and so ./L's lines are pruned completely.
- For those lines which do, their first : colon-delimited -field is cut away - and so goes all of grep's inserted lineno's.

small input test

seq 5 | sed -ne'2,3!w /tmp/L
        s/.*/a-z &\& 0-9/p' >/tmp/F

...generates 5 lines of sample input. Then...

(   export LC_ALL=C; </tmp/F \
    grep -n ''   | sort -t:  -nmk1,1 ./L - |
    sed /:/d\;n  | cut  -sd: -f2-
)|  head - /tmp[FL]

...prints...

==> standard input <==
a-z 1& 0-9
a-z 4& 0-9
a-z 5& 0-9

==> /tmp/F <==
a-z 1& 0-9
a-z 2& 0-9
a-z 3& 0-9
a-z 4& 0-9
a-z 5& 0-9

==> /tmp/L <==
1
4
5

bigger timed tests

I created a couple of pretty large files:

seq 5000000 | tee /tmp/F |
sort -R | head -n1500000 |
sort -n >/tmp/L

...which put 5mil lines in /tmp/F and 1.5mil randomly selected lines of that into /tmp/L. I then did:

time \
(   export LC_ALL=C
    grep -n ''   | sort -t:  -nmk1,1 ./L - |
    sed /:/d\;n  | cut  -sd: -f2-
)   <./F |wc - l

It printed:

1500000
grep -n '' \
    0.82s user 0.05s system 73% cpu 1.185 total
sort -t: -nmk1,1 /tmp/L - \
    0.92s user 0.11s system 86% cpu 1.185 total
sed /:/d\;n \
    1.02s user 0.14s system 98% cpu 1.185 total
cut -sd: -f2- \
    0.79s user 0.17s system 80% cpu 1.184 total
wc -l \
    0.05s user 0.07s system 10% cpu 1.183 total

(I added the backslashes there)

Among the solutions currently offered here, this is the fastest of all of them but one when pitted against the dataset generated above on my machine. Of the others only one came close to contending for second-place, and that is meuh's perl here.

This is by no means the original solution offered - it has dropped a third of its execution time thanks to advice/inspiration offered by others. See the post history for slower solutions (but why?).

Also, it is worth noting that some other answers might very well contend better if it were not for the multi-cpu architecture of my system and the concurrent execution of each of the processes in that pipeline. They all work at the same time - each on its own processor core - passing around the data and doing their small part of the whole. It's pretty cool.

but the fastest solution is...

But it is not the fastest solution. The fastest solution offered here, hands-down, is the C program. I called it cselect. After copying it to my X clipboard, I compiled it like:

xsel -bo | cc -xc - -o cselect

I then did:

time \
    ./cselect /tmp/L /tmp/F |
wc -l

...and the results were...

1500000
./cselect /tmp/L /tmp/F  \
    0.50s user 0.05s system 99% cpu 0.551 total
wc -l \
    0.05s user 0.05s system 19% cpu 0.551 total

I'd use awk:

awk 'NR==FNR {a[$1]; next}; FNR in a' L.txt F.txt

Update: I've done performance measures; it seems this version scales even better with very large data sets (as is the case with the stated requirements), since the comparison is very fast and overcompensates the effort necessary to build up the hash table.

Filter file by line number

`grep -n | sort | sed | cut`

small input test

bigger timed tests

but the fastest solution is...

Tags:

Text Processing

Filter

High Performance

Related

Recent Posts