Replace string in a huge (70GB), one line, text file

For such a big file, one possibility is Flex. Let unk.l be:

%%
\<unk\>     printf("<raw_unk>");  
%%

Then compile and execute:

$ flex -o unk.c  unk.l
$ cc -o unk -O2 unk.c -lfl
$ unk < corpus.txt > corpus.txt.new

The usual text processing tools are not designed to handle lines that don't fit in RAM. They tend to work by reading one record (one line), manipulating it, and outputting the result, then proceeding to the next record (line).

If there's an ASCII character that appears frequently in the file and doesn't appear in <unk> or <raw_unk>, then you can use that as the record separator. Since most tools don't allow custom record separators, swap between that character and newlines. tr processes bytes, not lines, so it doesn't care about any record size. Supposing that ; works:

<corpus.txt tr '\n;' ';\n' |
sed 's/<unk>/<raw_unk>/g' |
tr '\n;' ';\n' >corpus.txt.new

You could also anchor on the first character of the text you're searching for, assuming that it isn't repeated in the search text and it appears frequently enough. If the file may start with unk>, change the sed command to sed '2,$ s/… to avoid a spurious match.

<corpus.txt tr '\n<' '<\n' |
sed 's/^unk>/raw_unk>/g' |
tr '\n<' '<\n' >corpus.txt.new

Alternatively, use the last character.

<corpus.txt tr '\n>' '>\n' |
sed 's/<unk$/<raw_unk/g' |
tr '\n>' '>\n' >corpus.txt.new

Note that this technique assumes that sed operates seamlessly on a file that doesn't end with a newline, i.e. that it processes the last partial line without truncating it and without appending a final newline. It works with GNU sed. If you can pick the last character of the file as the record separator, you'll avoid any portability trouble.

So you don't have enough physical memory (RAM) to hold the whole file at once, but on a 64-bit system you have enough virtual address space to map the entire file. Virtual mappings can be useful as a simple hack in cases like this.

The necessary operations are all included in Python. There are several annoying subtleties, but it does avoid having to write C code. In particular, care is needed to avoid copying the file in memory, which would defeat the point entirely. On the plus side, you get error-reporting for free (python "exceptions") :).

#!/usr/bin/python3
# This script takes input from stdin
# (but it must be a regular file, to support mapping it),
# and writes the result to stdout.

search = b'<unk>'
replace = b'<raw_unk>'


import sys
import os
import mmap

# sys.stdout requires str, but we want to write bytes
out_bytes = sys.stdout.buffer

mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ)
i = mem.find(search)
if i < 0:
    sys.exit("Search string not found")

# mmap object subscripts to bytes (making a copy)
# memoryview object subscripts to a memoryview object
# (it implements the buffer protocol).
view = memoryview(mem)

out_bytes.write(view[:i])
out_bytes.write(replace)
out_bytes.write(view[i+len(search):])

Replace string in a huge (70GB), one line, text file

Tags:

Sed

Text Processing

Large Files

Related

Recent Posts