Filter 500 files with awk, then cat results to single file

Your code overwrites the output file in each iteration. You also do not actually call awk.

What you want to do is something like

awk '$5 >= 0.5' ./*.imputed.*_info >snplist.txt

This would call awk with all your files at once, and it would go through them one by one, in the order that the shell expands the globbing pattern. If the 5th column of any line in a file is greater or equal to 0.5, that line would be outputted (into snplist.txt). This works since the default action, if no action ({...} block) is associated with a condition, is to output the current line.

In cases where you have a large number of files (many thousands), this may generate an "Argument list too long" error. In that case, you may want to loop:

for filename in ./*.imputed.*_info; do
    awk '$5 >= 0.5' "$filename"
done >snplist.txt

Note that the result of awk does not need to be stored in a variable. Here, it's just outputted and the loop (and therefore all commands inside the loop) is redirected into snplist.txt.

For many thousands of files, this would be quite slow since awk would need to be invoked for each of them individually.

To speed things up, in the cases where you have too many files for a single invocation of awk, you may consider using xargs like so:

printf '%s\0' ./*.imputed.*_info | xargs -0 awk '$5 >= 0.5' >snplist.txt

This would create a list of filenames with printf and pass them off to xargs as a nul-terminated list. The xargs utility would take these and start awk with as many of them as possible at once, in batches. The output of the whole pipeline would be redirected to snplist.txt.

This xargs alternative is assuming that you are using a Unix, like Linux, which has an xargs command that implements the non-standard -0 option to read nul-terminated input. It also assumes that you are using a shell, like bash, that has a built-in printf utility (ksh, the default shell on OpenBSD, would not work here as it has no such built-in utility).


For the zsh shell (i.e. not bash):

autoload -U zargs
zargs -- ./*.imputed.*_info -- awk '$5 >= 0.5' >snplist.txt

This uses zargs, which is basically a reimplementation of xargs as a loadable zsh shell function. See zargs --help (after loading the function) and the zshcontrib(1) manual for further information about that.


Just do this :

awk '$5 >= .5' *.imputed.*_info > snplist.txt

I have a habit of using find for this kind of thing.

find . -type f -name "*.imputed.*_info" -exec awk '$5 >= 0.5' {} \; > ./snplist.txt

Tags:

Bash

Awk

Cat