Basic grep/awk help - extracting all lines containing a list of terms from one file into a separate file

To extract the lines from data.txt with the genes listed in genelist.txt:

grep -w -F -f genelist.txt data.txt > newdata.txt

grep options used:

  • -w tells grep to match whole words only (i.e. so ABC123 won't also match ABC1234).
  • -F search for fixed strings (plain text) rather than regular expressions
  • -f genelist.txt read search patterns from the file

If you want the header (Sample 1, Sample 2, etc) line as well:

grep -w -F -f genelist.txt -e Sample data.txt > newdata.txt
  • -e Sample also search for "Sample"

To find lines in genelist.txt that aren't in newdata.txt:

grep -v -w -F -f <(sed -E -e 's/(\t|  +).*//' newdata.txt) genelist.txt
  • -v invert the search, print non-matching lines.

The rest of the grep options are the same, but instead of using a file with the -f option, it's using something called Process Substitution (See also), which allows you to use a command in place of an actual file. Whatever output the command creates is treated as the "file"'s contents.

In this case, we're using the command sed -E -e 's/(\t| +).*//' newdata.txt, which outputs each line of newdata.txt after first deleting everything from either the first TAB character or the first pair of spaces it sees. In other words, the first field (e.g. "Gene A"). I had to use TAB or double space because a) i wasn't sure if your data was space-separated or TAB separated and b) the first fields in your example contained spaces.

sed options used:

  • -E use extended regular expressions, so we can use plain (, ), and + which are more readable than having to escape them with \ as \(, \), \+.
  • -e 's/(\t| +).*//' specifies the sed script to apply against the input (newdata.txt)

Running that command on your sample data.txt would produce the following output:

$ sed -E -e 's/(\t|  +).*//' data.txt

Gene A
Gene B
Gene C
Gene D

Anyway, the output of that sed command is used as the list of search patterns by the grep command.


To actually answer your question:

fgrep -w -f genelist.txt data.txt >results.txt
  • fgrep looks for fixed strings, rather than regular expressions (as grep and egrep do)
  • -w tells fgrep to match whole words, so ABC123 won't match ABC1234
  • -f genelist.txt tells fgrep to read search patterns from genelist.txt.

Seeing which genes from genelist.txt were not included in the extraction is a little more complicated. One way to do it:

awk '{ print $1 }' results.txt | fgrep -w -v -f - genelist.txt >outsiders.txt
  • awk '{ print $1 }' prints the first column in a text file; these is the list of matched genes
  • fgrep again matches fixed strings
  • -w tells fgrep to match whole words
  • -v tells it to print lines that don't match
  • -f - tells it to read the list of patterns from stdin, that is the list of matched genes from awk.

You can also make things a little more efficient by eliminating duplicates from the list of matched genes before searching, by interceding sort -u between awk and fgrep:

awk '{ print $1 }' results.txt | sort -u | fgrep -w -v -f - genelist.txt >outsiders.txt

This is quite an undertaking without any previous Linux experience. However, I think I understand what you need, and it shouldn't be too difficult. PArdon me in advance, this is a very concise crash course in addition to a very basic explanation but I'd be happy to expound in detail if it doesn't make sense, or edit as necessary.

If you simply want to parse the data.txt and move it to the genelist.txt you could simply use cat data.txt >> genelist.txt newfile.txt. (newfile.txt is the other file you mentioned it going to - the name is arbitrary).

If you want to print out the lines for a specific name, you could use cat data.txt | grep ABCD123 >> genelist.txt newfile.txt and change ABCD123 to whatever you want.

This command will ONLY output the lines found using grep (kind of like a "search" function, but it searches only by line.)

The "|" is called piping, and when coupled with "grep" command, acts a little like a filter for whatever you're looking for. (cat zoofile.txt | grep pandas for instance will look for all lines including the word "pandas" is a file names "zoofile." Note Linux IS CASE SENSITIVE and will only find EXACTLY what you put in. If you want ALL instances of either "panda, pandas, panderoons, or pandering, you could use pand*, where * is a wildcard and could be any character from 0 to 255 bits in length. This would pick up pand to pandzzzzzzzzzz and anything in between, including numbers).

You can use awk for more fancy column parsing (it's one of my favorite tools!) but it doesn't seem like it would fit here unless you ONLY want data from one of the columns based on certain parameters.

Finally, here is a good place to learn a bit about the command line. This may help with grep, but it doesn't cover awk.

https://www.codecademy.com/learn/learn-the-command-line

After that, this should cover awk in more detail. There are a lot of VERY expansive courses on awk, but they're easy to get lost in. This is a practical site that demonstrates more what you're looking to do.

https://www.ibm.com/developerworks/library/l-awk1/

EDIT - after re-reading, I may have missed something - are you looking to compare the two files and print out only things that match from one to the other? Please advise and provide an example and I'd be happy to edit my answer accordingly.

Tags:

Grep

Awk