Printing lines from one file if part of them appears in another. Both files are millions of lines long

You can do this very easily using grep:

$ grep -Ff 123.txt 789.txt
http://www.a.com/kgjdk-jgjg/ 
http://www.b.com/gsjahk123/ 
http://www.c.com/abc.txt 

The command above will print all lines from file 789.txt that contain any of the lines from 123.txt. The -f means "read the patterns to search from this file" and the -F tells grep to treat the search patterns as strings and not its default regular expressions.

This will not work if the lines of 123.txt contain trailing spaces, grep will treat the spaces as part of the pattern to look for an will not match if it occurs within a word. For example, the pattern foo (note the trailing space) will not match foobar. To remove trailing spaces from your file, run this command:

$ sed 's/ *$//' 123.txt > new_file

Then use the new_file to grep:

$ grep -Ff new_file 789.txt

You can also do this without a new file, using the i flag:

$ sed -i.bak 's/ *$//' 123.txt

This will change file 123.txt and keep a copy of the original called 123.txt.bak.

(Note that this form of the -i flag to sed assumes you have GNU sed; for BSD sed use -i .bak with a space in between.)


If the files like in your example are sorted and always follow that pattern, you could write it:

join -t/ -1 3 -2 3 123.txt 789.txt |
  sed -n 's,\([^/]*/\)\([^/]*://\)\2,\2\1,p'

That would be the most efficient.