Finding number of occurrences of a word in a file using R functions

The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.

You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:

names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )

Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.


As pointed by @andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:

names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)

length(idxs)
# [1] 10

Tags:

File

R