Read from a text file and parse lines into words in C

I suspect you really want to handle all non-alphabetical characters as separators, not just handle spaces as separators and ignore non-alphabetical characters. Otherwise, foo--bar would show up as a single word foobar, right? The good news is, that makes things easier. You can remove the isspace clause, and just use the else clause.

Meanwhile, whether you treat punctuations specially or not, you've got a problem: You print a newline for any space at all. So, a line that ends with \r\n or \n, or even a sentence that ends with ., will print a blank line. The obvious way around that is to keep track of the last character, or a flag, so you only print a newline if you've previously printed a letter.

For example:

int last_c = 0

while ((c = fgetc(input_file)) != EOF )
{
    //if it's an alpha, convert it to lower case
    if (isalpha(c))
    {
        c = tolower(c);
        putchar(c);
    }
    else if (isalpha(last_c))
    {
        putchar(c);
    }
    last_c = c;
}

But do you really want to treat all punctuation the same? The problem statement implies that you do, but in real life, that's a bit odd. For example, foo--bar should probably show up as separate words foo and bar, but should it's really show up as separate words it and s? For that matter, using isalpha as your rule for "word characters" also means that, say, 2nd will show up as nd.

So, if isascii isn't the appropriate rule for your use case to distinguish word characters from separator characters, you'll have to write your own function that makes the right distinction. You can easily express such a rule in logic (e.g., isalnum(c) || c == '\'') or with a table (just an array of 128 ints, so the function is c >= 0 && c < 128 && word_char_table[c]). Doing things that way has the added benefit that you can later extend your code to deal with Latin-1 or Unicode, or to handle program text (which has different word characters than English language text), or …


I think that you just need to ignore any non-alpha character !isalpha(c) otherwise convert to lowercase. You will need to keep track when you find a word in this case.

int found_word = 0;

while ((c =fgetc(input_file)) != EOF )
{
    if (!isalpha(c))
    {
        if (found_word) {
            putchar('\n');
            found_word = 0;
        }
    }
    else {
        found_word = 1;
        c = tolower(c);
        putchar(c);
    }
}

If you need to handle apostrophes within words such as "isn't" then this should do it -

int found_word = 0;
int found_apostrophe = 0;
    while ((c =fgetc(input_file)) != EOF )
    {
    if (!isalpha(c))
    {
        if (found_word) {
            if (!found_apostrophe && c=='\'') {
                found_apostrophe = 1;
            }
            else {
                found_apostrophe = 0;
                putchar('\n');
                found_word = 0;
            }
                }
    }
    else {
        if (found_apostrophe) {
            putchar('\'');
            found_apostrophe = 0;
        }
        found_word = 1;
        c = tolower(c);
        putchar(c);
    }
}

Tags:

C

Io

File

File Io