what is the difference between bigram and unigram text features extraction

We are trying to teach machine how to do natural language processing. We human can understand language easily but machines cannot so we trying to teach them specific pattern of language. As specific word has meaning but when we combine the words(i.e group of words) than it will be more helpful to understand the meaning.

n-gram is basically set of occurring words within given window so when

  • n=1 it is Unigram

  • n=2 it is bigram

  • n=3 it is trigram and so on

Now suppose machine try to understand the meaning of sentence "I have a lovely dog" then it will split sentences into a specific chunk.

  1. It will consider word one by one which is unigram so each word will be a gram.

    "I", "have", "a" , "lovely" , "dog"

  2. It will consider two words at a time so it will be biagram so each two adjacent words will be biagram

    "I have" , "have a" , "a lovely" , "lovely dog"

So like this machine will split sentences into small group of words to understand its meaning


Example: Consider the sentence "I ate banana".

In Unigram we assume that the occurrence of each word is independent of its previous word. Hence each word becomes a gram(feature) here.

For unigram, we will get 3 features - 'I', 'ate', 'banana' and all 3 are independent of each other. Although this is not the case in real languages.

In Bigram we assume that each occurrence of each word depends only on its previous word. Hence two words are counted as one gram(feature) here.

For bigram, we will get 2 features - 'I ate' and 'ate banana'. This makes sense since the model will learn that 'banana' comes after 'ate' and not the other way around.

Similarly, we can have trigram.......n-gram.