Extracting Key-Phrases from text based on the Topic with Python

It looks like a good approach here would be to use a Latent Dirichlet allocation model, which is an example of what are known as topic models.

A LDA is a an unsupervised model that finds similar groups among a set of observations, which you can then use to assign a topic to each of them. Here I'll go through what could be an approach to solve this by training a model using the sentences in the text column. Though in the case the phrases are representative enough an contain the necessary information to be captured by the models, then they could also be a good (possibly better) candidate for training the model, though that you'll better judge by yourself.

Before you train the model, you need to apply some preprocessing steps, including tokenizing the sentences, removing stopwords, lemmatizing and stemming. For that you can use nltk:

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import lda
from sklearn.feature_extraction.text import CountVectorizer

ignore = set(stopwords.words('english'))
stemmer = WordNetLemmatizer()
text = []
for sentence in df.text:
    words = word_tokenize(sentence)
    stemmed = []
    for word in words:
        if word not in ignore:
    text.append(' '.join(stemmed))

Now we have more appropriate corpus to train the model:


['great game lot amazing goal team',
 'goalkeeper team made misteke',
 'four grand slam championchips',
 'best player three-point line',
 'Novak Djokovic best player time',
 'amazing slam dunk best player',
 'deserved yellow-card foul',
 'free throw point']

We can then convert the text to a matrix of token counts through CountVectorizer, which is the input LDA will be expecting:

vec = CountVectorizer(analyzer='word', ngram_range=(1,1))
X = vec.fit_transform(text)

Note that you can use the ngram parameter to spacify the n-gram range you want to consider to train the model. By setting ngram_range=(1,2) for instance you'd end up with features containing all individual words as well as 2-grams in each sentence, here's an example having trained CountVectorizer with ngram_range=(1,2):

 'amazing goal',
 'amazing slam',
 'best player',

The advantage of using n-grams is that you could then also find Key-Phrases other than just single words.

Then we can train the LDA with whatever amount of topics you want, in this case I'll just be selecting 3 topics (note that this has nothing to do with the topics column), which you can consider to be the Key-Phrases - or words in this case - that you mention. Here I'll be using lda, though there are several options such as gensim. Each topic will have associated a set of words from the vocabulary it has been trained with, with each word having a score measuring the relevance of the word in a topic.

model = lda.LDA(n_topics=3, random_state=1)

Through topic_word_ we can now obtain these scores associated to each topic. We can use argsort to sort the vector of scores, and use it to index the vector of feature names, which we can obtain with vec.get_feature_names:

topic_word = model.topic_word_

vocab = vec.get_feature_names()
n_top_words = 3

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: best player point
Topic 1: amazing team slam
Topic 2: yellow novak card

The printed results don't really represent much in this case, since the model has been trained with the sample from the question, however you should see more clear and meaningful topics by training with your entire corpus.

Also note that for this example I've use the whole vocabulary to train the model. However it seems that in your case what would make more sense, is to split the text column into groups according to the different topics you already have, and train a separate model on each group. But hopefully this gives you a good idea on how to proceed.