How to plot text clusters?

There are several moving pieces to this question:

  1. How to vectorize text to data that kmeans clustering can understand
  2. How to plot clusters in two dimensional space
  3. How to label plots by source sentence

My solution follows a very common approach, which is to use the kmeans labels as colors for the scatter plot. (The kmeans values after fitting are just 0,1,2,3, and 4, indicating which arbitrary group each sentence was assigned to. The output is in the same order as the original samples.) Regarding how to get the points into two dimensional space, I use Principal Component Analysis (PCA). Note that I perform kmeans clustering on the full data, not the dimension-reduced output. I then use matplotlib's ax.annotate() to decorate my plot with the original sentences. (I also make the graph bigger so there's space between the points.) I can comment this further upon request.

import pandas as pd
import re
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
    'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
     'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
     'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
vectors = cv.fit_transform(x)
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
kmean_indices = kmeans.fit_predict(vectors)

pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(vectors.toarray())

colors = ["r", "b", "c", "y", "m" ]

x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))

ax.scatter(x_axis, y_axis, c=[colors[d] for d in kmean_indices])

for i, txt in enumerate(x):
    ax.annotate(txt, (x_axis[i], y_axis[i]))

enter image description here


As per the documentation of matplotlib.pyplot.scatter takes an array as in input but in your case x[y_kmeans == a,b] you are feeding in a sparse matrix, so you need to convert it into an numpy array using .toarray() method. I have modified your code below:

Modification

plt.scatter(x[y_kmeans == 0,0].toarray(), x[y_kmeans==0,1].toarray(), s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0].toarray(), x[y_kmeans==1,1].toarray(), s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0].toarray(), x[y_kmeans==2,1].toarray(), s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0].toarray(), x[y_kmeans==3,1].toarray(), s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0].toarray(), x[y_kmeans==4,1].toarray(), s = 15, c= 'magenta', label = 'Cluster_5')

Output

enter image description here

Hope this helps!