scatter plots with string arrays in matplotlib

You could use np.unique(..., return_inverse=True) to get representative ints for each string. For example,

In [117]: uniques, X = np.unique(['foo', 'baz', 'bar', 'foo', 'baz', 'bar'], return_inverse=True)

In [118]: X
Out[118]: array([2, 1, 0, 2, 1, 0])

Note that X has dtype int32, as np.unique can handle at most 2**31 unique strings.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as axes3d

N = 12
arr = np.arange(N*2).reshape(N,2)
words = np.array(['foo', 'bar', 'baz', 'quux', 'corge'])
df = pd.DataFrame(words[arr % 5], columns=list('XY'))
df['Z'] = np.linspace(1, 1000, N)
Z = np.log10(df['Z'])
Xuniques, X = np.unique(df['X'], return_inverse=True)
Yuniques, Y = np.unique(df['Y'], return_inverse=True)

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.scatter(X, Y, Z, s=20, c='b')
ax.set(xticks=range(len(Xuniques)), xticklabels=Xuniques,
       yticks=range(len(Yuniques)), yticklabels=Yuniques) 
plt.show()

enter image description here


Try converting the characters to numbers for the plotting and then use the characters again for the axis labels.

Using hash

You could use the hash function for the conversion;

from mpl_toolkits.mplot3d import Axes3D
xlab = myDataFrame.columnX.values
ylab = myDataFrame.columnY.values

X =[hash(l) for l in xlab] 
Y =[hash(l) for l in xlab] 

Z= myDataFrame.columnY.values #float

fig = figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X, Y, np.log10(Z), s=20, c='b')
ax.set_xticks(X)
ax.set_xticklabels(xlab)
ax.set_yticks(Y)
ax.set_yticklabels(ylab)
show()

As M4rtini has pointed out in the comments, it't not clear what the spacing/scaling of string coordinates should be; the hash function could give unexpected spacings.

Nondegenerate uniform spacing

If you wanted to have the points uniformly spaced then you would have to use a different conversion. For example you could use

X =[i for i in range(len(xlab))]

though that would cause each point to have a unique x-position even if the label is the same, and the x and y points would be correlated if you used the same approach for Y.

Degenerate uniform spacing

A third alternative is to first get the unique members of xlab (using e.g. set) and then map each xlab to a position using the unique set for the mapping; e.g.

xmap = dict((sn, i)for i,sn in enumerate(set(xlab)))
X = [xmap[l] for l in xlab]

Scatter does this automatically now (from at least matplotlib 2.1.0):

plt.scatter(['A', 'B', 'B', 'C'], [0, 1, 2, 1])   

scatter plot