How to create word clouds?

Here's what I came up with

Mathematica graphics

How I did it

First we need a list of words. Here, I've taken the original list ordered by size.

tally = Tally@
   Cases[StringSplit[ExampleData[{"Text", "AliceInWonderland"}], 
     Except@LetterCharacter], _?(StringLength@# > 4 \[And] # =!= 
         "Alice" &)];
tally = Cases[tally, _?(Last@# > 10 &)];
tally = Reverse@SortBy[tally, Last];
range = {Min@(Last /@ tally), Max@(Last /@ tally)};

words = Style[First@#, FontFamily -> "Cracked", FontWeight -> Bold, 
     FontColor -> 
      Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]], 
     FontSize -> Last@Rescale[#, range, {12, 70}]] & /@ tally;

The words are rasterised and cropped to make sure the bounding box is as tight as possible.

wordsimg = ImageCrop[Image[Graphics[Text[#]]]] & /@ words;

To produce the image the words are added one by one using a Fold loop where the next word is placed as close to the centre of the existing image as possible. This is done by applying a max filter to the binarized version of the original image thus turning forbidden pixels white and looking for the black point that is closest to the centre of the image.

iteration[img1_, w_, fun_: (Norm[#1 - #2] &)] := 
 Module[{imdil, centre, diff, dimw, padding, padded1, minpos},
  dimw = ImageDimensions[w];
  padded1 = ImagePad[img1, {dimw[[1]] {1, 1}, dimw[[2]] {1, 1}}, 1];
  
  imdil = MaxFilter[Binarize[ColorNegate[padded1], 0.01], 
    Reverse@Floor[dimw/2 + 2]];
  centre = ImageDimensions[padded1]/2;
  
  minpos = Reverse@Nearest[Position[Reverse[ImageData[imdil]], 0], 
      Reverse[centre], DistanceFunction -> fun][[1]];
  diff = ImageDimensions[imdil] - dimw;
  padding[pos_] := Transpose[{#, diff - #} &@Round[pos - dimw/2]];
  
  ImagePad[#, (-Min[#] {1, 1 }) & /@ BorderDimensions[#]] &@
   ImageMultiply[padded1, ImagePad[w, padding[minpos], 1]]]

Fold[iteration, wordsimg[[1]], Rest[wordsimg]]

You can play around with the distance function. For example for a distance function

fun = Norm[{1, 1/2} (#2 - #1)] &

you get an ellipsoidal shape:

Fold[iteration[##, fun]&, wordsimg[[1]], Rest[wordsimg]]

Mathematica graphics


Updated version

The previous code places new words in the image by approximating them with rectangles. This works fine for horizontally or vertically oriented words, but not so well for rotated words or more general shapes. Luckily, the code can be easily modified to deal with this by replacing the MaxFilter with a ImageCorrelate:

iteration2[img1_, w_, fun_: ( Norm[#1 - #2] &)] := 
 Module[{imdil, centre, diff, dimw, padding, padded1, minpos}, 
  dimw = ImageDimensions[w];
  padded1 = ImagePad[img1, {dimw[[1]] {1, 1}, dimw[[2]] {1, 1}}, 1];
  imdil = Binarize[ImageCorrelate[Binarize[ColorNegate[padded1], 0.05], 
     Dilation[Binarize[ColorNegate[w], .05], 1]]];
  centre = ImageDimensions[padded1]/2;
  minpos = 
   Reverse@Nearest[Position[Reverse[ImageData[imdil]], 0], 
      Reverse[centre], DistanceFunction -> fun][[1]];
  Sow[minpos - centre]; (* for creating vector plot *)
  diff = ImageDimensions[imdil] - dimw;
  padding[pos_] := Transpose[{#, diff - #} &@Round[pos - dimw/2]];
  ImagePad[#, (-Min[#] {1, 1}) & /@ BorderDimensions[#]] &@
   ImageMultiply[padded1, ImagePad[w, padding[minpos], 1]]]

To test this code we use a list of rotated words. Note that I'm using ImagePad instead of ImageCrop to crop the images. This is because ImageCrop seems to clip the words sometimes.

words = Style[First@#, FontFamily -> "Times", 
     FontColor -> 
      Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]], 
     FontSize -> (Last@Rescale[#, range, {12, 150}])] & /@ tally;

wordsimg = ImagePad[#, -3 - 
  BorderDimensions[#]] & /@ (Image[
   Graphics[Text[Framed[#, FrameMargins -> 2]]]] & /@ words);

wordsimgRot = ImageRotate[#, RandomReal[2 Pi], 
  Background -> White] & /@ wordsimg;

The iteration loop is as before:

Fold[iteration2, wordsimgRot[[1]], Rest[wordsimgRot]]

which produces

Mathematica graphics

Second update

To create a vector graphics of the previous result, we need to save the positions of the words in the image, for example by adding Sow[minpos - centre] to the definition of iteration2 somewhere towards the end of the code and using Reap to reap the results. We also need to keep the rotation angles of the words, so we'll replace wordsimgRot with

angles = RandomReal[2 Pi, Length[wordsimg]];

wordsimgRot = ImageRotate[##, Background -> White] & @@@ 
   Transpose[{wordsimg, angles}];

As mentioned before, we use Reap to create the position list

poslist = Reap[img = Fold[iteration2, wordsimgRot[[1]], 
  Rest[wordsimgRot]];][[2, 1]]

The vector graphics can then be created with

Graphics[MapThread[Text[#1, Offset[#2, {0, 0}], {0, 0}, {Cos[#3], Sin[#3]}] &,
  {words, Prepend[poslist, {0, 0}], angles}]]

A preview

Before I show any code, here's a preview of what is possible with some tweaking:

Mathematica graphics


First try

Here's a go at implementing Wordle's layout algorithm, described at cormullion's link.

First, let's generate the word data (this is pretty arbitrary):

punctuation = ",/.<>?;':\"()-_!&"

(* boring words: *)
common = {"the", "of", "and", "to", "in", "I", "that", "was", "his", 
   "he", "it", "with", "is", "for", "as", "had", "you", "not", "be", 
   "her", "on", "at", "by", "which", "have", "or", "from", "this", 
   "him", "but", "all", "she", "they", "were", "my", "are", "me", 
   "one", "their", "so", "an", "said", "them", "we", "who", "would", 
   "been", "will", "no", "when", "there", "if", "more", "out", "up", 
   "into", "do", "any", "your", "what", "has", "man", "could", 
   "other", "than", "our", "some", "very", "time", "upon", "about", 
   "may", "its", "only", "now", "like", "little", "then", "can", 
   "should", "made", "did", "us", "such", "a", "great", "before", 
   "must", "two", "these", "see", "know", "over", "much", "down", 
   "after", "first", "mr", "good", "men"};

text = Select[
   StringSplit@
    StringReplace[ExampleData[{"Text", "AliceInWonderland"}], 
     Alternatives @@ Characters[punctuation] -> " "],
   StringLength[#] > 2 &
   ];
text = DeleteCases[text, w_ /; MemberQ[common, ToLowerCase[w]]];

Now that we have the data, let's take a word tally and generate words at sized proportional to their frequency:

words = TakeWhile[Reverse@SortBy[Tally[text], Last], #[[2]] >= 10 &];

styledwords = 
 Style[#1, FontSize -> #2, FontFamily -> "Times"] & @@@ words

Mathematica graphics

Let's rasterize and binarize these (the binarization is to ease overlap detection):

images = Binarize@Rasterize[#, "Image"] & /@ styledwords;

This counts black pixels in an image:

count[img_] := ImageLevels[Binarize[img]][[1, 2]]

Now run this:

canvas = Image[Graphics[], ImageSize -> {1000, 1000}];
Monitor[
 Do[
  x = 0;
  w = images[[i]];
  cc = count[canvas];
  centre = RandomReal[0.1 {-1, 1}, {2}] + {0.5, 0.5};
  compose := 
   res = ImageCompose[canvas, SetAlphaChannel[w, ColorNegate[w]], 
     Scaled[centre + x/100 {Cos[x], Sin[x]}]];
  compose;
  While[count[res] - cc - count[w] < 0 && x < 80,
   x += 1;
   compose;
   ];
  canvas = res,
  
  {i, 1, Length[images]}
  ],
 canvas
 ]

canvas // ImageCrop

Mathematica graphics

Some explanations:

This code takes the word images one by one and tries to place them at a random position. If it does not fit at that position, it tries other positions, moving on an outward winding Archimedean spiral. The step size of moving on this spiral should be chosen so that the points are distributed with an approximately constant density in the plane, and are not on gathered on a few thin lines. I used this code to verify the point distribution for a given step size (1 was good enough for a first try):

Manipulate[Graphics[Point@Table[x/100 {Cos[x], Sin[x]}, {x, 0, 100, s}]], 
   {s, 0.1, 1.5}]

Testing for image overlap (i.e. whether the word fits) is done by composing the image onto a canvas which has all the previously placed words, and verifying that no black pixels will collide (i.e. the total pixel count will not change after placing the image). There are probably faster ways to do this.


Second try: converting all this to vector graphics

I used images but his should only be an aid for calculating positions. The images should be dilated prior to arranging them (to have more "air" between them), their positions should be recorded, and the recorded positions should be used to arrange the vector versions of words.

Here's a basic implementation:

images = ColorNegate@Dilation[ColorNegate[#], 2] & /@ images

canvas = Image[Graphics[], ImageSize -> {1000, 1000}];
positions = {};
Monitor[
 Do[
  x = 0;
  w = images[[i]];
  cc = count[canvas];
  centre = RandomReal[0.1 {-1, 1}, {2}] + {0.5, 0.5};
  compose := 
   res = ImageCompose[canvas, SetAlphaChannel[w, ColorNegate[w]], 
     pos = Scaled[centre + x/100 {Cos[x], Sin[x]}]];
  compose;
  While[count[res] - cc - count[w] < 0 && x < 80,
   x += 1;
   compose;
   ];
  canvas = res;
  AppendTo[positions, pos],
  
  {i, 1, Length[images]}
  ],
 canvas
 ]

Rasterize[
  Graphics[MapThread[Text, {styledwords, positions}], 
   ImageSize -> 1000], "Image"] // ImageCrop

Mathematica graphics

There's a lot of refinement needed though (I didn't have time to tune the parameters).


Now that two of our resident Mathematica geniuses (genii?) have produced such awesome examples, there's not much room left for anyone else... :) But that didn't stop me - and I'm here to make you guys look good. I had an idea...

I decided not to make a cloud, but a tale - or rather, a tail. I've pinched Szabolcs's code to get the words and frequencies:

punctuation = ",/.<>?;':\"()-_!&";
common = {"the", "of", "and", "to", "in", "I", "that", "was", "his", 
 "he", "it", "with", "is", "for", "as", "had", "you", "not", "be", 
  "her", "on", "at", "by", "which", "have", "or", "from", "this", 
  "him", "but", "all", "she", "they", "were", "my", "are", "me", 
  "one", "their", "so", "an", "said", "them", "we", "who", "would", 
  "been", "will", "no", "when", "there", "if", "more", "out", "up", 
  "into", "do", "any", "your", "what", "has", "man", "could", "other",
   "than", "our", "some", "very", "time", "upon", "about", "may", 
  "its", "only", "now", "like", "little", "then", "can", "should", 
  "made", "did", "us", "such", "a", "great", "before", "must", "two", 
  "these", "see", "know", "over", "much", "down", "after", "first", 
  "mr", "good", "men"}; text = 
 Select[StringSplit@
   StringReplace[ExampleData[{"Text", "AliceInWonderland"}], 
    Alternatives @@ Characters[punctuation] -> " "], 
  StringLength[#] > 2 &];
text = DeleteCases[text, w_ /; MemberQ[common, ToLowerCase[w]]];
words = TakeWhile[Reverse@SortBy[Tally[text], Last], #[[2]] >= 10 &];

Now, I can work out how big the words would be:

textToOutline[string_, size_] := 
  First[ImportString[
    ExportString[
     Style[string, FontFamily -> "Times", FontWeight -> "Bold", 
      FontSize -> size], "PDF"], "TextMode" -> "Outlines"]];

width[t_] := t[[2]][[2]][[1]]; (* the best way? - surely not! *)
height[t_] := t[[2]][[2]][[2]];

and make the graphics:

wordGraphics =  textToOutline[First[#], Last[#]] & /@ words;
baselines =   Reverse[Accumulate [Reverse[height /@ wordGraphics]]]; (* !? *)
wiggle = 0; fudgeFactor = 180; 
Graphics[
 MapThread[
   Tooltip[Text[#1, {Sin[++wiggle/4], #2/fudgeFactor}], #3] &, 
   {wordGraphics,  baselines, words}], 
 Background -> LightGreen] 

With the help of a fudge factor or two, it looks like this:

word tail

There's a number of problems with this code - I don't see how to do some things in Mathematica yet - such as find out how far I've got in a Map operation, or how the coordinates work (what is that fudge factor doing?), or, more importantly, why the first word doesn't look right. But I like the idea of reading the tale.

See The Mouse's tale - Wikipedia