Finding the most 'unique' word

APL (56)

{⎕ML←3⋄⊃{⍵,⍴∪⍵}¨W[⍙]⍴⍨↑+/∆∘.=∆←∆[⍙←⍒∆←↑∘⍴∘∪¨W←⍵⊂⍨⍵≠' ']}

This is a function (question says that's allowed) that takes a string and returns a matrix of words and unique lengths.

Usage:

      {⎕ML←3⋄⊃{⍵,⍴∪⍵}¨W[⍙]⍴⍨↑+/∆∘.=∆←∆[⍙←⍒∆←↑∘⍴∘∪¨W←⍵⊂⍨⍵≠' ']}'The quick brown fox jumps over the lazy dog.'
quick 5
brown 5
jumps 5

Explanation:

  • ⎕ML←3: set migration level to 3 (so that is partition instead of enclose)
  • W←⍵⊂⍨⍵≠' ': store in W the given string, where each partition consists of non-whitespace characters.
  • ⍙←⍒∆←↑∘⍴∘∪¨W: get the amount () of unique () elements in each part (¨) of W, and store these in , then get the sort order when sorted downwards on this () and store that in .
  • ∆[⍙...]: sort by , so now we have the unique lengths in order.
  • ∆∘.=∆←∆: store the sorted back in , and see which elements of are equal.
  • ↑+/: sum the rows (now we know how many elements are equal to each element) and then take the first item (now we know how many elements are equal to the first element, i.e. how many of the words are tied for first place.)
  • W[⍙]⍴⍨: sort W by , and take the first N, where N is the number we just calculated.
  • {⍵,⍴∪⍵}¨: for each of these, get the word itself and the amount of unique characters in the word
  • : format as matrix

Perl 78 bytes

map{push$_[keys{map{$_,1}/./g}]||=[],$_}split for<>;print"$_ $#_
"for@{$_[-1]}

Interpretting the restriction "The text document must be read in by your code" to mean that command line options that read and parse the input are not allowed. As with the PHP solution below, only characters 10 and 32 are considered to be word delimiters. Input and output are also taken in the same manner.


PHP 128 bytes

<?foreach(split(~߃õ,fread(STDIN,1e6))as$s){$w[count(count_chars($s,1))][]=$s;}krsort($w)?><?=join($f=~ß.key($w).~õ,pos($w)),$f;

The only characters considered to be word delimiters are characer 10, and character 32. The rest, including puncuation, are considered to be part of the word.

This contains a few binary characters, which saves quotation marks, but as a result needs to be saved with an ANSI encoding in order to function properly. Alternatively, this version can be used, which is 3 bytes heavier:

<?foreach(split(' |
',fread(STDIN,1e6))as$s){$w[count(count_chars($s,1))][]=$s;}krsort($w)?><?=join($f=' '.key($w).'
',pos($w)),$f;

Sample I/O:

input 1:

It was the best of times, it was the worst of times, it was the age of wisdom,
it was the age of foolishness, it was the epoch of belief, it was the epoch of
incredulity, it was the season of Light, it was the season of Darkness, it was
the spring of hope, it was the winter of despair, we had everything before us,
we had nothing before us, we were all going direct to Heaven, we were all going
direct the other way - in short, the period was so far like the present period,
that some of its noisiest authorities insisted on its being received, for good
or for evil, in the superlative degree of comparison only.

output 1:

$ php most-unique.php < input1.dat
incredulity, 11

input 2:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec mollis, nisl sit
amet consequat fringilla, justo risus iaculis justo, vel ullamcorper dui tellus
ut enim. Suspendisse lectus risus, molestie sed volutpat nec, eleifend vitae
ligula. Nulla porttitor elit vel augue pretium cursus. Donec in turpis lectus.
Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia
Curae; Quisque a lorem eu turpis viverra sodales. Pellentesque justo arcu,
venenatis nec hendrerit a, molestie vitae augue.

output 2:

$ php most-unique.php < input2.dat
consequat 9
ullamcorper 9
Vestibulum 9

Mathematica 96 115

Edit: code now finds all words of the maximum number of characters. I refuse to treat commas as word characters.

f@t := With[{r = {#, Length@Union@Characters@#} & /@ 
StringSplit[t,RegularExpression@"\\W+"]},  Cases[r, {_, Max[r[[All, 2]]]}]]

Examples

f@"It was the best of times,...of comparison only."

or

f@Import["t1.txt"]

{{"incredulity", 10}, {"superlative", 10}}


f@"Lorem ipsum... vitae augue."

or

f@Import["t2.txt"]

{"Vestibulum", 9}


Longer Examples

f@Import["ShakespearesSonnets.txt"]
f@Import["OriginOfSpecies.txt"]
f@Import["DeclarationOfIndependence.txt"]
f@Import["DonQuixoteISpanish.txt"]
f@Import["AliceInWonderland.txt"]
f@Import["UNHumanRightsGerman.txt"]
f@Import["GenesisKJV.txt"]

Surprise: The most "unique" word in the Declaration of Independence is also the most unique word in Alice in Wonderland!

{"prognosticate", 11}
{"undiscoverable", 13}
{"uncomfortable", 12}
{"regocijadamente", 12}
{"uncomfortable", 12}
{"Verpflichtung", 13}
{"buryingplace", 12}

Tags:

Code Golf

Word