Is this even a word?

Ruby, 29 bytes

->s{!s[/[^aeiou]{3}|[jqxz]/]}

Hopefully I've got this right - it's my first time programming in Ruby. I actually did all my testing in Python, but import re was far too long for me.

This is an anonymous function which takes in a string and outputs true/false accordingly. It uses a regex which looks for one of the following two things:

  • Three consonants in a row
  • Contains one of jqxz

If either of these are present, we classify the input as not a word.

The function matches 2030 words (incorrectly failing on 206) and fails on 1782 non-words (incorrectly matching 454), for a total of 660 misclassifications. Tested on ideone.

Thanks to @MartinBüttner for the Ruby help. Martin also points out that a full program takes the same number of bytes:

p !gets[/[^aeiou]{3}|[jqxz]/]

Also thanks to user20150203 for simplifying the regex.


Ruby, 1586 1488 1349 1288 1203 bytes

For a bonus, here's a function with a much longer regex:

->s{!s[/[^aeiouyhs]{3}|[^aeiouy]{4}|q[^u]|^x|^[bdf][^aeioulry]|^[cgkprtwy][mfdsbktjgxpc]|^a[aoz]|^e[hf]|^i[ea]|^o[ecy]|^u[^ltnspgr]|[bdgktyz][cgmpw]$|[fhpvx][^aieoflnrsty]$|eh$|i[iyh]|[wkybp]z|[dghz]t|[fjzsv]y.|h[ns].|ae.|y.?[yifj]|[ejo]..[iuw]|[inv]..[gpuvz]|[eu].[jqwx]|[vyz][^t][fhmpqy]|i[^ae][fjqrv]|[ospen].?j|[ceg][iuy][ghkoux]|[bcpx]f|[hnuy]w|[ghnw]b|[txz]n|[jk]r|.[fjuyz]u|[hnt]ia|lsy|.p.o|.l.l|.tas|zal|f.p|eeb|wei|.sc.|.pl|yat|hov|hab|aug|v.re|aba|ohu|ned|s.dd|uc$|nux|oo$|dgo|lix|wua|v.o|vo$|ryo|wue|dk|oic|yol|.tr|yrb|oba|ruh|c.ls|idd|chn|doy|ekh|tk|lke|asl|cir|eez|asc|uil|iou|m..p|awt|irp|zaa|td|swk|ors|phe|aro|yps|q.e|ati|ibt|e.mo|we.y|p.de|ley|eq|tui|e..g|sps|akh|dny|swr|iul|.t.t|.tao|rcy|.p.y|idi|j.o|.kl|oms|ogi|jat|.lis|mye|uza|rsi|.ala|ibo|ipi|yaa|eun|ruy|wog|mm$|oex|koi|uyn|.hid|osc|ofe|w.op|auc|uzy|yme|aab|slm|oza|.fi|bys|z.e|nse|faf|l.h|f.va|nay|hag|opo|lal|seck|z.b|kt|agl|epo|roch|ix.|pys|oez|h.zi|nan|jor|c.ey|dui|ry.d|.sn|sek|w.no|iaz|ieb|irb|tz.|ilz|oib|cd|bye|ded|f.b|if$|mig|kue|ki.w|yew|dab|kh.|grs|no.t|cs.|.n.m|iea|y.na|vev|eag|el[uz]|eo[dkr]|e[hlsu]e|e[dho]l|eov|e[adp]y|r[grt]u|yn[klo]|.[^ilv].v|s[bdgqrz]|m[dfghrz]|[vpcwx]{2}|^[hjlmnvz][^aeiouy]|^[drw]l|l[hnr]/]}

I wanted to show that regex can still beat compression, so this one classifies every given case correctly. The regex itself is a bit like exponential decay — the first bits match a lot of non-words, then every bit after matches fewer and fewer until I gave up and just concatenated the rest (about 200 or so) at the end. Some of the ones that were left looked surprisingly like real words (such as chia which is a word).

I threw the regex at my regex golf cleaner which I wrote for another challenge — it golfed about 300 bytes before I had to try shuffling things around manually. There's still a fair bit to be golfed though.


Groovy, 77 74

x={it==~/^(?!.+[jq]|[^aeiou][^aeiouhlr]|.[^aeiouy]{3}|.[x-z])|^s[cknptw]/}

I wrote the test program in Java, which you can find in this Gist on Github. Here is the output from my test program:

Good: 2135 1708
Bad: 101 528

(Failed 629 test cases)

P.S. I think this is going to end up a regex golf problem extremely soon...

If Sp3000's answer (the function) is to be converted to Groovy, it will end up with the same character count. As named function:

x={it!=~/[^aeiou]{3}|[jqxz]/}

or unnamed function:

{i->i!=~/[^aeiou]{3}|[jqxz]/}

Javascript, 1626 bytes:

I wanted to go for a solution which for each character has a clue of which one might come after. Not that short, but no regex and a fairly good result (words: 101 mistakes, non-words, 228 mistakes)

v=function(w){var g={a:{b:3,c:4,d:4,m:6,f:1,r:14,g:4,i:6,x:2,k:2,l:10,n:12,s:6,p:4,t:7,u:2,v:3,w:3,y:3,h:1,z:1},b:{b:3,a:19,e:19,y:3,l:6,o:17,u:12,i:9,s:9,r:6},e:{d:7,l:8,t:4,s:10,n:11,e:10,r:10,c:2,x:2,w:4,a:13,f:1,p:2,g:2,v:1,b:1,m:3,u:1,i:1,k:1,y:2},l:{e:16,y:5,a:16,b:1,f:2,l:12,m:2,o:14,p:1,s:2,u:8,d:4,i:10,k:3,t:5},o:{s:7,g:3,e:3,k:3,n:10,m:4,p:5,w:6,b:3,c:2,t:6,a:5,d:5,h:1,i:2,l:8,o:9,r:8,u:4,y:2,v:2,z:1,f:2,x:1},u:{t:8,e:5,m:7,s:11,a:2,n:13,r:15,d:6,c:4,f:1,g:5,l:9,y:1,z:1,b:5,j:1,x:1,p:2,k:1,i:2},c:{e:9,h:12,i:2,r:6,t:3,o:20,k:15,a:16,l:6,u:8,y:1},h:{e:21,r:2,a:22,i:15,o:20,u:15,n:3,l:1,y:1},i:{d:8,m:5,n:18,r:7,a:2,s:8,v:2,l:13,t:10,b:1,e:6,k:2,p:5,g:3,c:6,o:2,f:2,z:1},m:{e:19,s:8,a:21,i:12,m:1,o:15,y:2,b:4,p:8,n:1,u:8},n:{e:18,u:3,a:9,d:10,n:4,o:7,s:11,t:11,g:10,k:6,i:5,y:2,c:1},r:{e:18,s:4,y:4,a:16,c:1,g:1,i:12,m:3,p:2,t:4,b:1,d:4,k:4,n:5,r:2,o:11,l:2,u:6,f:1},t:{a:14,s:17,e:18,i:9,o:15,h:10,t:3,y:2,c:1,z:1,u:5,r:5,w:2},d:{a:14,d:4,s:10,e:22,y:8,i:12,o:14,r:4,u:10,l:1},f:{a:16,f:6,e:12,y:1,i:14,l:13,o:16,r:7,u:7,t:7,n:1,s:1},g:{a:16,e:12,o:17,u:7,s:18,g:1,y:2,i:8,l:4,n:2,h:3,r:9,w:1},j:{a:25,i:7,e:14,o:25,u:29},k:{i:23,s:6,e:41,u:6,a:10,l:2,n:8,o:2,h:1,y:1},p:{s:12,e:20,a:19,y:2,i:13,t:2,u:10,l:5,o:13,r:4},s:{o:8,i:8,e:13,k:6,h:10,s:8,t:14,y:1,p:5,c:2,l:6,a:10,m:1,n:2,u:4,w:2},v:{a:18,e:47,i:22,o:8,y:6},y:{l:4,e:18,s:20,d:3,n:8,r:8,t:4,a:14,k:1,m:1,o:8,x:3,p:3,u:4,v:1},q:{u:100},w:{a:24,e:17,l:4,r:3,s:10,n:6,y:2,k:1,d:1,t:1,i:17,u:1,o:10,h:4},x:{e:35,i:18,l:6,o:6,a:6,t:12,y:18},z:{z:10,y:10,a:3,e:43,r:3,o:17,i:10,u:3}},p=1,x,y,i=0;for(;i<3;){x=w[i],y=w[++i];p*=g[x]&&g[x][y]||0}return p>60}

Here's a working implementation http://fiddle.jshell.net/jc73sjyn/

In short: the Object g holds the characters from a to z (as keys), and for each of them, there is a set of characters (also as keys) that each represent a character that may come after, along with its probability percentage. Where no object exists, there is no probability.

The 3 scores (4 letters -> 3 evaluations) are multiplied, and a word with a score of 60 and above is considered a real word.

Example: for the word 'cope' there are three lookups:

g[c][o] = 20

g[o][p] = 5

g[p][e] = 20

score = 20 * 5 * 20 = 2000, which is more than 60, so that one is valid.

(I'm quite new with javascript, so there might be ways to make it shorter that I don't know of.)

LATE EDIT:

Completely irrelevant by now, but I evaluated my way to a more correct g:

g={a:{b:7,c:4,d:4,m:6,f:2,r:14,g:4,i:6,x:2,k:2,l:10,n:12,s:6,p:4,t:7,u:2,v:3,w:12,y:3,h:1,z:1},b:{b:10,a:19,e:19,y:3,l:6,o:17,u:10,i:9,s:9,r:3},e:{d:7,l:8,t:4,s:10,n:11,e:10,r:10,c:2,x:2,w:4,a:13,f:1,p:2,g:2,v:20,b:3,m:3,u:1,i:1,k:1,y:2},l:{e:16,y:5,a:16,b:1,f:2,l:12,m:2,o:14,p:1,s:6,u:61,d:1,i:10,k:3,t:5},o:{s:7,g:3,e:3,k:3,n:20,m:4,p:5,w:6,b:3,c:2,t:6,a:5,d:5,h:10,i:2,l:8,o:3,r:8,u:4,y:2,v:2,z:1,f:20,x:1},u:{t:8,e:5,m:7,s:11,a:2,n:13,r:15,d:6,c:1,f:10,g:5,l:9,y:1,z:1,b:5,j:1,x:1,p:2,k:1,i:2},c:{e:9,h:20,i:2,r:6,t:20,o:15,k:15,a:15,l:6,u:8,y:1},h:{e:21,r:2,a:7,i:15,o:20,u:15,n:10,l:0,y:1},i:{d:8,m:5,n:18,r:7,a:5,s:8,v:2,l:13,t:20,b:1,e:21,k:2,p:5,g:20,c:4,o:2,f:2,z:1},m:{e:10,s:8,a:21,i:12,m:1,o:15,y:2,b:4,p:2,n:1,u:8},n:{e:18,u:3,a:9,d:3,n:4,o:20,s:2,t:11,g:10,k:6,i:5,y:2,c:1},r:{e:15,s:4,y:4,a:16,c:1,g:1,i:12,m:3,p:2,t:4,b:1,d:4,k:4,n:5,r:2,o:11,l:2,u:20,f:1},t:{a:14,s:15,e:18,i:2,o:15,h:10,t:3,y:2,c:1,z:1,u:5,r:5,w:2},d:{a:14,d:4,s:10,e:61,y:8,i:12,o:7,r:3,u:10,l:0},f:{a:5,f:6,e:12,y:1,i:3,l:13,o:16,r:7,u:20,t:4,n:1,s:1},g:{a:16,e:12,o:17,u:7,s:18,g:0,y:2,i:8,l:3,n:2,h:3,r:9,w:1},j:{a:8,i:7,e:14,o:5,u:29},k:{i:3,s:20,e:41,u:6,a:10,l:20,n:8,o:2,h:1,y:1},p:{s:12,e:20,a:5,y:2,i:13,t:4,u:10,l:3,o:13,r:4},s:{o:8,i:8,e:13,k:6,h:10,s:8,t:14,y:1,p:5,c:2,l:2,a:10,m:2,n:6,u:8,w:2},v:{a:10,e:20,i:22,o:6,y:6},y:{l:6,e:15,s:20,d:3,n:8,r:8,t:4,a:4,k:1,m:1,o:3,x:3,p:3,u:1,v:1},q:{u:100},w:{a:24,e:17,l:4,r:2,s:3,n:6,y:20,k:1,d:1,t:1,i:17,u:6,o:10,h:20},x:{e:35,i:6,l:3,o:6,a:6,t:3,y:7},z:{z:10,y:10,a:3,e:43,r:1,o:8,i:7,u:1}}

New results:

words: 53 mistakes, non-words: 159 mistakes

http://fiddle.jshell.net/jc73sjyn/2/