Basic NLP in CoffeeScript or JavaScript -- Punkt tokenizaton, simple trained Bayes models -- where to start?

I think that, as you wrote in the comment, the amount of data needed for efficient algorithms to run will eventually prevent you from doing things client-side. Even basic processing require lots of data, for instance bigram/trigram frequencies, etc. On the other hand, symbolic approaches also need significant data (grammar rules, dictionaries, etc.). From my experience, you can't run a good NLP process without at the very least 3MB to 5MB of data, which I think is too big for today's clients.

So I would do things over the wire. For that I would recommend an asynchronous/push approach, maybe use Faye or Socket.io ? I'm sure you can achieve a perfect and fluid UX as long as the user is not stuck while the client is waiting for the server to process the text.


There is a quite nice natural language processing for node.js called natural. It's not currently built for running in the browser, but the authors have stated that they want to fix that. Most of it might even work already, using something like browserify or Require.JS.


winkjs has several packages for natural language processing:

  1. Multilingual tokenizer that tags each token with its type such as word, number, email, mention, etc.
  2. English Part-of-speech (POS) tagger,
  3. Language agnostic named entity recognizer,
  4. Useful functions for common NLP tasks and many more e.g. sentiment analysis, lemmatizer, naive bayes text classifier, etc.

It has packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS. The code is thoroughly documented for easy human comprehension and has a test coverage of ~100% for reliability to build production grade solutions.