A PHP Library / Class to Count Words in Various Languages?

Counting chars is easy:

echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
echo strlen(utf8_decode('一个有十的字符的句子')); // 10

Counting words is where things start to get tricky, specially for Chinese, Japanese and other languages that don't use spaces (or other common "word boundary" characters) as word separators. I don't speak Chinese and I don't understand how word counting works in Chinese, so you'll have to educate me a bit - what makes a word in these languages? Is it any specific char or set of chars? I remember reading something related to how hard it was to identify Japanese words in T9 writing but can't find it anymore.

The following should correctly return the number of words in languages that use spaces or punctuation chars as words separators:

count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));

A PHP Library / Class to Count Words in Various Languages?

Tags:

Php

Utf 8

Nlp

Word Count

Related

Recent Posts