c++ library to tokenize text (UTF-8 encoding) in typed tokens. This is a basic fonctionality for almost all Natural Language Processing (NLP) approaches. The library has a simple API and is initialised with a rule file defining the token types and the regular expressions to match -
View it on GitHub