What are the difficulties the tokenizer faces with languages?

German nouns not segmented


We use compound splitter module.

Gives us a 15% performance boost.

Japanese and chinese words have no spaces between words.

Arabic / Hebrew written right to left, certain items are written left to right.

L’ensemble -> one / two tokens??