What are the difficulties the tokenizer faces with languages?
German nouns not segmented
We use compound splitter module.
Gives us a 15% performance boost.
Japanese and chinese words have no spaces between words.
Arabic / Hebrew written right to left, certain items are written left to right.
L’ensemble -> one / two tokens??