What are the difficulties the tokenizer faces with languages?

German nouns not segmented

Lebensversicherungsgesellschaftsangetsellter

Gives us a 15% performance boost.

Japanese and chinese words have no spaces between words.

Arabic / Hebrew written right to left, certain items are written left to right.

L’ensemble -> one / two tokens??

2021-02-22