Text extraction
What format is it in?
PDF / Word / Excel / HTML
What language is used? (same document may have multiple languages)
What character set is used?
Determined with heuristics
What format is it in?
PDF / Word / Excel / HTML
What language is used? (same document may have multiple languages)
What character set is used?
Determined with heuristics