Text extraction

What format is it in?

PDF / Word / Excel / HTML

What language is used? (same document may have multiple languages)

What character set is used?

Determined with heuristics