Index every consecutive pair of terms in the text as a phrase.
“Hi I am” -> “Hi I”, “I am”
For longer phrases
use conjunction search for “Hi I am” -> “Hi I” AND “I am”
We have to maintain
doc source however, to make sure these terms actually appear next to each other, rather then at different places in the document.
For extended biwords
N X* N
Where N is a noun X* means that there is one or more articles / prepositions
Catcher in the rye becomes catcher rye. This is because we segment out nouns and articles, only preserving nouns.