What is a language model?

Given a query, we check the probability of documents being relevant to it.

It does this by seeing which document is likely to generate the query.

It does this by assigning probabilities to document terms, based on their frequency.

The query then swaps out query terms for these probabilities, and takes their product.

This indicates the probability that a document is likely to generate a query.

The other way around

If we do it the other way round (probability of document occurring in query), this is lossy.

Most of the query terms in the document probably won’t exist in query sample space.

Formal definition

$$ P(d|q) = P(p|d) * P(d) / P(q) $$