# Relevance feedback formulae for IR

## Pseudo relevance feedback

Assume / be given K relevant documents.

Compute Centroid for relevant & non relevant docs: Cr, Cnr.

Compute cosine similarity for Cr, Cnr and q => qr, qnr

Find the q where qr - qnr is maximized.

This is our new query.

results from this.

Recap:

We are given a query as a list of words.

We calculate a query vector, q from this list of words

We are given a list of documents, for each we calculate document vector, d

for each d, we compute the cosine similarity with q as follows:

```
q·d
-------
|q|·|d|
```

Which is equivalent to: `unit_vec(q) * unit_vec(d)`

- How do we calculate q from list of words in a query (
`1nc.ltc`

)?

Weight for a single word: (1 + log(tf)) * log(N / df)

After obtaining a vec of weights, vw, get unit_vec(vw)

recap:

- df: document freq, Difference between term / Collection, Document frequencies
^{ᛦ} - N: query size, idf-weighting scheme
^{ᛦ}

- How do we calculate d from list of words in a document (
`1nc.ltc`

)?

Weight for a single word: 1 + log(tf)

After obtaining a vec of weights, vw, get unit_vec(vw)