How do we perform information retrieval on unstructured data?

If we store using matrix, where x = words, y = documents,

Suppose we have 1 million words & 100_000 documents.

We allocate a huge amount of space.

As such, we should store in adjacency list.

Since we want to lookup words more often than documents, we index by word.

Steps

  1. Create a table matching term & docID
TermDocID
I2
I1
did1
  1. Sort table by terms (alphabetically etc…), followed by docID.
TermDocID
did1
I1
I2
  1. Merge duplicate entries.

Lookup

Merge the 2 posting .

Boolean retrieval model