How do we perform information retrieval on unstructured data?
If we store using matrix, where x = words, y = documents,
Suppose we have 1 million words & 100_000 documents.
We allocate a huge amount of space.
As such, we should store in adjacency list.
Since we want to lookup words more often than documents, we index by word.
- Create a table matching term & docID
- Sort table by terms (alphabetically etc…), followed by docID.
- Merge duplicate entries.
Merge the 2.