How do we perform information retrieval on unstructured data?
If we store using matrix, where x = words, y = documents,
Suppose we have 1 million words & 100_000 documents.
We allocate a huge amount of space.
As such, we should store in adjacency list.
Since we want to lookup words more often than documents, we index by word.
Steps
- Create a table matching term & docID
Term | DocID |
---|---|
I | 2 |
I | 1 |
did | 1 |
- Sort table by terms (alphabetically etc…), followed by docID.
Term | DocID |
---|---|
did | 1 |
I | 1 |
I | 2 |
- Merge duplicate entries.
Lookup
Merge the 2 posting .