Thursday, July 05, 2007

Librarianship and cosines - Who knew?

Did your mind ever drift while sitting in that mandatory trigonometry class and you wondered, when will I ever need to know this in the real world?

For me, those daydreams just met their match as I read about the Vector Space Model:
Vector space model (or term vector model) is an algebraic model used for information filtering, information retrieval, indexing and relevancy rankings. It represents natural language documents (or any objects, in general) in a formal manner through the use of vectors (of identifiers, such as, for example, index terms) in a multi-dimensional linear space...

Documents are represented as vectors of index terms (keywords). The set of terms is a predefined collection of terms, for example the set of all unique words occurring in the document corpus.

Relevancy rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as same kind of vector as the documents.

In practice, it is easier to calculate the cosine of the angle between the vectors instead of the angle... A cosine value of zero means that the query and document vector were orthogonal and had no match (i.e. the query term did not exist in the document being considered).

No comments: