Vector Model of Text Retrieval




Contents


Vector Model

The vector model (applied to information retrieval) was first proposed by Gerard Salton.

"father" of the field of text retrieval, Salton earned his Ph.D. from MIT.

SMART system.

Most web search engines

Vector Space

Formally, a vector space is defined by a set of linearly independent basis vectors.

Basis Vectors:

What should be the basis vectors for IR?

"Core" concepts

Terms

Representing Documents and Queries

Get your vector space of terms (the vocabulary for t>e dataset) Both documents and queries (indeed all relevant objects) are represented by vectors.

Vectors of features, in this case vocabularies.

Assuming a vocabulary of size N (also called the indexing vocabulary)
 

Qj = { w1t1, w2t2, w3t3,...., wNtN}

Dk = { v1t1, v2t2, v3t3,..., vNtN}


wis and vis are weights in [0,1].  Represent the relative importance of the terms.  The vectors consist of an entry for every term in the vocabulary.

Thus most of the weights are 0.

The ts are terms.

Example query: " I am interested in diseases of farm animals, especially mad cow disease"
 

Q1 = { 1. 0 diseases, 0.5 farm, 0.5 animals, 0.7 farm animals, 1.0 mad cow disease }
Boolean:  (farm animal? AND disease?) OR mad cow disease?
Important:  Where do the weights come from?

Indexers, Users, Statistical algorithms, Algorithms with other flavours

Normalization

Normalizing the weights with different scales for system and user.


Documents and queries are represented against the same universe of terms (i.e. vocabulary).
 

N-Dimensional Vector Space

N: represents the size of the vocabulary.

The following diagram represents a vector model representation of the terms 'information,' 'system,' and 'retrieval,' and six documents.
 
 

 Figure 1: 3-dimensional vector space where each dimension represents a term or concept in the vocabulary

Independence of terms

Three perpendicular dimensions corresponding to the three terms in the vocabulary.

Since the dimenstions are at 90 degree angles with each other, the corresponding terms are considered independent.

The terms are independent in both the statistical and linguistic senses.

Statistical independence means that the occurrence of a term is not related to the appearance of another.

Linguistic means that the interpretation of one does not affect that of another.

Ex: apple and computer;  apple and murder
 

Term Weighting in SMART


Represented by a triple:  ABC where A represents "within document frequency", B represents "inverse document frequency" and C represents "cosine normalization".

Within document frequency: number of times the term occurs in the document.
Inverse document frequency:
Cosine normalization: to account for the length of the vector.

Many variations on this theme:
 
 

DIMENSION
SYMBOL
INTERPRETATION
A
b: binary
1 if term is present and 0 if absent in the document
A
a: augmented
(0.5 + 0.5 * tf/max_tf_in_document)
A
l: logarithmic
1 + ln(tf)
A
L
(1 + log(tf) ) / (1 + log(average tf in text ) )
A
n: none
tf
B
t
ln(N/n); N=# of docs. in database; n=#docs. with term
B
n
idf is not used
C
c: cosine
sqrt(w12+ w22+... + wm2); m = vocabulary size
C
n
no normalization
C
u
1 / { 0.8 + 0.2 * ( # unique words in text / avg. # of unique words per document) }

Thus we can have indexing, i.e., term weighting strategies such as atn or nnc.

For example, atc is wt = {(0.5 + 0.5 * tf/max_tf_in_document) * ln(N/n) }/ sqrt(w12+ w22+... + wm2)

We can index the document using one strategy and the query using another.

Example: atc for documents and atn for queries.  Good reason for this, because queries seldom show significant variationn their lengths while documents do.

Ltu scheme is used by Singhal et al. in the two papers you read.

In TREC-6 (1997), the  best retrieval performance was achieved by an Australian group (Australian National University) using what they call the "Cornell variant of the Okapi BM25 weighting function.

wt = tfd x { log [ (N-n+0.5)/(n+0.5) ] / [ 2 * (0.25 + 0.75 (dl / avg-dl ) ) + tfd ] }  where dl is the length of the document and avg-dl is that average document length in the collection.

Note that wt in all these represents the weight of term t in the document (d).

Conducting Retrieval

Each document and query is mapped into the N-dimensional vocabulary space according to its representation, i.e., indexing.

Each document and query point is connected to the center using a distinct line, thus producing its vector.

The angle between the query and each document is then measured and its cosine computed.  This cosine represents the similarity between the document and query.

Values for similarity range from 0 (minimum) to 1 (identical.) Recall that the cosine of 90 degrees = 0 and the cosine of 0 degrees = 1.

The following diagram shows a query in the vector model representation.
 
 

 Figure 2: A query and document vectors

Example

D1 = (0.5t1, 0.8t2, 0.3t3)

Q = (1.5t1, 1t2, 0t3)

Cosine Similarity(D1,Q) = [(0.5 x 1.5) + (0.8 x 1)]/sqrt[(0.52 + 0.82 + 0.32)(1.52 + 12)]

= 1.55/sqrt(0.98 x 3.25) = 0.868

Cosine and vector lengths

Angle is independent of vector lengths

Generally normalize to length of one

Inner product equals cosine of angle

Inner product more efficient to compute

Very common to normalize vector lengths in index.

Same example, normalized

D1 = (0.5t1, 0.8t2, 0.3t3)

Q = (1.5t1, 1t2, 0t3)

D1' = (0.5t1, 0.8t2, 0.3t3)/sqrt(0.98)

0.51t1, 0.82t2, 0.31t3

Q' = (1.5t1, 1t2, 0t3)/sqrt(3.25)

= 0.83t1, 0.555t2

Similarity(D1,Q) = Similarity(D1', Q')

[(0.51 x 0.83) + (0.82 x 0.555)]/sqrt[(0.512 + 0.822 + 0.312)(0.832 + 0.5552)]

= 0.878

Query Modification

To enhance the query after conducting an initial retrieval cycle.

Rocchios' Method: using Relevance Feedback


Fig. 4

Qnew = a Q old + b Average Relevant Vector - r Average Nonrelevant Vectors


Fig. 5

Average vectors are also called centroids. Thus we have the centroid of the relevant documents and the centroid of the non relevant documents.

Example:

D1 = {0.5 t1, 0.3 t2}

D2 = {0.6 t1, 0.4 t3}

Centroid = { 0.55 t1, 0.15 t2, 0.2 t3}

Ide's Method: using Relevance Feedback

For IDE's method of feedback, number of non relevant documents is 1