"father" of the field of text retrieval, Salton earned his Ph.D. from MIT.
SMART system.
Most web search engines
Basis Vectors:
"Core" concepts
Terms
Get your vector space of terms (the vocabulary for t>e dataset) Both documents and queries (indeed all relevant objects) are represented by vectors.
Vectors of features, in this case vocabularies.
Assuming a vocabulary of size N (also called the indexing vocabulary)
Qj = { w1t1, w2t2, w3t3,...., wNtN}Dk = { v1t1, v2t2, v3t3,..., vNtN}
wis and vis are weights in [0,1]. Represent
the relative importance of the terms. The vectors consist of an entry
for every term in the vocabulary.
Thus most of the weights are 0.
The ts are terms.
Example query: " I am interested in diseases of farm animals, especially
mad cow disease"
Q1 = { 1. 0 diseases, 0.5 farm, 0.5 animals, 0.7 farm animals, 1.0 mad cow disease }
Boolean: (farm animal? AND disease?) OR mad cow disease?Important: Where do the weights come from?
Indexers, Users, Statistical algorithms, Algorithms with other flavours
Normalizing the weights with different scales for system and user.
Documents and queries are represented against the same universe
of terms (i.e. vocabulary).
N: represents the size of the vocabulary.
The following diagram represents a vector model representation of the
terms 'information,' 'system,' and 'retrieval,' and six documents.
Figure 1: 3-dimensional vector space where each dimension represents a term or concept in the vocabulary
Three perpendicular dimensions corresponding to the three terms in the vocabulary.
Since the dimenstions are at 90 degree angles with each other, the corresponding terms are considered independent.
The terms are independent in both the statistical and linguistic senses.
Statistical independence means that the occurrence of a term is not related to the appearance of another.
Linguistic means that the interpretation of one does not affect that of another.
Ex: apple and computer; apple and murder
Represented by a triple: ABC where A represents "within document
frequency", B represents "inverse document frequency" and C represents
"cosine normalization".
Within document frequency: number of times the term occurs in the document.
Inverse document frequency:
Cosine normalization: to account for the length of the vector.
Many variations on this theme:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Thus we can have indexing, i.e., term weighting strategies such as atn or nnc.
For example, atc is wt = {(0.5 + 0.5 * tf/max_tf_in_document) * ln(N/n) }/ sqrt(w12+ w22+... + wm2)
We can index the document using one strategy and the query using another.
Example: atc for documents and atn for queries. Good reason for this, because queries seldom show significant variationn their lengths while documents do.
Ltu scheme is used by Singhal et al. in the two papers you read.
In TREC-6 (1997), the best retrieval performance was achieved by an Australian group (Australian National University) using what they call the "Cornell variant of the Okapi BM25 weighting function.
wt = tfd x { log [ (N-n+0.5)/(n+0.5) ] / [ 2 * (0.25 + 0.75 (dl / avg-dl ) ) + tfd ] } where dl is the length of the document and avg-dl is that average document length in the collection.
Note that wt in all these represents the weight of term t in the document (d).
Each document and query is mapped into the N-dimensional vocabulary space according to its representation, i.e., indexing.
Each document and query point is connected to the center using a distinct line, thus producing its vector.
The angle between the query and each document is then measured and its cosine computed. This cosine represents the similarity between the document and query.
Values for similarity range from 0 (minimum) to 1 (identical.) Recall that the cosine of 90 degrees = 0 and the cosine of 0 degrees = 1.
The following diagram shows a query in the vector model representation.
Figure 2: A query and document vectors
D1 = (0.5t1, 0.8t2, 0.3t3)
Q = (1.5t1, 1t2, 0t3)
Cosine Similarity(D1,Q) = [(0.5 x 1.5) + (0.8 x 1)]/sqrt[(0.52 + 0.82 + 0.32)(1.52 + 12)]
= 1.55/sqrt(0.98 x 3.25) = 0.868
Angle is independent of vector lengths
Generally normalize to length of one
Inner product equals cosine of angle
Inner product more efficient to compute
Very common to normalize vector lengths in index.
D1 = (0.5t1, 0.8t2, 0.3t3)
Q = (1.5t1, 1t2, 0t3)
D1' = (0.5t1, 0.8t2, 0.3t3)/sqrt(0.98)
0.51t1, 0.82t2, 0.31t3
Q' = (1.5t1, 1t2, 0t3)/sqrt(3.25)
= 0.83t1, 0.555t2
Similarity(D1,Q) = Similarity(D1', Q')
[(0.51 x 0.83) + (0.82 x 0.555)]/sqrt[(0.512 + 0.822 + 0.312)(0.832 + 0.5552)]
= 0.878
Rocchios' Method: using Relevance Feedback
Qnew = a Q old + b Average Relevant Vector - r Average Nonrelevant Vectors
Fig. 5
Average vectors are also called centroids. Thus we have the centroid of the relevant documents and the centroid of the non relevant documents.
Example:
D1 = {0.5 t1, 0.3 t2}
D2 = {0.6 t1, 0.4 t3}
Centroid = { 0.55 t1, 0.15 t2, 0.2 t3}
Ide's Method: using Relevance Feedback
For IDE's method of feedback, number of non relevant documents is 1