A. Consider these documents:
Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9. Compute the tf-idf weights for the terms car, auto, insurance, best, for each document, using the idf values from Figure 6.8.
|
Doc1
|
Doc2
|
Doc3
|
car
|
27
|
4
|
24
|
auto
|
3
|
33
|
0
|
insurance
|
0
|
33
|
29
|
best
|
14
|
0
|
17
|
Figure 6.9 Table tf value
Term
|
dft
|
idft
|
car
|
18,165
|
1.65
|
auto
|
6723
|
2.08
|
insurance
|
19,241
|
1.62
|
best
|
25,235
|
1.5
|
Figure 6.8 Table idf value. The idf's of terms with various frequencies in the Reuters collection of 806,791 documents.
B. Consider these documents:
Compute the vector space similarity between the query "digital cameras" and the document "digital cameras and video cameras" by filling out the empty columns in Table 6.1. Assume N = 10,000,000, logarithmic term weighting (wf columns) for query and document, idf weighting for the query only and cosine normalization for the document only. Treat and as a stop word. Enter term counts in the tf columns. What is the final similarity score? (Please provide the details of the calculation.)
|
Query
|
Document
|
Word
|
tf
|
wf
|
df
|
idf
|
qi =wf-idf
|
tf
|
wf
|
di =normalized wf
|
digital
|
|
|
10,000
|
|
|
|
|
|
video
|
|
|
100,000
|
|
|
|
|
|
cameras
|
|
|
50,000
|
|
|
|
|
|
C. Why is the idf of a term always finite?
D. Sketch the frequency-ordered postings for the data in Figure 6.9.
E. Let the static quality scores for Doc1, Doc2 and Doc3 in Figure 6.11 be respectively 0.25, 0.5 and 1. Sketch the postings for impact ordering when each postings list is ordered by the sum of the static quality score and the Euclidean normalized tf values in Figure 6.11.
F. Derive the equivalence between the two formulas for F measure shown in the following Equation, given that α = 1/(β2 + 1).
F = 1/[α(1/p)+ (1- α)1/R]+= ((β2+1)PR)/(β2P+R)
G. What is the relationship between the value of F1 and the break-even point?
H. Below is a table showing how two human judges rated the relevance of a set of 12 documents to a particular information need (0 = nonrelevant, 1 = relevant). Let us assume that you've written an IR system that for this query returns the set of documents {4, 5, 6, 7, 8}.
Document ID
|
Judge 1
|
Judge 2
|
1
|
0
|
0
|
2
|
0
|
0
|
3
|
1
|
1
|
4
|
1
|
1
|
5
|
1
|
0
|
6
|
1
|
0
|
7
|
1
|
0
|
8
|
1
|
0
|
9
|
0
|
1
|
10
|
0
|
1
|
11
|
0
|
1
|
12
|
0
|
1
|
a. Calculate the kappa measure between the two judges.
b. Calculate precision, recall, and F1 of your system if a document is considered relevant only if the two judges agree.
c. Calculate precision, recall, and F1 of your system if a document is considered relevant if either judge thinks it is relevant.