Assignment
1. Instead of returning nothing for a query, a search engine should return some results even if they are incorrect." Do you agree or disagree? Explain.
2. What are the differences between static and dynamic summaries? Describe a scenario where each type of summary would be the best solution to a search query.
3. Consider an information need for which there are 6 relevant documents in the collection. Contrast two systems run on this collection. Their top 10 results are judged for relevance as follows:
a. Complete this table with numerical values:
|
System 1
|
System 2
|
|
|
Recall
|
Precision
|
|
Recall
|
Precision
|
1
|
R
|
|
|
N
|
|
|
2
|
N
|
|
|
R
|
|
|
3
|
R
|
|
|
N
|
|
|
4
|
N
|
|
|
N
|
|
|
5
|
N
|
|
|
R
|
|
|
6
|
R
|
|
|
R
|
|
|
7
|
N
|
|
|
R
|
|
|
8
|
N
|
|
|
N
|
|
|
9
|
R
|
|
|
N
|
|
|
10
|
R
|
|
|
N
|
|
|
b. What is the MAP value for each system? (equation okay)
c. What is F β=1 for each system with 10 documents returned?
4. True/False
a) _____ The tf-idf weight increases with the number of occurrences of a term within a document
b) _____ The tf-idf weight increases with the rarity of the term in the collection.
c) _____ The summary information displayed by a search engine must come from the "description" meta tag in a html file.
d) _____ Hard clustering is more common and easier than soft clustering.
e) _____ Pseudo relevance feedback is the same as indirect relevance feedback
f) _____ Query expansion means to double the number of results shown to the user
g) _____ Hierarchical agglomerative clustering is a "top down" clustering technique.
h) _____ Linear classifiers partition the dataspace into overlapping regions.
i) _____ K-means clustering is an example of unsupervised learning.
j) _____ Hub sites should be scored higher than authoritative sites.
0.80
|
0.70
|
0.90
|
0.00
|
0.50
|
0.10
|
0.50
|
0.75
|
0.25
|
5.The following probabilities have been Long Sweet Green determined from a training set of 1000. Cucumber
Given a sample that is long, sweet and green; Jalapeno what is the probability that it would be classified Other as a Cucumber using a naïve bayes classifier? Show your work.
6. Apply the KNN algorithm to classify the data item (14) with this known set of data and classes: { (10,1), (11,1), (15,2), (12,1), (18,2), (9,1), (20,2), (17,2) }. Show your work for K = 3 and K = 5.
7. Given this portion of a web graph
a) Show the node adjacency matrix
b) Convert the adjacency matrix to a transition probability matrix (i.e. Markov chain) for PageRank.
9. Complete the table below so that the cosine similarity for the query "brown cat" against document three is 1.0 ( SMART nnc.nnc ).
|
q
|
qq→
|
d1
|
d2
|
d3
|
dd→1
|
dd→2
|
dd→3
|
qq→•dd→1
|
qq→•dd→2
|
qq→•dd→3
|
Brown
|
1
|
.707
|
2
|
1
|
|
.632
|
.277
|
|
.447
|
.196
|
|
Cat
|
1
|
.707
|
1
|
1
|
|
.316
|
.277
|
|
.223
|
.196
|
|
how
|
0
|
0
|
1
|
1
|
|
.316
|
.277
|
|
0
|
0
|
|
meow
|
0
|
0
|
0
|
3
|
|
0
|
.832
|
|
0
|
0
|
|
now
|
0
|
0
|
2
|
1
|
|
.632
|
.277
|
|
0
|
0
|
|