Assignment
1. Consider this dictionary: {"CAT", "COUNT", "DOG", "DONKEY", "ELEPHANT" }
Term-ID1
|
Offset
|
1
|
|
2
|
|
3
|
|
4
|
|
5
|
|
a) Complete this table assuming "dictionary as a string"
b) Create a second dictionary consisting of each word reversed (e.g. CAT -> TAC ). Show the dictionary as a string.
Term-ID2
|
Offset
|
1
|
|
2
|
|
3
|
|
4
|
|
5
|
|
c) Complete this table using your reversed dictionary string
d) Using your two dictionaries, show how you can determine the words that satisfy the wildcard query C*T
Consider the following documents:
Doc1: the wood table
Doc2: they made the wood
Doc3: the table is made of steel
Doc4: wood table or steel table
|
Using a shingle size 2, compute the Jaccard coefficient of:
(Doc1, Doc2)
(Doc1, Doc3)
(Doc1, Doc4)
Based upon your results, Doc1 is most similar to ____?
1. Crawl-delay: 10
2. User-agent: crawlerbot
3. Disallow: /includes
4. Disallow: /misc
5. Disallow: /setup
6. Allow: /misc/*.jpg
7. User-agent: *
8. Disallow: /setup
2. Using this robots.txt file
a) What does line 1 mean?
b) What is the difference between line 2 and 7?
c) Should any crawler access the file /setup/help.txt?
d) Should the crawler "mybot" access the file /a/b.htm?
3. Consider the following text:
This tree is just one of many older-growth trees in the forest. Forests in Texas, can be over 100 years-old before they are considered "old". Trees can be over 200 years.
a) What punctuation can be removed to determine terms?
b) What stop words can be removed?
c) Which tokens can be converted to lower case?