Assignment: Classi?cation trees
1. Introduction
A classification tree is used for classifying inputs into one of several categories. Each node in the classification tree consists of a criterion that is applied to the input to classify it into one of (potentially) many sets. The following tree illustrates this for classifying 12 TV (s1,...,s12) shows into various categories:
As can be seen there are 3 shows that are half-hour mysteries, and two shows that are hour comedies. The tree allows one to find all shows that satisfy various criteria, by starting at the top of the tree, and descending through the tree based upon supplied criteria, eventually arriving at the appropriate category (shown in rectangles) at the leaves of the tree.
In this project you will build (by hand) a binary classification tree for a fixed set of inputs and analyze its behavior.
2. Binary classi?cation trees
Each criterion in a binary classification tree classifies the input into either one of two categories. It can therefore be implemented as a binary tree in which the interior nodes contained the classification criterion and each leaf contains all elements that satisfy the all the criteria along the path from the root to the leaf. Any classification tree can be built using a binary classification tree, by splitting any non-binary criteria into a sequence of binary criteria.
For this project, the leaves in the binary tree will contain single elements. This means that the criteria nodes must uniquely select a single element from the set of elements that comprise the leaves in the tree. Ideally a search tree will be balanced, in the sense that for every node in the tree, the heights of its left and right subtrees are with k| of one another, for some small k| (ideally, 1). call such a tree a balanced binary classification search tree, or a search tree for short.
3. Constructing a search tree
Let us assume that keys we are interested in are 2 characters long, e.g.
CD
ac
kc
ol
Choose 10 such codes, all different. Write each code out in binary. Now perform this recursive algorithm:
1. If you have but one code, append = to it and return the augmented pattern.
2. Total each column, counting the number of 1's. Call the leftmost column 0. Let b[i] designate this number for column-i and let a[i] = 10-b[i] designate the number of zeros.
3. Find a collection of a's or b's, such that the sum of the collection is closest to n/2 where n is the number of elements in your list.
a. If your collection is of b's, then write a bit pattern that consists of 16 bits, where the bit-i is set to 1 if b[i] is in your collection. Append to your result an operator of & (the bitwise "and").
b. If your collection is of a's, then write a bit pattern that consists of 16 bits, where the bit-i is set to 1 if b[i] is in your collection. Append to your result an operator of ^ (the bitwise "exclusive or").
4. Apply your chosen operator one-by-by to your bit patterns. Those that return "true" (i.e., a non-zero number) remove and place into a separate collection. These will form the starting block for the left-hand side of the tree. Those that are not selected will form the starting block fro the right-hand side of the tree.
5. You now have two starting sets. Recursively apply steps 1-5 to form two subtrees, with root being the bit pattern + operator you formed in step 3 above.
Here is an example
HEX
|
Binary
|
CD
|
43
|
44
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
0
|
0
|
ac
|
61
|
63
|
1
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
1
|
kc
|
6B
|
63
|
1
|
0
|
1
|
0
|
1
|
0
|
1
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
1
|
ol
|
6F
|
6C
|
1
|
0
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
0
|
1
|
1
|
0
|
0
|
b-Totals
|
3
|
1
|
3
|
0
|
2
|
1
|
3
|
4
|
3
|
1
|
3
|
0
|
1
|
2
|
2
|
2
|
a-Totals
|
1
|
3
|
1
|
4
|
2
|
3
|
1
|
0
|
1
|
3
|
1
|
4
|
3
|
2
|
2
|
2
|
In our case n=4, so we want to form a subset of size n/2=2 of either a's exclusively, or b's, exclusively.
For b's we note there are several columns that are candidates for inclusion in our subset of (approximately) size n/2.
b[1]
b[1]
b[4]
b[5]
b[9]
b[12]
b[13]
b[14]
b[15]
These are all 2 or less.
The goal is to form a subset from the list above that will match (as close to) two of the four input rows. Clearly one could pick either b[4], b[13], b[14] or `b[15]. because matching a 1 in any of these columns would pick two rows. Say we pick b[14]. Then we form the selection pattern:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
When this pattern is matched against the four inputs (using an &), it will select exactly two:
*
ac 61 63 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 1
kc 6B 63 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1
Had b[4] been chosen, then we would form the pattern:
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
When this pattern is matched against the four inputs, it will select exactly two:
kc 6B 63 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1
ol 6F 6C 1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0
In the case where there is no n/2 in the b-totals, then we can pick combinations of bits, which, when applied will choose close to n/2 inputs. This is done as follows.
Choose b[1], which will pick one row:
Pattern: 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CD 43 44 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0
Having chosen that row, eliminate it from consideration and recalculate the b-totals (which can be computed by subtracting CD from the b-totals).
ac 61 63 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 1
kc 6B 63 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1
ol 6F 6C 1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0
b-Totals 3 0 3 0 2 1 2 3 3 0 3 0 1 1 2 2
Now choose another column with a 1 in it (say b[12]) and "or' this with the previous pattern:
Pattern: 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Now and-ing this against the original four rows
CD
|
43
|
44
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
0
|
0
|
ac
|
61
|
63
|
1
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
1
|
kc
|
6B
|
63
|
1
|
0
|
1
|
0
|
1
|
0
|
1
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
1
|
ol
|
6F
|
6C
|
1
|
0
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
0
|
1
|
1
|
0
|
0
|
will choose any row with a 1 in column 1 or column 10, giving:
=>
CD 43 44 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0
ol 6F 6C 1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0
Of course, selecting different combinations of b-totals may choose different rows, but that does not matter, because the goal is to choose n/2 rows, which is accomplished.
Writing our pattern in hex we have:
Pattern: 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Pattern: 4 0 0 8
At this point we have formed:
The designation &:4008 in the root node means that when deciding which side of the tree a candidate key would be found, perform key & 4008. If it matches, take the T branch, otherwise take the F branch.
Having split the set in half, the process is repeated recursively on both subsets.
The left set b-totals are:
CD 43 44
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
0
|
0
|
ol 6F 6C
|
1
|
0
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
0
|
1
|
1
|
0
|
0
|
b-totals:
|
1
|
1
|
1
|
0
|
1
|
1
|
2
|
2
|
1
|
1
|
1
|
0
|
1
|
2
|
0
|
0
|
In this case n/2=1, so we choose any b-total with a 1, say (b[1]), forming the pattern:
Pattern: 0 1 0
|
0
|
0
|
0 0
|
0
|
0
|
0 0
|
0
|
0
|
0 0
|
0
|
Pattern: 4
|
|
|
0
|
|
|
0
|
|
|
0
|
|
which chooses:
The b-totals for the left-hand set are:
ac 61 63
|
1
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
1
|
kc 6B 63
|
1
|
0
|
1
|
0
|
1
|
0
|
1
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
1
|
b-totals:
|
2
|
0
|
2
|
0
|
1
|
0
|
1
|
2
|
2
|
0
|
2
|
0
|
0
|
0
|
2
|
2
|
Again we choose any b-total with a 1, say b[4],
Pattern: 0 0 0
|
0
|
1
|
0 0
|
0
|
0
|
0 0
|
0
|
0
|
0 0
|
0
|
Pattern: 0
|
|
|
8
|
|
|
0
|
|
|
0
|
|
which chooses:
kc 6B 63 1 0 1
|
0
|
1
|
0 1
|
1
|
1
|
0 1
|
0
|
0
|
0 1
|
1
|
Our search tree now looks like:
Note that the nodes contain patterns to be matched against a search key. If the search key is key = "ol" then key & 0x4008 matches so the left branch is selected. key & 0x4000 does not match (evaluates to 0), so the right branch is taken. Last, key =
"ol" matches at a leaf node, so ol is in the set.
Searching for "oZ' will follow the same path, but the final comparison of "oZ" = "ol" fails, so oz is not in the set.
Analysis
Now that you have worked through an example, it is time analyze both the algorithm for building the tree and the algorithm for searching the tree.
Build a tree
To demonstrate that you understand the algorithm, choose 6 two-character keys, translate them into binary (using ASCII mapping), and build a balanced search tree. Draw your tree.
Analyze the construction algorithm
Suppose we have n| records and that each record contains a key of k| bytes and information associated with the key of r| bytes. Assume also that the tree is built using 4-byte pointers.
1. The algorithm for building the tree proceeds by splitting the list in two each time, based on the b-counts to try and create a balanced tree. However, what would happen in the following pathological case?
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
a. There are 8 members of this set. Can the b-counts be used to split the set into two equally-sized subsets?
b. Suggest a way of using the a-counts in a situation like this. Specifically,
c. How would you use the a-counts to create a pattern that would split the list in two almost equal-size sets? List a particular pattern that would accomplish this for the 8 rows above.
d. What operation(s) would need to be applied to match your pattern to select 4 rows? Which four rows would be selected?
2. Suppose the b-counts are all the same value c|. This then means that all the values for the a-counts are n-c|. For what value c| would this cause the most uneven splitting of the set? (This is tricky. Spend some time on it).
3. How much space will the tree consume? Your answer should be in terms of n|, k| and r|. You may assume the r| bytes for records are stored in the leaves of the search tree.
a. What is the best possible case? (Hint consider a perfectly balanced search tree. Further hint: Your discrete math background should be sufficient to know that count of interior nodes given the count of leaf nodes. If it makes things easier for you, assume n| is a power of 2.)
b. Bonus: What is the worst case? (Hint: consider how badly the tree might be unbalanced. Calculate the number of nodes required in such a case. Consider your answer to the problem labeled "tricky" above.)
4. What is the running time of the algorithm that creates the tree? where the measure of the size of the program is the number of rows (n|) and the number of bytes (k|) in the key, and the size of the record information (r|) in the situation in which the list can be split perfectly in half each time? This is a very hard problem and it is likely that few will get it right. It is your thinking processes that will be judged rather than perfect mathematical analysis. Big-O answer without constant coefficients is good enough.
a. How many steps will it take to calculate the b-counts from a collection of rows?
(Note that this depends on k| and n|.)
b. How many steps to find the pattern? (Worst case: you have to form the pattern from a collection of 1's.)
c. How many steps does it take to reduce the b-counts for each selected pattern?
d. How many times will three steps above have to be repeated in a perfectly split binary tree?
e. Combine your results together to give your estimate of the running time of the tree-construction algorithm.
5. What is the running time of the search routine for a perfectly balanced search tree?
6. What is the running time of the search routine for the worst-balanced search tree?