Part 1 - Building Decision Trees
Consider the following data set containing examples e1, ..., e5, each comprised of three binary input attributes (x1,x2,x3) and one output label.
Example
|
x1
|
x2
|
x3
|
Output y
|
e1
|
1
|
0
|
0
|
0
|
e2
|
1
|
0
|
1
|
0
|
e3
|
0
|
1
|
0
|
0
|
e4
|
1
|
1
|
1
|
1
|
e5
|
1
|
1
|
0
|
1
|
Construct a decision tree for this data set. Compute information gain for each attribute. Select the attribute with the highest information gain as the root node. Then recursively create the children by re-computing information gain for each remaining attribute. Repeat this procedure until no more decision is necessary or possible.
Show all necessary information gain calculations in each step.
Part 2 - Perceptron Learning
Consider the following data set illustrating the boolean function NAND (= NOT AND).
example
|
x1
|
x2
|
y
|
e1
|
0
|
0
|
1
|
e2
|
0
|
1
|
1
|
e3
|
1
|
0
|
1
|
e4
|
1
|
1
|
0
|
Now imagine we are using this data to train a perceptron with weights w0, w1, w2 and the step function as activation function. Show each step of perceptron learning by filling in the following table:
x1
|
x2
|
w0
|
w1
|
w2
|
sum
|
output
|
target y
|
|
|
0
|
0
|
0
|
|
|
|
Add one row for each training example presented to the perceptron. The columns w0, w1, w2 contain the current weights at the time when x1 and x2 are presented to the network. Assume these weights are initialized to w0 = 0, w1 = 0, w2 = 0. Each subsequent row will contain the updated weights after each training step. sum is the activation of the network that is passed into the step function. output is the output of the step function. target y is the target label for the training sample.
Continue adding rows until learning converges. Show the final weights in an extra row.
You may re-arrange the training examples so that they are presented to the perceptron in any order to make it converge faster.
Part 3 - Neural Networks
Neural networks use non-linear activation functions (such as the sigmoid or rectifier functions) to represent non-linear functions. Suppose you had a neural network with linear activation function g(a) = c a + d instead.
(a) Assume that the network has one hidden layer. Show that, with a linear activation function, you can write the output of the network as a function of the weight vector w and the input vector x, without having to explicitly mention the output of the hidden layer. Show that, as a result, there is a network with no hidden units that computes the same function.
(b) Explain how this result in (a) can be generalized to n hidden layers (an explanation suffices, formal derivation is not necessary for this part).
(c) Suppose a network with one hidden layer and linear activation functions has n input and output nodes and hhidden nodes. Assume you apply the transformation in part (a) to obtain a network with no hidden layers. What is the total number of weights in the new network as a function of n and h? Discuss what this result means for learning a network with linear activation functions (consider in particular the case h « n).