=================================================================================================== Feedforward step of a basic feedforward neural network =================================================================================================== What is feedforward? --------------------------------------------------------------------------------------------------- * The feedforward step for a neural network is when we pick a sample :math:`D_i` from the dataset :math:`D`, and feed it into the network. * The network's weights in each layer tranform the sample from its initial representation into various other representations, which are fed into layer after layer. * The output of the final hidden layer, :math:`Z_{L-1}`, is then transformed by the output layer to create the network output :math:`O`. Differences in feedforward during training --------------------------------------------------------------------------------------------------- During training, two additional steps are * While the input propagates through a layer, we also calculate and store the gradients of the output with respect to the weights of that layer. * While training, can also calculate the value of the error function, :math:`E(o, Y_i)`. Vectorized feedforward with a single sample --------------------------------------------------------------------------------------------------- .. figure:: /_static/img/neural-networks/basic-feedforward-neural-networks/basic-feed-forward-neural-network.png :align: center :alt: Basic Feedforward Neural Network Basic Feedforward Neural Network Consider the example network above. * Suppose we have a dataset :math:`D = D_0 \dots D_{N-1}`, and each sample is represented by 3 features. * Assume we randomly pick the :math:`38^{th}` sample in the dataset to feed to our network: :math:`D_{37} = \left[\begin{array}{ccc} 3 & -5 & 12 \end{array}\right]` * The input \(row\) vector to the network is :math:`X_0`, which will have 4 features. The final one will be the bias value, +1, which we concatenate to :math:`D_{37}`. :math:`X_0 = \left[\begin{array}{cc} D_{37}, & +1 \end{array}\right] = \left[\begin{array}{cccc} 3 & -5 & 12 & +1 \end{array}\right]` * **Each layer in a basic feedforward network can be represented by a matrix**. * In the example above, :math:`W_0` has 3 neurons, each of which takes 4 inputs, and thus we can represent it as a :math:`4 \times 3` matrix, where **each column is a neuron**. The final row of each layer is the weights corresponding to the bias of the input :math:`X_0`. E.g. .. math:: W_0 = \left[\begin{array}{cccc} 0.3073 & -3.31913 & -2.455 \\ -0.121 & -2.149 & 0.041 \\ -4.2342 & 5.6798 & 0.6527 \\ -3.6295 & 12.88588 & -0.499 \end{array}\right] * We compute the vector-matrix multiplication of these two to get the affine of the first layer, i.e. .. math:: A_0 &= X_0 \cdot W_0 \\ &= \left[\begin{array}{ccc} -52.913 & 81.83109 & -0.2366 \end{array}\right] * To compute the output of the first layer, we apply the activation function to each element of the affine vector: .. math:: Z_0 &= sig(A_0) \\ &= \left[\begin{array}{ccc} sig(-52.913) & sig(81.83109) & sig(-0.2366) \end{array}\right] \\ &\approx \left[\begin{array}{ccc} 0 & 1 & 0.441 \end{array}\right] * Here, we have chosen the sigmoid activation function, i.e. .. math:: sigmoid(x) = sig(x) = \frac{1}{1+e^{-x}} * We're not done yet! :math:`Z_0` is the output from the layer :math:`W_0`, but to get :math:`X_1`, the input to layer :math:`W_1`, we must concatenate a bias value of +1 to the end of :math:`Z_0`. .. math:: X_1 &= \left[\begin{array}{cc} Z_0, & +1 \end{array}\right] \\ &= \left[\begin{array}{cccc} 0 & 1 & 0.441 & +1 \end{array}\right] * We pass this as the input to layer :math:`W_1`, which is also a :math:`4 \times 3` matrix, and similarly obtain :math:`Z_1` and :math:`X_2`. .. math:: X_2 &= \left[\begin{array}{cc} Z_1, & +1 \end{array}\right] \\ &= \left[\begin{array}{cc} activation(X_1 \cdot W_1), & +1 \end{array}\right] * Similarly, we compute all the way until we get :math:`Z_{L-1}`. In the example above, that is :math:`Z_2`. .. math:: Z_2 = \left[ \begin{array}{cc} activation(\left[\begin{array}{cc} activation(\left[\begin{array}{cc} activation(X_0 \cdot W_0), & +1 \end{array}\right] \cdot W_1), & +1 \end{array}\right] \cdot W_2) \end{array} \right] * We feed the output of the final layer into the output layer, where an *output function* computes the output of the network, :math:`O`. * :math:`Z_{L-1}` does **not** have a bias unit concatenated to it when we feed it to the output layer. * For the example above, assume we are performing multi-class classification, with :math:`K=3` output classes. * Let :math:`Z_{L-1} = Z_2 = \left[\begin{array}{ccc} 0.2 & 0.0013 & 0.998 \end{array}\right]` * We will use the *Softmax function* to convert our outputs into a probability distribution over the 3 classes. * For the :math:`i^{th}` element in :math:`Z_{L-1}`, we obtain the Softmax value as: .. math:: Softmax(Z_{L-1}, i) = \frac{ e^{Z_{(L-1, i)}} }{ \sum_{k=0}^{K-1} \left( e^{Z_{(L-1, {} k)}} \right) } i.e. we normalize the exponentials of :math:`Z_{L-1}`. * We calculate each of these and put them into a vector: .. math:: Softmax(Z_{L-1}) = \left[\begin{array}{c} Softmax(Z_{L-1}, i) \end{array}\right]_{i=0}^{K-1} * The softmax vector sums to :math:`1`, so each value can be considered the probability of belonging to the corresponding class, as predicted by our network. * Applying the softmax operation to :math:`Z_2`, we obtain the network output, :math:`O`: .. math:: O = Softmax(Z_2) = \left[\begin{array}{ccc} 0.2474 & 0.2029 & 0.5497 \end{array}\right] * We need to calculate how \(in\)accurate our network's output was. For this, we use an *Error function*, :math:`E`. * In our problem, there are :math:`K=3` classes: :math:`0, 1, 2`. * Let's assume the correct class for :math:`D_{37}` was the third one, i.e. :math:`Y_{37} = 2`. * We can't directly compare our output vector with this value. So instead, we use a mechanism known as *one-hot encoding* and convert :math:`Y_{37}` into the vector :math:`\left[\begin{array}{ccc} 0 & 0 & 1 \end{array}\right]`. The third element is :math:`1`, meaning our example :math:`D_{37}` belongs to the third class. * Let's use the *Squared Error function* to calculate how different our network's prediction :math:`O` is from the actual output from the dataset i.e. :math:`Y_{37}`. * Squared Error: .. math:: E(O, Y_i) = \frac{1}{2} \cdot \sum_{k=0}^{K-1} {\left( O_k - Y_{(i, k)} \right)}^2 i.e. we sqaure the differences between each element of the predicted output, and the actual output. This value is always positive. * In the example above, we get squared error value as: .. math:: E &= \frac{1}{2} \cdot \left( (0.2474 - 0)^2 + (0.2029 - 0)^2 + (0.5497 - 1)^2 \right) \\ &= 0.1526 .. Vectorized feedforward with a batch of samples