At their core, neural networks are sophisticated function approximators built from a series of interconnected computational units. Understanding the fundamental mathematics that govern their operation is essential for identifying potential vulnerabilities. This section breaks down the key computations, from the individual neuron to the full learning cycle.
The Core Unit: The Neuron
The simplest computational unit in a neural network is the neuron, or perceptron. Its operation involves two primary steps:
- Weighted Sum: The neuron receives multiple inputs, each multiplied by a corresponding weight. These weighted inputs are summed together, and a bias term is added. This linear combination is often denoted as
z. - Activation: The result of the weighted sum,
z, is passed through a non-linear activation function, denoted byσ(sigma) org, to produce the neuron’s final output,a.
For an input vector x, weight vector w, and bias b, the computation is:
a = σ(z)
The weights and biases are the learnable parameters of the network. During training, the goal is to adjust these parameters to minimize the model’s prediction error.
Forward Propagation: From Input to Output
Forward propagation is the process of passing an input through the network, layer by layer, to generate an output. The output of one layer becomes the input for the next. Using matrix notation, this process becomes highly efficient.
For a given layer l:
A[l] = g[l](Z[l])
A[l-1]is the matrix of activations from the previous layer (or the input dataXfor the first layer).W[l]is the weight matrix for the current layer.b[l]is the bias vector for the current layer.Z[l]is the matrix of linear combinations for the current layer.g[l]is the activation function for the current layer.A[l]is the matrix of activations (output) for the current layer.
Figure 1: Illustration of forward propagation through a simple neural network.
Activation Functions: Introducing Non-Linearity
Without activation functions, a neural network would just be a series of linear operations, equivalent to a single, simpler linear model. Activation functions introduce non-linearity, enabling the network to learn complex relationships in the data.
| Function | Formula | Range | Key Characteristic |
|---|---|---|---|
| Sigmoid | σ(z) = 1 / (1 + e-z) |
(0, 1) | S-shaped curve. Historically used for binary classification outputs, but prone to vanishing gradients. |
| Tanh | tanh(z) = (ez - e-z) / (ez + e-z) |
(-1, 1) | Zero-centered S-shaped curve. Often performs better than sigmoid in hidden layers. |
| ReLU | ReLU(z) = max(0, z) |
[0, ∞) | Computationally efficient and helps mitigate vanishing gradients. Can suffer from “dying ReLU” problem. |
| Softmax | S(z)i = ezi / Σjezj |
(0, 1) | Used in the output layer for multi-class classification. Converts logits into a probability distribution. |
Loss Functions: Quantifying Error
A loss (or cost) function measures the discrepancy between the model’s prediction (ŷ) and the true label (y). The goal of training is to minimize this value.
Mean Squared Error (MSE)
Commonly used for regression tasks.
Cross-Entropy Loss
The standard for classification tasks. It penalizes confident but incorrect predictions more heavily.
L(ŷ, y) = – (1/m) * Σi=1m [yi log(ŷi) + (1 – yi) log(1 – ŷi)]
In these formulas, m represents the number of examples in the batch.
Backpropagation: Learning from Mistakes
After calculating the loss, the network needs to learn how to improve. Backpropagation is the algorithm that makes this possible. It efficiently computes the gradient of the loss function with respect to each weight and bias in the network. It does this by applying the chain rule of calculus, propagating the error backward from the output layer to the input layer.
The core idea is to determine how much each parameter contributed to the total error. The gradient, ∂L/∂W, tells us the direction and magnitude to adjust the weights W to decrease the loss L.
The parameter update rule, driven by an optimization algorithm like Gradient Descent, is:
bnew = bold – α * ∂L/∂bold
Here, α (alpha) is the learning rate, a hyperparameter that controls the step size of each update. This iterative process of forward propagation, loss calculation, backpropagation, and parameter update is repeated until the model’s performance converges.
# Python/NumPy example of a single forward propagation step
import numpy as np
def relu(Z):
# ReLU activation function
return np.maximum(0, Z)
def forward_step(A_prev, W, b):
"""
Computes the forward propagation for one layer.
Arguments:
A_prev -- Activations from the previous layer (or input data)
W -- Weight matrix for the current layer
b -- Bias vector for the current layer
Returns:
A -- The output of the activation function
"""
# 1. Linear combination
Z = np.dot(W, A_prev) + b
# 2. Activation
A = relu(Z)
return A