Artificial Neural network basics

Introduction to Artificial Neural Networks

Neural Networks are a subset of machine learning models inspired by the structure and function of the human brain. They are designed to recognize patterns and learn from data, making them powerful tools for tasks such as image recognition, natural language processing, and predictive analytics.

1. Basic Structure

Neurons: The fundamental building block of a neural network is the neuron (or node). Each neuron receives input, processes it, and sends output to other neurons.
Layers: Neural networks are composed of layers of neurons:
- Input Layer: This is where the network receives the initial data.
- Hidden Layers: These layers process the inputs received from the input layer. A network can have multiple hidden layers.
- Output Layer: This layer produces the final output of the network.

2. Types of Neural Networks

Feedforward Neural Networks: The simplest type, where connections move only in one direction—from input to output.
Convolutional Neural Networks (CNNs): Primarily used for image and video recognition. They apply convolutional layers to detect patterns.
Recurrent Neural Networks (RNNs): Suitable for sequential data like time series or natural language. They have connections that loop back, allowing them to maintain memory of previous inputs.

3. Key Concepts

Activation Functions: These functions determine whether a neuron should be activated. Common activation functions include Sigmoid, ReLU (Rectified Linear Unit), and Tanh.
Weights and Biases: Each connection between neurons has an associated weight, and each neuron has a bias. These parameters are adjusted during training to minimize the error.
Forward Propagation: The process of passing the input data through the network to generate an output.
Backpropagation: The training algorithm that adjusts the weights and biases by minimizing the error. It involves calculating the gradient of the loss function with respect to each weight by using the chain rule.

4. Training Neural Networks

Data Preparation: Split the dataset into training, validation, and test sets. Normalize the data to ensure that it has a mean of zero and a standard deviation of one.
Loss Function: A function that measures the difference between the network's predicted output and the actual output. Common loss functions include Mean Squared Error (MSE) and Cross-Entropy Loss.
Optimization Algorithm: Methods like Stochastic Gradient Descent (SGD), Adam, or RMSprop are used to minimize the loss function.
Epochs and Batches: Training is done in epochs, where the entire dataset is passed through the network multiple times. During each epoch, the data is divided into smaller batches for efficient computation.

5. Applications of Neural Networks

Image and Speech Recognition: Detecting objects in images or converting speech to text.
Natural Language Processing: Tasks like language translation, sentiment analysis, and chatbots.
Healthcare: Diagnosing diseases from medical images or predicting patient outcomes.
Finance: Fraud detection, stock price prediction, and algorithmic trading.

Example in Python

Here's a simple example of a feedforward neural network using the Keras library:

python

import numpy as np
from keras.models import Sequential
from keras.layers import Dense

# Generate some data
X = np.random.rand(1000, 20)  # 1000 samples, 20 features
y = np.random.randint(2, size=(1000, 1))  # Binary labels

# Create the model
model = Sequential()
model.add(Dense(64, input_dim=20, activation='relu'))  # Hidden layer
model.add(Dense(1, activation='sigmoid'))  # Output layer

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X, y, epochs=10, batch_size=32)

# Evaluate the model
loss, accuracy = model.evaluate(X, y)
print(f"Loss: {loss}, Accuracy: {accuracy}")

This example demonstrates:

Model Creation: Building a simple neural network with one hidden layer.
Compilation: Setting the optimizer, loss function, and metrics.
Training: Training the model with the input data.
Evaluation: Evaluating the model's performance on the same data.

What is a Perceptron?

A perceptron is a type of artificial neuron used in machine learning. It's the building block of larger neural network architectures. The perceptron algorithm was invented in 1958 by Frank Rosenblatt and is primarily used for binary classification tasks.

Structure of a Perceptron

A perceptron consists of:

Input Features (x₁, x₂, ..., xₙ): These are the input values or features of the data.
Weights (w₁, w₂, ..., wₙ): Each input feature is associated with a weight.
Bias (b): An additional parameter added to the weighted sum of inputs to adjust the output.
Activation Function: Determines whether the neuron should activate or not, based on the weighted sum of inputs.

Perceptron Equation

The perceptron computes a weighted sum of the input features and passes it through an activation function to produce an output:

z = Σ (from i=1 to n) wᵢxᵢ + b

The output is determined by applying the activation function (usually a step function):

Training a Perceptron

Training a perceptron involves adjusting the weights and bias to minimize the classification error. The steps are:

Initialize Weights and Bias: Set initial values for weights and bias, typically to small random numbers.
Compute the Output: For each input sample, compute the weighted sum and apply the activation function.
Update Weights and Bias: Adjust the weights and bias based on the difference between the predicted output and the actual output. This is done using the following update rule:

For Weights: wᵢ = wᵢ + Δwᵢ
For Bias:Δwᵢ = η * (y - ŷ) * xᵢ
Here, η is the learning rate, y is the actual output, and ŷ is the predicted output.

Inputs of a Neural Network:

Features (X): The raw data or features provided to the network (e.g., pixel values for an image, words in a text).

Outputs of a Neural Network:

Predictions (Y_hat): The network's prediction or classification based on the input data (e.g., identifying the object in an image, predicting the next word in a sentence).

The input data passes through the network layers, undergoes transformations via weights and activation functions, and finally results in an output prediction.

Weights and Bias in Neural Networks

Weights and Bias are fundamental components in the architecture of neural networks. Here's a brief overview:

1. Weights

Purpose: Weights determine the importance of each input feature. They are the parameters that the network learns during training to make accurate predictions.
Function: Each input to a neuron is multiplied by its corresponding weight. This helps in amplifying or diminishing the input signal.
Update: Weights are updated through a process called backpropagation, where the network learns by minimizing the error (loss) using gradient descent.

2. Bias

Purpose: The bias allows the activation function to be shifted left or right, which helps the model in capturing the true patterns in the data.
Function: It's added to the weighted sum of the inputs. This helps in fitting the data better by allowing the activation function to be adjusted.
Update: Like weights, the bias is also adjusted during the training process to minimize the error.

Mathematical Representation

z = Σᵢ wᵢxᵢ + b

z = wᵀx + b

Example

In a simple neural network with three input features and one neuron:

Inputs: $x_{1}, x_{2}, x_{3}$
Weights: $w_{1}, w_{2}, w_{3,}$
Bias: $b$

The neuron's output before applying the activation function would be:

z = w_{1} x_{1} + w_{2} x_{2} + w_{3} x_{3} + b

This weighted sum $z$ is then passed through an activation function (like ReLU or Sigmoid) to get the final output.

Understanding weights and biases is crucial as they are the parameters that a neural network learns and optimizes during the training process. This optimization allows the model to make accurate predictions and learn from the data effectively.

Working of a Neuron

Inputs: The neuron receives input signals (features) x₁, x₂, ..., xₙ
Weights: Each input is multiplied by a corresponding weight $w_{1}, w_{2}, \dots, w_n$ .
Weighted Sum: The neuron calculates the weighted sum of the inputs:

z = Σᵢ wᵢxᵢ + b

Here, $b$ is the bias term.

Activation Function: The weighted sum $z$ is passed through an activation function (like ReLU, Sigmoid, or Tanh) to produce the neuron's output $a$ :

Output: The output $a$ is then passed to the next layer of neurons or becomes the final output if it is the output layer.

This process enables the neuron to learn and make decisions based on the input data. Each neuron's output serves as the input for the next layer in a multi-layer network.

Types of Activation Functions

Activation functions play a crucial role in determining whether a neuron should be activated, thereby influencing the model's learning and output. Here are some common types of activation functions:

1. Sigmoid (Logistic) Function

Formula: σ(x) = 1 / (1 + e⁻ˣ)
Range: 0 to 1
Usage: Often used in binary classification problems.
Pros: Smooth gradient, outputs probabilities.
Cons: Can cause vanishing gradient problems.

2. Hyperbolic Tangent (Tanh) Function

Formula: tanh(x) = 2/(1 + e⁻²ˣ) - 1
Range: -1 to 1
Usage: Commonly used in hidden layers of neural networks.
Pros: Zero-centered, smooth gradient.
Cons: Can also suffer from vanishing gradients.

3. Rectified Linear Unit (ReLU)

Formula: ReLU(x) = max(0, x)
Range: 0 to infinity
Usage: Widely used in hidden layers of deep neural networks.
Pros: Efficient computation, mitigates vanishing gradient problem.
Cons: Can cause "dying ReLU" problem where neurons get stuck at 0.

4. Leaky ReLU

Formula: Leaky ReLU(x) = max(αx, x)
Range: Negative infinity to infinity
Usage: Variant of ReLU to address the "dying ReLU" problem.
Pros: Prevents neurons from getting stuck at 0.

5. Parametric ReLU (PReLU)

Formula: eLU(x) = max(αx, x)
where $α$ is a learnable parameter.
Range: Negative infinity to infinity
Usage: Adaptive version of Leaky ReLU.
Pros: Allows learning the value of $\alpha$ .

6. Exponential Linear Unit (ELU)

Formula:

ELU(x) = { x, if x > 0; α(eˣ - 1), if x ≤ 0 }

Range: Negative infinity to infinity
Usage: Used in deep networks for faster convergence.
Pros: Alleviates vanishing gradient, outputs can be negative.

7. Softmax

Formula:

Softmax(xᵢ) = eˣⁱ / Σⱼ eˣʲ

Range: 0 to 1 (sum to 1)
Usage: Often used in the output layer of multi-class classification problems.
Pros: Outputs probabilities, suitable for multi-class tasks.

Summary Table

Activation Function	Formula	Range	Common Use
Sigmoid	1 / (1 + e⁻ˣ)	0 to 1	Binary classification
Tanh	2/(1 + e⁻²ˣ) - 1	-1 to 1	Hidden layers
ReLU	max(0, x)	0 to ∞	Hidden layers in deep networks
Leaky ReLU	$\max (α x, x)$	-∞ to ∞	Hidden layers
PReLU	$\max (α x, x)$	-∞ to ∞	Adaptive hidden layers
ELU

{ x, if x > 0; α(eˣ - 1), if x ≤ 0 }

| -∞ to ∞ | | 0 to 1 | Output layer for multi-class classification |

Each activation function has its advantages and is suited for specific tasks. Choosing the right one can greatly impact the performance and efficiency of your neural network.

Parameters and Hyperparameters of Neural Networks

Understanding the difference between parameters and hyperparameters is crucial for building and training neural networks effectively.

Parameters

Parameters are the variables that the model learns from the training data. They are internal to the model and are updated during training. Key parameters include:

Weights: The multipliers applied to each input feature or the connections between neurons.
Biases: The offsets added to the weighted sum before applying the activation function.

These parameters are optimized through the training process using backpropagation and gradient descent to minimize the loss function.

Hyperparameters

Hyperparameters are the settings that need to be defined before the training process begins. They govern the overall behavior and structure of the model. Key hyperparameters include:

Learning Rate: The step size used by the optimization algorithm to update the model parameters. A higher learning rate can speed up training but may overshoot the optimal solution.
Number of Epochs: The number of times the entire training dataset passes through the network during training.
Batch Size: The number of training samples used in one forward/backward pass. Smaller batch sizes lead to more updates per epoch, while larger batches provide more stable updates.
Number of Layers and Neurons: The architecture of the neural network, including the number of hidden layers and neurons per layer.
Activation Functions: The functions applied to the outputs of neurons to introduce non-linearity (e.g., ReLU, Sigmoid).
Optimization Algorithm: The method used to minimize the loss function (e.g., Stochastic Gradient Descent, Adam).
Dropout Rate: The fraction of neurons randomly dropped during training to prevent overfitting.
Regularization Parameter: Parameters like L1 or L2 regularization to penalize large weights and reduce overfitting.

Common Notations in Neural Networks

Input Variables
- $x$ : Input feature vector.
- $xi$ : The ith feature in the input vector.
Output Variables
- $y$ : Actual output or target value.
- $\hat{y}$ : Predicted output.
Weights and Biases
- $w$ : Weight matrix.
- wᵢⱼ: Weight connecting the ith neuron in the previous layer to the jth neuron in the current layer.
- $b$ : Bias term.
- $bⱼ$ : Bias for the jth neuron.
Layers
- $L$ : Total number of layers in the neural network.
- $l$ : Layer index (1 through L).
- $n⁽ˡ⁾$ : Number of neurons in the lth layer.
- a⁽ˡ⁾: Activation output of the lth layer.
Activation Functions
- $σ$ : Sigmoid activation function.
- $ReLU$ : Rectified Linear Unit activation function.
- $\tanh$ : Hyperbolic tangent activation function.
- $softmax$ : Softmax activation function.
Feedforward and Backpropagation
- z⁽ˡ⁾: Weighted sum (linear transformation) at the lth layer.
- a⁽ˡ⁾ = activation(z⁽ˡ⁾) Activation output at the lth layer.
- δ⁽ˡ⁾: Error term (delta) at the lth layer during backpropagation.
- $α$ : Learning rate.
Loss Functions
- $L$ : Loss function.
- $J$ : Cost function.

Example

Here's an example notation for a simple two-layer neural network:

Input layer: $x = [x_{1}, x_{2}, \dots, x_{n}] x = [x_1, x_2,...., x_n]$
Weights and Biases:
- First layer: w⁽¹⁾, b⁽¹⁾
- Second layer: w⁽²⁾, b⁽²⁾
Activations:
- First layer output: a⁽¹⁾ = σ(w⁽¹⁾x + b⁽¹⁾)
- Second layer output: ŷ = a⁽²⁾ = σ(w⁽²⁾a⁽¹⁾ + b⁽²⁾)

These notations help us systematically describe and implement neural networks, ensuring clarity and consistency in their construction and analysis.

commonly used neural network architectures make the following simplifying assumptions:

The neurons in an ANN are arranged in layers, and these layers are arranged sequentially.
The neurons within the same layer do not interact with each other.
The inputs are fed into the network through the input layer, and the outputs are sent out from the output layer.
Neurons in consecutive layers are densely connected, i.e., all neurons in layer l are connected to all neurons in layer l+1.
Every neuron in the neural network has a bias value associated with it, and each interconnection has a weight associated with it.
All neurons in a particular hidden layer use the same activation function. Different hidden layers can use different activation functions, but in a hidden layer, all neurons use the same activation function.

Assumptions for Simplifying Neural Networks

Linear Activation Functions:
- Assuming linear activation functions (like identity functions) simplifies the mathematics, but this limits the network's ability to model complex, non-linear relationships.
Single Layer Perceptron:
- Simplifying to a single-layer network (perceptron) helps in understanding basic concepts, although it's limited to linearly separable problems.
Small Network Size:
- Working with smaller networks (fewer layers and neurons) simplifies the computation and conceptualization but might not capture the complexity of the data.
Uniform Weight Initialization:
- Assuming all weights are initialized to small, random values or zeros makes the initialization process straightforward. However, this can lead to poor training performance.
Fixed Learning Rate:
- Using a fixed learning rate simplifies training but may not be optimal for convergence.
Ignoring Regularization:
- Ignoring techniques like dropout or L2 regularization can simplify the model, though it may lead to overfitting.
Simple Datasets:
- Using simple, synthetic datasets (like XOR or AND gate problems) instead of real-world complex data makes it easier to visualize and understand the network's learning process.
Batch Gradient Descent:
- Assuming the entire dataset is processed at once simplifies the gradient descent algorithm, though it's less efficient than mini-batch or stochastic gradient descent.
Static Model Architecture:

Not considering dynamic changes in network architecture during training simplifies the model-building process.

Flow of Information Between Layers in a Neural Network

In a neural network, information flows from the input layer through the hidden layers to the output layer. Here's a step-by-step overview:

Input Layer:
- The input layer receives the raw data or features. Each neuron in this layer corresponds to one feature of the input data.
- Input Vector ( $x$ ): The data is represented as a vector $x = [x_{1}, x_{2}, \dots, x_{n}] x = [x_1, x_2,...., x_n]$ .
Hidden Layers:
- Each hidden layer consists of neurons that process the inputs. The processing involves computing the weighted sum of the inputs plus a bias and then applying an activation function.
- Weighted Sum(z⁽ˡ⁾)
- : For the l-th layer, the weighted sum is computed as:

z⁽ˡ⁾ = W⁽ˡ⁾a⁽ˡ⁻¹⁾ + b⁽ˡ⁾

Here, W⁽ˡ⁾ is the weight matrix, a⁽ˡ⁻¹⁾ is the activation output from the previous layer, and b⁽ˡ⁾ is the bias vector.

Activation Output (a⁽ˡ⁾)
: The activation function (e.g., ReLU, Sigmoid) is applied to z⁽ˡ⁾ to get the activation output:

a^{[l]} = activation(z⁽ˡ⁾) a⁽ˡ⁾) =activation(z⁽ˡ⁾)

This output becomes the input for the next layer.

Output Layer:
- The final layer produces the output predictions of the network.
- Output Vector (ŷ): The output layer computes the final activation output, which represents the predictions of the network. This can be a single value (for regression) or a probability distribution (for classification).

Example of a Forward Pass

Consider a simple neural network with one hidden layer:

Input Layer: $x = [x_{1}, x_{2}, \dots, x_{n}] x = [x_1, x_2,...., x_n]$
Weights and Biases:
- Hidden Layer: W⁽¹⁾, b⁽¹⁾
- Output Layer: W⁽²⁾, b⁽²⁾
Activation Functions:
- Hidden Layer: ReLU
- Output Layer: Sigmoid (for binary classification)

Step-by-Step Forward Pass:

Compute Weighted Sum for Hidden Layer:

z⁽¹⁾ = W⁽¹⁾x + b⁽¹⁾

Apply Activation Function (ReLU):

a⁽¹⁾ = ReLU(z⁽¹⁾)

Compute Weighted Sum for Output Layer:

z⁽²⁾ = W⁽²⁾a⁽¹⁾ + b⁽²⁾

Apply Activation Function (Sigmoid):

ŷ = Sigmoid(z⁽²⁾)

This forward pass results in the predicted output ŷ.

Summary

Input Layer: Receives and normalizes the raw input data.
Hidden Layers: Process the data through weighted sums and activation functions to extract features and learn patterns.
Output Layer: Produces the final predictions based on the processed information.

let's walk through a concrete example of a forward pass in a simple neural network with one hidden layer.

Network Architecture:

Input Layer: 2 input features (x₁, x₂)
Hidden Layer: 3 neurons (using ReLU activation)
Output Layer: 1 neuron (using Sigmoid activation for binary classification)

Weights and Biases (Randomly initialized for demonstration):

W¹ (Weights between Input and Hidden Layer):

[[0.2, 0.5],
 [-0.3, 0.8],
 [0.1, -0.4]]

b¹ (Biases for Hidden Layer):

[[0.1],
 [-0.2],
 [0.3]]

W² (Weights between Hidden and Output Layer):

[[0.6, -0.7, 0.2]]

b² (Bias for Output Layer):

[[0.5]]

Input Data:

[[2],
 [3]]

Forward Pass Calculation:

Hidden Layer Calculation:

z¹ = W¹ * x + b¹

[[0.2, 0.5],   *   [[2],   +   [[0.1],
 [-0.3, 0.8],       [3]]        [-0.2],
 [0.1, -0.4]]                     [0.3]]

=  [[0.4 + 1.5],   +   [[0.1],
    [-0.6 + 2.4],       [-0.2],
    [0.2 - 1.2]]        [0.3]]

=  [[1.9],   +   [[0.1],
    [1.8],       [-0.2],
    [-1.0]]        [0.3]]

=  [[2.0],
    [1.6],
    [-0.7]]

a¹ = ReLU(z¹) (ReLU activation)

[[ReLU(2.0)],
 [ReLU(1.6)],
 [ReLU(-0.7)]]

=  [[2.0],
    [1.6],
    [0.0]]

Output Layer Calculation:

z² = W² * a¹ + b²

[[0.6, -0.7, 0.2]]   *   [[2.0],   +   [[0.5]]
                            [1.6],
                            [0.0]]

=  [[1.2 - 1.12 + 0]]   +   [[0.5]]

=  [[0.08]]   +   [[0.5]]

=  [[0.58]]

ŷ = Sigmoid(z²) (Sigmoid activation)

Sigmoid(0.58) = 1 / (1 + exp(-0.58)) ≈ 0.64

Result:

The output of the forward pass, ŷ, is approximately 0.64. This represents the network's prediction for the given input x.

Key Points:

Matrix multiplication is used to efficiently calculate the weighted sums.
Activation functions introduce non-linearity, which is crucial for the network to learn complex patterns.
The output of each layer becomes the input to the next layer.

This example demonstrates a single forward pass. In a real training scenario, this forward pass would be followed by a backward pass (to calculate gradients) and an optimization step (to update the weights and biases). This process is repeated many times until the network learns to make accurate predictions.

Feedforward Algorithm Steps

Initialization:
- Start with an input vector $x$ and initialize weights $W$ and biases $b$ for each layer.
Input Layer:
- The input data $x$ is fed into the network through the input layer.
Hidden Layers:
- Weighted Sum: For each hidden layer $l$ :

z⁽ˡ⁾ = W⁽ˡ⁾a⁽ˡ⁻¹⁾ + b⁽ˡ⁾

Here a⁽ˡ⁻¹⁾, is the activation from the previous layer (or input $x$ for the first hidden layer).

Activation Function: Apply an activation function (e.g., ReLU, Sigmoid) to the weighted sum to get the activation :

a^{[l]} = activation a⁽ˡ⁾ = activation(z⁽ˡ⁾)

Output Layer:
- Weighted Sum: Compute the weighted sum for the output layer:

z⁽ᴸ⁾ = W⁽ᴸ⁾a⁽ᴸ⁻¹⁾ + b⁽ᴸ⁾

Activation Function: Apply the activation function (e.g., Sigmoid for binary classification) to get the final output ŷ:

ŷ = activation(z⁽ᴸ⁾)

Example

Consider a simple neural network with one hidden layer:

Input Vector: x = [x₁, x₂]
Weights and Biases:
- Hidden Layer: W⁽¹⁾ = [ w₁₁⁽¹⁾ w₁₂⁽¹⁾ ] , b⁽¹⁾ = [ b₁⁽¹⁾ ] [ w₂₁⁽¹⁾ w₂₂⁽¹⁾ ] [ b₂⁽¹⁾ ]
  Output Layer: W⁽²⁾ = [ w₁₁⁽²⁾ w₁₂⁽²⁾ ] , b⁽²⁾ = b₁⁽²⁾

Step-by-Step Forward Pass:

Compute Weighted Sum for Hidden Layer:

z⁽¹⁾ = W⁽¹⁾x + b⁽¹⁾

z⁽¹⁾ = [ w₁₁⁽¹⁾ w₁₂⁽¹⁾ ] * [ x₁ ] + [ b₁⁽¹⁾ ] [ w₂₁⁽¹⁾ w₂₂⁽¹⁾ ] [ x₂ ] [ b₂⁽¹⁾ ]

Apply Activation Function (ReLU):

a⁽¹⁾ = ReLU(z⁽¹⁾)

Compute Weighted Sum for Output Layer:

z⁽²⁾ = W⁽²⁾a⁽¹⁾ + b⁽²⁾

z⁽²⁾ = [ w₁₁⁽²⁾ w₁₂⁽²⁾ ] * [ a₁⁽¹⁾ ] + b₁⁽²⁾ [ a₂⁽¹⁾ ]

Apply Activation Function (Sigmoid) :ŷ = Sigmoid(z⁽²⁾)SummaryInput Layer : Receives and normalizes the input data. Hidden Layers : Compute weighted sums and apply activation functions. Output Layer : Produces the final prediction by applying an activation function to the weighted sum.A loss function (also known as a cost function) measures how well a neural network's predictions match the actual target values. It quantifies the error between the predicted outputs and the true labels, guiding the model during training to learn and improve.Here are some common loss functions used in neural networks: 1. Mean Squared Error (MSE) Formula : L(y, ŷ) = (1/n) * Σᵢ (yᵢ - ŷᵢ)² Usage : Used for regression tasks where the goal is to predict a continuous value. Pros : Penalizes larger errors more heavily, promoting smaller, consistent errors. 2. Mean Absolute Error (MAE) Formula : L(y, ŷ) = (1/n) * Σᵢ |yᵢ - ŷᵢ| Usage : Also used for regression tasks. Pros : Less sensitive to outliers compared to MSE. 3. Binary Cross-Entropy (Log Loss) Formula : L(y, ŷ) = -(1/n) * Σᵢ [yᵢlog(ŷᵢ) + (1 - yᵢ)log(1 - ŷᵢ)] Usage : Used for binary classification tasks. Pros : Effective for binary outcomes by measuring the distance between the true label and the predicted probability. 4. Categorical Cross-Entropy Formula : L(y, ŷ) = - Σᵢ Σⱼ yᵢⱼlog(ŷᵢⱼ) Usage : Used for multi-class classification tasks. Pros : Handles multiple classes by comparing true labels with predicted probabilities. 5. Hinge Loss Formula : L(y, ŷ) = Σᵢ max(0, 1 - yᵢ * ŷᵢ) Usage : Commonly used for Support Vector Machines (SVMs) but can be used in neural networks for classification tasks. Pros : Promotes a margin of separation between classes.Learning in Neural Networks Learning in neural networks involves adjusting the model's parameters (weights and biases) to minimize the error between the predicted and actual outputs. This process is typically divided into several key steps: 1. Initialization Weights and Biases : Initialize the weights and biases with small random values. Proper initialization is crucial to ensure that the network learns effectively. 2. Forward Pass Input Data : Feed the input data through the network. Activation : Compute the weighted sums and apply activation functions to produce output predictions. 3. Loss Calculation Loss Function : Calculate the loss (error) by comparing the predicted output to the actual target values using a loss function (e.g., Mean Squared Error, Cross-Entropy Loss). 4. Backward Pass (Backpropagation) Compute Gradients : Calculate the gradients of the loss function with respect to the weights and biases using the chain rule of calculus. This process is called backpropagation. Error Term (δ) : Compute the error term for each layer. Weight and Bias Gradients : Calculate the gradient of the loss function with respect to each weight and bias. 5. Parameter Update Optimization Algorithm : Use an optimization algorithm (e.g., Stochastic Gradient Descent, Adam) to update the weights and biases based on the computed gradients. The update rule is typically: w = w - α * \partialL/\partialw b = b - α * \partialL/\partialb Where α is the learning rate. 6. Iteration Epochs : Repeat the forward pass, loss calculation, backward pass, and parameter update for a specified number of epochs until the model converges to a solution or achieves a satisfactory performance. Example Here's a simplified example of learning in a neural network using Python and Keras: python import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

# Generate some data
X = np.random.rand(1000, 20)  # 1000 samples, 20 features
y = np.random.randint(2, size=(1000, 1))  # Binary labels

# Create the model
model = Sequential()
model.add(Dense(64, input_dim=20, activation='relu'))  # Hidden layer
model.add(Dense(1, activation='sigmoid'))  # Output layer

# Compile the model with SGD optimizer
model.compile(optimizer=SGD(learning_rate=0.01), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X, y, epochs=10, batch_size=32)

# Evaluate the model
loss, accuracy = model.evaluate(X, y)
print(f"Loss: {loss}, Accuracy: {accuracy}") Summary Initialization : Start with random weights and biases. Forward Pass : Compute outputs by passing inputs through the network. Loss Calculation : Measure the error using a loss function. Backward Pass : Calculate gradients and perform backpropagation. Parameter Update : Adjust weights and biases using an optimization algorithm. Iteration : Repeat the process for multiple epochs until convergence.ANN for Regression Objective : Predict a continuous value. Example: Predicting house prices, temperature, or stock prices. Output Layer : Typically has one neuron for predicting a single continuous value. Activation function: Often none (linear activation) or ReLU. Loss Function : Commonly used: Mean Squared Error (MSE), Mean Absolute Error (MAE). Evaluation Metrics : Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared. ANN for Classification Objective : Predict a discrete label or category. Example: Classifying emails as spam or not spam, identifying handwritten digits. Output Layer : For binary classification: One neuron with a Sigmoid activation function (outputs probability). For multi-class classification: Multiple neurons (one for each class) with a Softmax activation function (outputs probabilities for each class). Loss Function : Binary classification: Binary Cross-Entropy (Log Loss). Multi-class classification: Categorical Cross-Entropy. Evaluation Metrics : Accuracy, Precision, Recall, F1 Score, AUC-ROC. Example Comparisons Regression ANN Example : Input: Features (e.g., size, location, number of rooms) of a house. Output: Predicted house price (continuous value). Classification ANN Example : Input: Features (e.g., pixel values of an image). Output: Predicted class (e.g., digit 0-9). Key Differences Output Layer and Activation Functions : Regression uses linear or ReLU activation, while classification uses Sigmoid or Softmax. Loss Functions : Different loss functions tailored to the nature of the task (continuous vs. discrete). Evaluation Metrics : Different metrics to evaluate performance based on the type of prediction (continuous vs. categorical).Backpropagation Algorithm The backpropagation algorithm updates the weights and biases in a neural network to minimize the loss function. Here are the steps with the relevant formulas: 1. Forward Pass Input Data : x = [x_{1}, x_{2}, \dots, x_{n}] x = [x_1, x_2, \ldots, x_n] Weighted Sum for Hidden Layer (z^{[1]} z^{[1]}) : z^{[1]} = W^{[1]} x + b^{[1]} z^{[1]} = W^{[1]} x + b^{[1]} Activation for Hidden Layer (a^{[1]} a^{[1]}) : a^{[1]} = ReLU (z^{[1]}) a^{[1]} = \text{ReLU}(z^{[1]}) Weighted Sum for Output Layer (z^{[2]} z^{[2]}) : z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} Activation for Output Layer (\hat{y} \hat{y}) : \hat{y} = Sigmoid (z^{[2]}) \hat{y} = \text{Sigmoid}(z^{[2]}) 2. Compute Loss Loss Function (L \mathcal{L}) : L (y, \hat{y}) = - \frac{1}{n} \sum_{i = 1}^{n} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})] \mathcal{L}(y, \hat{y}) = - \frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)] 3. Backward Pass Output Layer Error (δ^{[2]} \delta^{[2]}) : δ^{[2]} = \hat{y} - y \delta^{[2]} = \hat{y} - y Gradient with respect to Weights and Biases in Output Layer (W^{[2]} W^{[2]} and b^{[2]} b^{[2]}) : \frac{\partial L}{\partial W^{[2]}} = a^{[1]} δ^{[2]} \frac{\partial \mathcal{L}}{\partial W^{[2]}} = a^{[1]} \delta^{[2]} \frac{\partial L}{\partial b^{[2]}} = δ^{[2]} \frac{\partial \mathcal{L}}{\partial b^{[2]}} = \delta^{[2]} Hidden Layer Error (δ^{[1]} \delta^{[1]}) : δ⁽¹⁾ = (W⁽²⁾)ᵀ δ⁽²⁾ * ReLU'(z⁽¹⁾) Gradient with respect to Weights and Biases in Hidden Layer (W^{[1]} W^{[1]} and b^{[1]} b^{[1]}) : \frac{\partial L}{\partial W^{[1]}} = x^{T} δ^{[1]} \frac{\partial \mathcal{L}}{\partial W^{[1]}} = x^T \delta^{[1]} \frac{\partial L}{\partial b^{[1]}} = δ^{[1]} \frac{\partial \mathcal{L}}{\partial b^{[1]}} = \delta^{[1]} 4. Parameter Update Update Weights and Biases : W^{[2]} = W^{[2]} - α \cdot \frac{\partial L}{\partial W^{[2]}} W^{[2]} = W^{[2]} - \alpha \cdot \frac{\partial \mathcal{L}}{\partial W^{[2]}} b^{[2]} = b^{[2]} - α \cdot \frac{\partial L}{\partial b^{[2]}} b^{[2]} = b^{[2]} - \alpha \cdot \frac{\partial \mathcal{L}}{\partial b^{[2]}} W^{[1]} = W^{[1]} - α \cdot \frac{\partial L}{\partial W^{[1]}} W^{[1]} = W^{[1]} - \alpha \cdot \frac{\partial \mathcal{L}}{\partial W^{[1]}} b^{[1]} = b^{[1]} - α \cdot \frac{\partial L}{\partial b^{[1]}} b^{[1]} = b^{[1]} - \alpha \cdot \frac{\partial \mathcal{L}}{\partial b^{[1]}} Here, α \alpha is the learning rate. Summary Forward Pass : Compute weighted sums and activations. Compute Loss : Calculate the loss using the chosen loss function. Backward Pass : Compute gradients using backpropagation. Parameter Update : Update weights and biases using the gradients.Gradient Descent for Backpropagation Gradient descent is an optimization algorithm used to minimize the loss function in neural networks by iteratively adjusting the weights and biases. When combined with backpropagation, it efficiently updates the network parameters to reduce the prediction error. Here's how it works: Steps of Gradient Descent with Backpropagation 1. Initialization Initialize weights W W and biases b b with small random values. 2. Forward Pass Compute the activations for each layer by passing the input data x x through the network. Calculate the weighted sums and apply the activation functions to produce the predicted output \hat{y} \hat{y} . 3. Compute Loss Calculate the loss L \mathcal{L} using a suitable loss function (e.g., Mean Squared Error, Cross-Entropy). 4. Backward Pass (Backpropagation) Compute the gradient of the loss with respect to the output layer's activation: δ^{[L]} = \hat{y} - y \delta^{[L]} = \hat{y} - y For each hidden layer l l (from the last hidden layer to the first): Compute the error term: δ^{[l]} = (δ^{[l + 1]} \cdot W^{[l + 1]}) \cdot {ReLU}^{'} (z^{[l]}) \delta^{[l]} = (\delta^{[l+1]} \cdot W^{[l+1]}) \cdot \text{ReLU}'(z^{[l]}) Compute the gradients for weights and biases: For the weights W^{[l]} W^{[l]} : \partialL/\partialW⁽ˡ⁾ = δ⁽ˡ⁾(a⁽ˡ⁻¹⁾)ᵀ For the biases b^{[l]} b^{[l]} : \frac{\partial L}{\partial b^{[l]}} = δ^{[l]} \frac{\partial \mathcal{L}}{\partial b^{[l]}} = \delta^{[l]} 5. Parameter Update Update the weights and biases using the computed gradients and the learning rate α \alpha : W^{[l]} = W^{[l]} - α \cdot \frac{\partial L}{\partial W^{[l]}} W^{[l]} = W^{[l]} - \alpha \cdot \frac{\partial \mathcal{L}}{\partial W^{[l]}} b^{[l]} = b^{[l]} - α \cdot \frac{\partial L}{\partial b^{[l]}} b^{[l]} = b^{[l]} - \alpha \cdot \frac{\partial \mathcal{L}}{\partial b^{[l]}} 6. Iteration Repeat the forward pass, loss calculation, backpropagation, and parameter update for multiple epochs until the loss converges to a minimum or the model achieves satisfactory performance. Summary Gradient Descent : Optimization algorithm used to minimize the loss function. Forward Pass : Compute activations to produce the predicted output. Compute Loss : Calculate the error between predicted and actual outputs. Backward Pass : Compute gradients and adjust weights and biases. Parameter Update : Update weights and biases using the computed gradients and the learning rate. By iterating these steps, the neural network learns and improves its predictions over time Let's walk through a detailed numerical example to demonstrate backpropagation in a simple neural network. We'll manually compute the forward pass, calculate the loss, perform backpropagation to find the gradients, and update the weights and biases. This hands-on example will solidify your understanding of how neural networks learn from data. Neural Network Architecture We'll use a small neural network suitable for manual calculation: Input Layer : 2 neurons (features) Hidden Layer : 2 neurons Output Layer : 1 neuron Activation Functions Hidden Layer : Sigmoid function Output Layer : Sigmoid function Training Data We'll use one training example for simplicity: Input Feature Vector (x x) : x = [\begin{matrix} x_{1} \\ x_{2} \end{matrix}] = [\begin{matrix} 0.05 \\ 0.10 \end{matrix}] x = \begin{bmatrix}
  x_1 \\
  x_2
  \end{bmatrix} = \begin{bmatrix}
  0.05 \\
  0.10
  \end{bmatrix} Target Output (y y) : y = 0.01 y = 0.01 Step 1: Initialize Weights and Biases Let's assign some initial weights and biases. Weights between Input and Hidden Layer (W^{[1]} W^{[1]}) : W^{[1]} = [\begin{matrix} w_{1, 1}^{[1]} & w_{1, 2}^{[1]} \\ w_{2, 1}^{[1]} & w_{2, 2}^{[1]} \end{matrix}] = [\begin{matrix} 0.15 & 0.20 \\ 0.25 & 0.30 \end{matrix}] W^{[1]} = \begin{bmatrix}
w_{1,1}^{[1]} & w_{1,2}^{[1]} \\
w_{2,1}^{[1]} & w_{2,2}^{[1]}
\end{bmatrix} = \begin{bmatrix}
0.15 & 0.20 \\
0.25 & 0.30
\end{bmatrix} **Biases for Hidden Layer (b^{[1]} b^{[1]}): b^{[1]} = [\begin{matrix} b_{1}^{[1]} \\ b_{2}^{[1]} \end{matrix}] = [\begin{matrix} 0.35 \\ 0.35 \end{matrix}] b^{[1]} = \begin{bmatrix}
b_1^{[1]} \\
b_2^{[1]}
\end{bmatrix} = \begin{bmatrix}
0.35 \\
0.35
\end{bmatrix} Weights between Hidden and Output Layer (W^{[2]} W^{[2]}) : W^{[2]} = [\begin{matrix} w_{1}^{[2]} & w_{2}^{[2]} \end{matrix}] = [\begin{matrix} 0.40 & 0.45 \end{matrix}] W^{[2]} = \begin{bmatrix}
w_{1}^{[2]} & w_{2}^{[2]}
\end{bmatrix} = \begin{bmatrix}
0.40 & 0.45
\end{bmatrix} **Bias for Output Layer (b^{[2]} b^{[2]}): b^{[2]} = 0.60 b^{[2]} = 0.60 Step 2: Forward Pass Compute Weighted Sum and Activation for Hidden Layer a) Weighted Sum (z^{[1]} z^{[1]}) : z₁⁽¹⁾ = w₁,₁⁽¹⁾ * x₁ + w₂,₁⁽¹⁾ * x₂ + b₁⁽¹⁾ = (0.15 * 0.05) + (0.25 * 0.10) + 0.35 = 0.0075 + 0.025 + 0.35
= 0.3825 z₂⁽¹⁾ = w₁,₂⁽¹⁾ * x₁ + w₂,₂⁽¹⁾ * x₂ + b₂⁽¹⁾ = (0.20 * 0.05) + (0.30 * 0.10) + 0.35 = 0.010 + 0.030 + 0.35
= 0.3900 b) Activation using Sigmoid Function (a^{[1]} a^{[1]}) : Sigmoid (z) = \frac{1}{1 + e^{- z}} \text{Sigmoid}(z) = \frac{1}{1 + e^{-z}} a₁⁽¹⁾ = Sigmoid(z₁⁽¹⁾) = 1 / (1 + e⁻⁰\cdot³⁸²⁵) \approx 1 / (1 + 0.6821)
\approx 0.5947 a₂⁽¹⁾ = Sigmoid(z₂⁽¹⁾) = 1 / (1 + e⁻⁰\cdot³⁹⁰⁰) \approx 1 / (1 + 0.6769)
\approx 0.5963 Compute Weighted Sum and Activation for Output Layer a) Weighted Sum (z^{[2]} z^{[2]}) : z⁽²⁾ = w₁⁽²⁾ * a₁⁽¹⁾ + w₂⁽²⁾ * a₂⁽¹⁾ + b⁽²⁾ = (0.40 * 0.5947) + (0.45 * 0.5963) + 0.60 = 0.23788 + 0.268335 + 0.60 = 1.106215 b) Activation using Sigmoid Function (\hat{y} \hat{y}) : \hat{y} = Sigmoid (1.1062) = \frac{1}{1 + e^{- 1.1062}} \approx 0.7514 \hat{y} = \text{Sigmoid}(1.1062) = \frac{1}{1 + e^{-1.1062}} \approx 0.7514 Step 3: Compute Loss We'll use the Mean Squared Error (MSE) loss function: L = \frac{1}{2} (y - \hat{y})^{2} \mathcal{L} = \frac{1}{2} (y - \hat{y})^2 L = \frac{1}{2} (0.01 - 0.7514)^{2} = \frac{1}{2} (- 0.7414)^{2} \approx 0.2748 \mathcal{L} = \frac{1}{2} (0.01 - 0.7514)^2 = \frac{1}{2} (-0.7414)^2 \approx 0.2748 Step 4: Backpropagation Compute Error Term for Output Layer a) Derivative of Loss with respect to Output Activation (δ^{[2]} \delta^{[2]}) : δ^{[2]} = \frac{\partial L}{\partial \hat{y}} = \hat{y} - y = 0.7514 - 0.01 = 0.7414 \delta^{[2]} = \frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y = 0.7514 - 0.01 = 0.7414 b) Derivative of Activation Function (Sigmoid Derivative) : {Sigmoid}^{'} (z) = \hat{y} (1 - \hat{y}) = 0.7514 \times (1 - 0.7514) = 0.7514 \times 0.2486 \approx 0.1868 \text{Sigmoid}'(z) = \hat{y} (1 - \hat{y}) = 0.7514 \times (1 - 0.7514) = 0.7514 \times 0.2486 \approx 0.1868 c) Error Term for Output Neuron (δ^{[2]} \delta^{[2]}) : δ^{[2]} = (\hat{y} - y) \times {Sigmoid}^{'} (z^{[2]}) = 0.7414 \times 0.1868 \approx 0.1385 \delta^{[2]} = (\hat{y} - y) \times \text{Sigmoid}'(z^{[2]}) = 0.7414 \times 0.1868 \approx 0.1385 Compute Gradients for Weights and Biases between Hidden and Output Layer a) Gradients with respect to Weights (\frac{\partial L}{\partial W^{[2]}} \frac{\partial \mathcal{L}}{\partial W^{[2]}}) : \frac{\partial L}{\partial w_{1}^{[2]}} = δ^{[2]} \times a_{1}^{[1]} = 0.1385 \times 0.5947 \approx 0.0823 \frac{\partial \mathcal{L}}{\partial w_{1}^{[2]}} = \delta^{[2]} \times a_1^{[1]} = 0.1385 \times 0.5947 \approx 0.0823 \frac{\partial L}{\partial w_{2}^{[2]}} = δ^{[2]} \times a_{2}^{[1]} = 0.1385 \times 0.5963 \approx 0.0826 \frac{\partial \mathcal{L}}{\partial w_{2}^{[2]}} = \delta^{[2]} \times a_2^{[1]} = 0.1385 \times 0.5963 \approx 0.0826 b) Gradient with respect to Bias (\frac{\partial L}{\partial b^{[2]}} \frac{\partial \mathcal{L}}{\partial b^{[2]}}) : \frac{\partial L}{\partial b^{[2]}} = δ^{[2]} = 0.1385 \frac{\partial \mathcal{L}}{\partial b^{[2]}} = \delta^{[2]} = 0.1385 Compute Error Terms for Hidden Layer Neurons a) Error Term for Hidden Neuron 1 (δ_{1}^{[1]} \delta_1^{[1]}) : δ_{1}^{[1]} = (δ^{[2]} \times w_{1}^{[2]}) \times {Sigmoid}^{'} (z_{1}^{[1]}) \delta_1^{[1]} = (\delta^{[2]} \times w_{1}^{[2]}) \times \text{Sigmoid}'(z_1^{[1]}) First, compute Sigmoid Derivative at z_{1}^{[1]} z_1^{[1]} : {Sigmoid}^{'} (z_{1}^{[1]}) = a_{1}^{[1]} (1 - a_{1}^{[1]}) = 0.5947 \times (1 - 0.5947) \approx 0.2410 \text{Sigmoid}'(z_1^{[1]}) = a_1^{[1]} (1 - a_1^{[1]}) = 0.5947 \times (1 - 0.5947) \approx 0.2410 Now compute δ_{1}^{[1]} \delta_1^{[1]} : δ_{1}^{[1]} = (0.1385 \times 0.40) \times 0.2410 = 0.0554 \times 0.2410 \approx 0.0134 \delta_1^{[1]} = (0.1385 \times 0.40) \times 0.2410 = 0.0554 \times 0.2410 \approx 0.0134 b) Error Term for Hidden Neuron 2 (δ_{2}^{[1]} \delta_2^{[1]}) : δ_{2}^{[1]} = (δ^{[2]} \times w_{2}^{[2]}) \times {Sigmoid}^{'} (z_{2}^{[1]}) \delta_2^{[1]} = (\delta^{[2]} \times w_{2}^{[2]}) \times \text{Sigmoid}'(z_2^{[1]}) Compute Sigmoid Derivative at z_{2}^{[1]} z_2^{[1]} : {Sigmoid}^{'} (z_{2}^{[1]}) = a_{2}^{[1]} (1 - a_{2}^{[1]}) = 0.5963 \times (1 - 0.5963) \approx 0.2407 \text{Sigmoid}'(z_2^{[1]}) = a_2^{[1]} (1 - a_2^{[1]}) = 0.5963 \times (1 - 0.5963) \approx 0.2407 Now compute δ_{2}^{[1]} \delta_2^{[1]} : δ_{2}^{[1]} = (0.1385 \times 0.45) \times 0.2407 = 0.0623 \times 0.2407 \approx 0.0150 \delta_2^{[1]} = (0.1385 \times 0.45) \times 0.2407 = 0.0623 \times 0.2407 \approx 0.0150 Compute Gradients for Weights and Biases between Input and Hidden Layer a) Gradients with respect to Weights (\frac{\partial L}{\partial W^{[1]}} \frac{\partial \mathcal{L}}{\partial W^{[1]}}) : \frac{\partial L}{\partial w_{1, 1}^{[1]}} = δ_{1}^{[1]} \times x_{1} = 0.0134 \times 0.05 \approx 0.0007 \frac{\partial \mathcal{L}}{\partial w_{1,1}^{[1]}} = \delta_1^{[1]} \times x_1 = 0.0134 \times 0.05 \approx 0.0007 \frac{\partial L}{\partial w_{2, 1}^{[1]}} = δ_{1}^{[1]} \times x_{2} = 0.0134 \times 0.10 \approx 0.0013 \frac{\partial \mathcal{L}}{\partial w_{2,1}^{[1]}} = \delta_1^{[1]} \times x_2 = 0.0134 \times 0.10 \approx 0.0013 \frac{\partial L}{\partial w_{1, 2}^{[1]}} = δ_{2}^{[1]} \times x_{1} = 0.0150 \times 0.05 \approx 0.0008 \frac{\partial \mathcal{L}}{\partial w_{1,2}^{[1]}} = \delta_2^{[1]} \times x_1 = 0.0150 \times 0.05 \approx 0.0008 \frac{\partial L}{\partial w_{2, 2}^{[1]}} = δ_{2}^{[1]} \times x_{2} = 0.0150 \times 0.10 \approx 0.0015 \frac{\partial \mathcal{L}}{\partial w_{2,2}^{[1]}} = \delta_2^{[1]} \times x_2 = 0.0150 \times 0.10 \approx 0.0015 b) Gradients with respect to Biases (\frac{\partial L}{\partial b^{[1]}} \frac{\partial \mathcal{L}}{\partial b^{[1]} }) : \frac{\partial L}{\partial b_{1}^{[1]}} = δ_{1}^{[1]} = 0.0134 \frac{\partial \mathcal{L}}{\partial b_1^{[1]}} = \delta_1^{[1]} = 0.0134 \frac{\partial L}{\partial b_{2}^{[1]}} = δ_{2}^{[1]} = 0.0150 \frac{\partial \mathcal{L}}{\partial b_2^{[1]}} = \delta_2^{[1]} = 0.0150 Step 5: Update Weights and Biases We'll use a learning rate (η \eta) of 0.5 for this example. Update Weights and Biases between Hidden and Output Layer a) Updated Weights (W_{new}^{[2]} W^{[2]}_{\text{new}}) : w₁⁽²⁾_new = w₁⁽²⁾ - η * \partialL/\partialw₁⁽²⁾ = 0.40 - 0.5 * 0.0823 = 0.3588 w₂⁽²⁾_new = w₂⁽²⁾ - η * \partialL/\partialw₂⁽²⁾ = 0.45 - 0.5 * 0.0826 = 0.4087 b) Updated Bias (b_{new}^{[2]} b^{[2]}_{\text{new}}): b_{new}^{[2]} = b^{[2]} - η \times \frac{\partial L}{\partial b^{[2]}} = 0.60 - 0.5 \times 0.1385 = 0.60 - 0.0692 = 0.5308 b^{[2]}_{\text{new}} = b^{[2]} - \eta \times \frac{\partial \mathcal{L}}{\partial b^{[2]}} = 0.60 - 0.5 \times 0.1385 = 0.60 - 0.0692 = 0.5308 Update Weights and Biases between Input and Hidden Layer a) Updated Weights (W_{new}^{[1]} W^{[1]}_{\text{new}}) : w₁₁,₁⁽¹⁾_new: 0.15 - (0.5 * 0.0007) = 0.15 - 0.00035 = 0.14965 w₂,₁⁽¹⁾_new: 0.25 - (0.5 * 0.0013) = 0.25 - 0.00065 = 0.24935 w₁,₂⁽¹⁾_new: 0.20 - (0.5 * 0.0008) = 0.20 - 0.0004 = 0.1996 w₂,₂⁽¹⁾_new: 0.30 - (0.5 * 0.0015) = 0.30 - 0.00075 = 0.29925 b) Updated Biases (b_{new}^{[1]} b^{[1]}_{\text{new}}): b₁⁽¹⁾_new: 0.35 - (0.5 * 0.0134) = 0.35 - 0.0067 = 0.3433 b₂⁽¹⁾_new: 0.35 - (0.5 * 0.0150) = 0.35 - 0.0075 = 0.3425 Summary of Updated Parameters Updated Weights between Input and Hidden Layer (W_{new}^{[1]} W^{[1]}_{\text{new}}) : W_{new}^{[1]} = [\begin{matrix} 0.1496 & 0.1996 \\ 0.2493 & 0.2992 \end{matrix}] W^{[1]}_{\text{new}} = \begin{bmatrix}
  0.1496 & 0.1996 \\
  0.2493 & 0.2992
  \end{bmatrix} **Updated Biases for Hidden Layer (b_{new}^{[1]} b^{[1]}_{\text{new}}): b_{new}^{[1]} = [\begin{matrix} 0.3433 \\ 0.3425 \end{matrix}] b^{[1]}_{\text{new}} = \begin{bmatrix}
  0.3433 \\
  0.3425
  \end{bmatrix} Updated Weights between Hidden and Output Layer (W_{new}^{[2]} W^{[2]}_{\text{new}}) : W_{new}^{[2]} = [\begin{matrix} 0.3588 & 0.4087 \end{matrix}] W^{[2]}_{\text{new}} = \begin{bmatrix}
  0.3588 & 0.4087
  \end{bmatrix} **Updated Bias for Output Layer (b_{new}^{[2]} b^{[2]}_{\text{new}}): b_{new}^{[2]} = 0.5308 b^{[2]}_{\text{new}} = 0.5308 Step 6: Evaluate the Effect of Updates After updating the weights and biases, you can perform another forward pass to see if the loss has decreased. Note : Due to the small changes and the fact that we've only done one iteration, the change in loss might be small, but over multiple iterations, the loss should decrease, indicating that the network is learning. Conclusion By working through this numerical example, we've demonstrated: Forward Pass : How input data is transformed through the network to produce an output. Loss Computation : How the network's output is compared against the target output using a loss function. Backpropagation : How gradients are calculated and propagated backward through the network. Parameter Updates : How weights and biases are adjusted using the gradients and a learning rate. This iterative process is repeated with new data or multiple epochs over the same data to train the neural network effectively. Additional Insights Choice of Learning Rate : The learning rate (η = 0.5 \eta = 0.5) was arbitrarily chosen for this example. In practice, learning rates are often much smaller (e.g., 0.01) to ensure stable convergence. Batch Size : We used a single training example. In real-world scenarios, training is done using mini-batches or the entire dataset. Activation Functions : We used the sigmoid function for both layers, which can lead to vanishing gradient problems in deeper networks. ReLU is often preferred for hidden layers in modern architectures.The pseudocode/pseudo-algorithm of backpropogation is given as follows : 1: Initialise with the input Forward Propagation 2: For each layer, compute the cumulative input and apply the non-linear activation function on the cumulative input of each neuron of each layer to get the output. 3: For classification, get the probabilities of the observation belonging to a class, and for regression, compute the numeric output. 4: Assess the performance of the neural network through a loss function, for example, a cross-entropy loss function for classification and RMSE for regression. Backpropagation 5: From the last layer to the first layer, for each layer, compute the gradient of the loss function with respect to the weights at each layer and all the intermediate gradients. 6: Once all the gradients of the loss with respect to the weights (and biases) are obtained, use an optimisation technique like gradient descent to update the values of the weights and biases. Repeat this process until the model gives acceptable predictions: 7: Repeat the process for a specified number of iterations or until the predictions made by the model are acceptable. This is a consolidated algorithm for training a neural network.