Optimization Algorithms

Need: To optimize the learning process of a neural network, we use optimization algorithms. These algorithms are called optimizers. The goal of an optimizer is to minimize the loss function of the neural network. The loss function is a measure of how well the neural network is performing on the training data. The optimizer adjusts the weights and biases of the neural network to minimize the loss function.

Main goal: The main goal of an optimizer is to find the optimal set of weights and biases that minimize the loss function of the neural network.

How it works: The optimizer uses the gradients of the loss function with respect to the weights and biases to update the weights and biases in the direction that minimizes the loss function.

Most common optimizers: - Stochastic Gradient Descent (SGD), Mini-batch SGD, Batch SGD - SDG with Momentum - RMSProp - Adam (Adaptive Moment Estimation) (= - Adagrad

In practice, the choice of optimizer depends on the specific task and the architecture of the neural network. Some optimizers are better suited for certain tasks or architectures than others.

For example, Adam is a popular optimizer that is widely used in deep learning because it is computationally efficient and converges quickly. However, there are cases where other optimizers such as SGD or RMSProp may be more suitable.
SGD is more suitable for small datasets and simple models, while Adam is more suitable for large datasets and complex models.
RMSProp is a good choice when dealing with sparse data, while Adagrad is suitable for tasks with non-stationary objectives.

Types of Optimizers

Stochastic Gradient Descent (SGD)
- SGD is the simplest optimizer. It updates the weights and biases of the neural network by taking small steps in the direction of the negative gradient of the loss function.
- SGD is computationally efficient and easy to implement. However, it can be slow to converge and can get stuck in local minima.
- SGD has several variants, such as mini-batch SGD, momentum SGD, and Nesterov accelerated gradient.
- SGD is suitable for small datasets and simple models.
- SGD is not suitable for large datasets and complex models.
Adam
- Adam is a popular optimizer that combines the advantages of both AdaGrad and RMSProp.
- Adam adapts the learning rate for each parameter by computing the first and second moments of the gradients.
- Adam is computationally efficient and converges quickly.
- Adam is suitable for a wide range of deep learning tasks and is the default optimizer for many neural network architectures.
- Parameters of Adam optimizer:
  - \(\beta_1\) (default value: 0.9): Exponential decay rate for the first moment estimates.
  - \(\beta_2\) (default value: 0.999): Exponential decay rate for the second moment estimates.
  - \(\epsilon\) (default value: \(10^{-8}\)): A small constant to prevent division by zero.
  - \(alpha\) (default value: 0.001): Learning rate.
- How Adam works:
  - Adam computes the first and second moments of the gradients.
  - Adam updates the weights and biases of the neural network using the first and second moments of the gradients.
  - Adam adapts the learning rate for each parameter based on the first and second moments of the gradients.