Normalization

Basics

Why do we need Data Normalization in Machine Learning?

There are several reasons for the need for data normalization as follows:

Throughout the learning process, it guarantees that every feature contributes equally, preventing larger-magnitude features from overshadowing others.
It enables faster convergence of algorithms for optimisation, especially those that depend on gradient descent. Normalisation improves the performance of distance-based algorithms like k-Nearest Neighbours.
Normalisation improves overall performance by addressing model sensitivity problems in algorithms such as Support Vector Machines and Neural Networks.
Because it assumes uniform feature scales, it also supports the use of regularisation techniques like L1 and L2 regularisation.
In general, normalisation is necessary when working with attributes that have different scales; otherwise, the effectiveness of a significant attribute that is equally important (on a lower scale) could be diluted due to other attributes having values on a larger scale.

Normalization vs. Standardization:

	Normalization	Standardization
Scales the values	To a specific range, often between 0 and 1	To have a mean of 0 and a standard deviation of 1
Applicability /Effectivity	When the feature distribution is uncertain	When the data distribution is Gaussian
Influence of outliers	Susceptible to the influence of outliers	Less affected by the presence of outliers
Shape of the original distribution	Maintains the shape of the original distribution	Alters the shape of the original distribution
Scale values	Scales values to ranges like [0, 1]	Scale values are not constrained to a specific range

Types of Normalization Techniques:

Normalization Technique	Formula	When to Use
Linear Scaling	\(X' = \frac{X - X_{min}}{X_{max} - X_{min}}\)	When the feature is more-or-less uniformly distributed across a fixed range.
Clipping	if x > max, then x' = max. if x < min, then x' = min	When the feature contains some extreme outliers.
Log Scaling	x' = log(x)	When the feature conforms to the power law.
Z-score	x' = (x - μ) / σ	When the feature distribution does not contain extreme outliers.

Linear Scaling / Min-Max Normalization:

\(X' = \frac{X - X_{min}}{X_{max} - X_{min}}\)

Scaling to a range
This method rescales the features to a fixed range, usually 0 to 1, or -1 to 1.
converting values from their natural range (for example, 100 to 900) into a standard range—usually 0 and 1 (or sometimes -1 to +1).
Scaling to a range is a good choice when both of the following conditions are met:
- You know the approximate upper and lower bounds on your data with few or no outliers.
- Your data is approximately uniformly distributed across that range.

Log Scalling :

\(X' = \log(X)\)

This method transforms the features using the natural logarithm.
Log scaling computes the log of your values to compress a wide range to a narrow range.
helpful when a handful of your values have many points, while most other values have few points. This data distribution is known as the power law distribution.
Movie ratings are a good example. In the chart below,
most movies have very few ratings (the data in the tail),
while a few have lots of ratings (the data in the head).
Log scaling changes the distribution, helping to improve linear model performance.

Z-score Normalization (Standardization) / Zero-mean Standardization:

This method standardizes features by removing the mean and scaling to unit variance.

\(X' = \frac{X - \mu}{\sigma}\)

(where \(\mu\) is the mean of the feature vector and \(\sigma\) is the standard deviation of the feature vector)

The choice of normalisation method is determined by the data and context.

- Min-Max Scaling (MinMaxScaler) is good for preserving specific ranges, whereas
- Z-Score Normalisation (StandardScaler) is good for preserving mean and standard deviation.

The best method depends on the machine learning task’s specific requirements.

Decimal Scaling:

This method moves the decimal point of values of the feature.

\(X' = \frac{X}{10^j}\)

(where j is the smallest integer such that Max(|X'|) < 1)

Mean Normalization:

\(X' = X - \mu\)

(where \(\mu\) is the mean of the feature vector)

This method normalizes the features by subtracting the mean of the feature vector.
This method normalizes the features with respect to the mean.

Unit Vector Normalization:

\(X' = \frac{X}{||X||}\)

(where \(||X||\) is the magnitude of the feature vector)

This method scales the feature vector to the unit length.

RMS Normalization:

\(X' = \frac{X}{\sqrt{\frac{1}{n} \sum X^2}}\)

(where n is the number of elements in the feature vector)

This method normalizes the features using the root mean square (RMS) value of the feature vector.

References:

Normalization in Deep Learning

Batch Normalization:

This method normalizes the features by adjusting and scaling the activations in a deep learning network.
Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch.
This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks.
If previous layers have been trained with batch normalization, then the new layer can be trained faster and more effectively.

Equation for Batch Normalization:

\(X' = \gamma \frac{X - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\)

where \(\gamma\) and \(\beta\) are learnable parameters, \(\mu\) is the mean of the feature vector, \(\sigma\) is the standard deviation of the feature vector, and \(\epsilon\) is a small constant to prevent division by zero.
The \(\gamma\) parameter scales the normalized value, and the \(\beta\) parameter shifts the normalized value.
The \(\gamma\) and \(\beta\) parameters are learned during training.

Layer Normalization:

equation for Layer Normalization:

\(X' = \gamma \frac{X - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\)

This method normalizes the features by adjusting and scaling the activations in a deep learning network.
Layer normalization is a technique for training very deep neural networks that standardizes the inputs to a layer across all samples in a mini-batch.
This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks.
If previous layers have been trained with layer normalization, then the new layer can be trained faster and more effectively.
Layer normalization is similar to batch normalization, but it normalizes across the features instead of across the batch.
Layer normalization is useful when the batch size is small or when batch normalization is not effective.
Layer normalization is also useful when the input to a layer is a sequence of data, such as a sequence of words in a sentence or a sequence of frames in a video.