Skip to content

Contents (Auto-generated)

Top NLP papers

Llama

Llama 2

Normalization Layer : RMSNorm

Transformer vs Llama

Comparison of Llama 1, Llama 2, and Original Transformer Architectures

Mistral

Transformer vs Mistral vs Llama

Mistral Architecture

Sliding Window Attention(SWA) vs Self Attention

KV Cache

Mixure of Experts

Attention Mechanisms and Its Variants

Introduction to Attention Mechanisms

MHA: Multi-Head Attention

MQA: Multi-Query Attention

GQA: Grouped Query Attention

Sliding Window Attention

Other:

Reference:

KV Cache

Position Encoding Methods in Transformers

Why do we need Position Encodings in Transformers?

Absolute Positional Embeddings

Types of Position Encodings

Rotary Positional Embeddings

Tranformer Toolkit

Transformer based architectures

Transformer: Optimizer and Regularization

Optimizer

Regularization

Example of a GAN in PyTorch

Generative Adversarial Networks

(To Do)

Fine Tuning

Parameter-Efficient Finetuning (PEFT) Methods

Reparametrization-based parameter-efficient finetuning methods leverage low-rank representations to minimize the number of trainable parameters. The notion that neural networks have lowdimensional representations has been widely explored in both empirical and theoretical analysis of deep learning

LoRA: Low-Rank Adaptation

Home

Coding Notebook¶

Low-Level Neural Network Implementation¶

Perceptron Implementation¶

Optimizer¶

Activation Functions

Sigmoid

Softmax

Tanh

ReLU (Rectified Linear Unit)

Leaky ReLU

Parametric ReLU (PReLU)

GELU (Gaussian Error Linear Unit)

Swish function (\(Swish_{ \beta}\))

SiLU - Sigmoid Linear Unit (\(Swish_{1}\))

GLU (Gated Linear Unit)

SwiGLU Activation Function (Llama-2)

Comparison of Activation Functions

Other:

Monotonicity:

References

Evaluation metrics

Normalization

Basics

Normalization in Deep Learning

Optimization Algorithms

Types of Optimizers

Further Reading

Reading Material

LLMOps

ML Concepts