The Transformer, with its parallel processing capabilities, allowed for more efficient and scalable models, making it easier to train them on large datasets. It also demonstrated superior performance in several NLP tasks, such as sentiment analysis and text generation tasks. The Transformer model has since become the foundation for many state-of-the-art NLP models, such as BERT, GPT-2, and T5.
The Transformer Architecture Consists of 2 main Components ->
Encoder - The encoder receives an input and builds an Embedding of it’s Features. This means that the model learns to understand the association of words or sequence.
Decoder - The decoder uses the encoder’s output embeddings along with other inputs to generate a target sequence.
Whenever we use a dataset , and try to Train a Model on it , we always convert explicitly or implicitly to a representation which the model can interpret / understand and then reconvert it into a representation we understand , the Function of Input Embedding Block in the Transformer Architecture is just that only. In the Orignal Paper the Authors used , the Input Block with an Embedding Dimension of 512.To prevent the input Embeddings from being extremely small , we normalize them by Multiplying the by root of EmbeddingDimension
= # Reffered in the paper as d_model, (Size == 512)
= ## Size of the Vocabulary of the input
=
return * ## This is done to help Prevent the Size of Input Embedding being diminished
The input now is converted into input Embeddings of Dimension 512 , but unless we don’t provide a signal for the encoder on the relative or absolute position of the tokens in the sequence the Model can’t learn the corresponding association to get around that probelm the authors have provided a positional Encoding for a token based on if its index is an even number or an odd number , these encodings are computed only once and in the paper are not learned by the model.
"""
Positional Encoding module for Transformer models.
Args:
embedding_dim (int): The dimension of the input embeddings.
sequence_len (int): The length of the input sequence.
dropout (float): The dropout probability.
"""
=
=
=
# Creating a matrix of size (sequence_len,embedding_dim)
=
# Create a vector of shape (sequence_len,1)
=
= *
# Apply the sin formula to the even positions and cosine formula to the odd positions
= # Every two Terms even -> 0 -> 2 -> 4
= # Every two Terms odd -> 1 -> 3 -> 5
=
= +
return
These are the Add and Norm Layer in the Architecture these help scaling input tensor with Layer the LayerNormalization Block is already implemented in Pytorch
"""
Applies layer normalization to the input tensor.
Args:
eps (float, optional): A value added to the denominator for numerical stability. Default is 1e-5.
"""
=
"""
Applies layer normalization to the input tensor.
Args:
x (torch.Tensor): The input tensor.
Returns:
torch.Tensor: The normalized tensor.
"""
return
Really Simple MLP consisting of 2 Linear Layers with the ReLU activation function b/w them also using Dropout Layer for overfitting prevention.
"""
A feed-forward block in the Transformer model.
Args:
embedding_dim (int): The dimensionality of the input embeddings.
feed_forward_dim (int): The dimensionality of the hidden layer in the feed-forward network.
dropout (float): The dropout probability.
Attributes:
linear_1 (nn.Linear): The first linear layer.
dropout (nn.Dropout): The dropout layer.
linear_2 (nn.Linear): The second linear layer.
"""
=
=
=
"""
Forward pass of the feed-forward block.
Args:
x (torch.Tensor): The input tensor.
Returns:
torch.Tensor: The output tensor.
"""
=
=
=
=
return
The Multi-Head Attention block receives the input data split into queries, keys, and values organized into matrices 𝑄, 𝐾, and 𝑉. Each matrix contains different facets of the input, and they have the same dimensions as the input.We then linearly transform each matrix by their respective weight matrices 𝑊^Q, 𝑊^K, and 𝑊^V. These transformations will result in new matrices 𝑄′, 𝐾′, and 𝑉′, which will be split into smaller matrices corresponding to different heads ℎ, allowing the model to attend to information from different representation subspaces in parallel. This split creates multiple sets of queries, keys, and values for each head. Finally, we concatenate every head into an 𝐻 matrix, which is then transformed by another weight matrix 𝑊𝑜 to produce the multi-head attention output, a matrix 𝑀𝐻−𝐴 that retains the input dimensionality.
"""
Initializes the MultiHeadAttention module.
Args:
embedding_dim (int): The input and output dimension of the model.
num_heads (int): The number of attention heads.
Raises:
AssertionError: If embedding_dim is not divisible by num_heads.
"""
assert % == 0,
=
=
= //
=
=
=
=
"""
Performs scaled dot product attention.
Args:
Q (torch.Tensor): The query tensor of shape (batch_size, seq_length, embedding_dim).
K (torch.Tensor): The key tensor of shape (batch_size, seq_length, embedding_dim).
V (torch.Tensor): The value tensor of shape (batch_size, seq_length, embedding_dim).
mask (torch.Tensor, optional): The attention mask tensor of shape (batch_size, seq_length, seq_length).
Returns:
torch.Tensor: The output tensor of shape (batch_size, seq_length, embedding_dim).
"""
= /
=
=
=
return
"""
Splits the input tensor into multiple heads.
Args:
x (torch.Tensor): The input tensor of shape (batch_size, seq_length, embedding_dim).
Returns:
torch.Tensor: The tensor with shape (batch_size, num_heads, seq_length, d_k).
"""
=
, , return
"""
Combines the heads of the input tensor.
Args:
x (torch.Tensor): The input tensor of shape (batch_size, num_heads, seq_length, d_k).
Returns:
torch.Tensor: The tensor with shape (batch_size, seq_length, embedding_dim).
"""
=
, , , return
"""
Performs forward pass of the MultiHeadAttention module.
Args:
Q (torch.Tensor): The query tensor of shape (batch_size, seq_length, embedding_dim).
K (torch.Tensor): The key tensor of shape (batch_size, seq_length, embedding_dim).
V (torch.Tensor): The value tensor of shape (batch_size, seq_length, embedding_dim).
mask (torch.Tensor, optional): The attention mask tensor of shape (batch_size, seq_length, seq_length).
Returns:
torch.Tensor: The output tensor of shape (batch_size, seq_length, embedding_dim).
"""
=
=
=
=
=
return
An Encoder layer consists of a Multi-Head Attention layer, a Position-wise Feed-Forward layer, and two Layer Normalization layers. The EncoderLayer class initializes with input parameters and components, including a MultiHeadAttention module, a PositionWiseFeedForward module, two layer normalization modules, and a dropout layer. The forward methods computes the encoder layer output by applying self-attention, adding the attention output to the input tensor, and normalizing the result. Then, it computes the position-wise feed-forward output, combines it with the normalized self-attention output, and normalizes the final result before returning the processed tensor.
"""
Initializes an EncoderLayer module.
Args:
embedding_dim (int): The dimensionality of the input and output feature vectors.
num_heads (int): The number of attention heads.
feed_forward_dim (int): The dimensionality of the feed-forward layer.
dropout (float): The dropout probability.
"""
=
=
=
=
=
"""
Performs a forward pass of the EncoderLayer module.
Args:
x (torch.Tensor): The input tensor of shape (batch_size, seq_len, d_model).
mask (torch.Tensor): The attention mask tensor of shape (batch_size, seq_len, seq_len).
Returns:
torch.Tensor: The output tensor of shape (batch_size, seq_len, d_model).
"""
=
=
=
=
return
After Encoder the Keys and Values from the output are consumed and Query is provided by the Output Embedding in the Decoder the decoder layer consists of two Multi-Head Attention layers, a Position-wise Feed-Forward layer, and three Layer Normalization layers. The forward method computes the decoder layer output by performing the following steps:
"""
Initializes a DecoderLayer module.
Args:
embedding_dim (int): The dimension of the input embeddings.
num_heads (int): The number of attention heads.
feed_forward_dim (int): The dimension of the feed-forward layer.
dropout (float): The dropout probability.
"""
=
=
=
=
=
=
=
"""
Performs a forward pass of the DecoderLayer module.
Args:
x (torch.Tensor): The input tensor.
enc_output (torch.Tensor): The output of the encoder.
src_mask (torch.Tensor): The mask for the source sequence.
tgt_mask (torch.Tensor): The mask for the target sequence.
Returns:
torch.Tensor: The output tensor.
"""
=
=
=
=
=
=
return
Now that we have built every layer Lets combine them to Build the Transformer
"""
Initializes the Transformer model.
Args:
- src_vocab_size (int): The size of the source vocabulary.
- tgt_vocab_size (int): The size of the target vocabulary.
- embedding_dim (int): The dimension of the word embeddings.
- num_heads (int): The number of attention heads.
- num_layers (int): The number of encoder and decoder layers.
- feed_forward_dim (int): The dimension of the feed-forward layer.
- max_seq_length (int): The maximum sequence length.
- dropout (float): The dropout rate.
"""
=
=
=
=
=
=
=
"""
Generates masks for the source and target sequences.
Args:
- src (Tensor): The source sequence tensor.
- tgt (Tensor): The target sequence tensor.
Returns:
- src_mask (Tensor): The source mask tensor.
- tgt_mask (Tensor): The target mask tensor.
"""
=
=
=
=
= &
return ,
"""
Performs forward pass of the Transformer model.
Args:
- src (Tensor): The source sequence tensor.
- tgt (Tensor): The target sequence tensor.
Returns:
- output (Tensor): The output tensor.
"""
=
, =
=
=
=
=
=
=
return