Natural Language Processing with Transformers /
Transformer Decoder
Transformer encoder is used usually in NLP tasks. It’s the backbone of models like BERT, RoBERTa, DistilBERT and the encoder part of T5.
The overall Transformer encoder is made up of N identical layers, where each layer has two main sub-layers:
- Multi-Head Self-Attention
- Feed-Forward Neural Network (FNN) Each sub-layer has a residual connection + layer normalization.
Input Embeddings
→ [Positional Encoding added]
→ [Encoder Layer 1]
→ [Encoder Layer 2]
→ ...
→ [Encoder Layer N]
→ Final Encoder Output
Inside a Single Encoder Layer
Input
→ [Multi-Head Self-Attention + Add & Norm]
→ [Feed-Forward Network + Add & Norm]
→ Output