Natural Language Processing With Transformers

Transformer encoder is used usually in NLP tasks. It’s the backbone of models like BERT, RoBERTa, DistilBERT and the encoder part of T5.

The overall Transformer encoder is made up of N identical layers, where each layer has two main sub-layers:

Multi-Head Self-Attention
Feed-Forward Neural Network (FNN) Each sub-layer has a residual connection + layer normalization.

Input Embeddings 
	→ [Positional Encoding added] 
	→ [Encoder Layer 1] 
	→ [Encoder Layer 2] 
	→ ... 
	→ [Encoder Layer N] 
	→ Final Encoder Output

Inside a Single Encoder Layer

Transformer Decoder