Encoder-decoder vs transformer

 
Encoder-decoder architecture with the single context vector shared between the two models, this acts as an information bottleneck as all information must be passed through this point.
notion image
 
The attention mechanism provided a solution to the bottleneck issue. It offered another route for information to pass through. Still, it didn’t overwhelm the process because it focused attention only on the most relevant information. By passing a context vector from each timestep into the attention mechanism (producing annotation vectors), the information bottleneck is removed, and there is better information retention across longer sequences.
 
notion image
 
Transformer >> RNN coz of: Positional encoding( consider the position) Self attention( relation bw word and all other words) Multi head attention ( severall parallel attention mechanism working together)
 
Pretrained models - BERT spawned a whole host of further models and derivations such as distilBERT, RoBERTa, and ALBERT, covering tasks such as classification, Q&A, POS-tagging, and more.
 
Problem - Transformers work on word/token level embedding not sentence level embedding
 
The BERT cross-encoder architecture consists of a BERT model which consumes sentences A and B. Both are processed in the same sequence, separated by a [SEP] token. All of this is followed by a feedforward NN classifier that outputs a similarity score.
notion image
 
The cross-encoder network does produce very accurate similarity scores (better than SBERT), but it’s not scalable. If we wanted to perform a similarity search through a small 100K sentence dataset, we would need to complete the cross-encoder inference computation 100K times.
 
2 approaches
💡
Ideally, we need to pre-compute sentence vectors that can be stored and then used whenever required. If these vector representations are good, all we need to do is calculate the cosine similarity between each. With the original BERT (and other transformers), we can build a sentence embedding by averaging the values across all token embeddings output by BERT (if we input 512 tokens, we output 512 embeddings). Alternatively, we can use the output of the first [CLS] token (a BERT-specific token whose output embedding is used in classification tasks).
 
SBERT and sentence transformer introduced
 
Reimers and Gurevych demonstrated the dramatic speed increase in 2019. Finding the most similar sentence pair from 10K sentences took 65 hours with BERT. With SBERT, embeddings are created in ~5 seconds and compared with cosine similarity in ~0.01 seconds.
 
 

Residual Connections

Residual connections are a vital part of the transformer architecture. They allow the model to preserve information from earlier layers and help in mitigating the vanishing gradient problem. In transformers, each sub-layer (multi-head self-attention and position-wise feed-forward networks) has a residual connection followed by a layer normalization step. This means that the output of each sub-layer is added to its input, and this sum is then normalized before being fed to the next sub-layer.
 
 
Encoder: The encoder takes the input sequence and processes it to generate a continuous representation. This continuous representation preserves the contextual information of the input and can be effectively used by the decoder for generating the target sequence.
 
The encoder’s output serves as the keys and values, whereas the decoder’s hidden states act as queries. This setup effectively enables the decoder to align itself to different parts of the input sequence when generating the output, hence leading to improved translation and sequence generation
 
notion image
notion image
The queries are used to compare the input elements, while the keys and values represent the relationship between the elements. The softmax function is applied to the computed attention weights to form a probability distribution, emphasizing the most relevant elements in the sequence.
 
Rnn
notion image
Where x, h, o are the input sequence, hidden state, and the output sequence, respectively. U, V, and W are the training weights.
 
Sequence to sequence - aka - encoder and decoder
 

HIDDEN STATE?

 
 
. LSTM (and also GruRNN) can boost a bit the dependency range they can learn thanks to a deeper processing of the hidden states through specific units (which comes with an increased number of parameters to train) but nevertheless the problem is inherently related to recursion
 
meanign of vanishing gradient A well known problem is vanishin/exploding gradients, which means that the model is biased by most recent inputs in the sequence, or in other words, older inputs have practically no effect in the output at the current step.
What attention does— how fast the model can translate from one sequence to another
Encoder - Self attention feed forward NN
notion image
 
Decoder self attention feed forward nn attention layer - focuses on relevant parts of input sentence
notion image
 
notion image
 
which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer.
FFNN doesnt have dependencies - hence, various paths can be executed in parallel
 
 

How is self attention calculated?