Transformer - based solely on attention mechanism, ditching recurrence and convolution
Recurrent NN, CNN, and Encoder-Decoder
RNN- precludes parallelization, and this becomes bottleneck at longer sequence lengths
Attention mechanism - prominent without RNN also.
In transformer →to learn dependencies bw text, is reduced to a Constant operation
→ at cost of → r
educed effective resolution, coz of averaging attention
-weighted positions
::solved with the help of Multi-Head attentionSelf attention:
attention mech → relating different position of a sequence to compute a representation of the sequence
Most neural sequence transduction models have →
encoder-decoder structure