Transformers

Transformers are a type of neural network architecture developed in 2017.
They scale well with data, i.e, bigger the data, better the models!
Transformers are data agnostic and work well with many different data types including text, image, audio, etc.
They’re good at transfer learning.
Deep Neural Networks - As number of layers increase in the networks, so does its complexity

Traditional language models are count based; use n-grams models and statistical techniques

Suffer from fixed context windows

Struggle with long range dependence

See more definitions at: https://aiml.com/what-is-an-n-gram-model/

An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters. n-gram models suffer of fixed context windows

Transformers-based large language models

Transformers capture long range dependencies, using self-attention mechanisms

The model assesses all words in an input sequence while processing each individual word; each word is assigned weights based on its relevance to the others:

in self-attention mechanisms used in transformer models, each word in a text sequence has a relationship with all other words in the sequence, including itself. This is one of the key features that makes self-attention so powerful. Here's a brief explanation of how this works:

Query, Key, and Value vectors: For each word, three vectors are created - a query vector, a key vector, and a value vector.
Attention scores: The query vector of each word is compared with the key vectors of all words (including itself) to produce attention scores. These scores represent how much focus should be placed on other words when encoding a particular word.
Weighted sum: The attention scores are then used to create a weighted sum of the value vectors, producing the final representation for each word.

This process allows each word to attend to all other words in the sequence, capturing both short-range and long-range dependencies in the text. The strength of these relationships is learned during training and can vary based on the context and the specific task.

It's worth noting that while each word technically has a relationship with all other words, the strength of these relationships can vary significantly. Some words might have strong relationships with only a few other words, while others might have more distributed attention across the sequence.

See paper at https://arxiv.org/pdf/1706.03762

Transformers

Recent Posts

Comments