Introduction to Transformer Models

Introduction to Transformer Models

NLP

NLP is a field of linguistics and machine learning focused on understanding everything related to human language.

What is NLP

Classifying whole sentences — sentiment analysis
Classifying each word in a sentence — grammatically
Generating text content — auto generated text

It is challenging because computers can’t do work as humans do. We can see similarity between different sentences but computers are unable to do it. That is we can say they don’t have emotions so they really can’t understand sometimes similarity between different sentences. For example, here are 2 sentences:

I am hungry
I am sad

This is what basically ML do, train the data.

Transformers

These are basically models that can do almost every task of NLP; some are mentioned below. The most basic object that can do these tasks is pipeline() function.

Sentiment analysis
It can classify sentences that are positive or negative.

0.999… score tells that machine is confident about this 99%.
We can also pass several sentences, score for each will be provided.
By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when we create the classifier object

Zero-shot classification
It allows us to label the data which we want instead of relying the data labels in the models.

Text generation
The main idea about text generation is we’ll provide some text and it will generate text. We can also control the total length of output text.
If we don’t specify any model, it will use default model otherwise we can specify models as in above picture.

Mask filling
The idea of this task is to fill in the blanks
The value of k tells the number of possibility in the place of .

Named entity recognition
It can separate the person, organization or other things in a sentence.

PER – person
ORG – organization
LOC – location

Question answering
It will give the answer based on provided information. It does not generate answers it just extracts the answers from the given context

Summarization
In this case, it will summarize the whole paragraph which we will provide.

Translation
It will translate your provide text into different languages.
I have provided model name as well as translation languages “en-ur” English to Urdu.

How transformers work?

The architecture was introduced in 2018, some influential models are GPT, BERT etc.
The transformer models are basically language models, meaning they have been trained on large amounts of raw text in a self-supervised fashion. Self supervised learning means that humans are not needed to label the data. It is not useful for specific practical tasks so in that case we use Transfer Learning. It is transferring knowledge of specific model to other model for other specific task.
Transformers are large models, to achieve better results, the models should be trained on large data but training on large data impacts environment heavily due to emission of carbon dioxide.
So instead of pretraining(training of model from scratch) we finetune the existing models(using pretraining models) in order to reduce time, effects on environment.
Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining.

General Architecture
It generally consists of 2 sections

Encoders
Decoders

Encoders receive input and builds representation of its features.
Decoders uses the above representation and gives output.

Models
There 3 types of models

Only encoders — these are good for tasks that require
understanding of input such as name or entity recognition etc.
Only decoders — these are good for generative tasks.
Both encoders and decoders — these are good for generative tasks that need input such as summarization or translation.

Attention Layers
A key feature of transformers is attention layer. It tells model to pay attention to specific words.
For example in context of translation,
I want to translate, “I like this course” to French.
These layers will tell model to pay attention to specific words like “you”, “like”, “course” because in different contexts they have different meanings in French.
A word can have different meanings strongly related to their context, so these attention layers work for that.
Originally transformers were made for only translation purposes. So during training encoders receive a sentence and decoder translates according to the requirements the original structure of transformers initially was.

ENCODERS

The architecture of BERT(the most popular model) is “encoder only”.

How does it actually works
It takes input of certain words then generate a sequence (numerical, feature vector) for these words.
The numerical values generated for each word is not just value of that word but the numerical value or sequence is generated depending upon context of the sentence (Self attention mechanism), from left and right in sentence.(bi-directional)

When encoders can be used

Classifications tasks
Question answering tasks
Masked language modeling
In these tasks Encoders really shine.

Representatives of this family

ALBERT
BERT
DistilBERT
ELECTRA
RoBERTa

DECODERS

We can do similar tasks with decoders as in encoders with a little bit loss of performance.
The difference between Encoders and decoders is that encoders uses self attention mechanism while decoders use a masked self attention mechanism, that is it generates sequence for a word independently of its context.

When we should use a decoder

Text generation (ability to generate text, a word or a known sequence of words in NLP is called casual language modeling)
Word prediction
At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

Representatives of this family

CTRL
GPT
GPT-2
Transformer XL

ENCODER-DECODER

In these type of models, we use encoder alongside with decoder.

Working
Let’s take an example of translation (transduction)
We give a sentence as input to encoder, it generates some numerical sequence for those words and then these words are taken as input by decoder. Decoder decodes the sequence and output a word. The start of sequence word indicates that it should start decoding the words. When we get the first word and feature vector(sequence generated by encoder), encoder is no more needed.
We have learnt about auto regressive manner of decoder. So, the word it output can now be used as its input to generate 2nd word. It will goes on until the sequence is finished.
In this model, encoder takes care of understanding the sequence and decoder takes care about generation of output based on understanding of encoder.

Where we can use these

Translation
Summarization
generative question answering

Representatives of this family

BART
mBART
Marian
T5

Limitations
Important note at the end of article is that if you want to use pretrain the model or finetune model, while these models are powerful but comes with limitations.
While requiring a mask for above data it gives these possible words gender specific. So if you are using any of these models this can be an issue.