NLP
NLP is a field of linguistics and machine learning focused on understanding everything related to human language.
What is NLP
Classifying whole sentences â sentiment analysis
Classifying each word in a sentence â grammatically
Generating text content â auto generated text
It is challenging because computers canât do work as humans do. We can see similarity between different sentences but computers are unable to do it. That is we can say they donât have emotions so they really canât understand sometimes similarity between different sentences. For example, here are 2 sentences:
I am hungry
I am sad
This is what basically ML do, train the data.
Transformers
These are basically models that can do almost every task of NLP; some are mentioned below. The most basic object that can do these tasks is pipeline() function.
Sentiment analysis
It can classify sentences that are positive or negative.
0.999⊠score tells that machine is confident about this 99%.
We can also pass several sentences, score for each will be provided.
By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when we create the classifier object
Zero-shot classification
It allows us to label the data which we want instead of relying the data labels in the models.
Text generation
The main idea about text generation is weâll provide some text and it will generate text. We can also control the total length of output text.
If we donât specify any model, it will use default model otherwise we can specify models as in above picture.
Mask filling
The idea of this task is to fill in the blanks
The value of k tells the number of possibility in the place of .
Named entity recognition
It can separate the person, organization or other things in a sentence.
PER â person
ORG â organization
LOC â location
Question answering
It will give the answer based on provided information. It does not generate answers it just extracts the answers from the given context
Summarization
In this case, it will summarize the whole paragraph which we will provide.
Translation
It will translate your provide text into different languages.
I have provided model name as well as translation languages âen-urâ English to Urdu.
How transformers work?
The architecture was introduced in 2018, some influential models are GPT, BERT etc.
The transformer models are basically language models, meaning they have been trained on large amounts of raw text in a self-supervised fashion. Self supervised learning means that humans are not needed to label the data. It is not useful for specific practical tasks so in that case we use Transfer Learning. It is transferring knowledge of specific model to other model for other specific task.
Transformers are large models, to achieve better results, the models should be trained on large data but training on large data impacts environment heavily due to emission of carbon dioxide.
So instead of pretraining(training of model from scratch) we finetune the existing models(using pretraining models) in order to reduce time, effects on environment.
Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining.
General Architecture
It generally consists of 2 sections
Encoders
Decoders
Encoders receive input and builds representation of its features.
Decoders uses the above representation and gives output.
Models
There 3 types of models
Only encoders â these are good for tasks that require
understanding of input such as name or entity recognition etc.
Only decoders â these are good for generative tasks.
Both encoders and decoders â these are good for generative tasks that need input such as summarization or translation.
Attention Layers
A key feature of transformers is attention layer. It tells model to pay attention to specific words.
For example in context of translation,
I want to translate, âI like this courseâ to French.
These layers will tell model to pay attention to specific words like âyouâ, âlikeâ, âcourseâ because in different contexts they have different meanings in French.
A word can have different meanings strongly related to their context, so these attention layers work for that.
Originally transformers were made for only translation purposes. So during training encoders receive a sentence and decoder translates according to the requirements the original structure of transformers initially was.
ENCODERS
The architecture of BERT(the most popular model) is âencoder onlyâ.
How does it actually works
It takes input of certain words then generate a sequence (numerical, feature vector) for these words.
The numerical values generated for each word is not just value of that word but the numerical value or sequence is generated depending upon context of the sentence (Self attention mechanism), from left and right in sentence.(bi-directional)
When encoders can be used
Classifications tasks
Question answering tasks
Masked language modeling
In these tasks Encoders really shine.
Representatives of this family
ALBERT
BERT
DistilBERT
ELECTRA
RoBERTa
DECODERS
We can do similar tasks with decoders as in encoders with a little bit loss of performance.
The difference between Encoders and decoders is that encoders uses self attention mechanism while decoders use a masked self attention mechanism, that is it generates sequence for a word independently of its context.
When we should use a decoder
Text generation (ability to generate text, a word or a known sequence of words in NLP is called casual language modeling)
Word prediction
At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.
Representatives of this family
CTRL
GPT
GPT-2
Transformer XL
ENCODER-DECODER
In these type of models, we use encoder alongside with decoder.
Working
Letâs take an example of translation (transduction)
We give a sentence as input to encoder, it generates some numerical sequence for those words and then these words are taken as input by decoder. Decoder decodes the sequence and output a word. The start of sequence word indicates that it should start decoding the words. When we get the first word and feature vector(sequence generated by encoder), encoder is no more needed.
We have learnt about auto regressive manner of decoder. So, the word it output can now be used as its input to generate 2nd word. It will goes on until the sequence is finished.
In this model, encoder takes care of understanding the sequence and decoder takes care about generation of output based on understanding of encoder.
Where we can use these
Translation
Summarization
generative question answering
Representatives of this family
BART
mBART
Marian
T5
Limitations
Important note at the end of article is that if you want to use pretrain the model or finetune model, while these models are powerful but comes with limitations.
While requiring a mask for above data it gives these possible words gender specific. So if you are using any of these models this can be an issue.