Transformers in NLP: Creating a Translator Model from Scratch

Article by Rahul Agarwal | September 25, 2020

Transformers have now become the defacto standard for NLP tasks. Originally developed for sequence transduction processes such as speech recognition, translation, and text to speech, transformers work by using convolutional neural networks together with attention models, making them much more efficient than previous architectures. And although transformers were developed for NLP, they’ve also been implemented in the fields of computer vision and music generation.

However, for all their wide and varied uses, transformers are still very difficult to understand, which is why I wrote a detailed post describing how they work on a basic level. It covers the encoder and decoder architecture, and the whole dataflow through the different pieces of the neural network.

In this post, we’ll get deeper into looking at transformers by implementing our own English to German language translator. The contents are below:

  • Task Description
  • Data Preprocessing
  • The Transformer
  • Define Optimizer and Model
  • Training our Translator
  • Results
  • Caveats/Improvements
    • References

 

Task Description

We want to create a translator that uses transformers to convert English to German. If we look at it as a black-box, our network will take an English sentence as input, and give us a German sentence as output.

The NLP transformer is shown as as a black box in this translation example.
Transformer for translation

 

Data Preprocessing

To train our English-German translation model, we’ll need translated sentence pairs between English and German.

Fortunately, we can use the IWSLT (International Workshop on Spoken Language Translation) dataset using torchtext.datasets. This machine translation dataset is the defacto standard for translation tasks. It contains translations of TED and TEDx talks covering a variety of topics in many languages.

But before we dive into the coding section, let’s look at what we need as input and output for the model while training. We’ll need two matrices to be input to our network:

 

  • The Source English sentences (Source): A matrix of shape (batch size x source sentence length). The numbers in this matrix correspond to words based on the English vocabulary we need to create. For example, 234 in the English vocabulary might correspond to the word “the”. Did you notice in the image above that some sentences end with a word vocabulary index of 6? Because not all sentences are same length, these ones are padded with a word index of 6. So, 6 refers to a <blank> token.
  • The Shifted Target German sentences (Target): A matrix of shape (batch size x target sentence length). As with our Source sentences, the numbers in this matrix correspond to words based on the German vocabulary we need to create. You might have noticed there seems to be a pattern to this particular matrix. All sentences start with a word whose index in German vocabulary is 2, and they invariably end with a pattern [3 and 0 or more 1’s]. This is intentional: we want to start the target sentence with a start token (so 2 is for <s> token) and end the target sentence with an end token (so 3 is </s> token) and a string of blank tokens (so 1 refers to <blank> token). Note: This is covered in more detail in my post on transformers, so take a look and come back if you’re feeling confused here.

Now that we know how to preprocess our data, let’s get into the actual code for the preprocessing steps.

Please note that it doesn’t really matter if you preprocess using other methods. What matters is that in the end you send the sentence source and target to your model in a way that can be used by the transformer i.e. source sentences should be padded with blank tokens, and target sentences need to have a start token, an end token, and the rest padded by blank tokens.

 

We’ll start by loading the Spacy Models, which provide tokenizers to tokenize German and English text.

 


# Load the Spacy Models
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

def tokenize_de(text):
return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]

 

We also define some special tokens we’ll use for specifying blank/padding words, and for the beginning and end of sentences as I discussed above.

 

# Special Tokens
BOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = "<blank>"

 

We can now define a preprocessing pipeline for both our source and target sentences using data.field from torchtext. You’ll notice that while we only specify pad_token for the source sentence, we mention pad_token, init_token and eos_token for the target sentence. We also define which tokenizers to use.

 

SRC = data.Field(tokenize=tokenize_en, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_de, init_token = BOS_WORD,
eos_token = EOS_WORD, pad_token=BLANK_WORD)

 

Until now, we haven’t seen any data, so let’s use the IWSLT data from torchtext.datasets to create a train, validation, and test dataset. Let’s also filter our sentences using the MAX_LEN parameter so our code runs faster. Note that we get the data with .en and .de extensions and we specify the preprocessing steps using the fields parameter.

 

MAX_LEN = 20
train, val, test = datasets.IWSLT.splits(
    exts=('.en', '.de'), fields=(SRC, TGT),
    filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN
    and len(vars(x)['trg']) <= MAX_LEN)

 

Now that we have our training data, let’s see what it looks like:

 

for i, example in enumerate([(x.src,x.trg) for x in train[0:5]]):
print(f"Example_{i}:{example}")

---------------------------------------------------------------

Example_0:(['David', 'Gallo', ':', 'This', 'is', 'Bill', 'Lange', '.', 'I', "'m", 'Dave', 'Gallo', '.'], ['David', 'Gallo', ':', 'Das', 'ist', 'Bill', 'Lange', '.', 'Ich', 'bin', 'Dave', 'Gallo', '.'])

Example_1:(['And', 'we', "'re", 'going', 'to', 'tell', 'you', 'some', 'stories', 'from', 'the', 'sea', 'here', 'in', 'video', '.'], ['Wir', 'werden', 'Ihnen', 'einige', 'Geschichten', 'über', 'das', 'Meer', 'in', 'Videoform', 'erzählen', '.'])

Example_2:(['And', 'the', 'problem', ',', 'I', 'think', ',', 'is', 'that', 'we', 'take', 'the', 'ocean', 'for', 'granted', '.'], ['Ich', 'denke', ',', 'das', 'Problem', 'ist', ',', 'dass', 'wir', 'das', 'Meer', 'für', 'zu', 'selbstverständlich', 'halten', '.'])

Example_3:(['When', 'you', 'think', 'about', 'it', ',', 'the', 'oceans', 'are', '75', 'percent', 'of', 'the', 'planet', '.'], ['Wenn', 'man', 'darüber', 'nachdenkt', ',', 'machen', 'die', 'Ozeane', '75', '%', 'des', 'Planeten', 'aus', '.'])

Example_4:(['Most', 'of', 'the', 'planet', 'is', 'ocean', 'water', '.'], ['Der', 'Großteil', 'der', 'Erde', 'ist', 'Meerwasser', '.'])

 

You might notice that while the data.field object has done the tokenization, it has not yet applied the start, end, and pad tokens. This is intentional; we don’t have batches yet and the number of pad tokens will inherently depend on the maximum length of a sentence in a particular batch.

As mentioned at the start, we also create a Source and Target Language vocabulary by using the built-in function in the data.field object. We specify a MIN_FREQ of 2 so that any word that doesn’t occur at least twice isn’t included in our vocabulary.

 

MIN_FREQ = 2
SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

 

Once we’re done here, we can use data.Bucketiterator which is used to give batches of similar lengths to get our train iterator and validation iterator. Note that we use a batch_size of 1 for our validation data. This is optional but done to avoid padding or minimal padding while checking validation data performance.

 

BATCH_SIZE = 350

# Create iterators to process text in batches of approx. the same length by sorting on sentence lengths

train_iter = data.BucketIterator(train, batch_size=BATCH_SIZE, repeat=False, sort_key=lambda x: len(x.src))

val_iter = data.BucketIterator(val, batch_size=1, repeat=False, sort_key=lambda x: len(x.src))

 

Before we proceed, it’s a good idea to see what our batch looks like and what we are sending to the model as input while training.

 

batch = next(iter(train_iter))
src_matrix = batch.src.T
print(src_matrix, src_matrix.size())

 

This is our source matrix:

 

trg_matrix = batch.trg.T
print(trg_matrix, trg_matrix.size())

 

And here is our target matrix:

 

So in the first batch, the src_matrix contains 350 sentences of length 20 and the trg_matrix is 350 sentences of length 22. Just so we’re sure of our preprocessing, let’s see what some of these numbers represent in the src_matrix and the trg_matrix.

 

print(SRC.vocab.itos[1])
print(TGT.vocab.itos[2])
print(TGT.vocab.itos[1])
--------------------------------------------------------------------
<blank>
<s>
<blank>

 

Just as expected. The opposite method, i.e. string to index also works well.

 

print(TGT.vocab.stoi['</s>'])
--------------------------------------------------------------------
3

 

The Transformer

 

So, now that we have a way to send the source sentence and the shifted target to our transformer, we can look at creating the transformer for our NLP task.

 

A lot of the blocks here are taken from the Pytorch nn module. Actually, Pytorch has a transformer module too, but it doesn’t include a lot of functionalities present in the paper, such as the embedding layer and the positional encoding layer. So this is more of a complete implementation that takes in a lot from pytorch implementation as well.

We create our Transformer by using these various blocks from the Pytorch nn module:

Also, note that whatever is happening in the layers is actually just matrix functions. See in particular how the decoder stack takes memory from the encoder as input. We also create a positional encoding layer which lets us add the positional embedding to our word embedding.

Feel free to look at the source code of all these blocks, which is linked to above. I often used the source code as a reference to make sure I was giving the right inputs to these layers.

 

Define Optimizer and Model

Now, we can initialize the transformer and the optimizer using:

 

source_vocab_length = len(SRC.vocab)
target_vocab_length = len(TGT.vocab)

model = MyTransformer(source_vocab_length=source_vocab_length,target_vocab_length=target_vocab_length)

optim = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

model = model.cuda()

 

In the paper the authors used an Adam optimizer with a scheduled learning rate, but here I use a normal Adam optimizer to keep things simple.

 

Training our Translator

Now we can train our transformer using the train function below. Essentially, what we are doing in the training loop is:

  • Getting the src_matrix and trg_matrix from a batch.
  • Creating a src_mask — This is the mask that tells the model about the padded words in src_matrix data.
  • Creating a trg_mask — To prevent our model from looking at the future subsequent target words at any point in time.
  • Getting the prediction from the model.
  • Calculating loss using cross-entropy. (In the paper they use KL divergence, but this also works fine for understanding)
  • Backprop.
  • Save the best model based on validation loss.
  • We also predict the model output at every epoch for some sentences of our choice as a debug step using the function greedy_decode_sentence. We will discuss this function in the results section.

 

Now let’s run our training using:

 

train_losses,valid_losses = train(train_iter, val_iter, model, optim, 35)

 

Below is the output of the training loop (shown only for some epochs):

 

Epoch [1/35] complete. Train Loss: 86.092. Val Loss: 64.514
Original Sentence: This is an example to check how our model is performing.
Translated Sentence: Und die der der der der der der der der der der der der der der der der der der der der der der der

Epoch [2/35] complete. Train Loss: 59.769. Val Loss: 55.631
Original Sentence: This is an example to check how our model is performing.
Translated Sentence: Das ist ein paar paar paar sehr , die das ist ein paar sehr Jahre . </s>

.
.
.
.

Epoch [16/35] complete. Train Loss: 21.791. Val Loss: 28.000
Original Sentence: This is an example to check how our model is performing.
Translated Sentence: Hier ist ein Beispiel , um zu prüfen , wie unser Modell aussieht . Das ist ein Modell . </s>

.
.
.
.

Epoch [34/35] complete. Train Loss: 9.492. Val Loss: 31.005
Original Sentence: This is an example to check how our model is performing.
Translated Sentence: Hier ist ein Beispiel , um prüfen zu überprüfen , wie unser Modell ist . Wir spielen . </s>

Epoch [35/35] complete. Train Loss: 9.014. Val Loss: 32.097
Original Sentence: This is an example to check how our model is performing.
Translated Sentence: Hier ist ein Beispiel , um prüfen wie unser Modell ist . Wir spielen . </s>

We can see how our model starts with a gibberish translation — “Und die der der der der der der der der der der der der der der der der der der der der der der der” but starts giving us something more understandable after a few iterations.

 

Results

We can plot the training and validation losses using Plotly express.

 

import pandas as pd
import plotly.express as px
losses = pd.DataFrame({'train_loss':train_losses,'val_loss':valid_losses})
px.line(losses,y = ['train_loss','val_loss'])

 

If we want to deploy this model we can load it by simply using:

 

model.load_state_dict(torch.load(f”checkpoint_best_epoch.pt”))

 

We can also predict for any source sentence using the greeedy_decode_sentence function, which is:

Predicting with a greedy search using the Transformer
Predicting with a greedy search using the Transformer

 

This function does piecewise predictions. The greedy search would start with:

  • Passing the whole English sentence as encoder input and just the start token <s> as shifted output (input to the decoder) to the model and doing the forward pass.
  • The model will predict the next word — der
  • Then, we pass the whole English sentence as encoder input and add the last predicted word to the shifted output(input to the decoder = <s> der) and do the forward pass.
  • The model will predict the next word — schnelle
  • Passing the whole English sentence as encoder input and <s> der schnelle as shifted output (input to the decoder) to the model and doing the forward pass.
  • and so on, until the model predicts the end token </s> or we generate some maximum number of tokens(something we can define) so the translation doesn’t run for an infinite duration in the case it breaks.

Now we can translate any sentence using this:

 

sentence = "Isn't Natural language processing just awesome? Please do let me know in the comments."

print(greeedy_decode_sentence(model,sentence))

------------------------------------------------------------------

Ist es nicht einfach toll ? Bitte lassen Sie mich gerne in den Kommentare kennen . </s>

 

Since I don’t have a German translator at hand, I will use the next best thing to see how our NLP transformer model is performing. Let’s use the help of Google’s translation service to understand what this german sentence means.

There are a few obvious mistakes such as a noticeably missing “Natural Language Processing” (ironic?), but it seems like a good enough translation considering the neural network can now understand the structure of both languages after only an hour of training.

 

Caveats/Improvements

We might have achieved better results if we did everything in the same way the paper did:

  • Train on whole data
  • Byte Pair Encoding
  • Learning Rate Scheduling
  • KL Divergence Loss
  • Beam search, and
  • Checkpoint ensembling

I discussed all of these in my last post and all of these additions are easy to implement. However, I didn’t include these here because my goal was to understand how a transformer works, and I didn’t want to make things more complicated. That said, there have been many advancements on top of transformers that have resulted in better translation models, and I intend to discuss those in an upcoming post about BERT, a popular NLP model that utilizes a transformer at its core.

In this post, we looked at transformer architecture in NLP by creating an English to German translation network almost from scratch. For a closer look at the code for this post, please visit my GitHub repository where you can find the code for this post as well as all my posts.

If you’re interested in more technical machine learning articles, you can check out my other articles in the related resources section below. And if you’d like machine learning articles delivered directly to your inbox, you can subscribe to the Lionbridge AI newsletter here.


 

References

Subscribe to our newsletter for more technical articles
The Author
Rahul Agarwal

Rahul is a data scientist currently working with WalmartLabs. He enjoys working with data-intensive problems and is constantly in search of new ideas to work on. Contact him on Twitter: @MLWhiz

    Welcome!

    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.