Swimming in data

We have discussed the basics of tokenization and models. In this post we'll go over some practical considerations concerning model input and output. Namely, using multiple input sequences, padding, and truncating these. We will consider the same code snippet as before.

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = ["I really enjoy swimming!"]
tokens = tokenizer(sequence, return_tensors="pt")
output = model(**tokens)

predictions = torch.nn.functional.softmax(output.logits, dim=-1)
highest_probability = predictions[0].max().item()
highest_label = predictions[0].argmax().item()
print("Highest probability:", highest_probability, ", label:",
      model.config.id2label[highest_label])

Let's start with truncation. If in the above sentence, we substitute the line with the input sequence by

sequence = ["I really enjoy swimming!"*200]

We get RuntimeError: The size of tensor a (1002) must match the size of tensor b (512) at non-singleton dimension 1. This error occurs because we provided too many input tokens to the model. Note that the number above concerns the size of the tensor tokens. It is usually somewhere between the number of characters or words of our input. The error above can be solved by passing the keyword argument truncation=True to the tokenizer.

We can also provide multiple sequences as input to the model.

sequence = ["I really enjoy swimming!", "I do not like running in circles."]

This will give an error ValueError: Unable to create tensor, you should probably activate truncation and/or padding ... The reason for this error is that the input sequences have different lengths. The tokenizer wants to output a single tensor which requires all tokenizer input sequences to have the same length. This can be achieved by adding the keyword argument padding=True as input to the tokenizer. If we do this we get the following code and output.

sequence = ["I really enjoy swimming!", 
            "I do not like running in circles."]
tokens = tokenizer(sequence, padding=True, 
                   truncation=True, return_tensors="pt")

output = model(**tokens)

predictions = torch.nn.functional.softmax(output.logits, dim=-1)

for i in range(2):
    highest_probability = predictions[i].max().item()
    highest_label = predictions[i].argmax().item()
    print("Highest probability:", highest_probability,
          ", label:", model.config.id2label[highest_label])

Output:

Highest probability: 0.9998099207878113 , label: POSITIVE
Highest probability: 0.9936527013778687 , label: NEGATIVE

Now, what did adding padding actually do? In the token id's below we see some 0s after the end of the first sentence. This is a special token that is mapped to [PAD] after decoding. This is a special token that does not influence model behaviour. Hence, the input has been processed as if the padding isn't present.

[[ 101, 1045, 2428, 5959, 5742,  999,  102,    0,    0,    0],
 [ 101, 1045, 2079, 2025, 2066, 2770, 1999, 7925, 1012,  102]]

There is a difference where padding and truncation errors occur. The tokenizer is able to tokenize long input sequences. However, putting a long sequence in the model will raise an error. If two sequences with different lengths are passed jointly to the tokenizer. Then, the tokenizer will throw an error.

Now, to output. In the last part of the code we take the following actions to transform the model's output by first normalizing the output using softmax. This gives us a probability distribution of the possible outputs; positive and negative. We then determine which of these labels has the highest probability and print it.

predictions = torch.nn.functional.softmax(output.logits, dim=-1)
highest_probability = predictions[0].max().item()
highest_label = predictions[0].argmax().item()
print("Highest probability:", highest_probability, 
      ", label:", model.config.id2label[highest_label])

Swimming in Data

LLMs with Hugging Face - Part 4: Models - Input & Output