LLMs with Hugging Face - Part 2: Tokenizers
This post two on using LLMs with the Hugging Face library based on Chapter 2 of the Hugging Face course.
The following code shows a step-by-step breakdown of the steps performed by a text-classification pipeline to generate output. In this post we will focus on the tokenizer. The tokenizer turns natural language into tokens that the model can interpret.
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = ["I really enjoy swimming!"]
tokens = tokenizer(sequence, return_tensors="pt")
output = model(**tokens)
predictions = torch.nn.functional.softmax(output.logits, dim=-1)
highest_probability = predictions[0].max().item()
highest_label = predictions[0].argmax().item()
print("Highest probability:", highest_probability, ", label:", model.config.id2label[highest_label])
Hugging Face introduced auto classes to make the retrieval of a model's architecture easier. In the case of the second line above, the tokenizer related to "distilbert-base-uncased-finetuned-sst-2-english" is obtained. If we print the tokens obtained from the code above we get
{'input_ids': tensor([[ 101, 1045, 2428, 5959, 5742, 999, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
We can get the token mapping using the .get_vocab() method using the code below.
vocab = tokenizer.get_vocab()
print(vocab)
Running this shows all mappings: {'##wara': 11872, '[unused786]': 791, '1839': 10011, 'click': 11562, 'alternating': 15122, '##nable': 22966, 'drives': 9297, 'cannabis': 17985, 'likes': 7777, 'designate': 24414, 'blanket': 8768, 'harold': 7157, '##wc': 16526, '1763': 18432, 'robber': 27307, '[unused852]': 857, 'starters': 29400, 'committing': 16873, 'difficulties': 8190, 'fortunate': 19590, 'decorate': 29460, 'abused': 16999, 'er': 9413, 'hiroshima': 20168, '[unused700]': 705, ...
We see that most of the above are words but we also have some ## appear in the words. We'll get to those in a moment. We can use the method tokenizer.decode([token1, token2, ...]) to decode the sentence we obtained above. Going over the tokens on-by-one, gives the following.
101 -> '[CLS]'
1045 -> 'i'
2428 -> 'really'
5959 -> 'enjoy'
5742 -> 'swimming'
999 -> '!'
102 -> '[SEP]'
We see that all words are in lower case. This is done by some models including this one. It is what uncased in distilbert-base-uncased refers to. By removing capitalized letters the amount of letter combinations to consider shrinks considerably while maintaining most meaning of the text. Furthermore, we have obtained two extra tokens, namely 101 and 102, that map to [CLS] and [SEP], respectively. These are special tokens. [CLS] is put by BERT at the start of every sentence to indicate a new input sequence. [SEP] indicates a separation in content, such as a new sentence.
There are some more special tokens including [UNK]. The token [UNK] indicates an unknown word. I tried finding one. My guess was that it wouldn't be too hard since we can run tokenizer.vocab_size gives to see our model has 30522 tokens. Merriam-webster.com claims: "Webster's Third New International Dictionary, Unabridged, together with its 1993 Addenda Section, includes some 476,000 entries." Let's have a look at xylophone that does not directly return in the model's vocabulary. Encoding and decoding it gives the mapping.
101 -> '[CLS]'
1060 -> 'x'
8516 -> '##yl'
25232 -> '##ophone'
102 -> '[SEP]'
Here we see the use of the ##. It is a placeholder for everything that comes before in the word and allows the formation for complex words. Now, we have glanced over one detail, how does a model give meaning to tokens? We will get to that in two posts.