Coaching a Tokenizer for BERT Fashions

BERT is an early transformer-based mannequin for NLP duties that’s small and quick sufficient to coach on a house pc. Like all deep studying fashions, it requires a tokenizer to transform textual content into integer tokens. This text exhibits easy methods to prepare a WordPiece tokenizer following BERT’s authentic design.

Let’s get began.

Coaching a Tokenizer for BERT Fashions
Photograph by JOHN TOWNER. Some rights reserved.

Overview

This text is split into two components; they’re:

Selecting a Dataset
Coaching a Tokenizer

Selecting a Dataset

To maintain issues easy, we’ll use English textual content solely. WikiText is a well-liked preprocessed dataset for experiments, obtainable by way of the Hugging Face datasets library:

import random from datasets import load_dataset # path and identify of every dataset path, identify = “wikitext-2”, “wikitext-2-raw-v1″ dataset = load_dataset(path, identify, cut up=”prepare”) print(f”measurement: {len(dataset)}”) # Print a number of samples for idx in random.pattern(vary(len(dataset)), 5): textual content = dataset[idx][“text”].strip() print(f”{idx}: {textual content}”)

import random

from datasets import load_dataset

# path and identify of every dataset

path, identify = “wikitext-2”, “wikitext-2-raw-v1”

dataset = load_dataset(path, identify, cut up=“prepare”)

print(f“measurement: {len(dataset)}”)

# Print a number of samples

for idx in random.pattern(vary(len(dataset)), 5):

textual content = dataset[idx][“text”].strip()

print(f“{idx}: {textual content}”)

On first run, the dataset downloads to ~/.cache/huggingface/datasets and is cached for future use. WikiText-2 that used above is a smaller dataset appropriate for fast experiments, whereas WikiText-103 is bigger and extra consultant of real-world textual content for a greater mannequin.

The output of this code could appear to be this:

measurement: 36718 23905: Dudgeon Creek 4242: In 1825 the Congress of Mexico established the Port of Galveston and in 1830 … 7181: Crew : 5 24596: On March 19 , 2007 , Sports activities Illustrated posted on its web site an article in its … 12920: The latest constructing included within the checklist is within the Quantock Hills . The …

measurement: 36718

23905: Dudgeon Creek

4242: In 1825 the Congress of Mexico established the Port of Galveston and in 1830 …

7181: Crew : 5

24596: On March 19 , 2007 , Sports activities Illustrated posted on its web site an article in its …

12920: The latest constructing included within the checklist is within the Quantock Hills . The …

The dataset accommodates strings of various lengths with areas round punctuation marks. When you might cut up on whitespace, this wouldn’t seize sub-word elements. That’s what the WordPiece tokenization algorithm is sweet at.

Coaching a Tokenizer

A number of tokenization algorithms help sub-word elements. BERT makes use of WordPiece, whereas trendy LLMs usually use Byte-Pair Encoding (BPE). We’ll prepare a WordPiece tokenizer following BERT’s authentic design.

The tokenizers library implements a number of tokenization algorithms that may be configured to your wants. It saves you the effort of implementing the tokenization algorithm from scratch. It is best to set up it with pip command:

Let’s prepare a tokenizer:

import tokenizers from datasets import load_dataset path, identify = “wikitext”, “wikitext-103-raw-v1″ vocab_size = 30522 dataset = load_dataset(path, identify, cut up=”prepare”) # Acquire texts, skip title traces beginning with “=” texts = [] for line in dataset[“text”]: line = line.strip() if line and never line.startswith(“=”): texts.append(line) # Configure WordPiece tokenizer with NFKC normalization and particular tokens tokenizer = tokenizers.Tokenizer(tokenizers.fashions.WordPiece()) tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace() tokenizer.decoder = tokenizers.decoders.WordPiece() tokenizer.normalizer = tokenizers.normalizers.NFKC() tokenizer.coach = tokenizers.trainers.WordPieceTrainer( vocab_size=vocab_size, special_tokens=[“[PAD]”, “[CLS]”, “[SEP]”, “[MASK]”, “[UNK]”] ) # Practice the tokenizer and put it aside tokenizer.train_from_iterator(texts, coach=tokenizer.coach) tokenizer.enable_padding(pad_id=tokenizer.token_to_id(“[PAD]”), pad_token=”[PAD]”) tokenizer_path = f”{dataset_name}_wordpiece.json” tokenizer.save(tokenizer_path, fairly=True) # Take a look at the tokenizer tokenizer = tokenizers.Tokenizer.from_file(tokenizer_path) print(tokenizer.encode(“Hi there, world!”).tokens) print(tokenizer.decode(tokenizer.encode(“Hi there, world!”).ids))

import tokenizers

from datasets import load_dataset

path, identify = “wikitext”, “wikitext-103-raw-v1”

vocab_size = 30522

dataset = load_dataset(path, identify, cut up=“prepare”)

# Acquire texts, skip title traces beginning with “=”

texts = []

for line in dataset[“text”]:

line = line.strip()

if line and not line.startswith(“=”):

texts.append(line)

# Configure WordPiece tokenizer with NFKC normalization and particular tokens

tokenizer = tokenizers.Tokenizer(tokenizers.fashions.WordPiece())

tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()

tokenizer.decoder = tokenizers.decoders.WordPiece()

tokenizer.normalizer = tokenizers.normalizers.NFKC()

tokenizer.coach = tokenizers.trainers.WordPieceTrainer(

vocab_size=vocab_size,

special_tokens=[“[PAD]”, “[CLS]”, “[SEP]”, “[MASK]”, “[UNK]”]

)

# Practice the tokenizer and put it aside

tokenizer.train_from_iterator(texts, coach=tokenizer.coach)

tokenizer.enable_padding(pad_id=tokenizer.token_to_id(“[PAD]”), pad_token=“[PAD]”)

tokenizer_path = f“{dataset_name}_wordpiece.json”

tokenizer.save(tokenizer_path, fairly=True)

# Take a look at the tokenizer

tokenizer = tokenizers.Tokenizer.from_file(tokenizer_path)

print(tokenizer.encode(“Hi there, world!”).tokens)

print(tokenizer.decode(tokenizer.encode(“Hi there, world!”).ids))

Working this code could print the next output:

wikitext-103-raw-v1/train-00000-of-00002(…): 100%|█████| 157M/157M [00:46<00:00, 3.40MB/s] wikitext-103-raw-v1/train-00001-of-00002(…): 100%|█████| 157M/157M [00:04<00:00, 37.0MB/s] Producing take a look at cut up: 100%|███████████████| 4358/4358 [00:00<00:00, 174470.75 examples/s] Producing prepare cut up: 100%|████████| 1801350/1801350 [00:09<00:00, 199210.10 examples/s] Producing validation cut up: 100%|█████████| 3760/3760 [00:00<00:00, 201086.14 examples/s] measurement: 1801350 [00:00:04] Pre-processing sequences ████████████████████████████ 0 / 0 [00:00:00] Tokenize phrases ████████████████████████████ 606445 / 606445 [00:00:00] Depend pairs ████████████████████████████ 606445 / 606445 [00:00:04] Compute merges ████████████████████████████ 22020 / 22020 [‘Hell’, ‘##o’, ‘,’, ‘world’, ‘!’] Hi there, world!

wikitext-103-raw-v1/train-00000-of-00002(…): 100%|█████| 157M/157M [00:46<00:00, 3.40MB/s]

wikitext-103-raw-v1/train-00001-of-00002(…): 100%|█████| 157M/157M [00:04<00:00, 37.0MB/s]

Producing take a look at cut up: 100%|███████████████| 4358/4358 [00:00<00:00, 174470.75 examples/s]

Producing prepare cut up: 100%|████████| 1801350/1801350 [00:09<00:00, 199210.10 examples/s]

Producing validation cut up: 100%|█████████| 3760/3760 [00:00<00:00, 201086.14 examples/s]

measurement: 1801350

[00:00:04] Pre-processing sequences ████████████████████████████ 0 / 0

[00:00:00] Tokenize phrases ████████████████████████████ 606445 / 606445

[00:00:00] Depend pairs ████████████████████████████ 606445 / 606445

[00:00:04] Compute merges ████████████████████████████ 22020 / 22020

[‘Hell’, ‘##o’, ‘,’, ‘world’, ‘!’]

Hi there, world!

This code makes use of the WikiText-103 dataset. The primary run downloads 157MB of knowledge containing 1.8 million traces. The coaching takes a number of seconds. The instance exhibits how "Hi there, world!" turns into 5 tokens, with “Hi there” cut up into “Hell” and “##o” (the “##” prefix signifies a sub-word part).

The tokenizer created within the code above has the next properties:

Vocabulary measurement: 30,522 tokens (matching the unique BERT mannequin)
Particular tokens: [PAD], [CLS], [SEP], [MASK], and [UNK] are added to the vocabulary though they don’t seem to be within the dataset.
Pre-tokenizer: Whitespace splitting (because the dataset has areas round punctuation)
Normalizer: NFKC normalization for unicode textual content. Be aware which you could additionally configure the tokenizer to transform all the pieces into lowercase, because the frequent BERT-uncased mannequin does.
Algorithm: WordPiece is used. Therefore the decoder ought to be set accordingly in order that the “##” prefix for sub-word elements is acknowledged.
Padding: Enabled with [PAD] token for batch processing. This isn’t demonstrated within the code above, however will probably be helpful if you end up coaching a BERT mannequin.

The tokenizer saves to a reasonably large JSON file containing the complete vocabulary, permitting you to reload the tokenizer later with out retraining.

To transform a string into an inventory of tokens, you utilize the syntax tokenizer.encode(textual content).tokens, by which every token is only a string. To be used in a mannequin, it is best to use tokenizer.encode(textual content).ids as an alternative, by which the end result will probably be an inventory of integers. The decode methodology can be utilized to transform an inventory of integers again to a string. That is demonstrated within the code above.

Beneath are some assets that you could be discover helpful:

This text demonstrated easy methods to prepare a WordPiece tokenizer for BERT utilizing the WikiText dataset. You realized to configure the tokenizer with acceptable normalization and particular tokens, and easy methods to encode textual content to tokens and decode again to strings. That is simply a place to begin for tokenizer coaching. Contemplate leveraging current libraries and instruments to optimize tokenizer coaching pace so it doesn’t grow to be a bottleneck in your coaching course of.

Main Menu

What's Hot

Pricing Breakdown and Core Characteristic Overview

65% of Organisations Nonetheless Detect Unauthorised Shadow AI Regardless of Visibility Optimism

Nvidia's new open weights Nemotron 3 tremendous combines three totally different architectures to beat gpt-oss and Qwen in throughput

Coaching a Tokenizer for BERT Fashions

3 Questions: On the way forward for AI and the mathematical and bodily sciences | MIT Information

New MIT class makes use of anthropology to enhance chatbots | MIT Information

How Joseph Paradiso’s sensing improvements bridge the humanities, drugs, and ecology | MIT Information

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge

Pricing Breakdown and Core Characteristic Overview

65% of Organisations Nonetheless Detect Unauthorised Shadow AI Regardless of Visibility Optimism

Nvidia's new open weights Nemotron 3 tremendous combines three totally different architectures to beat gpt-oss and Qwen in throughput

How To Change A Company Tradition With Kate Johnson, CEO of Lumen Applied sciences

Main Menu

Subscribe to Updates

What's Hot

Coaching a Tokenizer for BERT Fashions

Overview

Selecting a Dataset

Coaching a Tokenizer

Related Posts