fix: add missing pre-tokenizer type for GPT-2 BPE models by raphaelbgr · Pull Request #508 · microsoft/BitNet

raphaelbgr · 2026-03-23T07:37:57Z

Summary

Fix BitnetModel.set_vocab() in convert-hf-to-gguf-bitnet.py to call _set_vocab_gpt2() instead of _set_vocab_sentencepiece()
Add add_token_pre_type("gpt-2") in convert-ms-to-gguf-bitnet.py for GPT-2 BPE models

Problem

The GGUF conversion scripts do not write the tokenizer.ggml.pre metadata key, causing llama.cpp to fall back to the default pre-tokenizer with the warning:

llm_load_vocab: missing pre-tokenizer type, using: 'default'
GENERATION QUALITY WILL BE DEGRADED! CONSIDER REGENERATING THE MODEL

This results in incoherent/garbage output from llama-cli and run_inference.py, even though the model weights are correct. The issue affects all users running inference via bitnet.cpp.

Root cause

convert-hf-to-gguf-bitnet.py: BitnetModel.set_vocab() calls _set_vocab_sentencepiece(), which hardcodes tokenizer.ggml.pre = "default" and sets tokenizer.ggml.model = "llama". BitNet-b1.58-2B-4T uses a GPT-2 BPE tokenizer (128K vocab, tiktoken-based), not SentencePiece.
convert-ms-to-gguf-bitnet.py: add_meta_vocab() writes add_tokenizer_model() but never calls add_token_pre_type(), leaving the pre-tokenizer field empty in the GGUF.

The default pre-tokenizer uses different regex rules for text splitting, causing incorrect tokenization boundaries that produce nonsensical output.

Fix

Change _set_vocab_sentencepiece() to _set_vocab_gpt2() in the HF converter — this correctly detects the pre-tokenizer type via hash matching and writes it to the GGUF.
Add add_token_pre_type("gpt-2") in the MS converter for GPT-2 models.

Test plan

Reconverted BitNet-b1.58-2B-4T from bf16 safetensors with the fix
Verified tokenizer.ggml.pre = "gpt-2" is present in the output GGUF
No more "missing pre-tokenizer type" warning during model load
Coherent text output at ~41 tokens/sec on Apple M4 (ARM, i2_s kernel)
Output quality matches HuggingFace transformers reference
Benchmark throughput unchanged (no performance regression)

Related issues

Model only outputs G repeatedly in interactive mode with ggml-model-i2_s.gguf #195 (Model only outputs G repeatedly)
Garbage output on ARMv8.0 (Cortex-A53/A73) — NEON-only fallback path produces incorrect results #411 (Garbage output on ARMv8.0)

The pre-built GGUF on HuggingFace (microsoft/BitNet-b1.58-2B-4T-gguf) was also converted without this metadata and has a non-standard chat template. Regenerating it with these fixes would resolve the output quality issues.

The GGUF conversion scripts do not write the `tokenizer.ggml.pre` metadata key for BitNet models, causing llama.cpp to fall back to the default pre-tokenizer. This produces degraded or incoherent output with the warning: "missing pre-tokenizer type, using: 'default'" "GENERATION QUALITY WILL BE DEGRADED!" Root cause: - convert-hf-to-gguf-bitnet.py: BitnetModel.set_vocab() calls _set_vocab_sentencepiece() which hardcodes pre="default", instead of _set_vocab_gpt2() which correctly detects and writes the pre-tokenizer type. - convert-ms-to-gguf-bitnet.py: add_meta_vocab() writes the tokenizer model but never writes the pre-tokenizer type. Fix: - Change BitnetModel.set_vocab() to call _set_vocab_gpt2() - Add add_token_pre_type("gpt-2") in add_meta_vocab() for GPT-2 models Tested on Mac Mini M4 (ARM64) with BitNet-b1.58-2B-4T: reconverted model produces coherent output at ~41 tokens/sec via bitnet.cpp, matching the quality seen through HuggingFace transformers.

raphaelbgr · 2026-03-23T07:44:47Z

@microsoft-github-policy-service agree

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add missing pre-tokenizer type for GPT-2 BPE models#508

fix: add missing pre-tokenizer type for GPT-2 BPE models#508
raphaelbgr wants to merge 1 commit intomicrosoft:mainfrom
raphaelbgr:fix/tokenizer-pre-type

raphaelbgr commented Mar 23, 2026

Uh oh!

raphaelbgr commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raphaelbgr commented Mar 23, 2026

Summary

Problem

Root cause

Fix

Test plan

Related issues

Uh oh!

raphaelbgr commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant