Skip to content

fix: add missing pre-tokenizer type for GPT-2 BPE models#508

Open
raphaelbgr wants to merge 1 commit intomicrosoft:mainfrom
raphaelbgr:fix/tokenizer-pre-type
Open

fix: add missing pre-tokenizer type for GPT-2 BPE models#508
raphaelbgr wants to merge 1 commit intomicrosoft:mainfrom
raphaelbgr:fix/tokenizer-pre-type

Conversation

@raphaelbgr
Copy link

Summary

  • Fix BitnetModel.set_vocab() in convert-hf-to-gguf-bitnet.py to call _set_vocab_gpt2() instead of _set_vocab_sentencepiece()
  • Add add_token_pre_type("gpt-2") in convert-ms-to-gguf-bitnet.py for GPT-2 BPE models

Problem

The GGUF conversion scripts do not write the tokenizer.ggml.pre metadata key, causing llama.cpp to fall back to the default pre-tokenizer with the warning:

llm_load_vocab: missing pre-tokenizer type, using: 'default'
GENERATION QUALITY WILL BE DEGRADED! CONSIDER REGENERATING THE MODEL

This results in incoherent/garbage output from llama-cli and run_inference.py, even though the model weights are correct. The issue affects all users running inference via bitnet.cpp.

Root cause

  1. convert-hf-to-gguf-bitnet.py: BitnetModel.set_vocab() calls _set_vocab_sentencepiece(), which hardcodes tokenizer.ggml.pre = "default" and sets tokenizer.ggml.model = "llama". BitNet-b1.58-2B-4T uses a GPT-2 BPE tokenizer (128K vocab, tiktoken-based), not SentencePiece.

  2. convert-ms-to-gguf-bitnet.py: add_meta_vocab() writes add_tokenizer_model() but never calls add_token_pre_type(), leaving the pre-tokenizer field empty in the GGUF.

The default pre-tokenizer uses different regex rules for text splitting, causing incorrect tokenization boundaries that produce nonsensical output.

Fix

  • Change _set_vocab_sentencepiece() to _set_vocab_gpt2() in the HF converter — this correctly detects the pre-tokenizer type via hash matching and writes it to the GGUF.
  • Add add_token_pre_type("gpt-2") in the MS converter for GPT-2 models.

Test plan

  • Reconverted BitNet-b1.58-2B-4T from bf16 safetensors with the fix
  • Verified tokenizer.ggml.pre = "gpt-2" is present in the output GGUF
  • No more "missing pre-tokenizer type" warning during model load
  • Coherent text output at ~41 tokens/sec on Apple M4 (ARM, i2_s kernel)
  • Output quality matches HuggingFace transformers reference
  • Benchmark throughput unchanged (no performance regression)

Related issues

The pre-built GGUF on HuggingFace (microsoft/BitNet-b1.58-2B-4T-gguf) was also converted without this metadata and has a non-standard chat template. Regenerating it with these fixes would resolve the output quality issues.

The GGUF conversion scripts do not write the `tokenizer.ggml.pre`
metadata key for BitNet models, causing llama.cpp to fall back to the
default pre-tokenizer. This produces degraded or incoherent output
with the warning:

  "missing pre-tokenizer type, using: 'default'"
  "GENERATION QUALITY WILL BE DEGRADED!"

Root cause:
- convert-hf-to-gguf-bitnet.py: BitnetModel.set_vocab() calls
  _set_vocab_sentencepiece() which hardcodes pre="default", instead
  of _set_vocab_gpt2() which correctly detects and writes the
  pre-tokenizer type.
- convert-ms-to-gguf-bitnet.py: add_meta_vocab() writes the
  tokenizer model but never writes the pre-tokenizer type.

Fix:
- Change BitnetModel.set_vocab() to call _set_vocab_gpt2()
- Add add_token_pre_type("gpt-2") in add_meta_vocab() for GPT-2 models

Tested on Mac Mini M4 (ARM64) with BitNet-b1.58-2B-4T: reconverted
model produces coherent output at ~41 tokens/sec via bitnet.cpp,
matching the quality seen through HuggingFace transformers.
@raphaelbgr
Copy link
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant