microsoft · mohitcr77 · Mar 22, 2026
diff --git a/docs/I2_s.md b/docs/I2_s.md
@@ -0,0 +1,57 @@
+# I2_S Quantization Format
+
+I2_S is the quantization format used to store BitNet b1.58 ternary weights in GGUF files.
+It packs 4 ternary values per byte using a block-interleaved layout.
+
+This document is intended for developers building alternative inference runtimes
+(WebGPU, Vulkan, Metal, etc.) who need to load and dequantize BitNet weights directly.
+
+### Ternary encoding
+
+Each weight is one of three values, stored in 2 bits:
+
+| Bits | Value |
+|------|-------|
+| 00 | 0     |
+| 01 | +1    |
+| 10 | -1    |
+| 11 | unused |
+
+### Block layout
+
+Weights are stored in blocks of 128 elements (32 bytes each).
+Within a block, the 128 elements are split into 4 groups of 32.
+Each byte encodes one element from each group:
+```
+bits [7:6] → element at position gp        (group 0, offset 0)
+bits [5:4] → element at position 32 + gp   (group 1, offset 32)
+bits [3:2] → element at position 64 + gp   (group 2, offset 64)
+bits [1:0] → element at position 96 + gp   (group 3, offset 96)
+```
+
+To extract the element at logical index `k`:
+```python
+block       = k // 128
+pos         = k % 128
+group       = pos // 32
+gp          = pos % 32
+byte_offset = block * 32 + gp
+shift       = 6 - 2 * group
+value       = (byte >> shift) & 0x03
+```
+
+### Scale factor
+
+The total byte size per tensor is:
+```
+ceil(num_elements / 4) + 32
+```
+
+The trailing 32 bytes store a single `float32` scale value, replicated 8 times.
+
+### GGUF notes
+
+- Type ID is **36** in the Eddie-Wang1120/llama.cpp fork (not type 27, which is I64 in upstream ggml)
+- GGUF metadata uses architecture prefix `bitnet-25`, not `bitnet` or `llama`
+- `token_embd.weight` is stored as F16 (type 1) — embeddings are not quantized to I2_S
+- There is no `output.weight` tensor — the model uses tied embeddings (`lm_head` reuses `token_embd`)