Implementation:Ggml org Llama cpp Common Unicode
| Knowledge Sources | |
|---|---|
| Domains | Unicode, Streaming |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Implements UTF-8 codepoint parsing for streaming-aware unicode text processing.
Description
`utf8_sequence_length` determines the expected byte count of a UTF-8 sequence from its first byte using a lookup table indexed by the top 4 bits. `parse_utf8_codepoint` decodes a single UTF-8 codepoint from a `string_view` at a given offset, handling 1-byte ASCII (fast path), 2-byte, 3-byte, and 4-byte sequences. Returns a status indicating SUCCESS (with decoded codepoint and byte count), INCOMPLETE (not enough bytes available, important for streaming), or INVALID (malformed sequence). Each multi-byte path validates continuation byte patterns (0x80 prefix).
Usage
Use this module for correct UTF-8 handling in streaming contexts where input may arrive in chunks. It is used by the console and chat parsing subsystems to decode Unicode codepoints incrementally without requiring the entire string to be available.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: common/unicode.cpp
- Lines: 1-64
Signature
size_t utf8_sequence_length(unsigned char first_byte);
utf8_parse_result parse_utf8_codepoint(std::string_view input, size_t offset);
Import
#include "unicode.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| first_byte | unsigned char | Yes | The first byte of a UTF-8 sequence to determine its expected length |
| input | std::string_view | Yes | The input buffer containing UTF-8 encoded text |
| offset | size_t | Yes | Byte offset within the input to start parsing from |
Outputs
| Name | Type | Description |
|---|---|---|
| sequence length | size_t | Expected number of bytes in the UTF-8 sequence (1-4) |
| parse result | utf8_parse_result | Contains the decoded codepoint, bytes consumed (1-4), and status (SUCCESS, INCOMPLETE, INVALID) |
Usage Examples
#include "unicode.h"
// Parse ASCII character
std::string_view text = "Hello";
auto result = parse_utf8_codepoint(text, 0);
// result.status == utf8_parse_result::SUCCESS
// result.codepoint == 'H' (72)
// result.bytes_consumed == 1
// Parse multi-byte character (e.g., U+00E9 = "e with accent")
std::string_view utf8_text = "\xC3\xA9";
auto result2 = parse_utf8_codepoint(utf8_text, 0);
// result2.status == utf8_parse_result::SUCCESS
// result2.codepoint == 0xE9
// result2.bytes_consumed == 2
// Handle incomplete streaming input
std::string_view partial = "\xC3"; // first byte of 2-byte sequence, missing continuation
auto result3 = parse_utf8_codepoint(partial, 0);
// result3.status == utf8_parse_result::INCOMPLETE
// Get expected sequence length
size_t len = utf8_sequence_length(0xC3); // returns 2