Implementation:Ggml org Llama cpp Common Unicode Header
| Knowledge Sources | |
|---|---|
| Domains | Unicode, Streaming |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Declares the UTF-8 parsing utilities for streaming-aware unicode codepoint decoding.
Description
Defines the `utf8_parse_result` struct containing a decoded codepoint (`uint32_t`), `bytes_consumed` count (1-4), and a status enum (`SUCCESS`, `INCOMPLETE`, `INVALID`). Declares `utf8_sequence_length` to determine expected byte length from a first byte, and `parse_utf8_codepoint` to decode a single codepoint from a `string_view` at a given offset.
Usage
Include this header in any module that needs safe UTF-8 text handling in streaming scenarios. It provides the minimal unicode interface needed throughout the common library for incremental codepoint decoding.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: common/unicode.h
- Lines: 1-22
Signature
struct utf8_parse_result {
uint32_t codepoint;
size_t bytes_consumed;
enum status { SUCCESS, INCOMPLETE, INVALID } status;
utf8_parse_result(enum status s, uint32_t cp = 0, size_t bytes = 0);
};
size_t utf8_sequence_length(unsigned char first_byte);
utf8_parse_result parse_utf8_codepoint(std::string_view input, size_t offset);
Import
#include "unicode.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| first_byte | unsigned char | Yes | First byte of a UTF-8 sequence for length determination |
| input | std::string_view | Yes | Buffer containing UTF-8 encoded data |
| offset | size_t | Yes | Starting byte offset in the input |
Outputs
| Name | Type | Description |
|---|---|---|
| sequence length | size_t | Expected byte count for the UTF-8 sequence (1-4) |
| result | utf8_parse_result | Decoded codepoint, bytes consumed, and status (SUCCESS/INCOMPLETE/INVALID) |
Usage Examples
#include "unicode.h"
// Determine expected sequence length
size_t len = utf8_sequence_length(0xE4); // returns 3 for a 3-byte sequence
// Parse a codepoint and check status
std::string_view buf = "...";
auto res = parse_utf8_codepoint(buf, 0);
switch (res.status) {
case utf8_parse_result::SUCCESS:
// Use res.codepoint and res.bytes_consumed
break;
case utf8_parse_result::INCOMPLETE:
// Need more data from the stream
break;
case utf8_parse_result::INVALID:
// Malformed UTF-8 sequence
break;
}