Implementation:Ggml org Llama cpp Common Unicode Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Unicode, Streaming
Last Updated	2026-02-15 00:00 GMT

Overview

Declares the UTF-8 parsing utilities for streaming-aware unicode codepoint decoding.

Description

Defines the `utf8_parse_result` struct containing a decoded codepoint (`uint32_t`), `bytes_consumed` count (1-4), and a status enum (`SUCCESS`, `INCOMPLETE`, `INVALID`). Declares `utf8_sequence_length` to determine expected byte length from a first byte, and `parse_utf8_codepoint` to decode a single codepoint from a `string_view` at a given offset.

Usage

Include this header in any module that needs safe UTF-8 text handling in streaming scenarios. It provides the minimal unicode interface needed throughout the common library for incremental codepoint decoding.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: common/unicode.h
Lines: 1-22

Signature

struct utf8_parse_result {
    uint32_t codepoint;
    size_t bytes_consumed;
    enum status { SUCCESS, INCOMPLETE, INVALID } status;

    utf8_parse_result(enum status s, uint32_t cp = 0, size_t bytes = 0);
};

size_t utf8_sequence_length(unsigned char first_byte);
utf8_parse_result parse_utf8_codepoint(std::string_view input, size_t offset);

Import

#include "unicode.h"

I/O Contract

Inputs

Name	Type	Required	Description
first_byte	unsigned char	Yes	First byte of a UTF-8 sequence for length determination
input	std::string_view	Yes	Buffer containing UTF-8 encoded data
offset	size_t	Yes	Starting byte offset in the input

Outputs

Name	Type	Description
sequence length	size_t	Expected byte count for the UTF-8 sequence (1-4)
result	utf8_parse_result	Decoded codepoint, bytes consumed, and status (SUCCESS/INCOMPLETE/INVALID)

Usage Examples

#include "unicode.h"

// Determine expected sequence length
size_t len = utf8_sequence_length(0xE4); // returns 3 for a 3-byte sequence

// Parse a codepoint and check status
std::string_view buf = "...";
auto res = parse_utf8_codepoint(buf, 0);
switch (res.status) {
    case utf8_parse_result::SUCCESS:
        // Use res.codepoint and res.bytes_consumed
        break;
    case utf8_parse_result::INCOMPLETE:
        // Need more data from the stream
        break;
    case utf8_parse_result::INVALID:
        // Malformed UTF-8 sequence
        break;
}

Related Pages

Principle:Ggml_org_Llama_cpp_Unicode

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment