Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Common Unicode Header

From Leeroopedia
Knowledge Sources
Domains Unicode, Streaming
Last Updated 2026-02-15 00:00 GMT

Overview

Declares the UTF-8 parsing utilities for streaming-aware unicode codepoint decoding.

Description

Defines the `utf8_parse_result` struct containing a decoded codepoint (`uint32_t`), `bytes_consumed` count (1-4), and a status enum (`SUCCESS`, `INCOMPLETE`, `INVALID`). Declares `utf8_sequence_length` to determine expected byte length from a first byte, and `parse_utf8_codepoint` to decode a single codepoint from a `string_view` at a given offset.

Usage

Include this header in any module that needs safe UTF-8 text handling in streaming scenarios. It provides the minimal unicode interface needed throughout the common library for incremental codepoint decoding.

Code Reference

Source Location

Signature

struct utf8_parse_result {
    uint32_t codepoint;
    size_t bytes_consumed;
    enum status { SUCCESS, INCOMPLETE, INVALID } status;

    utf8_parse_result(enum status s, uint32_t cp = 0, size_t bytes = 0);
};

size_t utf8_sequence_length(unsigned char first_byte);
utf8_parse_result parse_utf8_codepoint(std::string_view input, size_t offset);

Import

#include "unicode.h"

I/O Contract

Inputs

Name Type Required Description
first_byte unsigned char Yes First byte of a UTF-8 sequence for length determination
input std::string_view Yes Buffer containing UTF-8 encoded data
offset size_t Yes Starting byte offset in the input

Outputs

Name Type Description
sequence length size_t Expected byte count for the UTF-8 sequence (1-4)
result utf8_parse_result Decoded codepoint, bytes consumed, and status (SUCCESS/INCOMPLETE/INVALID)

Usage Examples

#include "unicode.h"

// Determine expected sequence length
size_t len = utf8_sequence_length(0xE4); // returns 3 for a 3-byte sequence

// Parse a codepoint and check status
std::string_view buf = "...";
auto res = parse_utf8_codepoint(buf, 0);
switch (res.status) {
    case utf8_parse_result::SUCCESS:
        // Use res.codepoint and res.bytes_consumed
        break;
    case utf8_parse_result::INCOMPLETE:
        // Need more data from the stream
        break;
    case utf8_parse_result::INVALID:
        // Malformed UTF-8 sequence
        break;
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment