Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Common Unicode

From Leeroopedia
Knowledge Sources
Domains Unicode, Streaming
Last Updated 2026-02-15 00:00 GMT

Overview

Implements UTF-8 codepoint parsing for streaming-aware unicode text processing.

Description

`utf8_sequence_length` determines the expected byte count of a UTF-8 sequence from its first byte using a lookup table indexed by the top 4 bits. `parse_utf8_codepoint` decodes a single UTF-8 codepoint from a `string_view` at a given offset, handling 1-byte ASCII (fast path), 2-byte, 3-byte, and 4-byte sequences. Returns a status indicating SUCCESS (with decoded codepoint and byte count), INCOMPLETE (not enough bytes available, important for streaming), or INVALID (malformed sequence). Each multi-byte path validates continuation byte patterns (0x80 prefix).

Usage

Use this module for correct UTF-8 handling in streaming contexts where input may arrive in chunks. It is used by the console and chat parsing subsystems to decode Unicode codepoints incrementally without requiring the entire string to be available.

Code Reference

Source Location

Signature

size_t utf8_sequence_length(unsigned char first_byte);
utf8_parse_result parse_utf8_codepoint(std::string_view input, size_t offset);

Import

#include "unicode.h"

I/O Contract

Inputs

Name Type Required Description
first_byte unsigned char Yes The first byte of a UTF-8 sequence to determine its expected length
input std::string_view Yes The input buffer containing UTF-8 encoded text
offset size_t Yes Byte offset within the input to start parsing from

Outputs

Name Type Description
sequence length size_t Expected number of bytes in the UTF-8 sequence (1-4)
parse result utf8_parse_result Contains the decoded codepoint, bytes consumed (1-4), and status (SUCCESS, INCOMPLETE, INVALID)

Usage Examples

#include "unicode.h"

// Parse ASCII character
std::string_view text = "Hello";
auto result = parse_utf8_codepoint(text, 0);
// result.status == utf8_parse_result::SUCCESS
// result.codepoint == 'H' (72)
// result.bytes_consumed == 1

// Parse multi-byte character (e.g., U+00E9 = "e with accent")
std::string_view utf8_text = "\xC3\xA9";
auto result2 = parse_utf8_codepoint(utf8_text, 0);
// result2.status == utf8_parse_result::SUCCESS
// result2.codepoint == 0xE9
// result2.bytes_consumed == 2

// Handle incomplete streaming input
std::string_view partial = "\xC3"; // first byte of 2-byte sequence, missing continuation
auto result3 = parse_utf8_codepoint(partial, 0);
// result3.status == utf8_parse_result::INCOMPLETE

// Get expected sequence length
size_t len = utf8_sequence_length(0xC3); // returns 2

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment