Implementation:ArroyoSystems Arroyo Proto Schema Converter
Overview
Protobuf Schema Converter converts Protocol Buffer message descriptors into Arrow schemas. It maps protobuf field types to Arrow data types, handling nested messages, repeated fields (lists), maps (as JSON strings), enums, and optional fields.
Description
The module provides:
protobuf_to_arrow: Converts aMessageDescriptorinto an ArrowSchemaby iterating over message fields and mapping each to an ArrowField.
protobuf_to_arrow_datatype: Maps individual protobuf field descriptors to Arrow data types:- Bool -> Boolean
- Int32, Sint32, Sfixed32 -> Int32
- Int64, Sint64, Sfixed64 -> Int64
- Uint32, Fixed32 -> UInt32
- Uint64, Fixed64 -> UInt64
- Float -> Float32
- Double -> Float64
- String, Bytes -> Utf8
- Nested messages -> Struct (recursive)
- Maps -> Utf8 with JSON extension (maps are not natively supported)
- Enums -> Utf8
- Repeated fields -> List (wrapping the element type)
get_pool: Creates aDescriptorPoolfrom encoded file descriptor set bytes, using the global pool as a base.
is_nullable: Determines field nullability -- fields with Optional cardinality, lists, or maps are nullable.
The module also includes utilities for compiling .proto files with protoc and validating file paths for safety.
Usage
This is called during CREATE TABLE processing when a user specifies a Protobuf format. The protobuf descriptor is resolved and converted to an Arrow schema for internal processing.
Code Reference
Source Location
crates/arroyo-formats/src/proto/schema.rs
Signature
pub fn protobuf_to_arrow(proto_schema: &MessageDescriptor) -> anyhow::Result<Schema>
pub fn get_pool(encoded: &[u8]) -> anyhow::Result<DescriptorPool>
Import
use arroyo_formats::proto::schema::{protobuf_to_arrow, get_pool};
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
| proto_schema | &MessageDescriptor |
Protobuf message descriptor defining the message structure |
| encoded | &[u8] |
Encoded protobuf file descriptor set bytes |
Outputs
| Name | Type | Description |
|---|---|---|
| arrow_schema | Schema |
Arrow schema derived from the protobuf message definition |
| pool | DescriptorPool |
Protobuf descriptor pool for resolving message types |
Usage Examples
// Load a descriptor pool from compiled proto
let pool = get_pool(&encoded_descriptor_set)?;
let message_desc = pool.get_message_by_name("my.package.MyMessage")
.ok_or_else(|| anyhow!("message not found"))?;
// Convert to Arrow schema
let arrow_schema = protobuf_to_arrow(&message_desc)?;
Related Pages
- ArroyoSystems_Arroyo_Avro_Schema_Converter - Similar conversion for Avro schemas
- ArroyoSystems_Arroyo_Json_Schema_Converter - Similar conversion for JSON schemas
- ArroyoSystems_Arroyo_Format_Deserializer - Uses descriptor pools for protobuf deserialization