Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datahub project Datahub Protoc Descriptor Set Out

From Leeroopedia


Field Value
Implementation Name Protoc_Descriptor_Set_Out
Type External Tool Doc
Workflow Protobuf_Schema_Ingestion
Repository https://github.com/datahub-project/datahub
Implements Principle:Datahub_project_Datahub_Protobuf_Compilation
Last Updated 2026-02-09 17:00 GMT

Overview

Description

Protoc Descriptor Set Out is the external tool invocation that compiles Protocol Buffer .proto source files into binary descriptor set files (.protoc) using the protoc compiler. This is the entry point of the protobuf ingestion pipeline -- without a compiled descriptor set, no downstream metadata extraction can occur. The compilation is configured with specific flags to ensure that all transitive imports, source code comments, and custom option extensions are preserved in the output binary.

The DataHub project uses this compilation step in its Gradle build for test resources, but the same protoc invocation is expected to be run by users on their own proto schemas before invoking Proto2DataHub.

Usage

The protoc compiler is invoked from the command line or from a build system (Gradle, Bazel, Make) to produce descriptor set files. The output files are then passed to Proto2DataHub via the --descriptor flag.

Code Reference

Source Location

metadata-integration/java/datahub-protobuf/build.gradle, lines 42-58.

Signature

protoc --proto_path=. --include_imports --include_source_info --descriptor_set_out=<output.protoc> <input.proto>

Import

This is an external tool invocation. The protoc binary must be installed on the system. It is part of the Protocol Buffers compiler distribution from Google.

I/O Contract

Direction Type Description
Input .proto files One or more Protocol Buffer schema source files. May include imports referencing other .proto files.
Output .protoc binary descriptor set file A serialized google.protobuf.FileDescriptorSet message containing the compiled descriptors for the input file and all its transitive imports.

Key Flags

Flag Description Required
--proto_path=. Base path for resolving import statements in proto files. All import paths are resolved relative to this directory. Yes
--include_imports Include all transitively imported file descriptors in the output descriptor set. Without this flag, only the directly compiled file's descriptor is included, and type references to imported messages would be unresolvable. Yes
--include_source_info Include SourceCodeInfo in the output, which preserves source code comments and their positions. This is essential for extracting documentation, ownership annotations (e.g., @datahub-project/data-team), and Slack channel references (e.g., #data-eng) from proto file comments. Yes
--descriptor_set_out=<path> Path to write the binary descriptor set output file. By convention, DataHub uses the .protoc extension for these files. Yes

Usage Examples

Single File Compilation

protoc --proto_path=. \
  --include_imports \
  --include_source_info \
  --descriptor_set_out=protobuf/my_schema.protoc \
  protobuf/my_schema.proto

Batch Compilation via Gradle (as used in DataHub tests)

task compileProtobuf {
    doLast {
        def basePath = Paths.get("${projectDir}/src/test/resources")
        [
                fileTree("${projectDir}/src/test/resources/protobuf") { include "*.proto" },
                fileTree("${projectDir}/src/test/resources/extended_protobuf") { include "*.proto" }
        ].collectMany { it.collect() }.each { f ->
            def input = basePath.relativize(Paths.get(f.getAbsolutePath()))
            exec {
                workingDir "${projectDir}/src/test/resources"
                commandLine 'protoc', '--proto_path=.', '--include_imports', '--include_source_info',
                        "--descriptor_set_out=${input.toString().replace(".proto", ".protoc")}",
                        input
            }
        }
    }
}

Multiple Proto Files With Custom Import Paths

protoc --proto_path=./schemas \
  --proto_path=./third_party \
  --include_imports \
  --include_source_info \
  --descriptor_set_out=output/all_schemas.dsc \
  schemas/events/*.proto \
  schemas/entities/*.proto

Compilation Then Ingestion

# Step 1: Compile
protoc --proto_path=. --include_imports --include_source_info \
  --descriptor_set_out=my_schema.protoc my_schema.proto

# Step 2: Ingest into DataHub
java -jar datahub-protobuf.jar \
  --descriptor my_schema.protoc \
  --file my_schema.proto \
  --datahub_api http://localhost:8080

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment