Implementation:Datahub project Datahub Protoc Descriptor Set Out
| Field | Value |
|---|---|
| Implementation Name | Protoc_Descriptor_Set_Out |
| Type | External Tool Doc |
| Workflow | Protobuf_Schema_Ingestion |
| Repository | https://github.com/datahub-project/datahub |
| Implements | Principle:Datahub_project_Datahub_Protobuf_Compilation |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
Protoc Descriptor Set Out is the external tool invocation that compiles Protocol Buffer .proto source files into binary descriptor set files (.protoc) using the protoc compiler. This is the entry point of the protobuf ingestion pipeline -- without a compiled descriptor set, no downstream metadata extraction can occur. The compilation is configured with specific flags to ensure that all transitive imports, source code comments, and custom option extensions are preserved in the output binary.
The DataHub project uses this compilation step in its Gradle build for test resources, but the same protoc invocation is expected to be run by users on their own proto schemas before invoking Proto2DataHub.
Usage
The protoc compiler is invoked from the command line or from a build system (Gradle, Bazel, Make) to produce descriptor set files. The output files are then passed to Proto2DataHub via the --descriptor flag.
Code Reference
Source Location
metadata-integration/java/datahub-protobuf/build.gradle, lines 42-58.
Signature
protoc --proto_path=. --include_imports --include_source_info --descriptor_set_out=<output.protoc> <input.proto>
Import
This is an external tool invocation. The protoc binary must be installed on the system. It is part of the Protocol Buffers compiler distribution from Google.
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | .proto files |
One or more Protocol Buffer schema source files. May include imports referencing other .proto files.
|
| Output | .protoc binary descriptor set file |
A serialized google.protobuf.FileDescriptorSet message containing the compiled descriptors for the input file and all its transitive imports.
|
Key Flags
| Flag | Description | Required |
|---|---|---|
--proto_path=. |
Base path for resolving import statements in proto files. All import paths are resolved relative to this directory. |
Yes |
--include_imports |
Include all transitively imported file descriptors in the output descriptor set. Without this flag, only the directly compiled file's descriptor is included, and type references to imported messages would be unresolvable. | Yes |
--include_source_info |
Include SourceCodeInfo in the output, which preserves source code comments and their positions. This is essential for extracting documentation, ownership annotations (e.g., @datahub-project/data-team), and Slack channel references (e.g., #data-eng) from proto file comments. |
Yes |
--descriptor_set_out=<path> |
Path to write the binary descriptor set output file. By convention, DataHub uses the .protoc extension for these files. |
Yes |
Usage Examples
Single File Compilation
protoc --proto_path=. \
--include_imports \
--include_source_info \
--descriptor_set_out=protobuf/my_schema.protoc \
protobuf/my_schema.proto
Batch Compilation via Gradle (as used in DataHub tests)
task compileProtobuf {
doLast {
def basePath = Paths.get("${projectDir}/src/test/resources")
[
fileTree("${projectDir}/src/test/resources/protobuf") { include "*.proto" },
fileTree("${projectDir}/src/test/resources/extended_protobuf") { include "*.proto" }
].collectMany { it.collect() }.each { f ->
def input = basePath.relativize(Paths.get(f.getAbsolutePath()))
exec {
workingDir "${projectDir}/src/test/resources"
commandLine 'protoc', '--proto_path=.', '--include_imports', '--include_source_info',
"--descriptor_set_out=${input.toString().replace(".proto", ".protoc")}",
input
}
}
}
}
Multiple Proto Files With Custom Import Paths
protoc --proto_path=./schemas \
--proto_path=./third_party \
--include_imports \
--include_source_info \
--descriptor_set_out=output/all_schemas.dsc \
schemas/events/*.proto \
schemas/entities/*.proto
Compilation Then Ingestion
# Step 1: Compile
protoc --proto_path=. --include_imports --include_source_info \
--descriptor_set_out=my_schema.protoc my_schema.proto
# Step 2: Ingest into DataHub
java -jar datahub-protobuf.jar \
--descriptor my_schema.protoc \
--file my_schema.proto \
--datahub_api http://localhost:8080