Implementation:PacktPublishing LLM Engineers Handbook ZenML Step Context Metadata
| Aspect | Detail |
|---|---|
| Type | Wrapper Doc (ZenML external API) |
| API | get_step_context().add_output_metadata(metadata: dict) -> None
|
| Source | steps/etl/crawl_links.py:L42-54 |
| Import | from zenml import get_step_context
|
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Pipeline_Reporting |
Overview
The ZenML Step Context Metadata mechanism allows pipeline steps to attach structured runtime metadata to their outputs. In the Digital Data ETL pipeline, the crawl_links step uses this API to report per-domain crawling statistics (number of successfully crawled URLs, broken down by platform). This metadata becomes visible in the ZenML dashboard and is queryable via the ZenML API.
Usage in Source
from zenml import get_step_context
# Inside the crawl_links step, after processing all URLs:
step_context = get_step_context()
step_context.add_output_metadata(
output_name="crawled_links",
metadata=metadata,
)
Full Step Context
The metadata attachment occurs at the end of the crawl_links ZenML step:
from zenml import step, get_step_context
from llm_engineering.application.crawlers.dispatcher import CrawlerDispatcher
from llm_engineering.domain.documents import UserDocument
@step
def crawl_links(user: UserDocument, links: list[str]) -> list[str]:
dispatcher = (
CrawlerDispatcher.build()
.register_linkedin()
.register_medium()
.register_github()
)
metadata = {}
successful_links = []
for link in links:
try:
crawler = dispatcher.get_crawler(link)
crawler.extract(link=link, user=user)
successful_links.append(link)
# Track per-domain statistics
domain = link.split("/")[2]
metadata[domain] = metadata.get(domain, 0) + 1
except Exception:
logger.exception(f"Failed to crawl link: {link}")
step_context = get_step_context()
step_context.add_output_metadata(
output_name="crawled_links",
metadata=metadata,
)
return successful_links
API Reference
get_step_context()
| Aspect | Detail |
|---|---|
| Import | from zenml import get_step_context
|
| Signature | get_step_context() -> StepContext
|
| Description | Returns the current step's execution context object. Must be called from within a running ZenML step (i.e., a function decorated with @step).
|
| Raises | RuntimeError if called outside a ZenML step execution
|
StepContext.add_output_metadata()
| Aspect | Detail |
|---|---|
| Signature | add_output_metadata(output_name: str, metadata: dict) -> None
|
| Parameter: output_name | The name of the step output to annotate. Must match a declared output of the step. In the crawl step, this is "crawled_links".
|
| Parameter: metadata | A dictionary of key-value pairs to attach. Keys are strings; values can be strings, numbers, booleans, or nested dictionaries. |
| Returns | None
|
| Side Effect | The metadata is persisted with the step's output artifact in ZenML's metadata store |
Inputs
| Parameter | Type | Description |
|---|---|---|
output_name |
str |
Name of the step output artifact to annotate (e.g., "crawled_links")
|
metadata |
dict |
Dictionary containing crawl statistics, keyed by URL domain |
Example Metadata Dictionary
{
"medium.com": 5,
"linkedin.com": 12,
"github.com": 3,
}
This indicates that in a given pipeline run, 5 Medium articles, 12 LinkedIn posts, and 3 GitHub repositories were successfully crawled.
Outputs
The method returns None. The metadata is persisted as a side effect in ZenML's metadata store and becomes accessible through:
- ZenML Dashboard: Visible in the step detail view of a pipeline run
- ZenML Python Client: Queryable via
client.get_artifact_version()which returns metadata on the artifact - ZenML CLI: Inspectable via
zenml artifact describecommands
Metadata Lifecycle
1. Step execution begins (ZenML @step decorator)
2. Step logic runs (crawling URLs, collecting statistics)
3. get_step_context() retrieves the active step context
4. add_output_metadata() attaches statistics to the output artifact
5. Step returns its output value
6. ZenML persists both the output artifact and its metadata
7. Dashboard/API surfaces the metadata for this run
Design Notes
- Timing:
add_output_metadata()should be called before the step returns. Metadata added after the return statement is not captured. - Idempotency: Calling
add_output_metadata()multiple times for the sameoutput_namemerges the metadata dictionaries (later calls overwrite keys from earlier calls). - Schema Freedom: The metadata dictionary has no enforced schema. It is the step author's responsibility to maintain consistent key naming across runs.
- Error Isolation: If
add_output_metadata()fails (e.g., due to serialization issues), it should not prevent the step from returning its primary output. In practice, ZenML handles this gracefully.
External References
Source References
- Crawl links step: steps/etl/crawl_links.py:L42-54
- Full step file: steps/etl/crawl_links.py
See Also
- Principle:PacktPublishing_LLM_Engineers_Handbook_Pipeline_Reporting -- the principle this implements
- Implementation:PacktPublishing_LLM_Engineers_Handbook_CrawlerDispatcher_Build -- the dispatcher that produces the crawl results being reported
- Implementation:PacktPublishing_LLM_Engineers_Handbook_BaseCrawler_Extract -- the crawlers whose results are aggregated in the metadata
- Environment:PacktPublishing_LLM_Engineers_Handbook_Python_3_11_Poetry_Environment