Implementation:PacktPublishing LLM Engineers Handbook ZenML Step Context Metadata

Aspect	Detail
Type	Wrapper Doc (ZenML external API)
API	`get_step_context().add_output_metadata(metadata: dict) -> None`
Source	steps/etl/crawl_links.py:L42-54
Import	`from zenml import get_step_context`
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_Pipeline_Reporting

Overview

The ZenML Step Context Metadata mechanism allows pipeline steps to attach structured runtime metadata to their outputs. In the Digital Data ETL pipeline, the crawl_links step uses this API to report per-domain crawling statistics (number of successfully crawled URLs, broken down by platform). This metadata becomes visible in the ZenML dashboard and is queryable via the ZenML API.

Usage in Source

from zenml import get_step_context

# Inside the crawl_links step, after processing all URLs:
step_context = get_step_context()
step_context.add_output_metadata(
    output_name="crawled_links",
    metadata=metadata,
)

Full Step Context

The metadata attachment occurs at the end of the crawl_links ZenML step:

from zenml import step, get_step_context

from llm_engineering.application.crawlers.dispatcher import CrawlerDispatcher
from llm_engineering.domain.documents import UserDocument


@step
def crawl_links(user: UserDocument, links: list[str]) -> list[str]:
    dispatcher = (
        CrawlerDispatcher.build()
        .register_linkedin()
        .register_medium()
        .register_github()
    )

    metadata = {}
    successful_links = []

    for link in links:
        try:
            crawler = dispatcher.get_crawler(link)
            crawler.extract(link=link, user=user)

            successful_links.append(link)

            # Track per-domain statistics
            domain = link.split("/")[2]
            metadata[domain] = metadata.get(domain, 0) + 1
        except Exception:
            logger.exception(f"Failed to crawl link: {link}")

    step_context = get_step_context()
    step_context.add_output_metadata(
        output_name="crawled_links",
        metadata=metadata,
    )

    return successful_links

API Reference

get_step_context()

Aspect	Detail
Import	`from zenml import get_step_context`
Signature	`get_step_context() -> StepContext`
Description	Returns the current step's execution context object. Must be called from within a running ZenML step (i.e., a function decorated with `@step`).
Raises	`RuntimeError` if called outside a ZenML step execution

StepContext.add_output_metadata()

Aspect	Detail
Signature	`add_output_metadata(output_name: str, metadata: dict) -> None`
Parameter: output_name	The name of the step output to annotate. Must match a declared output of the step. In the crawl step, this is `"crawled_links"`.
Parameter: metadata	A dictionary of key-value pairs to attach. Keys are strings; values can be strings, numbers, booleans, or nested dictionaries.
Returns	`None`
Side Effect	The metadata is persisted with the step's output artifact in ZenML's metadata store

Inputs

Parameter	Type	Description
`output_name`	`str`	Name of the step output artifact to annotate (e.g., `"crawled_links"`)
`metadata`	`dict`	Dictionary containing crawl statistics, keyed by URL domain

Example Metadata Dictionary

{
    "medium.com": 5,
    "linkedin.com": 12,
    "github.com": 3,
}

This indicates that in a given pipeline run, 5 Medium articles, 12 LinkedIn posts, and 3 GitHub repositories were successfully crawled.

Outputs

The method returns None. The metadata is persisted as a side effect in ZenML's metadata store and becomes accessible through:

ZenML Dashboard: Visible in the step detail view of a pipeline run
ZenML Python Client: Queryable via client.get_artifact_version() which returns metadata on the artifact
ZenML CLI: Inspectable via zenml artifact describe commands

Metadata Lifecycle

1. Step execution begins (ZenML @step decorator)
2. Step logic runs (crawling URLs, collecting statistics)
3. get_step_context() retrieves the active step context
4. add_output_metadata() attaches statistics to the output artifact
5. Step returns its output value
6. ZenML persists both the output artifact and its metadata
7. Dashboard/API surfaces the metadata for this run

Design Notes

Timing: add_output_metadata() should be called before the step returns. Metadata added after the return statement is not captured.
Idempotency: Calling add_output_metadata() multiple times for the same output_name merges the metadata dictionaries (later calls overwrite keys from earlier calls).
Schema Freedom: The metadata dictionary has no enforced schema. It is the step author's responsibility to maintain consistent key naming across runs.
Error Isolation: If add_output_metadata() fails (e.g., due to serialization issues), it should not prevent the step from returning its primary output. In practice, ZenML handles this gracefully.

External References

Source References

Crawl links step: steps/etl/crawl_links.py:L42-54
Full step file: steps/etl/crawl_links.py

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment