Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook ZenML Step Context Metadata

From Leeroopedia


Aspect Detail
Type Wrapper Doc (ZenML external API)
API get_step_context().add_output_metadata(metadata: dict) -> None
Source steps/etl/crawl_links.py:L42-54
Import from zenml import get_step_context
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Pipeline_Reporting

Overview

The ZenML Step Context Metadata mechanism allows pipeline steps to attach structured runtime metadata to their outputs. In the Digital Data ETL pipeline, the crawl_links step uses this API to report per-domain crawling statistics (number of successfully crawled URLs, broken down by platform). This metadata becomes visible in the ZenML dashboard and is queryable via the ZenML API.

Usage in Source

from zenml import get_step_context

# Inside the crawl_links step, after processing all URLs:
step_context = get_step_context()
step_context.add_output_metadata(
    output_name="crawled_links",
    metadata=metadata,
)

Full Step Context

The metadata attachment occurs at the end of the crawl_links ZenML step:

from zenml import step, get_step_context

from llm_engineering.application.crawlers.dispatcher import CrawlerDispatcher
from llm_engineering.domain.documents import UserDocument


@step
def crawl_links(user: UserDocument, links: list[str]) -> list[str]:
    dispatcher = (
        CrawlerDispatcher.build()
        .register_linkedin()
        .register_medium()
        .register_github()
    )

    metadata = {}
    successful_links = []

    for link in links:
        try:
            crawler = dispatcher.get_crawler(link)
            crawler.extract(link=link, user=user)

            successful_links.append(link)

            # Track per-domain statistics
            domain = link.split("/")[2]
            metadata[domain] = metadata.get(domain, 0) + 1
        except Exception:
            logger.exception(f"Failed to crawl link: {link}")

    step_context = get_step_context()
    step_context.add_output_metadata(
        output_name="crawled_links",
        metadata=metadata,
    )

    return successful_links

API Reference

get_step_context()

Aspect Detail
Import from zenml import get_step_context
Signature get_step_context() -> StepContext
Description Returns the current step's execution context object. Must be called from within a running ZenML step (i.e., a function decorated with @step).
Raises RuntimeError if called outside a ZenML step execution

StepContext.add_output_metadata()

Aspect Detail
Signature add_output_metadata(output_name: str, metadata: dict) -> None
Parameter: output_name The name of the step output to annotate. Must match a declared output of the step. In the crawl step, this is "crawled_links".
Parameter: metadata A dictionary of key-value pairs to attach. Keys are strings; values can be strings, numbers, booleans, or nested dictionaries.
Returns None
Side Effect The metadata is persisted with the step's output artifact in ZenML's metadata store

Inputs

Parameter Type Description
output_name str Name of the step output artifact to annotate (e.g., "crawled_links")
metadata dict Dictionary containing crawl statistics, keyed by URL domain

Example Metadata Dictionary

{
    "medium.com": 5,
    "linkedin.com": 12,
    "github.com": 3,
}

This indicates that in a given pipeline run, 5 Medium articles, 12 LinkedIn posts, and 3 GitHub repositories were successfully crawled.

Outputs

The method returns None. The metadata is persisted as a side effect in ZenML's metadata store and becomes accessible through:

  • ZenML Dashboard: Visible in the step detail view of a pipeline run
  • ZenML Python Client: Queryable via client.get_artifact_version() which returns metadata on the artifact
  • ZenML CLI: Inspectable via zenml artifact describe commands

Metadata Lifecycle

1. Step execution begins (ZenML @step decorator)
2. Step logic runs (crawling URLs, collecting statistics)
3. get_step_context() retrieves the active step context
4. add_output_metadata() attaches statistics to the output artifact
5. Step returns its output value
6. ZenML persists both the output artifact and its metadata
7. Dashboard/API surfaces the metadata for this run

Design Notes

  • Timing: add_output_metadata() should be called before the step returns. Metadata added after the return statement is not captured.
  • Idempotency: Calling add_output_metadata() multiple times for the same output_name merges the metadata dictionaries (later calls overwrite keys from earlier calls).
  • Schema Freedom: The metadata dictionary has no enforced schema. It is the step author's responsibility to maintain consistent key naming across runs.
  • Error Isolation: If add_output_metadata() fails (e.g., due to serialization issues), it should not prevent the step from returning its primary output. In practice, ZenML handles this gracefully.

External References

Source References

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment