Implementation:Huggingface Datasets Xml Builder

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Loading, Structured_Data
Last Updated	2026-02-14 18:00 GMT

Overview

Packaged dataset builder for loading XML files into datasets.

Description

Xml is an ArrowBasedBuilder subclass that loads XML files into HuggingFace Datasets. It is one of the built-in packaged modules, meaning users can invoke it directly via load_dataset("xml", data_files=...) without writing a custom builder script. The builder is configured through XmlConfig, a BuilderConfig dataclass that controls encoding and error handling.

Unlike the Text builder which supports multiple sampling modes, the Xml builder reads each file in its entirety as a single example. By default, the raw XML content is stored as a single string in an "xml" column. If custom features are specified in the config, the builder uses those column names and applies schema casting, supporting both cheap Arrow casts and more expensive storage casts (e.g., string to numeric types).

The builder handles split generation by downloading and extracting data files, then iterating over them. Each file produces exactly one row in the output dataset, with the full XML content as the value.

Usage

Use Xml when you need to load raw XML files into a HuggingFace Dataset for downstream processing. It is typically invoked via load_dataset("xml", data_files=...). This builder loads each XML file as a single string example, so further parsing of the XML structure should be done via Dataset.map() or similar post-processing.

Code Reference

Source Location

Repository: datasets
File: src/datasets/packaged_modules/xml/xml.py
Lines: 1-70

Signature

@dataclass
class XmlConfig(datasets.BuilderConfig):
    """BuilderConfig for xml files."""
    features: Optional[datasets.Features] = None
    encoding: str = "utf-8"
    encoding_errors: Optional[str] = None


class Xml(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = XmlConfig

    def _info(self):
    def _split_generators(self, dl_manager):
    def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    def _generate_shards(self, files):
    def _generate_tables(self, files):

Import

# Typically used via load_dataset, not imported directly
from datasets import load_dataset

ds = load_dataset("xml", data_files="path/to/file.xml")

I/O Contract

XmlConfig Fields

Name	Type	Default	Description
features	`Optional[datasets.Features]`	`None`	Explicit schema for the output dataset. If None, a single `"xml"` string column is used.
encoding	`str`	`"utf-8"`	Character encoding used to read the XML files.
encoding_errors	`Optional[str]`	`None`	How to handle encoding errors (e.g., `"strict"`, `"ignore"`, `"replace"`). Passed to Python's `open()`.

Inputs

Name	Type	Required	Description
data_files	`str`, `List[str]`, or `Dict[str, str/List[str]]`	Yes	Path(s) to the XML file(s) to load.

Outputs

Name	Type	Description
dataset	`Dataset`	Arrow-backed dataset with an `"xml"` column (or custom features if specified). Each row contains the full content of one XML file.

Usage Examples

Basic XML Loading

from datasets import load_dataset

# Load XML files - each file becomes one row
ds = load_dataset("xml", data_files="data/*.xml", split="train")
print(ds[0])  # {"xml": "<?xml version='1.0'?>..."}

Loading with Custom Encoding

from datasets import load_dataset

# Load XML files with latin-1 encoding
ds = load_dataset(
    "xml",
    data_files="legacy_data/*.xml",
    encoding="latin-1",
    encoding_errors="replace",
    split="train",
)

Post-Processing XML Content

from datasets import load_dataset
import xml.etree.ElementTree as ET

ds = load_dataset("xml", data_files="records.xml", split="train")

# Parse the XML content in a map function
def extract_fields(example):
    root = ET.fromstring(example["xml"])
    return {"title": root.find("title").text, "body": root.find("body").text}

ds = ds.map(extract_fields)

Related Pages

Implements Principle

Principle:Huggingface_Datasets_XML_Dataset_Building

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment