Implementation:Huggingface Datasets Xml Builder
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Structured_Data |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Packaged dataset builder for loading XML files into datasets.
Description
Xml is an ArrowBasedBuilder subclass that loads XML files into HuggingFace Datasets. It is one of the built-in packaged modules, meaning users can invoke it directly via load_dataset("xml", data_files=...) without writing a custom builder script. The builder is configured through XmlConfig, a BuilderConfig dataclass that controls encoding and error handling.
Unlike the Text builder which supports multiple sampling modes, the Xml builder reads each file in its entirety as a single example. By default, the raw XML content is stored as a single string in an "xml" column. If custom features are specified in the config, the builder uses those column names and applies schema casting, supporting both cheap Arrow casts and more expensive storage casts (e.g., string to numeric types).
The builder handles split generation by downloading and extracting data files, then iterating over them. Each file produces exactly one row in the output dataset, with the full XML content as the value.
Usage
Use Xml when you need to load raw XML files into a HuggingFace Dataset for downstream processing. It is typically invoked via load_dataset("xml", data_files=...). This builder loads each XML file as a single string example, so further parsing of the XML structure should be done via Dataset.map() or similar post-processing.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/packaged_modules/xml/xml.py - Lines: 1-70
Signature
@dataclass
class XmlConfig(datasets.BuilderConfig):
"""BuilderConfig for xml files."""
features: Optional[datasets.Features] = None
encoding: str = "utf-8"
encoding_errors: Optional[str] = None
class Xml(datasets.ArrowBasedBuilder):
BUILDER_CONFIG_CLASS = XmlConfig
def _info(self):
def _split_generators(self, dl_manager):
def _cast_table(self, pa_table: pa.Table) -> pa.Table:
def _generate_shards(self, files):
def _generate_tables(self, files):
Import
# Typically used via load_dataset, not imported directly
from datasets import load_dataset
ds = load_dataset("xml", data_files="path/to/file.xml")
I/O Contract
XmlConfig Fields
| Name | Type | Default | Description |
|---|---|---|---|
| features | Optional[datasets.Features] |
None |
Explicit schema for the output dataset. If None, a single "xml" string column is used.
|
| encoding | str |
"utf-8" |
Character encoding used to read the XML files. |
| encoding_errors | Optional[str] |
None |
How to handle encoding errors (e.g., "strict", "ignore", "replace"). Passed to Python's open().
|
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_files | str, List[str], or Dict[str, str/List[str]] |
Yes | Path(s) to the XML file(s) to load. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset |
Arrow-backed dataset with an "xml" column (or custom features if specified). Each row contains the full content of one XML file.
|
Usage Examples
Basic XML Loading
from datasets import load_dataset
# Load XML files - each file becomes one row
ds = load_dataset("xml", data_files="data/*.xml", split="train")
print(ds[0]) # {"xml": "<?xml version='1.0'?>..."}
Loading with Custom Encoding
from datasets import load_dataset
# Load XML files with latin-1 encoding
ds = load_dataset(
"xml",
data_files="legacy_data/*.xml",
encoding="latin-1",
encoding_errors="replace",
split="train",
)
Post-Processing XML Content
from datasets import load_dataset
import xml.etree.ElementTree as ET
ds = load_dataset("xml", data_files="records.xml", split="train")
# Parse the XML content in a map function
def extract_fields(example):
root = ET.fromstring(example["xml"])
return {"title": root.find("title").text, "body": root.find("body").text}
ds = ds.map(extract_fields)