Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:CrewAIInc CrewAI RAG Data Types

From Leeroopedia
Knowledge Sources
Domains RAG, Data_Loading
Last Updated 2026-02-11 00:00 GMT

Overview

Defines the data type enumeration and provides automatic mapping between content types, their appropriate loaders, and chunkers in the CrewAI RAG system.

Description

This module contains two main components that form a critical coordination layer in the RAG pipeline.

DataType is a string enum with 17 members representing all supported content types: FILE, PDF_FILE, TEXT_FILE, CSV, JSON, XML, DOCX, MDX, MYSQL, POSTGRES, GITHUB, DIRECTORY, WEBSITE, DOCS_SITE, YOUTUBE_VIDEO, YOUTUBE_CHANNEL, and TEXT. Each enum member provides two methods:

  • get_chunker() returns the appropriate chunker instance by dynamically importing from the chunkers package. Text-based types use TextChunker, structured formats (CSV, JSON, XML) use specialized chunkers, and web content uses WebsiteChunker.
  • get_loader() returns the appropriate loader instance by dynamically importing from the loaders package. Each data type maps to a specific loader class (e.g., PDF_FILE maps to PDFLoader, GITHUB maps to GithubLoader).

DataTypes is a utility class with a static from_content() method that performs automatic type detection. It examines file extensions (mapping .pdf, .csv, .json, .xml, .docx, .mdx, .md, .txt to their respective types), URL patterns (detecting GitHub, docs sites, and general websites), filesystem paths (distinguishing files from directories), and falls back to plain TEXT for unrecognized content.

Usage

Import DataType when you need to explicitly specify a content type or access its loader/chunker. Import DataTypes when you need automatic content type detection from file paths, URLs, or strings.

Code Reference

Source Location

  • Repository: CrewAI
  • File: lib/crewai-tools/src/crewai_tools/rag/data_types.py
  • Lines: 1-154

Signature

class DataType(str, Enum):
    FILE = "file"
    PDF_FILE = "pdf_file"
    TEXT_FILE = "text_file"
    CSV = "csv"
    JSON = "json"
    XML = "xml"
    DOCX = "docx"
    MDX = "mdx"
    MYSQL = "mysql"
    POSTGRES = "postgres"
    GITHUB = "github"
    DIRECTORY = "directory"
    WEBSITE = "website"
    DOCS_SITE = "docs_site"
    YOUTUBE_VIDEO = "youtube_video"
    YOUTUBE_CHANNEL = "youtube_channel"
    TEXT = "text"

    def get_chunker(self) -> BaseChunker: ...
    def get_loader(self) -> BaseLoader: ...

class DataTypes:
    @staticmethod
    def from_content(content: str | Path | None = None) -> DataType: ...

Import

from crewai_tools.rag.data_types import DataType, DataTypes

I/O Contract

Inputs (DataType.get_chunker)

Name Type Required Description
self DataType Yes The data type enum member

Inputs (DataType.get_loader)

Name Type Required Description
self DataType Yes The data type enum member

Inputs (DataTypes.from_content)

Name Type Required Description
content Path | None No File path, URL, or content string to detect type for (default None returns TEXT)

Outputs

Name Type Description
get_chunker() return BaseChunker Appropriate chunker instance for the data type
get_loader() return BaseLoader Appropriate loader instance for the data type
from_content() return DataType Detected data type enum member

Usage Examples

Basic Usage

from crewai_tools.rag.data_types import DataType, DataTypes

# Automatic type detection
dtype = DataTypes.from_content("/path/to/document.pdf")
# Returns DataType.PDF_FILE

dtype = DataTypes.from_content("https://github.com/crewAIInc/crewAI")
# Returns DataType.GITHUB

dtype = DataTypes.from_content("https://docs.example.com/guide")
# Returns DataType.DOCS_SITE

# Get loader and chunker for a type
loader = DataType.CSV.get_loader()     # Returns CSVLoader()
chunker = DataType.CSV.get_chunker()   # Returns CsvChunker()

# Detect and process
dtype = DataTypes.from_content("data.json")
loader = dtype.get_loader()
chunker = dtype.get_chunker()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment