Implementation:CrewAIInc CrewAI RAG Data Types
| Knowledge Sources | |
|---|---|
| Domains | RAG, Data_Loading |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Defines the data type enumeration and provides automatic mapping between content types, their appropriate loaders, and chunkers in the CrewAI RAG system.
Description
This module contains two main components that form a critical coordination layer in the RAG pipeline.
DataType is a string enum with 17 members representing all supported content types: FILE, PDF_FILE, TEXT_FILE, CSV, JSON, XML, DOCX, MDX, MYSQL, POSTGRES, GITHUB, DIRECTORY, WEBSITE, DOCS_SITE, YOUTUBE_VIDEO, YOUTUBE_CHANNEL, and TEXT. Each enum member provides two methods:
- get_chunker() returns the appropriate chunker instance by dynamically importing from the chunkers package. Text-based types use TextChunker, structured formats (CSV, JSON, XML) use specialized chunkers, and web content uses WebsiteChunker.
- get_loader() returns the appropriate loader instance by dynamically importing from the loaders package. Each data type maps to a specific loader class (e.g., PDF_FILE maps to PDFLoader, GITHUB maps to GithubLoader).
DataTypes is a utility class with a static from_content() method that performs automatic type detection. It examines file extensions (mapping .pdf, .csv, .json, .xml, .docx, .mdx, .md, .txt to their respective types), URL patterns (detecting GitHub, docs sites, and general websites), filesystem paths (distinguishing files from directories), and falls back to plain TEXT for unrecognized content.
Usage
Import DataType when you need to explicitly specify a content type or access its loader/chunker. Import DataTypes when you need automatic content type detection from file paths, URLs, or strings.
Code Reference
Source Location
- Repository: CrewAI
- File: lib/crewai-tools/src/crewai_tools/rag/data_types.py
- Lines: 1-154
Signature
class DataType(str, Enum):
FILE = "file"
PDF_FILE = "pdf_file"
TEXT_FILE = "text_file"
CSV = "csv"
JSON = "json"
XML = "xml"
DOCX = "docx"
MDX = "mdx"
MYSQL = "mysql"
POSTGRES = "postgres"
GITHUB = "github"
DIRECTORY = "directory"
WEBSITE = "website"
DOCS_SITE = "docs_site"
YOUTUBE_VIDEO = "youtube_video"
YOUTUBE_CHANNEL = "youtube_channel"
TEXT = "text"
def get_chunker(self) -> BaseChunker: ...
def get_loader(self) -> BaseLoader: ...
class DataTypes:
@staticmethod
def from_content(content: str | Path | None = None) -> DataType: ...
Import
from crewai_tools.rag.data_types import DataType, DataTypes
I/O Contract
Inputs (DataType.get_chunker)
| Name | Type | Required | Description |
|---|---|---|---|
| self | DataType | Yes | The data type enum member |
Inputs (DataType.get_loader)
| Name | Type | Required | Description |
|---|---|---|---|
| self | DataType | Yes | The data type enum member |
Inputs (DataTypes.from_content)
| Name | Type | Required | Description |
|---|---|---|---|
| content | Path | None | No | File path, URL, or content string to detect type for (default None returns TEXT) |
Outputs
| Name | Type | Description |
|---|---|---|
| get_chunker() return | BaseChunker | Appropriate chunker instance for the data type |
| get_loader() return | BaseLoader | Appropriate loader instance for the data type |
| from_content() return | DataType | Detected data type enum member |
Usage Examples
Basic Usage
from crewai_tools.rag.data_types import DataType, DataTypes
# Automatic type detection
dtype = DataTypes.from_content("/path/to/document.pdf")
# Returns DataType.PDF_FILE
dtype = DataTypes.from_content("https://github.com/crewAIInc/crewAI")
# Returns DataType.GITHUB
dtype = DataTypes.from_content("https://docs.example.com/guide")
# Returns DataType.DOCS_SITE
# Get loader and chunker for a type
loader = DataType.CSV.get_loader() # Returns CSVLoader()
chunker = DataType.CSV.get_chunker() # Returns CsvChunker()
# Detect and process
dtype = DataTypes.from_content("data.json")
loader = dtype.get_loader()
chunker = dtype.get_chunker()