Workflow:PacktPublishing LLM Engineers Handbook Feature Engineering
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Feature_Engineering, NLP |
| Last Updated | 2026-02-08 07:45 GMT |
Overview
End-to-end process for transforming raw crawled documents into cleaned text and vector embeddings stored in a Qdrant vector database for downstream retrieval and dataset generation.
Description
This workflow takes the raw documents collected by the ETL pipeline from MongoDB, applies content-type-specific cleaning transformations, splits cleaned text into chunks, generates vector embeddings using sentence-transformers, and loads everything into Qdrant. It uses a dispatcher pattern to route each document to the correct cleaning, chunking, and embedding handler based on its content type (article, post, repository). The pipeline produces two parallel outputs: cleaned documents for dataset generation and embedded chunks for RAG retrieval.
Usage
Execute this workflow after the Digital Data ETL pipeline has populated the MongoDB data warehouse with raw documents. You need to transform raw content into searchable vector representations for the RAG system and prepare cleaned text for fine-tuning dataset generation.
Execution Steps
Step 1: Query Data Warehouse
Fetch all raw documents from MongoDB for the specified author(s). This queries across all document types (ArticleDocument, PostDocument, RepositoryDocument) and aggregates them into a single collection for processing.
Key considerations:
- Queries are filtered by author full name(s) specified in the pipeline configuration
- All document types are retrieved: articles, posts, and repositories
- Documents are returned as typed domain model instances
Step 2: Clean Documents
Apply content-type-specific cleaning transformations to each raw document using the CleaningDispatcher. Each content type has a dedicated handler that normalizes text by removing special characters, collapsing whitespace, and stripping platform-specific formatting artifacts.
Key considerations:
- The CleaningDispatcher routes documents to type-specific handlers (article, post, repository)
- Cleaning operations include regex-based text normalization and whitespace collapsing
- Output is a set of CleanedDocument instances stored in Qdrant for later retrieval
Step 3: Load Cleaned Documents to Vector DB
Bulk-insert the cleaned documents into Qdrant. These serve as the source material for dataset generation pipelines that need access to cleaned (but not chunked) text.
Key considerations:
- Cleaned documents are stored in Qdrant collections organized by document type
- The VectorBaseDocument base class handles Qdrant CRUD operations
- This step runs in parallel with the chunking/embedding branch
Step 4: Chunk and Embed
Split each cleaned document into smaller chunks using the ChunkingDispatcher, then generate vector embeddings for each chunk using the EmbeddingDispatcher. Chunking strategies are content-type-specific (different chunk sizes and overlap for articles vs. code repositories). Embeddings are generated using a sentence-transformers model loaded as a thread-safe singleton.
Key considerations:
- Chunking is dispatched by content type with configurable chunk sizes and overlap
- Embeddings are generated in batches of 10 chunks for efficiency
- The EmbeddingModelSingleton ensures the model is loaded only once across threads
- Output is a list of EmbeddedChunk instances with vector representations
Step 5: Load Embedded Chunks to Vector DB
Bulk-insert the embedded chunks into Qdrant vector database collections. These embedded chunks power the RAG retrieval system, enabling semantic search across the author's content.
Key considerations:
- Embedded chunks are stored in type-specific Qdrant collections (articles, posts, repositories)
- Vector indices enable efficient similarity search at query time
- Metadata (author ID, content type) is preserved for filtered retrieval