Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:PacktPublishing LLM Engineers Handbook Feature Engineering

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Feature_Engineering, NLP
Last Updated 2026-02-08 07:45 GMT

Overview

End-to-end process for transforming raw crawled documents into cleaned text and vector embeddings stored in a Qdrant vector database for downstream retrieval and dataset generation.

Description

This workflow takes the raw documents collected by the ETL pipeline from MongoDB, applies content-type-specific cleaning transformations, splits cleaned text into chunks, generates vector embeddings using sentence-transformers, and loads everything into Qdrant. It uses a dispatcher pattern to route each document to the correct cleaning, chunking, and embedding handler based on its content type (article, post, repository). The pipeline produces two parallel outputs: cleaned documents for dataset generation and embedded chunks for RAG retrieval.

Usage

Execute this workflow after the Digital Data ETL pipeline has populated the MongoDB data warehouse with raw documents. You need to transform raw content into searchable vector representations for the RAG system and prepare cleaned text for fine-tuning dataset generation.

Execution Steps

Step 1: Query Data Warehouse

Fetch all raw documents from MongoDB for the specified author(s). This queries across all document types (ArticleDocument, PostDocument, RepositoryDocument) and aggregates them into a single collection for processing.

Key considerations:

  • Queries are filtered by author full name(s) specified in the pipeline configuration
  • All document types are retrieved: articles, posts, and repositories
  • Documents are returned as typed domain model instances

Step 2: Clean Documents

Apply content-type-specific cleaning transformations to each raw document using the CleaningDispatcher. Each content type has a dedicated handler that normalizes text by removing special characters, collapsing whitespace, and stripping platform-specific formatting artifacts.

Key considerations:

  • The CleaningDispatcher routes documents to type-specific handlers (article, post, repository)
  • Cleaning operations include regex-based text normalization and whitespace collapsing
  • Output is a set of CleanedDocument instances stored in Qdrant for later retrieval

Step 3: Load Cleaned Documents to Vector DB

Bulk-insert the cleaned documents into Qdrant. These serve as the source material for dataset generation pipelines that need access to cleaned (but not chunked) text.

Key considerations:

  • Cleaned documents are stored in Qdrant collections organized by document type
  • The VectorBaseDocument base class handles Qdrant CRUD operations
  • This step runs in parallel with the chunking/embedding branch

Step 4: Chunk and Embed

Split each cleaned document into smaller chunks using the ChunkingDispatcher, then generate vector embeddings for each chunk using the EmbeddingDispatcher. Chunking strategies are content-type-specific (different chunk sizes and overlap for articles vs. code repositories). Embeddings are generated using a sentence-transformers model loaded as a thread-safe singleton.

Key considerations:

  • Chunking is dispatched by content type with configurable chunk sizes and overlap
  • Embeddings are generated in batches of 10 chunks for efficiency
  • The EmbeddingModelSingleton ensures the model is loaded only once across threads
  • Output is a list of EmbeddedChunk instances with vector representations

Step 5: Load Embedded Chunks to Vector DB

Bulk-insert the embedded chunks into Qdrant vector database collections. These embedded chunks power the RAG retrieval system, enabling semantic search across the author's content.

Key considerations:

  • Embedded chunks are stored in type-specific Qdrant collections (articles, posts, repositories)
  • Vector indices enable efficient similarity search at query time
  • Metadata (author ID, content type) is preserved for filtered retrieval

Execution Diagram

GitHub URL

Workflow Repository