Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Togethercomputer Together python Text Preprocessing

From Leeroopedia

Overview

Text Preprocessing is the principle of cleaning and preparing raw text data before sending it to embedding generation or document reranking APIs in the Together Python SDK.

Description

Text preprocessing is a user-defined step for preparing raw text before sending it to embedding or reranking APIs. This may include removing HTML tags, normalizing whitespace, truncating to model token limits, splitting long documents into chunks, and deduplicating input text. The Together Python SDK does not provide built-in preprocessing utilities -- this responsibility is left entirely to the user.

Preprocessing ensures that the text inputs passed to Embeddings.create() or Rerank.create() are clean, well-formed, and appropriately sized for the target model. Poor input quality (e.g., excessive whitespace, HTML artifacts, text exceeding token limits) degrades embedding quality and reranking accuracy.

Usage

Use text preprocessing before calling client.embeddings.create() or client.rerank.create() to ensure clean, appropriately-sized text inputs. Common scenarios include:

  • Web scraping pipelines -- Strip HTML tags and boilerplate before embedding page content
  • Document ingestion -- Split long documents into chunks that fit within model token limits
  • RAG pipelines -- Normalize and deduplicate retrieved passages before reranking
  • Multi-source aggregation -- Standardize text from different sources (PDFs, emails, databases) into a consistent format

Theoretical Basis

Embedding quality depends directly on input text quality. The key preprocessing considerations are:

  • Tokenization-aware chunking -- Models have maximum token limits (e.g., 512 or 8192 tokens depending on the model). Text exceeding these limits is silently truncated by the API, potentially losing important content. Splitting text into chunks that respect these limits ensures complete semantic coverage.
  • Normalization -- Consistent casing, whitespace normalization, and Unicode standardization reduce spurious variation in embedding space. Two semantically identical texts with different formatting should produce similar embeddings.
  • Noise removal -- HTML tags, markdown formatting, headers/footers, and boilerplate text add noise that can shift embeddings away from the core semantic content. Removing these artifacts improves embedding relevance.
  • Deduplication -- Duplicate or near-duplicate texts waste API calls and can skew retrieval results. Deduplication before embedding saves costs and improves result diversity.
  • Reranking considerations -- For reranking, preserving the semantic content is critical while removing formatting artifacts. Unlike embeddings where chunking is common, reranking documents should generally remain as coherent units since the cross-encoder needs to assess overall document relevance to the query.

Metadata

Property Value
Principle Text Preprocessing
Domain NLP, Information_Retrieval, RAG
Workflow Embeddings_And_Reranking
Related Concepts Tokenization, Text Normalization, Document Chunking, Deduplication
Implementation Implementation:Togethercomputer_Together_python_Text_Preprocessing_Pattern

Knowledge Sources

2026-02-15 16:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment