Principle:Explodinggradients Ragas Persona Generation
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| explodinggradients/ragas | LLM Evaluation, Test Data Generation, Persona Modeling | 2026-02-10 |
Overview
Description
Persona Generation is the principle of creating diverse user personas from knowledge graph content to enable varied perspectives in test question synthesis. Rather than generating test queries from a single viewpoint, Ragas creates multiple personas -- each with a unique name and role description -- that represent different types of users who might interact with the content. This diversity ensures the generated test set covers a broad range of question types, styles, and levels of domain expertise.
Usage
Persona generation is an intermediate step in the Ragas test data generation pipeline. It occurs after the knowledge graph has been constructed and enriched (with summaries and embeddings) and before query synthesis begins. The pipeline:
- Filters knowledge graph nodes to find those with summary embeddings (typically DOCUMENT or CHUNK nodes that have been enriched).
- Computes cosine similarity between summary embeddings to group related nodes.
- Selects a representative summary from each group (the longest summary in the cluster).
- Sends each representative summary to an LLM with a persona generation prompt to produce a
Persona(name, role_description). - Passes the resulting personas to query synthesizers, which use them to diversify the perspective of generated questions.
Theoretical Basis
Embedding-Based Clustering for Thematic Grouping: The persona generation process begins by computing pairwise cosine similarity between the summary embeddings of filtered nodes. Nodes whose similarity exceeds a threshold (0.75) are grouped together. This clustering ensures that each persona is derived from a coherent thematic cluster rather than from random or overlapping content. By using embedding similarity rather than keyword matching, the system captures semantic relatedness even when documents use different terminology.
Representative Summary Selection: Within each cluster, the longest summary is selected as the representative. The rationale is that longer summaries typically contain more specific and contextual information, giving the LLM richer material from which to infer a plausible user persona. This heuristic avoids the need for a separate LLM call to select representatives.
LLM-Driven Persona Synthesis: Each representative summary is fed to an LLM via a structured prompt (PersonaGenerationPrompt) that instructs the model to generate a single persona with a name and a concise role description. The prompt includes a few-shot example (a digital marketing guide producing a "Digital Marketing Specialist" persona) to guide the LLM's output format and style. The temperature is set to 1.0 to encourage diversity across personas.
Diversity Through Multiple Perspectives: The resulting personas represent different user archetypes who would interact with the corpus. When these personas are later assigned to query synthesizers, they influence the framing, vocabulary, and complexity of generated questions. For example, a "Data Scientist" persona might generate technical, implementation-focused questions, while a "Product Manager" persona might generate strategic, outcome-focused questions about the same content.
Bounded Generation: The num_personas parameter caps the number of personas generated, balancing diversity against computational cost. If there are fewer thematic clusters than the requested number, the system oversamples from existing clusters to meet the target.
Related Pages
- Implementation:Explodinggradients_Ragas_Generate_Personas_From_KG
- Principle:Explodinggradients_Ragas_Knowledge_Graph_Enrichment -- provides the summary embeddings used for clustering
- Principle:Explodinggradients_Ragas_Test_Query_Synthesis -- consumes personas to diversify generated queries