Implementation:Neuml Txtai Baseball Example

Knowledge Sources	txtai txtai Documentation
Domains	Example, Semantic_Search
Last Updated	2026-02-09 00:00 GMT

Overview

Pattern Doc: Streamlit application demonstrating txtai Embeddings for numeric vector search on baseball statistics.

Description

This example application demonstrates how to use txtai's Embeddings class for numeric vector search -- searching over structured statistical data rather than natural language text. Instead of encoding text with a transformer model, it constructs numeric vectors directly from baseball statistics (batting averages, home runs, ERA, etc.) and uses txtai's ANN index to find players with similar statistical profiles.

The application loads historical baseball data from the Lahman Baseball Database (hosted on Hugging Face), computes derived statistics, builds per-player-season vectors, and indexes them with txtai. Users interact via a Streamlit web interface to find statistically similar players across baseball history.

This is a Pattern Doc that illustrates a non-obvious use of txtai: using the embeddings index as a general-purpose numeric similarity search engine by providing a custom transform function that bypasses the default text encoding pipeline.

Key Pattern: Custom Transform for Numeric Vectors

The central technique is passing a custom transform function to the Embeddings constructor:

embeddings = Embeddings({"transform": Stats.transform})
embeddings.index((uid, vectors[uid], None) for uid in vectors)

This bypasses txtai's default text-to-vector encoding and allows indexing pre-computed numeric vectors directly. The Stats.transform method converts a stats row dictionary into a NumPy array, or passes through an existing NumPy array unchanged:

def transform(self, row):
    if isinstance(row, np.ndarray):
        return row
    return np.array([0.0 if not row[x] or np.isnan(row[x]) else row[x] for x in self.columns])

Code Reference

Source Location

Repository: txtai
File: examples/baseball.py
Lines: L1-712

Class Hierarchy

class Stats:          # Base class - data loading, indexing, search
class Batting(Stats): # Batting statistics (30 columns, 350+ PA filter)
class Pitching(Stats):# Pitching statistics (31 columns, 20+ G filter)
class Application:    # Streamlit UI with Player and Search tabs

Import

# This is an example application - run directly:
#   pip install txtai streamlit
#   streamlit run examples/baseball.py

Architecture

Stats (Base Class)

The Stats class provides the framework for loading, indexing, and searching baseball statistics. It defines four abstract methods that subclasses must implement:

Method	Returns	Purpose
`loadcolumns()`	list of str	Column names used to build vectors.
`load()`	DataFrame	Raw statistics data merged with player info.
`metric()`	str	Primary metric column name for ranking.
`vector(row)`	np.ndarray	Compute derived stats and build a vector from an input row.

The constructor calls these methods in sequence: load columns, load data, load player names (with weighted scoring for UI selection), and build the embeddings index.

Stats.init Flow

def __init__(self):
    self.columns = self.loadcolumns()   # Define vector dimensions
    self.stats = self.load()             # Load raw data from CSV
    self.names = self.loadnames()        # Build name -> playerID mapping with weights
    self.vectors, self.data, self.maxyear, self.embeddings = self.index()  # Build search index

Stats.loadnames()

Builds a name-to-player mapping with weighted scores for the random selection UI. Players are sorted by the primary metric in descending order. The top 5% of players receive a squared season count as their weight (increasing their likelihood of being selected as the default), while others use a linear season count.

exponent = 2 if ((len(rows) - x) / len(rows)) >= 0.95 else 1
score = math.pow(len(self.stats[self.stats["playerID"] == row["playerID"]]), exponent)

Stats.index()

Builds the txtai embeddings index using pre-computed numeric vectors:

vectors = {f'{row["yearID"]}{row["playerID"]}': self.transform(row) for _, row in self.stats.iterrows()}
data = {f'{row["yearID"]}{row["playerID"]}': dict(row) for _, row in self.stats.iterrows()}
maxyear = max(row["yearID"] for _, row in self.stats.iterrows())

embeddings = Embeddings({"transform": Stats.transform})
embeddings.index((uid, vectors[uid], None) for uid in vectors)

The unique ID for each record is the concatenation of year and playerID (e.g., "2023troutmi01").

Stats.search()

Runs an embeddings search with deduplication (one result per player) and optional season window filtering:

def search(self, name=None, year=None, window=None, row=None, limit=10):

The method supports two search modes:

Player-season mode: Look up a player's vector by name and year, search for similar vectors.
Stats mode: Accept a row of statistics directly (from the Search tab), compute a vector, and search.

Results are enriched with Baseball Reference links and deduplicated by playerID. The window parameter restricts results to the most recent N seasons.

Batting (Stats Subclass)

Loads batting data with a minimum 350 plate appearances filter. Computes derived columns:

Column	Formula
age	yearID - birthYear
POS	Primary fielding position (numeric: P=1, C=2, 1B=3, 2B=4, 3B=5, SS=6, OF=7)
AVG	H / AB
OBP	(H + BB) / (AB + BB)
1B	H - 2B - 3B - HR
TB	1B + 22B + 33B + 4*HR
SLG	TB / AB
OPS	OBP + SLG
OPS+	100 + (OPS - mean(OPS)) * 100

Uses 30 vector dimensions including physical attributes (birthMonth, age, height, weight), counting stats, and rate stats.

Pitching (Stats Subclass)

Loads pitching data with a minimum 20 game appearances filter. Computes derived columns:

Column	Formula
age	yearID - birthYear
WHIP	(BB + H) / (IPouts / 3)
WADJ	(W + SV) / (ERA + WHIP)

Uses 31 vector dimensions. The primary metric is WADJ (Wins Adjusted), a custom metric that balances wins, saves, ERA, and WHIP.

Application Class

The Application class creates and manages the Streamlit web interface with two tabs:

Player tab: Select a player name (weighted random default), year (slider), category (Batting/Pitching), and optional season window. Displays a metric trend chart and a results table of similar player-seasons.

Search tab: Enter raw statistics via a data editor form. Submitting the form computes a vector and searches for matching player-seasons.

The application is cached using @st.cache_resource so data is loaded only once per Streamlit session.

Data Flow

CSV Data (Hugging Face Hub)
    |
    v
pandas DataFrame (merge players + stats, filter, compute derived columns)
    |
    v
Numeric Vectors (np.array of stat columns per player-season)
    |
    v
txtai Embeddings Index (ANN search over numeric vectors)
    |
    v
Streamlit UI (player selection -> search -> results table)

Usage Examples

Running the Application

pip install txtai streamlit
streamlit run examples/baseball.py

Pattern: Numeric Vector Search with txtai

import numpy as np
from txtai import Embeddings

# Custom transform: pass through numeric arrays unchanged
def transform(row):
    if isinstance(row, np.ndarray):
        return row
    return np.array(list(row.values()), dtype=np.float32)

# Create embeddings with custom transform (no text model needed)
embeddings = Embeddings({"transform": transform})

# Index numeric data
data = {
    "player_a_2023": np.array([150, 600, 95, 180, 35, 2, 30, 100, 0.300, 0.380, 0.550], dtype=np.float32),
    "player_b_2023": np.array([140, 550, 80, 150, 28, 1, 25, 85, 0.273, 0.350, 0.500], dtype=np.float32),
    "player_c_2023": np.array([160, 620, 110, 200, 40, 5, 45, 120, 0.323, 0.400, 0.620], dtype=np.float32),
}

embeddings.index((uid, vec, None) for uid, vec in data.items())

# Search for similar stat profiles
query_vector = np.array([155, 590, 100, 185, 33, 3, 32, 95, 0.314, 0.390, 0.560], dtype=np.float32)
results = embeddings.search(query_vector, limit=3)

for uid, score in results:
    print(f"Player: {uid}, Similarity: {score:.4f}")

Pattern: Weighted Random Selection

import math
import random

# Weighted player selection (from the example)
# Top 5% of players by metric get squared season count as weight
names = {}
for x, row in sorted_players.iterrows():
    key = f"{row['nameFirst']} {row['nameLast']}"
    exponent = 2 if ((len(sorted_players) - x) / len(sorted_players)) >= 0.95 else 1
    score = math.pow(num_seasons, exponent)
    names[key] = (row["playerID"], score)

# Weighted random choice for UI default
default = random.choices(
    list(names.keys()),
    weights=[names[x][1] for x in names]
)[0]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment