Workflow:HKUDS AI Trader Data Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Market_Data, ETL |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
End-to-end data acquisition and preparation pipeline that fetches market price data from external APIs and transforms it into the unified JSONL format consumed by AI-Trader agents across three market domains (US stocks, Chinese A-shares, and cryptocurrency).
Description
This workflow covers the complete data pipeline from raw API responses to agent-consumable JSONL files. It supports three distinct market domains, each with its own data source and transformation logic: US stocks use Alpha Vantage for NASDAQ-100 hourly/daily data, Chinese A-shares use Alpha Vantage or Tushare for SSE-50 constituents, and cryptocurrency uses Alpha Vantage for the CD5 index (BTC, ETH, XRP, SOL, ADA and 5 others). The pipeline normalizes field names, applies anti-look-ahead bias masking, and produces a single merged.jsonl per market domain.
Usage
Execute this workflow when you need to refresh or initialize market data before running a trading simulation. This is a prerequisite for any agent run. Choose the appropriate market variant based on your target: US stocks (NASDAQ-100), Chinese A-shares (SSE-50), or cryptocurrency (CD5 index).
Execution Steps
Step 1: Select Market Domain
Determine which market's data pipeline to execute. Each market has its own directory structure, fetch scripts, and merge scripts. US stock data lives in the root data directory, A-share data in data/A_stock, and crypto data in data/crypto. The choice determines which fetch and merge scripts to run.
Key considerations:
- US stocks: ~102 symbols (NASDAQ-100 + QQQ index)
- A-shares: ~50 symbols (SSE-50 constituents) with two data source options (Alpha Vantage or Tushare)
- Crypto: 10 coins (BTC, ETH, XRP, SOL, ADA, AVAX, DOT, LINK, LTC, SUI) plus a composite CD5 index
Step 2: Fetch Raw Price Data
Call the external market data API to download OHLCV (Open, High, Low, Close, Volume) time series for each symbol. Each symbol's response is saved as an individual JSON file named with the pattern daily_prices_{SYMBOL}.json. The fetcher iterates over the predefined symbol list, handles API errors and rate limits, and writes raw responses to disk.
Key considerations:
- Alpha Vantage has rate limits (typically 5 calls/minute on free tier); the fetcher must handle throttling
- Tushare (A-shares alternative) requires a separate API token and uses a different response format
- Crypto fetcher includes synthesis of the CD5 composite index from individual coin prices
- Hourly data (intraday) uses a separate fetch script from daily data
Step 3: Merge into JSONL
Consolidate individual per-symbol JSON files into a single merged.jsonl file. During this step, price field names are standardized (1. open becomes 1. buy price, 4. close becomes 4. sell price) to match the trading agent's buy/sell abstraction. The latest date's entry is modified to only expose the buy price, masking the sell price to prevent look-ahead bias in backtesting.
Key considerations:
- Only symbols matching the target index constituent list are included in the merge
- The merged file is the single source of truth for the MCP price lookup tool
- A-share and crypto mergers follow the same field renaming convention but with market-specific adjustments
- Running the merge script overwrites the previous merged.jsonl
Step 4: Validate Data Integrity
Verify that the merged JSONL file contains valid entries for all expected symbols and date ranges. Check that field renaming was applied correctly, that the latest date only contains buy prices, and that no future data leaks through the masking logic.
Pseudocode:
- Load merged.jsonl
- For each line (symbol), verify:
- - Time series key exists
- - Field names are buy/sell price (not open/close)
- - Latest date has buy price only
- - Date range covers expected period