Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Langfuse Langfuse Export Storage Upload

From Leeroopedia
Knowledge Sources
Domains Batch Export, Blob Storage, Cloud Infrastructure
Last Updated 2026-02-14 00:00 GMT

Overview

Export Storage Upload is the principle of streaming export data directly into cloud blob storage using multipart upload, then generating a time-limited signed download URL, so that the worker never needs to buffer the complete file in memory or on local disk.

Description

Batch exports can produce files ranging from kilobytes to gigabytes, depending on the volume of data and the chosen format. Buffering such files in memory or writing them to temporary local storage introduces several risks:

  • Memory exhaustion: A multi-gigabyte CSV file would exceed the worker's heap allocation.
  • Disk space constraints: Containerized worker environments often have limited ephemeral storage.
  • Upload latency: Writing the entire file first and then uploading it doubles the total processing time.
  • Failure recovery: If the upload fails after the file is fully generated, the entire generation must be repeated.

Export Storage Upload eliminates these risks by piping the format-transformed stream directly into a multipart upload operation. The stream is consumed in fixed-size parts (configurable via partSize), and each part is uploaded to the storage backend as soon as it is filled. Multiple parts can be uploaded concurrently (controlled by queueSize), overlapping network I/O with data generation.

The principle encompasses three key aspects:

  1. Backend abstraction: A factory pattern selects the appropriate storage implementation based on configuration. Supported backends include S3-compatible object storage (including MinIO), Azure Blob Storage, and Google Cloud Storage. All backends expose the same interface: upload a stream and return a signed URL.
  1. Multipart upload: For S3, the @aws-sdk/lib-storage Upload class handles the complexity of initiating a multipart upload, uploading individual parts, and completing or aborting the upload. The default part size of 5 MB supports files up to approximately 50 GB (5 MB times 10,000 parts). For larger files, the part size can be increased (e.g., 100 MB for up to ~1 TB). Azure and GCS use their respective streaming upload APIs.
  1. Signed URL generation: After the upload completes, a time-limited download URL is generated using the storage backend's presigning capability. The expiration is controlled by BATCH_EXPORT_DOWNLOAD_LINK_EXPIRATION_HOURS. This URL is stored on the export record and shared with the user via email or the UI. When the URL expires, the UI displays the export as "expired."

The approach also supports server-side encryption (SSE) for S3 backends, including AWS KMS encryption, ensuring that exported data is encrypted at rest.

Usage

Use Export Storage Upload whenever:

  • The output data is generated as a stream and the final size is unknown in advance.
  • The file must be stored in cloud blob storage for later retrieval by end users.
  • You need to support multiple cloud storage providers with a unified API.
  • Time-limited access control (signed URLs) is required for security.

Theoretical Basis

The upload follows a streaming multipart upload pattern:

FUNCTION uploadExportFile(dataStream, config):

  -- Select storage backend based on environment configuration
  backend = StorageFactory.getInstance(config)
    -- Returns S3StorageService, AzureBlobStorageService, or GoogleCloudStorageService

  -- Construct file path
  fileName = "{prefix}{timestamp}-lf-{tableName}-export-{projectId}.{extension}"
  expiresInSeconds = BATCH_EXPORT_DOWNLOAD_LINK_EXPIRATION_HOURS * 3600

  -- Execute streaming upload with signed URL generation
  result = backend.uploadWithSignedUrl({
    fileName,
    fileType,        -- e.g., "text/csv; charset=utf-8"
    data: dataStream, -- the piped format-transformed stream
    expiresInSeconds,
    partSize,        -- bytes per multipart part (default: 5 MB for S3)
    queueSize,       -- concurrent part uploads (default: 4)
  })

  -- Internal implementation (S3 example):
  --   1. Create multipart upload: POST /{bucket}/{key}?uploads
  --   2. For each partSize chunk of dataStream:
  --        Upload part: PUT /{bucket}/{key}?partNumber=N&uploadId=X
  --        (up to queueSize concurrent uploads)
  --   3. Complete multipart upload: POST /{bucket}/{key}?uploadId=X
  --   4. Generate signed GET URL with expiration

  RETURN result.signedUrl

The DNS failure handling deserves mention: if the storage service endpoint cannot be resolved (DNS EAI_AGAIN error), the system throws a ServiceUnavailableError rather than a generic error. This allows the calling code to implement retry logic appropriate for transient network issues.

The external endpoint pattern is important for self-hosted deployments: the S3 client may use an internal endpoint (e.g., http://minio:9000) for uploading, but the signed URL must use an externally accessible endpoint (e.g., https://storage.example.com). A separate S3 client instance is configured for URL signing when an external endpoint is provided.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment