Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Dolphinscheduler Datasource Connection Management

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Database_Integration, Connection_Pooling
Last Updated 2026-02-10 10:00 GMT

Overview

End-to-end process for configuring, validating, pooling, and using database connections in Apache DolphinScheduler's multi-datasource environment.

Description

This workflow describes how DolphinScheduler manages database connections across its 28+ supported datasource types. The system provides a unified API for creating, validating, caching, and using connections regardless of the underlying database technology. It covers the lifecycle from user-provided connection parameters through validation, pooling with HikariCP, cached access via Guava cache, and secure credential management. The architecture separates ad-hoc connections (for testing and one-time queries) from pooled connections (for production workloads).

Usage

Execute this workflow when you need to configure a new database connection in DolphinScheduler, troubleshoot connection issues, optimize connection pool settings, or understand how tasks interact with external databases during execution.

Execution Steps

Step 1: Configure Datasource Parameters

The user provides datasource configuration through the API, specifying the database type, host, port, database name, credentials, and any additional connection properties. The system maps this input to a BaseDataSourceParamDTO subclass specific to the chosen database type, applying default values where appropriate (e.g., default ports per database).

Key considerations:

  • Each database type has a dedicated ParamDTO with type-specific defaults
  • Additional connection properties are stored as key-value pairs (other field)
  • Kerberos authentication is supported for Hadoop-ecosystem databases via BaseHDFSDataSourceParamDTO
  • The API validates required fields before proceeding

Step 2: Validate and Transform Parameters

The DataSourceProcessor for the selected database type validates the parameters against security rules and database-specific constraints. Validation includes checking host format (no illegal characters), database name validity, and filtering dangerous connection properties. Valid parameters are transformed into a ConnectionParam object containing the constructed JDBC URL, encoded credentials, and driver class.

Key considerations:

  • AbstractDataSourceProcessor enforces a blocklist of dangerous parameter keys
  • Host and database name are validated against regex patterns
  • Password is encrypted using PasswordUtils before storage
  • JDBC URL is constructed using database-specific format rules
  • A unique datasource ID is generated for caching purposes

Step 3: Test Connection

Before saving the datasource configuration, the system tests the connection by creating an ad-hoc client. The AdHocDataSourceClient establishes a direct JDBC connection using the validated ConnectionParam, executes a basic connectivity check, and immediately closes the connection. This confirms that the credentials and network path are valid.

Key considerations:

  • Ad-hoc connections bypass the connection pool
  • Connection test respects configured timeouts
  • Failed tests return descriptive error messages for troubleshooting
  • Special handling for non-JDBC datasources (e.g., SSH, K8s, SageMaker)

Step 4: Establish Connection Pool

For production use, the DataSourceClientProvider creates a PooledDataSourceClient backed by HikariCP. The pooled client manages a pool of reusable connections with configurable sizing, idle timeout, and validation settings. The provider uses a Guava cache (24-hour TTL, max 100 entries) to store pooled clients, keyed by the datasource unique ID.

Key considerations:

  • HikariCP provides high-performance connection pooling
  • Cache eviction triggers proper cleanup of pooled connections
  • Multiple tasks sharing the same datasource share the same pool
  • Pool configuration adapts to database-specific requirements

Step 5: Execute Queries Through Connection

When a task needs to interact with a database, it obtains a connection from the pool through DataSourceClientProvider. The client executes SQL statements, handles result sets, and returns the connection to the pool upon completion. SQL statements are preprocessed to remove comments and split into individual statements for safe execution.

Key considerations:

  • SQL parsing uses the Druid SQL parser for correctness
  • Comment removal prevents injection through SQL comments
  • Statement splitting handles multi-statement execution
  • Connection is returned to pool even on execution failure (try-with-resources)

Step 6: Monitor and Manage Connection Lifecycle

The system monitors connection health through pool statistics and periodic validation. Stale connections are evicted from the pool, and the Guava cache removes unused datasource clients after the TTL expires. Administrators can update datasource configurations, which triggers pool recreation with the new settings.

Key considerations:

  • Pool statistics are exposed for monitoring (active connections, idle connections, wait time)
  • Configuration changes require pool recreation, not hot-reload
  • Credential rotation is supported by updating the datasource and recreating the pool
  • Proper shutdown drains all pools and closes all connections

Execution Diagram

GitHub URL

Workflow Repository