Workflow:Apache Dolphinscheduler Datasource Connection Management
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Database_Integration, Connection_Pooling |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
End-to-end process for configuring, validating, pooling, and using database connections in Apache DolphinScheduler's multi-datasource environment.
Description
This workflow describes how DolphinScheduler manages database connections across its 28+ supported datasource types. The system provides a unified API for creating, validating, caching, and using connections regardless of the underlying database technology. It covers the lifecycle from user-provided connection parameters through validation, pooling with HikariCP, cached access via Guava cache, and secure credential management. The architecture separates ad-hoc connections (for testing and one-time queries) from pooled connections (for production workloads).
Usage
Execute this workflow when you need to configure a new database connection in DolphinScheduler, troubleshoot connection issues, optimize connection pool settings, or understand how tasks interact with external databases during execution.
Execution Steps
Step 1: Configure Datasource Parameters
The user provides datasource configuration through the API, specifying the database type, host, port, database name, credentials, and any additional connection properties. The system maps this input to a BaseDataSourceParamDTO subclass specific to the chosen database type, applying default values where appropriate (e.g., default ports per database).
Key considerations:
- Each database type has a dedicated ParamDTO with type-specific defaults
- Additional connection properties are stored as key-value pairs (other field)
- Kerberos authentication is supported for Hadoop-ecosystem databases via BaseHDFSDataSourceParamDTO
- The API validates required fields before proceeding
Step 2: Validate and Transform Parameters
The DataSourceProcessor for the selected database type validates the parameters against security rules and database-specific constraints. Validation includes checking host format (no illegal characters), database name validity, and filtering dangerous connection properties. Valid parameters are transformed into a ConnectionParam object containing the constructed JDBC URL, encoded credentials, and driver class.
Key considerations:
- AbstractDataSourceProcessor enforces a blocklist of dangerous parameter keys
- Host and database name are validated against regex patterns
- Password is encrypted using PasswordUtils before storage
- JDBC URL is constructed using database-specific format rules
- A unique datasource ID is generated for caching purposes
Step 3: Test Connection
Before saving the datasource configuration, the system tests the connection by creating an ad-hoc client. The AdHocDataSourceClient establishes a direct JDBC connection using the validated ConnectionParam, executes a basic connectivity check, and immediately closes the connection. This confirms that the credentials and network path are valid.
Key considerations:
- Ad-hoc connections bypass the connection pool
- Connection test respects configured timeouts
- Failed tests return descriptive error messages for troubleshooting
- Special handling for non-JDBC datasources (e.g., SSH, K8s, SageMaker)
Step 4: Establish Connection Pool
For production use, the DataSourceClientProvider creates a PooledDataSourceClient backed by HikariCP. The pooled client manages a pool of reusable connections with configurable sizing, idle timeout, and validation settings. The provider uses a Guava cache (24-hour TTL, max 100 entries) to store pooled clients, keyed by the datasource unique ID.
Key considerations:
- HikariCP provides high-performance connection pooling
- Cache eviction triggers proper cleanup of pooled connections
- Multiple tasks sharing the same datasource share the same pool
- Pool configuration adapts to database-specific requirements
Step 5: Execute Queries Through Connection
When a task needs to interact with a database, it obtains a connection from the pool through DataSourceClientProvider. The client executes SQL statements, handles result sets, and returns the connection to the pool upon completion. SQL statements are preprocessed to remove comments and split into individual statements for safe execution.
Key considerations:
- SQL parsing uses the Druid SQL parser for correctness
- Comment removal prevents injection through SQL comments
- Statement splitting handles multi-statement execution
- Connection is returned to pool even on execution failure (try-with-resources)
Step 6: Monitor and Manage Connection Lifecycle
The system monitors connection health through pool statistics and periodic validation. Stale connections are evicted from the pool, and the Guava cache removes unused datasource clients after the TTL expires. Administrators can update datasource configurations, which triggers pool recreation with the new settings.
Key considerations:
- Pool statistics are exposed for monitoring (active connections, idle connections, wait time)
- Configuration changes require pool recreation, not hot-reload
- Credential rotation is supported by updating the datasource and recreating the pool
- Proper shutdown drains all pools and closes all connections