feat(cubemaster): add PostgreSQL support for dao layer#674
Conversation
| connTimeout = 5 | ||
| } | ||
| return fmt.Sprintf( | ||
| "host=%s user=%s password=%s dbname=%s sslmode=disable connect_timeout=%d", |
There was a problem hiding this comment.
Security: sslmode=disable is hardcoded with no config knobs to enable TLS. The dao.Config struct has an Extra map[string]string field explicitly documented for engine-specific knobs, but this driver never reads it. This means database credentials (username, password) and all query data are transmitted in cleartext over the network between CubeMaster and PostgreSQL.
The MySQL driver does not hardcode tls=false, making this PG driver strictly worse. In production environments where the DB is not on the loopback interface, this is a real risk (CWE-319).
Recommendation: Read an SSL mode from cfg.Extra["sslmode"] (defaulting to require or at minimum prefer), and support sslrootcert for verified TLS.
| func buildDSN(cfg dao.Config) string { | ||
| connTimeout := cfg.ConnTimeoutSeconds | ||
| if connTimeout <= 0 { | ||
| connTimeout = 5 | ||
| } | ||
| return fmt.Sprintf( | ||
| "host=%s user=%s password=%s dbname=%s sslmode=disable connect_timeout=%d", | ||
| cfg.Addr, cfg.User, cfg.Pwd, cfg.DBName, connTimeout, | ||
| ) | ||
| } |
There was a problem hiding this comment.
Behavioral gap vs MySQL: The MySQL driver's buildDSN plumbers cfg.ReadTimeoutSeconds and cfg.WriteTimeoutSeconds into the DSN (mysql.go:87-93). This driver ignores both. The pgx driver has no native DSN read/write timeout, so a hung or deadlocked PG query occupies a connection from the pool indefinitely without deadline protection.
On large tables, the migration backfill queries (e.g., the leading-wildcard LIKE '%' || a.artifact_id || '%' in 20260624121500_template_definition_rootfs_artifact_id.sql) can take minutes. With read/write timeout unset, the only recourse is the OS-level TCP timeout.
Recommendation: Either set a session-level statement_timeout after connecting, or pass a context with deadline derived from the config values.
| func (s *sessionLocker) SessionLock(ctx context.Context, conn *sql.Conn) error { | ||
| deadline := time.Now().Add(time.Duration(s.timeout) * time.Second) | ||
| const retryInterval = 200 * time.Millisecond | ||
|
|
||
| for { | ||
| var acquired bool | ||
| if err := conn.QueryRowContext(ctx, | ||
| "SELECT pg_try_advisory_lock($1)", s.id).Scan(&acquired); err != nil { | ||
| return fmt.Errorf("acquire advisory lock %d: %w", s.id, err) | ||
| } | ||
| if acquired { | ||
| return nil | ||
| } | ||
| if time.Now().After(deadline) { | ||
| return fmt.Errorf("acquire advisory lock %d: timeout after %ds", s.id, s.timeout) | ||
| } | ||
| select { | ||
| case <-ctx.Done(): | ||
| return fmt.Errorf("acquire advisory lock %d: %w", s.id, ctx.Err()) | ||
| case <-time.After(retryInterval): |
There was a problem hiding this comment.
Thundering herd risk: The retry loop polls pg_try_advisory_lock at a fixed 200ms interval with no backoff. The MySQL counterpart (mysql.go:115-128) makes a single blocking GET_LOCK(?, timeout) call and lets MySQL handle retries internally. During a cluster rolling restart (common with Kubernetes), N instances all pound PG at 200ms intervals — 10 instances contending for 30s = 1500 round trips.
Recommendation: Add exponential backoff with jitter (e.g., base 50ms → max 2s) to desynchronize concurrent lockers and reduce PG load.
| // advisoryLockID is the pg_advisory_lock key held for the entire | ||
| // goose.Up() run (outer layer of the two-layer locking scheme). | ||
| // The value is arbitrary but MUST remain stable across versions; | ||
| // changing it would let a paused old instance and a new instance | ||
| // both acquire the lock and race. Derived from: | ||
| // SELECT hashtext('cubemaster_schema_migration_global'); | ||
| advisoryLockID = 3764529487 |
There was a problem hiding this comment.
Minor comment inaccuracy: The comment says the constant is "Derived from: SELECT hashtext('cubemaster_schema_migration_global')". PostgreSQL's hashtext() returns int4 (signed 32-bit, max 2147483647), so the actual PG result would be different from 3764529487. The outer locker never calls hashtext — it uses this constant directly as a bigint advisory lock key. No functional bug, but the comment misleads anyone trying to verify the value against a live PG instance.
Recommendation: Either store the actual signed int32 output of SELECT hashtext(...), or simplify the comment to "Stable arbitrary value; do not change across versions" and remove the "Derived from" claim.
Review: PR #674 ��� PostgreSQL DAO LayerA well-structured, additive PR that cleanly extracts a fingerprintStore interface and provides a pgx-based PostgreSQL driver. Findings below are production hardening issues rather than fundamental design problems. Critical1. Hardcoded sslmode=disable ��� no TLS possible (postgres.go:90) 2. ReadTimeout/WriteTimeout silently ignored (postgres.go:84-93) High3. Session locker retry has no backoff (session_locker.go:28-49) 4. Test locker duplicates production code with regression (migrate_postgres_test.go:129) 5. Duplicated preflight validation logic (fingerprint.go:224-286) 6. driver.Open() error paths untested (postgres_test.go) 7. No fingerprint-content-drift test for PostgreSQL Medium8. advisoryLockID comment misleading (postgres.go:28-34) 9. currentlyAppliedVersionsPostgres query (fingerprint_postgres.go:61-69) SummarySolid architecture ��� pure additive, no MySQL regression, proper interface extraction. Address TLS and timeout gaps before production PG deployment. |
| connTimeout = 5 | ||
| } | ||
| return fmt.Sprintf( | ||
| "host=%s user=%s password=%s dbname=%s sslmode=disable connect_timeout=%d", |
There was a problem hiding this comment.
Security: sslmode=disable is hardcoded with no config knobs to enable TLS. The dao.Config struct has an Extra map[string]string field explicitly documented for engine-specific knobs, but this driver never reads it. This means database credentials (username, password) and all query data are transmitted in cleartext over the network between CubeMaster and PostgreSQL.
The MySQL driver does not hardcode tls=false, making this PG driver strictly worse. In production environments where the DB is not on the loopback interface, this is a real risk (CWE-319).
Recommendation: Read an SSL mode from cfg.Extra["sslmode"] (defaulting to require or at minimum prefer), and support sslrootcert for verified TLS.
| func buildDSN(cfg dao.Config) string { | ||
| connTimeout := cfg.ConnTimeoutSeconds | ||
| if connTimeout <= 0 { | ||
| connTimeout = 5 | ||
| } | ||
| return fmt.Sprintf( | ||
| "host=%s user=%s password=%s dbname=%s sslmode=disable connect_timeout=%d", | ||
| cfg.Addr, cfg.User, cfg.Pwd, cfg.DBName, connTimeout, | ||
| ) | ||
| } |
There was a problem hiding this comment.
Behavioral gap vs MySQL: The MySQL driver's buildDSN plumbers cfg.ReadTimeoutSeconds and cfg.WriteTimeoutSeconds into the DSN (mysql.go:87-93). This driver ignores both. The pgx driver has no native DSN read/write timeout, so a hung or deadlocked PG query occupies a connection from the pool indefinitely without deadline protection.
On large tables, the migration backfill queries (e.g., the leading-wildcard LIKE '%' || a.artifact_id || '%' in 20260624121500_template_definition_rootfs_artifact_id.sql) can take minutes. With read/write timeout unset, the only recourse is the OS-level TCP timeout.
Recommendation: Either set a session-level statement_timeout after connecting, or pass a context with deadline derived from the config values.
| func (s *sessionLocker) SessionLock(ctx context.Context, conn *sql.Conn) error { | ||
| deadline := time.Now().Add(time.Duration(s.timeout) * time.Second) | ||
| const retryInterval = 200 * time.Millisecond | ||
|
|
||
| for { | ||
| var acquired bool | ||
| if err := conn.QueryRowContext(ctx, | ||
| "SELECT pg_try_advisory_lock($1)", s.id).Scan(&acquired); err != nil { | ||
| return fmt.Errorf("acquire advisory lock %d: %w", s.id, err) | ||
| } | ||
| if acquired { | ||
| return nil | ||
| } | ||
| if time.Now().After(deadline) { | ||
| return fmt.Errorf("acquire advisory lock %d: timeout after %ds", s.id, s.timeout) | ||
| } | ||
| select { | ||
| case <-ctx.Done(): | ||
| return fmt.Errorf("acquire advisory lock %d: %w", s.id, ctx.Err()) | ||
| case <-time.After(retryInterval): |
There was a problem hiding this comment.
Thundering herd risk: The retry loop polls pg_try_advisory_lock at a fixed 200ms interval with no backoff. The MySQL counterpart (mysql.go:115-128) makes a single blocking GET_LOCK(?, timeout) call and lets MySQL handle retries internally. During a cluster rolling restart (common with Kubernetes), N instances all pound PG at 200ms intervals — 10 instances contending for 30s = 1500 round trips.
Recommendation: Add exponential backoff with jitter (e.g., base 50ms → max 2s) to desynchronize concurrent lockers and reduce PG load.
| // advisoryLockID is the pg_advisory_lock key held for the entire | ||
| // goose.Up() run (outer layer of the two-layer locking scheme). | ||
| // The value is arbitrary but MUST remain stable across versions; | ||
| // changing it would let a paused old instance and a new instance | ||
| // both acquire the lock and race. Derived from: | ||
| // SELECT hashtext('cubemaster_schema_migration_global'); | ||
| advisoryLockID = 3764529487 |
There was a problem hiding this comment.
Minor comment inaccuracy: The comment says the constant is "Derived from: SELECT hashtext('cubemaster_schema_migration_global')". PostgreSQL's hashtext() returns int4 (signed 32-bit, max 2147483647), so the actual PG result would be different from 3764529487. The outer locker never calls hashtext — it uses this constant directly as a bigint advisory lock key. No functional bug, but the comment misleads anyone trying to verify the value against a live PG instance.
Recommendation: Either store the actual signed int32 output of SELECT hashtext(...), or simplify the comment to "Stable arbitrary value; do not change" and remove the "Derived from" claim.
chenhengqi
left a comment
There was a problem hiding this comment.
Why do we have to duplicate all 13 *.sql files?
|
Will this PR affect the MySQL part? If not, I think it's not a big deal. It would be best to have some documentation explaining how to use PostgreSQL as the database for CubeMaster, how to configure CubeMaster, etc. Of course, this can be included as part of a future PR. |
Implement the PostgreSQL driver for CubeMaster's data-access layer,
fulfilling the multi-database design that was pre-wired into the
dao.Driver interface.
Changes:
- Add pkg/base/dao/driver/postgres with pgx-based Driver implementation
and pg_advisory_lock session locker
- Translate all 13 MySQL migrations to PostgreSQL equivalents under
migrations/postgres/, preserving identical logical schema
- Refactor fingerprint defence layer into a fingerprintStore interface
with per-dialect implementations (MySQL: ON DUPLICATE KEY UPDATE,
PG: ON CONFLICT DO UPDATE)
- Register postgres dialect in migrate.go dialectSpecs
- Blank-import postgres driver in cmd/cubemaster/app/main.go
- Add dockertest-based integration tests (postgres:16) verifying
fresh migration and idempotent re-run
The postgres driver is selected via conf.yaml driver field:
ossdb_config:
driver: postgres
All existing MySQL tests pass without regression.
Closes TencentCloud#517
Signed-off-by: chrisloong <chrisloong5@gmail.com>
d460872 to
e0d0273
Compare
|
Thanks for the review! @chenhengqi — The 13 SQL files under
goose runs migrations per-dialect from embedded FS ( @kinwin-ustc — Correct, this PR is purely additive and does not touch any existing MySQL code paths. All existing MySQL tests continue to pass unchanged. I'll follow up with a documentation PR explaining PG configuration (config fields, sslmode, example YAML) once this lands. Regarding the automated review findings — I've addressed all Critical and High items in the latest force-push:
|
| } | ||
|
|
||
| // currentlyAppliedVersionsPostgres uses $1 placeholder syntax for PG. | ||
| func currentlyAppliedVersionsPostgres(ctx context.Context, db *sql.DB) (map[int64]bool, error) { |
There was a problem hiding this comment.
This function is a near-identical copy of currentlyAppliedVersions in fingerprint.go (same query structure, same scan loop, same JOIN subquery). The only differences are the boolean literal (true vs 1), schema lookup (current_schema() vs DATABASE()), and placeholder syntax ($1 vs ?). Consider extracting the dialect-specific fragments into configurable parameters to eliminate ~35 lines of duplicated logic.
| return out, nil | ||
| } | ||
|
|
||
| func tableExistsPostgres(ctx context.Context, db *sql.DB, name string) (bool, error) { |
There was a problem hiding this comment.
Same duplication pattern as currentlyAppliedVersionsPostgres: this is a near-copy of tableExists in fingerprint.go, differing only in current_schema() vs DATABASE() and the placeholder character. These could share a parameterized helper.
| return &pgTestLocker{id: 999999999, timeout: 30} | ||
| } | ||
|
|
||
| type pgTestLocker struct { |
There was a problem hiding this comment.
This test locker duplicates the production sessionLocker logic but with two behavioral differences: (1) no ctx.Done() check — the test can't be cancelled by context deadline, and (2) fixed 200ms sleep instead of exponential backoff. If the production locker's ctx.Done() path or backoff logic changes, the test locker silently diverges. Consider using the real driver's SessionLocker instead, or exporting a session locker helper that both production and test can use.
| // Versions that are NOT currently applied are intentionally skipped: this lets | ||
| // the operator remediation runbook (delete a goose_db_version row to force a | ||
| // re-apply) work without tripping the check. | ||
| // Deprecated: delegates to preflightFingerprintsWithStore using the MySQL store |
There was a problem hiding this comment.
These wrapper functions (preflightFingerprints at line 226, and its counterparts recordFingerprints at line 263 and ensureFingerprintTable at line 120) are now dead code after the refactor — they only delegate to the MySQL store and have no callers. The MySQL-specific DDL inside ensureFingerprintTable (ENGINE=InnoDB, CHARSET=utf8mb4) could confuse future readers who don't realize it's dead. Consider removing them.
Summary
Implement the PostgreSQL driver for CubeMaster's data-access layer, fulfilling the multi-database design pre-wired into the
dao.Driverinterface.Closes #517
Changes
pkg/base/dao/driver/postgres/— pgx-baseddao.Driverimplementation withpg_advisory_locksession lockermigrations/postgres/— 1:1 translation of all MySQL migrations producing identical logical schemafingerprintStoreinterface with per-dialect implementations (MySQL:ON DUPLICATE KEY UPDATE, PG:ON CONFLICT DO UPDATE)"postgres"indialectSpecsmap + embed PG migrationscmd/cubemaster/app/main.gogorm.io/driver/postgres+github.com/jackc/pgx/v5How to use
Set
driver: postgresin conf.yaml:Testing
dockertest+postgres:16-alpine):TestPostgres_Run_Fresh— full migration from empty DB to HEAD ✅TestPostgres_Run_Idempotent— re-run is a no-op ✅Design decisions
pg_try_advisory_lockwith retry loop mirrors MySQLGET_LOCK(name, timeout)semanticsfingerprint.gothat explicitly calls for extracting an interface before adding a second dialect