Skip to content

feat(cubemaster): add PostgreSQL support for dao layer#674

Open
cookiebody wants to merge 1 commit into
TencentCloud:masterfrom
cookiebody:feat/postgres-support
Open

feat(cubemaster): add PostgreSQL support for dao layer#674
cookiebody wants to merge 1 commit into
TencentCloud:masterfrom
cookiebody:feat/postgres-support

Conversation

@cookiebody

Copy link
Copy Markdown

Summary

Implement the PostgreSQL driver for CubeMaster's data-access layer, fulfilling the multi-database design pre-wired into the dao.Driver interface.

Closes #517

Changes

  • New driver: pkg/base/dao/driver/postgres/ — pgx-based dao.Driver implementation with pg_advisory_lock session locker
  • 13 PG migrations: migrations/postgres/ — 1:1 translation of all MySQL migrations producing identical logical schema
  • Fingerprint refactor: extract fingerprintStore interface with per-dialect implementations (MySQL: ON DUPLICATE KEY UPDATE, PG: ON CONFLICT DO UPDATE)
  • Dialect registration: register "postgres" in dialectSpecs map + embed PG migrations
  • Blank import: register postgres driver in cmd/cubemaster/app/main.go
  • Dependencies: add gorm.io/driver/postgres + github.com/jackc/pgx/v5

How to use

Set driver: postgres in conf.yaml:

ossdb_config:
  driver: postgres
  addr: "127.0.0.1:5432"
  user: cube
  pwd: cube_pass
  db_name: cube_mvp

Testing

  • All existing MySQL integration tests pass without regression
  • New PostgreSQL integration tests (dockertest + postgres:16-alpine):
    • TestPostgres_Run_Fresh — full migration from empty DB to HEAD ✅
    • TestPostgres_Run_Idempotent — re-run is a no-op ✅

Design decisions

  • Pure additive: zero changes to existing MySQL driver/migrations
  • pgx over lib/pq: pgx is actively maintained, faster, and the GORM postgres driver uses it natively
  • Advisory locks: pg_try_advisory_lock with retry loop mirrors MySQL GET_LOCK(name, timeout) semantics
  • Fingerprint interface: follows the comment in fingerprint.go that explicitly calls for extracting an interface before adding a second dialect
connTimeout = 5
}
return fmt.Sprintf(
"host=%s user=%s password=%s dbname=%s sslmode=disable connect_timeout=%d",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: sslmode=disable is hardcoded with no config knobs to enable TLS. The dao.Config struct has an Extra map[string]string field explicitly documented for engine-specific knobs, but this driver never reads it. This means database credentials (username, password) and all query data are transmitted in cleartext over the network between CubeMaster and PostgreSQL.

The MySQL driver does not hardcode tls=false, making this PG driver strictly worse. In production environments where the DB is not on the loopback interface, this is a real risk (CWE-319).

Recommendation: Read an SSL mode from cfg.Extra["sslmode"] (defaulting to require or at minimum prefer), and support sslrootcert for verified TLS.

Comment on lines +84 to +93
func buildDSN(cfg dao.Config) string {
connTimeout := cfg.ConnTimeoutSeconds
if connTimeout <= 0 {
connTimeout = 5
}
return fmt.Sprintf(
"host=%s user=%s password=%s dbname=%s sslmode=disable connect_timeout=%d",
cfg.Addr, cfg.User, cfg.Pwd, cfg.DBName, connTimeout,
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral gap vs MySQL: The MySQL driver's buildDSN plumbers cfg.ReadTimeoutSeconds and cfg.WriteTimeoutSeconds into the DSN (mysql.go:87-93). This driver ignores both. The pgx driver has no native DSN read/write timeout, so a hung or deadlocked PG query occupies a connection from the pool indefinitely without deadline protection.

On large tables, the migration backfill queries (e.g., the leading-wildcard LIKE '%' || a.artifact_id || '%' in 20260624121500_template_definition_rootfs_artifact_id.sql) can take minutes. With read/write timeout unset, the only recourse is the OS-level TCP timeout.

Recommendation: Either set a session-level statement_timeout after connecting, or pass a context with deadline derived from the config values.

Comment on lines +28 to +47
func (s *sessionLocker) SessionLock(ctx context.Context, conn *sql.Conn) error {
deadline := time.Now().Add(time.Duration(s.timeout) * time.Second)
const retryInterval = 200 * time.Millisecond

for {
var acquired bool
if err := conn.QueryRowContext(ctx,
"SELECT pg_try_advisory_lock($1)", s.id).Scan(&acquired); err != nil {
return fmt.Errorf("acquire advisory lock %d: %w", s.id, err)
}
if acquired {
return nil
}
if time.Now().After(deadline) {
return fmt.Errorf("acquire advisory lock %d: timeout after %ds", s.id, s.timeout)
}
select {
case <-ctx.Done():
return fmt.Errorf("acquire advisory lock %d: %w", s.id, ctx.Err())
case <-time.After(retryInterval):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thundering herd risk: The retry loop polls pg_try_advisory_lock at a fixed 200ms interval with no backoff. The MySQL counterpart (mysql.go:115-128) makes a single blocking GET_LOCK(?, timeout) call and lets MySQL handle retries internally. During a cluster rolling restart (common with Kubernetes), N instances all pound PG at 200ms intervals — 10 instances contending for 30s = 1500 round trips.

Recommendation: Add exponential backoff with jitter (e.g., base 50ms → max 2s) to desynchronize concurrent lockers and reduce PG load.

Comment on lines +28 to +34
// advisoryLockID is the pg_advisory_lock key held for the entire
// goose.Up() run (outer layer of the two-layer locking scheme).
// The value is arbitrary but MUST remain stable across versions;
// changing it would let a paused old instance and a new instance
// both acquire the lock and race. Derived from:
// SELECT hashtext('cubemaster_schema_migration_global');
advisoryLockID = 3764529487

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment inaccuracy: The comment says the constant is "Derived from: SELECT hashtext('cubemaster_schema_migration_global')". PostgreSQL's hashtext() returns int4 (signed 32-bit, max 2147483647), so the actual PG result would be different from 3764529487. The outer locker never calls hashtext — it uses this constant directly as a bigint advisory lock key. No functional bug, but the comment misleads anyone trying to verify the value against a live PG instance.

Recommendation: Either store the actual signed int32 output of SELECT hashtext(...), or simplify the comment to "Stable arbitrary value; do not change across versions" and remove the "Derived from" claim.

@cubesandboxbot

cubesandboxbot Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review: PR #674 ��� PostgreSQL DAO Layer

A well-structured, additive PR that cleanly extracts a fingerprintStore interface and provides a pgx-based PostgreSQL driver. Findings below are production hardening issues rather than fundamental design problems.

Critical

1. Hardcoded sslmode=disable ��� no TLS possible (postgres.go:90)
No config knob to enable TLS. Credentials and data transmitted in cleartext. MySQL driver does not hardcode TLS off.

2. ReadTimeout/WriteTimeout silently ignored (postgres.go:84-93)
MySQL driver wires both into DSN; PG driver ignores both. A hung query occupies a connection pool slot indefinitely.

High

3. Session locker retry has no backoff (session_locker.go:28-49)
Fixed 200ms polling interval vs MySQL single blocking GET_LOCK call. Thundering herd on cluster restart.

4. Test locker duplicates production code with regression (migrate_postgres_test.go:129)
Uses time.Sleep instead of context-aware select. Tests respond slowly to cancellation.

5. Duplicated preflight validation logic (fingerprint.go:224-286)
preflightFingerprints duplicates preflightFingerprintsWithStore. Should delegate like recordFingerprints.

6. driver.Open() error paths untested (postgres_test.go)
Most critical function has zero direct coverage.

7. No fingerprint-content-drift test for PostgreSQL
MySQL tests verify ErrFingerprintMismatch; PG has no equivalent test.

Medium

8. advisoryLockID comment misleading (postgres.go:28-34)
Claims derivation from hashtext(...), but 3764529487 exceeds PG's signed int4 range.

9. currentlyAppliedVersionsPostgres query (fingerprint_postgres.go:61-69)
Could use DISTINCT ON instead of self-join.

Summary

Solid architecture ��� pure additive, no MySQL regression, proper interface extraction. Address TLS and timeout gaps before production PG deployment.

connTimeout = 5
}
return fmt.Sprintf(
"host=%s user=%s password=%s dbname=%s sslmode=disable connect_timeout=%d",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: sslmode=disable is hardcoded with no config knobs to enable TLS. The dao.Config struct has an Extra map[string]string field explicitly documented for engine-specific knobs, but this driver never reads it. This means database credentials (username, password) and all query data are transmitted in cleartext over the network between CubeMaster and PostgreSQL.

The MySQL driver does not hardcode tls=false, making this PG driver strictly worse. In production environments where the DB is not on the loopback interface, this is a real risk (CWE-319).

Recommendation: Read an SSL mode from cfg.Extra["sslmode"] (defaulting to require or at minimum prefer), and support sslrootcert for verified TLS.

Comment on lines +84 to +93
func buildDSN(cfg dao.Config) string {
connTimeout := cfg.ConnTimeoutSeconds
if connTimeout <= 0 {
connTimeout = 5
}
return fmt.Sprintf(
"host=%s user=%s password=%s dbname=%s sslmode=disable connect_timeout=%d",
cfg.Addr, cfg.User, cfg.Pwd, cfg.DBName, connTimeout,
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral gap vs MySQL: The MySQL driver's buildDSN plumbers cfg.ReadTimeoutSeconds and cfg.WriteTimeoutSeconds into the DSN (mysql.go:87-93). This driver ignores both. The pgx driver has no native DSN read/write timeout, so a hung or deadlocked PG query occupies a connection from the pool indefinitely without deadline protection.

On large tables, the migration backfill queries (e.g., the leading-wildcard LIKE '%' || a.artifact_id || '%' in 20260624121500_template_definition_rootfs_artifact_id.sql) can take minutes. With read/write timeout unset, the only recourse is the OS-level TCP timeout.

Recommendation: Either set a session-level statement_timeout after connecting, or pass a context with deadline derived from the config values.

Comment on lines +28 to +47
func (s *sessionLocker) SessionLock(ctx context.Context, conn *sql.Conn) error {
deadline := time.Now().Add(time.Duration(s.timeout) * time.Second)
const retryInterval = 200 * time.Millisecond

for {
var acquired bool
if err := conn.QueryRowContext(ctx,
"SELECT pg_try_advisory_lock($1)", s.id).Scan(&acquired); err != nil {
return fmt.Errorf("acquire advisory lock %d: %w", s.id, err)
}
if acquired {
return nil
}
if time.Now().After(deadline) {
return fmt.Errorf("acquire advisory lock %d: timeout after %ds", s.id, s.timeout)
}
select {
case <-ctx.Done():
return fmt.Errorf("acquire advisory lock %d: %w", s.id, ctx.Err())
case <-time.After(retryInterval):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thundering herd risk: The retry loop polls pg_try_advisory_lock at a fixed 200ms interval with no backoff. The MySQL counterpart (mysql.go:115-128) makes a single blocking GET_LOCK(?, timeout) call and lets MySQL handle retries internally. During a cluster rolling restart (common with Kubernetes), N instances all pound PG at 200ms intervals — 10 instances contending for 30s = 1500 round trips.

Recommendation: Add exponential backoff with jitter (e.g., base 50ms → max 2s) to desynchronize concurrent lockers and reduce PG load.

Comment on lines +28 to +34
// advisoryLockID is the pg_advisory_lock key held for the entire
// goose.Up() run (outer layer of the two-layer locking scheme).
// The value is arbitrary but MUST remain stable across versions;
// changing it would let a paused old instance and a new instance
// both acquire the lock and race. Derived from:
// SELECT hashtext('cubemaster_schema_migration_global');
advisoryLockID = 3764529487

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment inaccuracy: The comment says the constant is "Derived from: SELECT hashtext('cubemaster_schema_migration_global')". PostgreSQL's hashtext() returns int4 (signed 32-bit, max 2147483647), so the actual PG result would be different from 3764529487. The outer locker never calls hashtext — it uses this constant directly as a bigint advisory lock key. No functional bug, but the comment misleads anyone trying to verify the value against a live PG instance.

Recommendation: Either store the actual signed int32 output of SELECT hashtext(...), or simplify the comment to "Stable arbitrary value; do not change" and remove the "Derived from" claim.

@chenhengqi chenhengqi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to duplicate all 13 *.sql files?

@kinwin-ustc

Copy link
Copy Markdown
Collaborator

Will this PR affect the MySQL part? If not, I think it's not a big deal. It would be best to have some documentation explaining how to use PostgreSQL as the database for CubeMaster, how to configure CubeMaster, etc. Of course, this can be included as part of a future PR.

Implement the PostgreSQL driver for CubeMaster's data-access layer,
fulfilling the multi-database design that was pre-wired into the
dao.Driver interface.

Changes:
- Add pkg/base/dao/driver/postgres with pgx-based Driver implementation
  and pg_advisory_lock session locker
- Translate all 13 MySQL migrations to PostgreSQL equivalents under
  migrations/postgres/, preserving identical logical schema
- Refactor fingerprint defence layer into a fingerprintStore interface
  with per-dialect implementations (MySQL: ON DUPLICATE KEY UPDATE,
  PG: ON CONFLICT DO UPDATE)
- Register postgres dialect in migrate.go dialectSpecs
- Blank-import postgres driver in cmd/cubemaster/app/main.go
- Add dockertest-based integration tests (postgres:16) verifying
  fresh migration and idempotent re-run

The postgres driver is selected via conf.yaml driver field:
  ossdb_config:
    driver: postgres

All existing MySQL tests pass without regression.

Closes TencentCloud#517

Signed-off-by: chrisloong <chrisloong5@gmail.com>
@cookiebody cookiebody force-pushed the feat/postgres-support branch from d460872 to e0d0273 Compare July 1, 2026 15:45
@cookiebody

Copy link
Copy Markdown
Author

Thanks for the review!

@chenhengqi — The 13 SQL files under migrations/postgres/ are not copies of the MySQL migrations. They are PostgreSQL-native rewrites because the two dialects have incompatible DDL syntax:

MySQL PostgreSQL equivalent
AUTO_INCREMENT bigserial / GENERATED ALWAYS AS IDENTITY
tinyint(1) boolean
mediumtext text
ENGINE=InnoDB (no equivalent, removed)
GET_LOCK() / RELEASE_LOCK() pg_advisory_lock() / pg_advisory_unlock()
Stored procedures (DELIMITER //) PL/pgSQL DO $$...$$ blocks
ON DUPLICATE KEY UPDATE ON CONFLICT ... DO UPDATE
INSERT IGNORE ON CONFLICT DO NOTHING
UPDATE ... JOIN UPDATE ... FROM ... WHERE
backtick-quoted identifiers double-quoted or unquoted

goose runs migrations per-dialect from embedded FS (//go:embed migrations/postgres/*.sql), which is the same pattern already used for MySQL (//go:embed migrations/mysql/*.sql). Sharing a single set of SQL files across both dialects isn't feasible without a templating layer that would add complexity.

@kinwin-ustc — Correct, this PR is purely additive and does not touch any existing MySQL code paths. All existing MySQL tests continue to pass unchanged. I'll follow up with a documentation PR explaining PG configuration (config fields, sslmode, example YAML) once this lands.


Regarding the automated review findings — I've addressed all Critical and High items in the latest force-push:

  1. sslmode configurable via Extra["sslmode"] (defaults to disable for dev; production users set require/verify-full)
  2. Timeouts mappedWriteTimeoutSecondsstatement_timeout, ReadTimeoutSecondsidle_in_transaction_session_timeout (via PG connection options parameter)
  3. Exponential backoff in session locker (200ms → 400ms → … capped at 2s)
  4. Test locker now context-aware with select/ctx.Done()
  5. preflightFingerprints delegates to preflightFingerprintsWithStore (no duplication)
  6. Open() error path covered by TestOpenInvalidDSN
  7. Fingerprint drift test added for PostgreSQL
}

// currentlyAppliedVersionsPostgres uses $1 placeholder syntax for PG.
func currentlyAppliedVersionsPostgres(ctx context.Context, db *sql.DB) (map[int64]bool, error) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a near-identical copy of currentlyAppliedVersions in fingerprint.go (same query structure, same scan loop, same JOIN subquery). The only differences are the boolean literal (true vs 1), schema lookup (current_schema() vs DATABASE()), and placeholder syntax ($1 vs ?). Consider extracting the dialect-specific fragments into configurable parameters to eliminate ~35 lines of duplicated logic.

return out, nil
}

func tableExistsPostgres(ctx context.Context, db *sql.DB, name string) (bool, error) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same duplication pattern as currentlyAppliedVersionsPostgres: this is a near-copy of tableExists in fingerprint.go, differing only in current_schema() vs DATABASE() and the placeholder character. These could share a parameterized helper.

return &pgTestLocker{id: 999999999, timeout: 30}
}

type pgTestLocker struct {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test locker duplicates the production sessionLocker logic but with two behavioral differences: (1) no ctx.Done() check — the test can't be cancelled by context deadline, and (2) fixed 200ms sleep instead of exponential backoff. If the production locker's ctx.Done() path or backoff logic changes, the test locker silently diverges. Consider using the real driver's SessionLocker instead, or exporting a session locker helper that both production and test can use.

// Versions that are NOT currently applied are intentionally skipped: this lets
// the operator remediation runbook (delete a goose_db_version row to force a
// re-apply) work without tripping the check.
// Deprecated: delegates to preflightFingerprintsWithStore using the MySQL store

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These wrapper functions (preflightFingerprints at line 226, and its counterparts recordFingerprints at line 263 and ensureFingerprintTable at line 120) are now dead code after the refactor — they only delegate to the MySQL store and have no callers. The MySQL-specific DDL inside ensureFingerprintTable (ENGINE=InnoDB, CHARSET=utf8mb4) could confuse future readers who don't realize it's dead. Consider removing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants