Skip to content

feat(cubeapi): add webhook event notifications#702

Open
hyy321 wants to merge 6 commits into
TencentCloud:masterfrom
hyy321:feat/cubeapi-webhooks
Open

feat(cubeapi): add webhook event notifications#702
hyy321 wants to merge 6 commits into
TencentCloud:masterfrom
hyy321:feat/cubeapi-webhooks

Conversation

@hyy321

@hyy321 hyy321 commented Jul 1, 2026

Copy link
Copy Markdown

Summary

This PR adds CubeAPI webhook event notifications for sandbox lifecycle events.

Implemented features:

  • Configure one or more webhook endpoints through CUBE_API_WEBHOOK_ENDPOINTS
  • Subscribe endpoints by event type
  • Support sandbox lifecycle events:
    • sandbox.created
    • sandbox.deleted
    • sandbox.paused
    • sandbox.resumed
  • Deliver JSON webhook payloads through asynchronous HTTP POST
  • Use a bounded queue and background dispatcher so webhook delivery does not block sandbox lifecycle APIs
  • Support optional HMAC-SHA256 signatures
  • Retry retryable delivery failures with bounded exponential backoff
  • Log delivery failures without exposing endpoint secrets or payload signatures
  • Provide a standard-library Python webhook receiver example
  • Document configuration, payload format, signature verification, retry behavior, and adapter-based alerting integration

Design Notes

Webhook delivery is implemented as a CubeAPI logging backend.

API handlers emit structured LogEvent values. The existing logger fan-out keeps FileLogger behavior and optionally adds HttpLogger when webhook endpoints are configured.

Webhook delivery is best-effort and asynchronous. Endpoint failures, timeouts, or retries do not change sandbox lifecycle API responses.

Lifecycle webhook payloads include id, timestamp, level, event, and sandbox_id. template_id is included when available.

The payload is not a complete Sandbox object and does not include tokens, secrets, environment variables, or runtime/network details.

This PR does not add a REST API for managing webhooks, persistent outbox, disk spool, dead-letter queue, batch delivery, or exactly-once delivery.

Validation

Tested on a local CubeSandbox PVM deployment.

cargo fmt --check
cargo test logging::http::tests -- --nocapture
cargo test
python3 -m py_compile examples/webhook-receiver/receiver.py
cargo build

Results:

logging::http::tests: 9 passed, 0 failed
cargo test: 79 passed, 0 failed
receiver.py py_compile: passed
cargo build: passed

End-to-end webhook verification:

Git commit:
4c7730b fix(cubeapi): silence webhook test helper warning

Received webhook events:
event: sandbox.created
event: sandbox.paused
event: sandbox.resumed
event: sandbox.deleted

API responses:
pause:  HTTP/1.1 204 No Content
resume: HTTP/1.1 201 Created
delete: HTTP/1.1 204 No Content

Build warning check:
no endpoint_matches_event warning

Runtime log check:
no file logger permission errors

The example receiver was started with WEBHOOK_SECRET=test-secret, so receiving these events also verifies the documented HMAC-SHA256 signing path.

Closes #642

Comment thread CubeAPI/src/config/mod.rs
}
}

#[derive(Debug, Deserialize, Clone)]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: ServerConfig derives Debug without redacting credential-bearing fields

Unlike WebhookEndpointConfig (which has a custom Debug impl that redacts the URL and secret), ServerConfig uses the standard derive macro. This means database_url (which contains the MySQL DSN with password) and auth_callback_url (which may contain tokens/credentials) would be printed verbatim if the struct is ever logged, included in panic output, or displayed in test failure messages.

Consider implementing a custom Debug for ServerConfig mirroring the pattern used for WebhookEndpointConfig, redacting database_url and auth_callback_url (showing only database_url_configured: true/false).

|| status.is_server_error()
}

fn backoff_delay(attempt: usize, initial_ms: u64, max_ms: u64) -> Duration {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reliability: Backoff lacks jitter (thundering herd risk on recovery)

The pure deterministic exponential backoff means that when a transient outage heals (e.g., receiver comes back up after a brief downtime), all retrying delivery tasks wake at exactly the same instant and fire HTTP requests at the receiver simultaneously. Under load, this can overwhelm the receiver and trigger cascading retries or its own rate limiting.

Consider adding random jitter: compute the base delay then apply base / 2 + rand::thread_rng().gen_range(0..base) — this spreads retries across a window and prevents synchronized wake-up.

matches!(key, "id" | "timestamp" | "level" | "event")
}

fn is_sensitive_field(key: &str) -> bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: is_sensitive_field uses broad substring matching — both too aggressive and too permissive

The contains() substring approach will:

  • Over-redact: legitimate fields like token_bucket_config, password_reset_url, or credentials_verification_status get silently stripped even though they carry no secret.
  • Under-redact: misses common credential field names like private_key, passwd, jwt, bearer, auth, csrf.

Consider splitting the logic: use word-boundary matching (exact match or split on _) for broad terms like token/secret to reduce over-redaction, and add explicit entries for credential shorthand names (passwd, jwt, bearer, private_key) to close under-redaction gaps. A doc comment explaining the intentional conservatism would also help future readers understand the trade-off.

let queue_capacity = config.queue_capacity;
let max_concurrency = config.max_concurrency;
let flush_timeout = Duration::from_secs(config.flush_timeout_secs);
let max_outstanding = queue_capacity

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance / Memory: max_outstanding can grow very large with custom config

max_outstanding = queue_capacity.saturating_mul(endpoints.len().max(1)).max(max_concurrency) — with the defaults (1024, 1 endpoint) this is 1024, which is fine. But if a user configures queue_capacity: 100000 with 5 endpoints, max_outstanding becomes 500,000. The JoinSet holds that many task handles (each with a captured Delivery struct containing heap-allocated Url, Vec<u8> body, etc.), while the Semaphore only limits HTTP concurrency to max_concurrency (default 32).

Consider capping max_outstanding with a hard upper bound (e.g., min(computed_value, 100_000)), or tying it directly to max_concurrency (e.g., max_concurrency * 10). Also consider validating this in validate_config.

timestamp: event.timestamp,
level: event.level,
event: event.event.clone(),
fields: sanitized_fields(&event.fields),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance: sanitized_fields runs once per endpoint per event

When an event matches N endpoints, Delivery::new is called N times (line ~102 inside spawn_deliveries). Each call runs sanitized_fields(&event.fields) which clones and filters the entire field HashMap. For an event matching 10 endpoints with many fields, the same work is done 10 times.

Consider hoisting sanitized_fields into spawn_deliveries so it runs once per event, then pass the sanitized map into each Delivery::new call.

|| status.is_server_error()
}

fn backoff_delay(attempt: usize, initial_ms: u64, max_ms: u64) -> Duration {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage: backoff_delay and validate_config are completely untested

backoff_delay (line 527) is a pure function with zero dependencies — ideal for unit testing. It contains checked_shl, saturating_mul, and min logic that would silently produce wrong delays if regressed. Similarly, validate_config (line 223) has five validation checks (zero queue_capacity, zero timeout, zero concurrency, zero flush_timeout, inverted backoff) — none are tested.

These would be high-value, low-effort additions to the test suite, especially since they're pure functions requiring no infrastructure.

if length < 0:
self._respond(400, "invalid Content-Length\n")
return None
return self.rfile.read(length)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: Example receiver has no upper bound on body read

self.rfile.read(length) reads the full Content-Length value with no bound. An attacker sending Content-Length: 1073741824 (1 GB) would cause the receiver to allocate that much memory. While this is example code, users who copy this pattern into production would have a trivially exploitable OOM DoS.

Consider adding MAX_BODY_BYTES = 1024 * 1024 and returning 413 if length > MAX_BODY_BYTES.

Comment thread CubeAPI/src/config/mod.rs
}
}

#[derive(Debug, Deserialize, Clone)]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: ServerConfig derives Debug without redacting credential-bearing fields

Unlike WebhookEndpointConfig (which has a custom Debug impl that redacts the URL and secret), ServerConfig uses the standard derive macro. This means database_url (which contains the MySQL DSN with password) and auth_callback_url (which may contain tokens/credentials) would be printed verbatim if the struct is ever logged, included in panic output, or displayed in test failure messages.

Consider implementing a custom Debug for ServerConfig mirroring the pattern used for WebhookEndpointConfig, redacting database_url and auth_callback_url (showing only database_url_configured: true/false).

|| status.is_server_error()
}

fn backoff_delay(attempt: usize, initial_ms: u64, max_ms: u64) -> Duration {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reliability: Backoff lacks jitter (thundering herd risk on recovery)

The pure deterministic exponential backoff means that when a transient outage heals (e.g., receiver comes back up after a brief downtime), all retrying delivery tasks wake at exactly the same instant and fire HTTP requests at the receiver simultaneously. Under load, this can overwhelm the receiver and trigger cascading retries or its own rate limiting.

Consider adding random jitter: compute the base delay then apply base / 2 + rand::thread_rng().gen_range(0..base) — this spreads retries across a window and prevents synchronized wake-up.

matches!(key, "id" | "timestamp" | "level" | "event")
}

fn is_sensitive_field(key: &str) -> bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: is_sensitive_field uses broad substring matching — both too aggressive and too permissive

The contains() substring approach will:

  • Over-redact: legitimate fields like token_bucket_config, password_reset_url, or credentials_verification_status get silently stripped even though they carry no secret.
  • Under-redact: misses common credential field names like private_key, passwd, jwt, bearer, auth, csrf.

Consider splitting the logic: use word-boundary matching (exact match or split on _) for broad terms like token/secret to reduce over-redaction, and add explicit entries for credential shorthand names (passwd, jwt, bearer, private_key) to close under-redaction gaps. A doc comment explaining the intentional conservatism would also help future readers understand the trade-off.

let queue_capacity = config.queue_capacity;
let max_concurrency = config.max_concurrency;
let flush_timeout = Duration::from_secs(config.flush_timeout_secs);
let max_outstanding = queue_capacity

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance / Memory: max_outstanding can grow very large with custom config

max_outstanding = queue_capacity.saturating_mul(endpoints.len().max(1)).max(max_concurrency) — with the defaults (1024, 1 endpoint) this is 1024, which is fine. But if a user configures queue_capacity: 100000 with 5 endpoints, max_outstanding becomes 500,000. The JoinSet holds that many task handles (each with a captured Delivery struct containing heap-allocated Url, Vec<u8> body, etc.), while the Semaphore only limits HTTP concurrency to max_concurrency (default 32).

Consider capping max_outstanding with a hard upper bound (e.g., min(computed_value, 100_000)), or tying it directly to max_concurrency (e.g., max_concurrency * 10). Also consider validating this in validate_config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant