feat(cubeapi): add webhook event notifications by hyy321 · Pull Request #702 · TencentCloud/CubeSandbox

hyy321 · 2026-07-01T17:20:24Z

Summary

This PR adds CubeAPI webhook event notifications for sandbox lifecycle events.

Implemented features:

Configure one or more webhook endpoints through CUBE_API_WEBHOOK_ENDPOINTS
Subscribe endpoints by event type
Support sandbox lifecycle events:
- sandbox.created
- sandbox.deleted
- sandbox.paused
- sandbox.resumed
Deliver JSON webhook payloads through asynchronous HTTP POST
Use a bounded queue and background dispatcher so webhook delivery does not block sandbox lifecycle APIs
Support optional HMAC-SHA256 signatures
Retry retryable delivery failures with bounded exponential backoff
Log delivery failures without exposing endpoint secrets or payload signatures
Provide a standard-library Python webhook receiver example
Document configuration, payload format, signature verification, retry behavior, and adapter-based alerting integration

Design Notes

Webhook delivery is implemented as a CubeAPI logging backend.

API handlers emit structured LogEvent values. The existing logger fan-out keeps FileLogger behavior and optionally adds HttpLogger when webhook endpoints are configured.

Webhook delivery is best-effort and asynchronous. Endpoint failures, timeouts, or retries do not change sandbox lifecycle API responses.

Lifecycle webhook payloads include id, timestamp, level, event, and sandbox_id. template_id is included when available.

The payload is not a complete Sandbox object and does not include tokens, secrets, environment variables, or runtime/network details.

This PR does not add a REST API for managing webhooks, persistent outbox, disk spool, dead-letter queue, batch delivery, or exactly-once delivery.

Validation

Tested on a local CubeSandbox PVM deployment.

cargo fmt --check
cargo test logging::http::tests -- --nocapture
cargo test
python3 -m py_compile examples/webhook-receiver/receiver.py
cargo build

Results:

logging::http::tests: 9 passed, 0 failed
cargo test: 79 passed, 0 failed
receiver.py py_compile: passed
cargo build: passed

End-to-end webhook verification:

Git commit:
4c7730b fix(cubeapi): silence webhook test helper warning

Received webhook events:
event: sandbox.created
event: sandbox.paused
event: sandbox.resumed
event: sandbox.deleted

API responses:
pause:  HTTP/1.1 204 No Content
resume: HTTP/1.1 201 Created
delete: HTTP/1.1 204 No Content

Build warning check:
no endpoint_matches_event warning

Runtime log check:
no file logger permission errors

The example receiver was started with WEBHOOK_SECRET=test-secret, so receiving these events also verifies the documented HMAC-SHA256 signing path.

Closes #642

cubesandboxbot · 2026-07-01T17:25:08Z

+    }
+}
+
 #[derive(Debug, Deserialize, Clone)]


Security: ServerConfig derives Debug without redacting credential-bearing fields

Unlike WebhookEndpointConfig (which has a custom Debug impl that redacts the URL and secret), ServerConfig uses the standard derive macro. This means database_url (which contains the MySQL DSN with password) and auth_callback_url (which may contain tokens/credentials) would be printed verbatim if the struct is ever logged, included in panic output, or displayed in test failure messages.

Consider implementing a custom Debug for ServerConfig mirroring the pattern used for WebhookEndpointConfig, redacting database_url and auth_callback_url (showing only database_url_configured: true/false).

cubesandboxbot · 2026-07-01T17:25:09Z

+        || status.is_server_error()
+}
+
+fn backoff_delay(attempt: usize, initial_ms: u64, max_ms: u64) -> Duration {


Reliability: Backoff lacks jitter (thundering herd risk on recovery)

The pure deterministic exponential backoff means that when a transient outage heals (e.g., receiver comes back up after a brief downtime), all retrying delivery tasks wake at exactly the same instant and fire HTTP requests at the receiver simultaneously. Under load, this can overwhelm the receiver and trigger cascading retries or its own rate limiting.

Consider adding random jitter: compute the base delay then apply base / 2 + rand::thread_rng().gen_range(0..base) — this spreads retries across a window and prevents synchronized wake-up.

cubesandboxbot · 2026-07-01T17:25:10Z

+    matches!(key, "id" | "timestamp" | "level" | "event")
+}
+
+fn is_sensitive_field(key: &str) -> bool {


Security: is_sensitive_field uses broad substring matching — both too aggressive and too permissive

The contains() substring approach will:

Over-redact: legitimate fields like token_bucket_config, password_reset_url, or credentials_verification_status get silently stripped even though they carry no secret.

Under-redact: misses common credential field names like private_key, passwd, jwt, bearer, auth, csrf.

Consider splitting the logic: use word-boundary matching (exact match or split on _) for broad terms like token/secret to reduce over-redaction, and add explicit entries for credential shorthand names (passwd, jwt, bearer, private_key) to close under-redaction gaps. A doc comment explaining the intentional conservatism would also help future readers understand the trade-off.

cubesandboxbot · 2026-07-01T17:25:12Z

+        let queue_capacity = config.queue_capacity;
+        let max_concurrency = config.max_concurrency;
+        let flush_timeout = Duration::from_secs(config.flush_timeout_secs);
+        let max_outstanding = queue_capacity


Performance / Memory: max_outstanding can grow very large with custom config

max_outstanding = queue_capacity.saturating_mul(endpoints.len().max(1)).max(max_concurrency) — with the defaults (1024, 1 endpoint) this is 1024, which is fine. But if a user configures queue_capacity: 100000 with 5 endpoints, max_outstanding becomes 500,000. The JoinSet holds that many task handles (each with a captured Delivery struct containing heap-allocated Url, Vec<u8> body, etc.), while the Semaphore only limits HTTP concurrency to max_concurrency (default 32).

Consider capping max_outstanding with a hard upper bound (e.g., min(computed_value, 100_000)), or tying it directly to max_concurrency (e.g., max_concurrency * 10). Also consider validating this in validate_config.

cubesandboxbot · 2026-07-01T17:25:23Z

+            timestamp: event.timestamp,
+            level: event.level,
+            event: event.event.clone(),
+            fields: sanitized_fields(&event.fields),


Performance: sanitized_fields runs once per endpoint per event

When an event matches N endpoints, Delivery::new is called N times (line ~102 inside spawn_deliveries). Each call runs sanitized_fields(&event.fields) which clones and filters the entire field HashMap. For an event matching 10 endpoints with many fields, the same work is done 10 times.

Consider hoisting sanitized_fields into spawn_deliveries so it runs once per event, then pass the sanitized map into each Delivery::new call.

cubesandboxbot · 2026-07-01T17:25:24Z

+        || status.is_server_error()
+}
+
+fn backoff_delay(attempt: usize, initial_ms: u64, max_ms: u64) -> Duration {


Test coverage: backoff_delay and validate_config are completely untested

backoff_delay (line 527) is a pure function with zero dependencies — ideal for unit testing. It contains checked_shl, saturating_mul, and min logic that would silently produce wrong delays if regressed. Similarly, validate_config (line 223) has five validation checks (zero queue_capacity, zero timeout, zero concurrency, zero flush_timeout, inverted backoff) — none are tested.

These would be high-value, low-effort additions to the test suite, especially since they're pure functions requiring no infrastructure.

cubesandboxbot · 2026-07-01T17:25:26Z

+        if length < 0:
+            self._respond(400, "invalid Content-Length\n")
+            return None
+        return self.rfile.read(length)


Security: Example receiver has no upper bound on body read

self.rfile.read(length) reads the full Content-Length value with no bound. An attacker sending Content-Length: 1073741824 (1 GB) would cause the receiver to allocate that much memory. While this is example code, users who copy this pattern into production would have a trivially exploitable OOM DoS.

Consider adding MAX_BODY_BYTES = 1024 * 1024 and returning 413 if length > MAX_BODY_BYTES.

cubesandboxbot · 2026-07-01T17:31:13Z

+    }
+}
+
 #[derive(Debug, Deserialize, Clone)]


Security: ServerConfig derives Debug without redacting credential-bearing fields

Unlike WebhookEndpointConfig (which has a custom Debug impl that redacts the URL and secret), ServerConfig uses the standard derive macro. This means database_url (which contains the MySQL DSN with password) and auth_callback_url (which may contain tokens/credentials) would be printed verbatim if the struct is ever logged, included in panic output, or displayed in test failure messages.

Consider implementing a custom Debug for ServerConfig mirroring the pattern used for WebhookEndpointConfig, redacting database_url and auth_callback_url (showing only database_url_configured: true/false).

cubesandboxbot · 2026-07-01T17:31:14Z

+        || status.is_server_error()
+}
+
+fn backoff_delay(attempt: usize, initial_ms: u64, max_ms: u64) -> Duration {


Reliability: Backoff lacks jitter (thundering herd risk on recovery)

The pure deterministic exponential backoff means that when a transient outage heals (e.g., receiver comes back up after a brief downtime), all retrying delivery tasks wake at exactly the same instant and fire HTTP requests at the receiver simultaneously. Under load, this can overwhelm the receiver and trigger cascading retries or its own rate limiting.

Consider adding random jitter: compute the base delay then apply base / 2 + rand::thread_rng().gen_range(0..base) — this spreads retries across a window and prevents synchronized wake-up.

cubesandboxbot · 2026-07-01T17:31:14Z

+    matches!(key, "id" | "timestamp" | "level" | "event")
+}
+
+fn is_sensitive_field(key: &str) -> bool {


Security: is_sensitive_field uses broad substring matching — both too aggressive and too permissive

The contains() substring approach will:

Over-redact: legitimate fields like token_bucket_config, password_reset_url, or credentials_verification_status get silently stripped even though they carry no secret.

Under-redact: misses common credential field names like private_key, passwd, jwt, bearer, auth, csrf.

Consider splitting the logic: use word-boundary matching (exact match or split on _) for broad terms like token/secret to reduce over-redaction, and add explicit entries for credential shorthand names (passwd, jwt, bearer, private_key) to close under-redaction gaps. A doc comment explaining the intentional conservatism would also help future readers understand the trade-off.

cubesandboxbot · 2026-07-01T17:31:15Z

+        let queue_capacity = config.queue_capacity;
+        let max_concurrency = config.max_concurrency;
+        let flush_timeout = Duration::from_secs(config.flush_timeout_secs);
+        let max_outstanding = queue_capacity


Performance / Memory: max_outstanding can grow very large with custom config

max_outstanding = queue_capacity.saturating_mul(endpoints.len().max(1)).max(max_concurrency) — with the defaults (1024, 1 endpoint) this is 1024, which is fine. But if a user configures queue_capacity: 100000 with 5 endpoints, max_outstanding becomes 500,000. The JoinSet holds that many task handles (each with a captured Delivery struct containing heap-allocated Url, Vec<u8> body, etc.), while the Semaphore only limits HTTP concurrency to max_concurrency (default 32).

Consider capping max_outstanding with a hard upper bound (e.g., min(computed_value, 100_000)), or tying it directly to max_concurrency (e.g., max_concurrency * 10). Also consider validating this in validate_config.

hengyy123 and others added 6 commits July 1, 2026 16:17

feat(cubeapi): add webhook configuration

7219924

feat(cubeapi): implement async webhook logger

651a9bd

feat(cubeapi): wire webhook logger

2e69cd1

docs(cubeapi): add webhook documentation and receiver example

520d980

fix(cubeapi): harden webhook delivery and docs

f9705d0

fix(cubeapi): silence webhook test helper warning

4c7730b

hyy321 requested review from chenhengqi, fslongjin, ls-ggg, tinklone and up2wing as code owners July 1, 2026 17:20

cubesandboxbot Bot reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cubeapi): add webhook event notifications#702

feat(cubeapi): add webhook event notifications#702
hyy321 wants to merge 6 commits into
TencentCloud:masterfrom
hyy321:feat/cubeapi-webhooks

hyy321 commented Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

cubesandboxbot Bot Jul 1, 2026

Labels

1 participant

Uh oh!

Conversation

hyy321 commented Jul 1, 2026

Summary

Design Notes

Validation

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

cubesandboxbot Bot Jul 1, 2026

Choose a reason for hiding this comment

Labels

1 participant