feat(cubeapi): add webhook event notifications#702
Conversation
| } | ||
| } | ||
|
|
||
| #[derive(Debug, Deserialize, Clone)] |
There was a problem hiding this comment.
Security: ServerConfig derives Debug without redacting credential-bearing fields
Unlike WebhookEndpointConfig (which has a custom Debug impl that redacts the URL and secret), ServerConfig uses the standard derive macro. This means database_url (which contains the MySQL DSN with password) and auth_callback_url (which may contain tokens/credentials) would be printed verbatim if the struct is ever logged, included in panic output, or displayed in test failure messages.
Consider implementing a custom Debug for ServerConfig mirroring the pattern used for WebhookEndpointConfig, redacting database_url and auth_callback_url (showing only database_url_configured: true/false).
| || status.is_server_error() | ||
| } | ||
|
|
||
| fn backoff_delay(attempt: usize, initial_ms: u64, max_ms: u64) -> Duration { |
There was a problem hiding this comment.
Reliability: Backoff lacks jitter (thundering herd risk on recovery)
The pure deterministic exponential backoff means that when a transient outage heals (e.g., receiver comes back up after a brief downtime), all retrying delivery tasks wake at exactly the same instant and fire HTTP requests at the receiver simultaneously. Under load, this can overwhelm the receiver and trigger cascading retries or its own rate limiting.
Consider adding random jitter: compute the base delay then apply base / 2 + rand::thread_rng().gen_range(0..base) — this spreads retries across a window and prevents synchronized wake-up.
| matches!(key, "id" | "timestamp" | "level" | "event") | ||
| } | ||
|
|
||
| fn is_sensitive_field(key: &str) -> bool { |
There was a problem hiding this comment.
Security: is_sensitive_field uses broad substring matching — both too aggressive and too permissive
The contains() substring approach will:
- Over-redact: legitimate fields like
token_bucket_config,password_reset_url, orcredentials_verification_statusget silently stripped even though they carry no secret. - Under-redact: misses common credential field names like
private_key,passwd,jwt,bearer,auth,csrf.
Consider splitting the logic: use word-boundary matching (exact match or split on _) for broad terms like token/secret to reduce over-redaction, and add explicit entries for credential shorthand names (passwd, jwt, bearer, private_key) to close under-redaction gaps. A doc comment explaining the intentional conservatism would also help future readers understand the trade-off.
| let queue_capacity = config.queue_capacity; | ||
| let max_concurrency = config.max_concurrency; | ||
| let flush_timeout = Duration::from_secs(config.flush_timeout_secs); | ||
| let max_outstanding = queue_capacity |
There was a problem hiding this comment.
Performance / Memory: max_outstanding can grow very large with custom config
max_outstanding = queue_capacity.saturating_mul(endpoints.len().max(1)).max(max_concurrency) — with the defaults (1024, 1 endpoint) this is 1024, which is fine. But if a user configures queue_capacity: 100000 with 5 endpoints, max_outstanding becomes 500,000. The JoinSet holds that many task handles (each with a captured Delivery struct containing heap-allocated Url, Vec<u8> body, etc.), while the Semaphore only limits HTTP concurrency to max_concurrency (default 32).
Consider capping max_outstanding with a hard upper bound (e.g., min(computed_value, 100_000)), or tying it directly to max_concurrency (e.g., max_concurrency * 10). Also consider validating this in validate_config.
| timestamp: event.timestamp, | ||
| level: event.level, | ||
| event: event.event.clone(), | ||
| fields: sanitized_fields(&event.fields), |
There was a problem hiding this comment.
Performance: sanitized_fields runs once per endpoint per event
When an event matches N endpoints, Delivery::new is called N times (line ~102 inside spawn_deliveries). Each call runs sanitized_fields(&event.fields) which clones and filters the entire field HashMap. For an event matching 10 endpoints with many fields, the same work is done 10 times.
Consider hoisting sanitized_fields into spawn_deliveries so it runs once per event, then pass the sanitized map into each Delivery::new call.
| || status.is_server_error() | ||
| } | ||
|
|
||
| fn backoff_delay(attempt: usize, initial_ms: u64, max_ms: u64) -> Duration { |
There was a problem hiding this comment.
Test coverage: backoff_delay and validate_config are completely untested
backoff_delay (line 527) is a pure function with zero dependencies — ideal for unit testing. It contains checked_shl, saturating_mul, and min logic that would silently produce wrong delays if regressed. Similarly, validate_config (line 223) has five validation checks (zero queue_capacity, zero timeout, zero concurrency, zero flush_timeout, inverted backoff) — none are tested.
These would be high-value, low-effort additions to the test suite, especially since they're pure functions requiring no infrastructure.
| if length < 0: | ||
| self._respond(400, "invalid Content-Length\n") | ||
| return None | ||
| return self.rfile.read(length) |
There was a problem hiding this comment.
Security: Example receiver has no upper bound on body read
self.rfile.read(length) reads the full Content-Length value with no bound. An attacker sending Content-Length: 1073741824 (1 GB) would cause the receiver to allocate that much memory. While this is example code, users who copy this pattern into production would have a trivially exploitable OOM DoS.
Consider adding MAX_BODY_BYTES = 1024 * 1024 and returning 413 if length > MAX_BODY_BYTES.
| } | ||
| } | ||
|
|
||
| #[derive(Debug, Deserialize, Clone)] |
There was a problem hiding this comment.
Security: ServerConfig derives Debug without redacting credential-bearing fields
Unlike WebhookEndpointConfig (which has a custom Debug impl that redacts the URL and secret), ServerConfig uses the standard derive macro. This means database_url (which contains the MySQL DSN with password) and auth_callback_url (which may contain tokens/credentials) would be printed verbatim if the struct is ever logged, included in panic output, or displayed in test failure messages.
Consider implementing a custom Debug for ServerConfig mirroring the pattern used for WebhookEndpointConfig, redacting database_url and auth_callback_url (showing only database_url_configured: true/false).
| || status.is_server_error() | ||
| } | ||
|
|
||
| fn backoff_delay(attempt: usize, initial_ms: u64, max_ms: u64) -> Duration { |
There was a problem hiding this comment.
Reliability: Backoff lacks jitter (thundering herd risk on recovery)
The pure deterministic exponential backoff means that when a transient outage heals (e.g., receiver comes back up after a brief downtime), all retrying delivery tasks wake at exactly the same instant and fire HTTP requests at the receiver simultaneously. Under load, this can overwhelm the receiver and trigger cascading retries or its own rate limiting.
Consider adding random jitter: compute the base delay then apply base / 2 + rand::thread_rng().gen_range(0..base) — this spreads retries across a window and prevents synchronized wake-up.
| matches!(key, "id" | "timestamp" | "level" | "event") | ||
| } | ||
|
|
||
| fn is_sensitive_field(key: &str) -> bool { |
There was a problem hiding this comment.
Security: is_sensitive_field uses broad substring matching — both too aggressive and too permissive
The contains() substring approach will:
- Over-redact: legitimate fields like
token_bucket_config,password_reset_url, orcredentials_verification_statusget silently stripped even though they carry no secret. - Under-redact: misses common credential field names like
private_key,passwd,jwt,bearer,auth,csrf.
Consider splitting the logic: use word-boundary matching (exact match or split on _) for broad terms like token/secret to reduce over-redaction, and add explicit entries for credential shorthand names (passwd, jwt, bearer, private_key) to close under-redaction gaps. A doc comment explaining the intentional conservatism would also help future readers understand the trade-off.
| let queue_capacity = config.queue_capacity; | ||
| let max_concurrency = config.max_concurrency; | ||
| let flush_timeout = Duration::from_secs(config.flush_timeout_secs); | ||
| let max_outstanding = queue_capacity |
There was a problem hiding this comment.
Performance / Memory: max_outstanding can grow very large with custom config
max_outstanding = queue_capacity.saturating_mul(endpoints.len().max(1)).max(max_concurrency) — with the defaults (1024, 1 endpoint) this is 1024, which is fine. But if a user configures queue_capacity: 100000 with 5 endpoints, max_outstanding becomes 500,000. The JoinSet holds that many task handles (each with a captured Delivery struct containing heap-allocated Url, Vec<u8> body, etc.), while the Semaphore only limits HTTP concurrency to max_concurrency (default 32).
Consider capping max_outstanding with a hard upper bound (e.g., min(computed_value, 100_000)), or tying it directly to max_concurrency (e.g., max_concurrency * 10). Also consider validating this in validate_config.
Summary
This PR adds CubeAPI webhook event notifications for sandbox lifecycle events.
Implemented features:
CUBE_API_WEBHOOK_ENDPOINTSsandbox.createdsandbox.deletedsandbox.pausedsandbox.resumedDesign Notes
Webhook delivery is implemented as a CubeAPI logging backend.
API handlers emit structured
LogEventvalues. The existing logger fan-out keepsFileLoggerbehavior and optionally addsHttpLoggerwhen webhook endpoints are configured.Webhook delivery is best-effort and asynchronous. Endpoint failures, timeouts, or retries do not change sandbox lifecycle API responses.
Lifecycle webhook payloads include
id,timestamp,level,event, andsandbox_id.template_idis included when available.The payload is not a complete
Sandboxobject and does not include tokens, secrets, environment variables, or runtime/network details.This PR does not add a REST API for managing webhooks, persistent outbox, disk spool, dead-letter queue, batch delivery, or exactly-once delivery.
Validation
Tested on a local CubeSandbox PVM deployment.
Results:
End-to-end webhook verification:
The example receiver was started with
WEBHOOK_SECRET=test-secret, so receiving these events also verifies the documented HMAC-SHA256 signing path.Closes #642