Skip to content

fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups#9652

Merged
mudler merged 2 commits into
masterfrom
fix/distributed-router-selector-and-reconciler
May 4, 2026
Merged

fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups#9652
mudler merged 2 commits into
masterfrom
fix/distributed-router-selector-and-reconciler

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Summary

Two distinct bugs were causing tight retry loops in the distributed scheduler — both observed in a live cluster:

1. FindAndLockNodeWithModel ignored NodeSelector → eviction-busy loop

When a model was loaded on multiple nodes and only some matched the current selector, the function returned the lowest-in_flight node — even one the selector excluded. Route()'s post-check then fell through to scheduleNewModel, which targeted the matching node where the model was already at MaxReplicasPerModel capacity. Eviction couldn't help (the only loaded model on that node was the one being requested, and it was busy), so every request looped:

Cached model on node that no longer matches selector, falling through node="agx-orin-slow"
Chosen node has no free replica slot, evicting LRU node="nvidia-thor"
All models have in-flight requests, waiting for capacity
... eviction failed: all models busy, cannot evict

Fix: thread an optional candidateNodeIDs filter through FindAndLockNodeWithModel. Route() resolves the selector once via a new resolveSelectorCandidates helper and passes the matching IDs to both the cached-replica lookup and scheduleNewModel. Same helper replaces the inline selector block already inside scheduleNewModel.

2. Reconciler scaleup with empty backend type

ScheduleAndLoadModel fell back to scheduleNewModel(ctx, "", modelName, nil) when GetModelLoadInfo had nothing stored (model never loaded yet). The worker rejected the resulting backend.install ("backend name is empty") on every reconciler tick (~30s):

Reconciler: scaling up to meet minimum model="whisper-large-v3-turbo-it-ggml" current=0 min=1 adding=1
No stored model load info for reconciler scale-up, falling back to backend install only
Sending NATS backend.install nodeID="..." backend="" modelID="whisper-..."
Reconciler: failed to scale up replica ... installing backend from gallery: backend name is empty

Fix: remove the broken fallback. When GetModelLoadInfo has nothing stored, return a clear error instead of firing a doomed NATS install. The reconciler's existing scale-up failure log surfaces it once per tick; the model auto-replicates as soon as Route() serves it once and stores load info.

3. Drive-by: noisy "error stopping model" log

The defensive StopGRPC after a failed LoadModel was logged at ERROR. That cleanup usually hits "model not found" because LoadModel failed before the process was registered, and the outer "Failed to load model" already carries the real reason. Downgraded to Debug.

Test plan

  • go build ./core/services/nodes/... ./pkg/model/... ./tests/e2e/distributed/...
  • go test ./core/services/nodes/... -ginkgo.focus="SmartRouter"
  • go test ./core/services/nodes/... -ginkgo.focus="FindAndLockNodeWithModel"
  • go test ./core/services/nodes/... -ginkgo.focus="Reconciler"
  • Full nodes suite: go test ./core/services/nodes/... -timeout 10m
  • Live cluster: confirm the agx-orin-slow / nvidia-thor request stops looping and serves from the cached replica on the selector-matching node
  • Live cluster: confirm whisper-large-v3-turbo-it-ggml reconciler stops sending empty-backend installs (one Warn per tick instead of Warn + failed NATS roundtrip)

Files changed

  • core/services/nodes/registry.goFindAndLockNodeWithModel adds optional candidateNodeIDs filter
  • core/services/nodes/router.go — new resolveSelectorCandidates helper; Route and scheduleNewModel use it; ScheduleAndLoadModel removes broken empty-backend fallback
  • core/services/nodes/interfaces.goModelRouter interface signature update
  • core/services/nodes/{router,model_router,registry}_test.go — fakes/callers updated
  • tests/e2e/distributed/model_routing_test.go — call site updated
  • pkg/model/initializers.go — downgrade post-load cleanup StopGRPC log to Debug
mudler added 2 commits May 4, 2026 06:49
…mpty-backend reconciler scaleups

Two distinct bugs were causing tight retry loops in the distributed scheduler:

1. FindAndLockNodeWithModel ignored the model's NodeSelector. When a model
   was loaded on multiple nodes and only some matched the current selector,
   the function returned the lowest-in_flight node — even one the selector
   excluded. Route()'s post-check then fell through to scheduleNewModel,
   which targeted the matching node where the model was already at
   MaxReplicasPerModel capacity. Eviction couldn't help (the only loaded
   model on that node was the one being requested, and it was busy), so
   every request looped through "evicting LRU" → "all models busy".

   Fix: thread an optional candidateNodeIDs filter through
   FindAndLockNodeWithModel. Route() resolves the selector once via a new
   resolveSelectorCandidates helper and passes the matching IDs to both
   the cached-replica lookup and scheduleNewModel. The same helper
   replaces the inline selector block in scheduleNewModel.

2. ScheduleAndLoadModel (reconciler scale-up path) fell back to
   scheduleNewModel with backendType="" when no replica had ever been
   loaded for a model. The worker rejected the resulting backend.install
   ("backend name is empty") on every reconciler tick (~30s).

   Fix: remove the broken fallback. When GetModelLoadInfo has nothing
   stored, return a clear error instead of firing a doomed NATS install.
   The reconciler's existing scale-up failure log surfaces it once per
   tick; the model auto-replicates as soon as Route() serves it once and
   stores load info.

Also downgrade the post-LoadModel-failure StopGRPC error to Debug — that
cleanup attempt usually hits "model not found" because LoadModel failed
before registering the process, and the outer "Failed to load model"
error already carries the real reason.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
…reconciler scaleup guard

Two regression tests for the bugs fixed in the previous commit:

1. FindAndLockNodeWithModel — registry-level integration tests verify the
   candidateNodeIDs filter:
   - Returns the included node even when an excluded node has lower
     in_flight (the original selector-mismatch loop scenario).
   - Returns not-found when the model is loaded only on excluded nodes,
     forcing Route() to fall through to a fresh schedule instead of
     reusing the excluded replica.

2. ScheduleAndLoadModel — mock-based test verifies the reconciler scale-up
   path returns an error and does NOT fire backend.install when no replica
   has been loaded yet. fakeUnloader gains an installCalls slice so this
   negative assertion is direct.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
@mudler mudler merged commit 170d55c into master May 4, 2026
49 checks passed
@mudler mudler deleted the fix/distributed-router-selector-and-reconciler branch May 4, 2026 07:42
@localai-bot localai-bot added the bug Something isn't working label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

2 participants