fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups by localai-bot · Pull Request #9652 · mudler/LocalAI

localai-bot · 2026-05-04T06:50:08Z

Summary

Two distinct bugs were causing tight retry loops in the distributed scheduler — both observed in a live cluster:

1. `FindAndLockNodeWithModel` ignored `NodeSelector` → eviction-busy loop

When a model was loaded on multiple nodes and only some matched the current selector, the function returned the lowest-in_flight node — even one the selector excluded. Route()'s post-check then fell through to scheduleNewModel, which targeted the matching node where the model was already at MaxReplicasPerModel capacity. Eviction couldn't help (the only loaded model on that node was the one being requested, and it was busy), so every request looped:

Cached model on node that no longer matches selector, falling through node="agx-orin-slow"
Chosen node has no free replica slot, evicting LRU node="nvidia-thor"
All models have in-flight requests, waiting for capacity
... eviction failed: all models busy, cannot evict

Fix: thread an optional candidateNodeIDs filter through FindAndLockNodeWithModel. Route() resolves the selector once via a new resolveSelectorCandidates helper and passes the matching IDs to both the cached-replica lookup and scheduleNewModel. Same helper replaces the inline selector block already inside scheduleNewModel.

2. Reconciler scaleup with empty backend type

ScheduleAndLoadModel fell back to scheduleNewModel(ctx, "", modelName, nil) when GetModelLoadInfo had nothing stored (model never loaded yet). The worker rejected the resulting backend.install ("backend name is empty") on every reconciler tick (~30s):

Reconciler: scaling up to meet minimum model="whisper-large-v3-turbo-it-ggml" current=0 min=1 adding=1
No stored model load info for reconciler scale-up, falling back to backend install only
Sending NATS backend.install nodeID="..." backend="" modelID="whisper-..."
Reconciler: failed to scale up replica ... installing backend from gallery: backend name is empty

Fix: remove the broken fallback. When GetModelLoadInfo has nothing stored, return a clear error instead of firing a doomed NATS install. The reconciler's existing scale-up failure log surfaces it once per tick; the model auto-replicates as soon as Route() serves it once and stores load info.

3. Drive-by: noisy "error stopping model" log

The defensive StopGRPC after a failed LoadModel was logged at ERROR. That cleanup usually hits "model not found" because LoadModel failed before the process was registered, and the outer "Failed to load model" already carries the real reason. Downgraded to Debug.

Test plan

go build ./core/services/nodes/... ./pkg/model/... ./tests/e2e/distributed/...
go test ./core/services/nodes/... -ginkgo.focus="SmartRouter"
go test ./core/services/nodes/... -ginkgo.focus="FindAndLockNodeWithModel"
go test ./core/services/nodes/... -ginkgo.focus="Reconciler"
Full nodes suite: go test ./core/services/nodes/... -timeout 10m ✅
Live cluster: confirm the agx-orin-slow / nvidia-thor request stops looping and serves from the cached replica on the selector-matching node
Live cluster: confirm whisper-large-v3-turbo-it-ggml reconciler stops sending empty-backend installs (one Warn per tick instead of Warn + failed NATS roundtrip)

Files changed

core/services/nodes/registry.go — FindAndLockNodeWithModel adds optional candidateNodeIDs filter
core/services/nodes/router.go — new resolveSelectorCandidates helper; Route and scheduleNewModel use it; ScheduleAndLoadModel removes broken empty-backend fallback
core/services/nodes/interfaces.go — ModelRouter interface signature update
core/services/nodes/{router,model_router,registry}_test.go — fakes/callers updated
tests/e2e/distributed/model_routing_test.go — call site updated
pkg/model/initializers.go — downgrade post-load cleanup StopGRPC log to Debug

…mpty-backend reconciler scaleups Two distinct bugs were causing tight retry loops in the distributed scheduler: 1. FindAndLockNodeWithModel ignored the model's NodeSelector. When a model was loaded on multiple nodes and only some matched the current selector, the function returned the lowest-in_flight node — even one the selector excluded. Route()'s post-check then fell through to scheduleNewModel, which targeted the matching node where the model was already at MaxReplicasPerModel capacity. Eviction couldn't help (the only loaded model on that node was the one being requested, and it was busy), so every request looped through "evicting LRU" → "all models busy". Fix: thread an optional candidateNodeIDs filter through FindAndLockNodeWithModel. Route() resolves the selector once via a new resolveSelectorCandidates helper and passes the matching IDs to both the cached-replica lookup and scheduleNewModel. The same helper replaces the inline selector block in scheduleNewModel. 2. ScheduleAndLoadModel (reconciler scale-up path) fell back to scheduleNewModel with backendType="" when no replica had ever been loaded for a model. The worker rejected the resulting backend.install ("backend name is empty") on every reconciler tick (~30s). Fix: remove the broken fallback. When GetModelLoadInfo has nothing stored, return a clear error instead of firing a doomed NATS install. The reconciler's existing scale-up failure log surfaces it once per tick; the model auto-replicates as soon as Route() serves it once and stores load info. Also downgrade the post-LoadModel-failure StopGRPC error to Debug — that cleanup attempt usually hits "model not found" because LoadModel failed before registering the process, and the outer "Failed to load model" error already carries the real reason. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]

…reconciler scaleup guard Two regression tests for the bugs fixed in the previous commit: 1. FindAndLockNodeWithModel — registry-level integration tests verify the candidateNodeIDs filter: - Returns the included node even when an excluded node has lower in_flight (the original selector-mismatch loop scenario). - Returns not-found when the model is loaded only on excluded nodes, forcing Route() to fall through to a fresh schedule instead of reusing the excluded replica. 2. ScheduleAndLoadModel — mock-based test verifies the reconciler scale-up path returns an error and does NOT fire backend.install when no replica has been loaded yet. fakeUnloader gains an installCalls slice so this negative assertion is direct. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]

mudler added 2 commits May 4, 2026 06:49

mudler merged commit 170d55c into master May 4, 2026
49 checks passed

mudler deleted the fix/distributed-router-selector-and-reconciler branch May 4, 2026 07:42

localai-bot added the bug Something isn't working label May 9, 2026

BrewTestBot mentioned this pull request May 11, 2026

localai 4.2.0 Homebrew/homebrew-core#282016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups#9652

fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups#9652
mudler merged 2 commits into
masterfrom
fix/distributed-router-selector-and-reconciler

localai-bot commented May 4, 2026

Uh oh!

Labels

2 participants

Uh oh!

Conversation

localai-bot commented May 4, 2026

Summary

1. FindAndLockNodeWithModel ignored NodeSelector → eviction-busy loop

2. Reconciler scaleup with empty backend type

3. Drive-by: noisy "error stopping model" log

Test plan

Files changed

Uh oh!

Labels

2 participants

1. `FindAndLockNodeWithModel` ignored `NodeSelector` → eviction-busy loop