fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups#9652
Merged
Conversation
…mpty-backend reconciler scaleups
Two distinct bugs were causing tight retry loops in the distributed scheduler:
1. FindAndLockNodeWithModel ignored the model's NodeSelector. When a model
was loaded on multiple nodes and only some matched the current selector,
the function returned the lowest-in_flight node — even one the selector
excluded. Route()'s post-check then fell through to scheduleNewModel,
which targeted the matching node where the model was already at
MaxReplicasPerModel capacity. Eviction couldn't help (the only loaded
model on that node was the one being requested, and it was busy), so
every request looped through "evicting LRU" → "all models busy".
Fix: thread an optional candidateNodeIDs filter through
FindAndLockNodeWithModel. Route() resolves the selector once via a new
resolveSelectorCandidates helper and passes the matching IDs to both
the cached-replica lookup and scheduleNewModel. The same helper
replaces the inline selector block in scheduleNewModel.
2. ScheduleAndLoadModel (reconciler scale-up path) fell back to
scheduleNewModel with backendType="" when no replica had ever been
loaded for a model. The worker rejected the resulting backend.install
("backend name is empty") on every reconciler tick (~30s).
Fix: remove the broken fallback. When GetModelLoadInfo has nothing
stored, return a clear error instead of firing a doomed NATS install.
The reconciler's existing scale-up failure log surfaces it once per
tick; the model auto-replicates as soon as Route() serves it once and
stores load info.
Also downgrade the post-LoadModel-failure StopGRPC error to Debug — that
cleanup attempt usually hits "model not found" because LoadModel failed
before registering the process, and the outer "Failed to load model"
error already carries the real reason.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
…reconciler scaleup guard
Two regression tests for the bugs fixed in the previous commit:
1. FindAndLockNodeWithModel — registry-level integration tests verify the
candidateNodeIDs filter:
- Returns the included node even when an excluded node has lower
in_flight (the original selector-mismatch loop scenario).
- Returns not-found when the model is loaded only on excluded nodes,
forcing Route() to fall through to a fresh schedule instead of
reusing the excluded replica.
2. ScheduleAndLoadModel — mock-based test verifies the reconciler scale-up
path returns an error and does NOT fire backend.install when no replica
has been loaded yet. fakeUnloader gains an installCalls slice so this
negative assertion is direct.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two distinct bugs were causing tight retry loops in the distributed scheduler — both observed in a live cluster:
1.
FindAndLockNodeWithModelignoredNodeSelector→ eviction-busy loopWhen a model was loaded on multiple nodes and only some matched the current selector, the function returned the lowest-
in_flightnode — even one the selector excluded.Route()'s post-check then fell through toscheduleNewModel, which targeted the matching node where the model was already atMaxReplicasPerModelcapacity. Eviction couldn't help (the only loaded model on that node was the one being requested, and it was busy), so every request looped:Fix: thread an optional
candidateNodeIDsfilter throughFindAndLockNodeWithModel.Route()resolves the selector once via a newresolveSelectorCandidateshelper and passes the matching IDs to both the cached-replica lookup andscheduleNewModel. Same helper replaces the inline selector block already insidescheduleNewModel.2. Reconciler scaleup with empty backend type
ScheduleAndLoadModelfell back toscheduleNewModel(ctx, "", modelName, nil)whenGetModelLoadInfohad nothing stored (model never loaded yet). The worker rejected the resultingbackend.install("backend name is empty") on every reconciler tick (~30s):Fix: remove the broken fallback. When
GetModelLoadInfohas nothing stored, return a clear error instead of firing a doomed NATS install. The reconciler's existing scale-up failure log surfaces it once per tick; the model auto-replicates as soon asRoute()serves it once and stores load info.3. Drive-by: noisy "error stopping model" log
The defensive
StopGRPCafter a failedLoadModelwas logged at ERROR. That cleanup usually hits"model not found"becauseLoadModelfailed before the process was registered, and the outer"Failed to load model"already carries the real reason. Downgraded to Debug.Test plan
go build ./core/services/nodes/... ./pkg/model/... ./tests/e2e/distributed/...go test ./core/services/nodes/... -ginkgo.focus="SmartRouter"go test ./core/services/nodes/... -ginkgo.focus="FindAndLockNodeWithModel"go test ./core/services/nodes/... -ginkgo.focus="Reconciler"go test ./core/services/nodes/... -timeout 10m✅agx-orin-slow/nvidia-thorrequest stops looping and serves from the cached replica on the selector-matching nodewhisper-large-v3-turbo-it-ggmlreconciler stops sending empty-backend installs (one Warn per tick instead of Warn + failed NATS roundtrip)Files changed
core/services/nodes/registry.go—FindAndLockNodeWithModeladds optionalcandidateNodeIDsfiltercore/services/nodes/router.go— newresolveSelectorCandidateshelper;RouteandscheduleNewModeluse it;ScheduleAndLoadModelremoves broken empty-backend fallbackcore/services/nodes/interfaces.go—ModelRouterinterface signature updatecore/services/nodes/{router,model_router,registry}_test.go— fakes/callers updatedtests/e2e/distributed/model_routing_test.go— call site updatedpkg/model/initializers.go— downgrade post-load cleanupStopGRPClog to Debug