Skip to content

Refactor/torch autocast encapsulate global state#7946

Open
nathon-lee wants to merge 15 commits into
deepspeedai:masterfrom
nathon-lee:refactor/torch-autocast-encapsulate-global-state
Open

Refactor/torch autocast encapsulate global state#7946
nathon-lee wants to merge 15 commits into
deepspeedai:masterfrom
nathon-lee:refactor/torch-autocast-encapsulate-global-state

Conversation

@nathon-lee

@nathon-lee nathon-lee commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

refactor: replace bare global vars in torch_autocast with _AutocastState

TORCH_AUTOCAST_INITIALIZED and TORCH_AUTOCAST_DTYPE were module-level
globals mutated via global statements inside init_autocast_params().
This pattern is fragile: it is invisible to type checkers, prevents
isolation between multiple engine instances, and makes the state harder
to reset in tests.

Replace them with a private _AutocastState dataclass instance
_autocast_state. The public API (is_autocast_initialized,
get_autocast_dtype) is unchanged, so no call sites are affected.


fix: store autocast state per-engine to support multiple engine configs

Previously, _autocast_state was a module-level singleton in
torch_autocast.py. When a second DeepSpeed engine called
init_autocast_params(), it would overwrite the first engine's dtype
and initialized state, making it impossible to run two engines with
different autocast configurations concurrently.

Fix by attaching _AutocastState directly to the engine instance
(engine._autocast_state). Update is_autocast_initialized() and
get_autocast_dtype() to accept an engine argument. For ZeRO
optimizers (which hold no engine reference), switch from the global
state query to the per-parameter has_comm_dtype() check; parameters
are already stamped by their own engine inside init_autocast_params(),
so isolation is automatic.

@tohtana tohtana left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nathon-lee, thank you for opening this PR!
_autocast_state is still global and doesn't seem support different configs for multiple engines. Did I misunderstand something?

@nathon-lee

nathon-lee commented Apr 3, 2026

Copy link
Copy Markdown
Contributor Author

**tohtana **

@tohtana thank you, good catch, — I still need to make one more change.

Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants