Skip to content

feat: add Trackio as a new experiment monitoring backend#8065

Open
chanduripranav wants to merge 2 commits into
deepspeedai:masterfrom
chanduripranav:feature/trackio-monitor-7964
Open

feat: add Trackio as a new experiment monitoring backend#8065
chanduripranav wants to merge 2 commits into
deepspeedai:masterfrom
chanduripranav:feature/trackio-monitor-7964

Conversation

@chanduripranav

Copy link
Copy Markdown

Closes #7964

Summary

Adds Trackio as a new experiment monitoring backend to DeepSpeed,
following the existing pattern used by WandB, TensorBoard, Comet, and CSV monitors.

Trackio is a lightweight, offline-first logging library developed by
Hugging Face with a WandB-compatible API. Runs can be visualized as
an HF Space or dataset on the HF Hub.

Changes

  • deepspeed/monitor/trackio.py — new TrackioMonitor class implementing
    the Monitor interface with log() and write_events() methods
  • deepspeed/monitor/config.py — new TrackioConfig with enabled and
    project fields; registered in get_monitor_config() and
    DeepSpeedMonitorConfig; included in check_enabled validator
  • deepspeed/monitor/utils.py — new check_trackio_availability() helper
    with install instructions
  • deepspeed/monitor/monitor.pyTrackioMonitor imported and wired
    into MonitorMaster.__init__() and write_events()

Usage

Add to your DeepSpeed config:

{
  "trackio": {
    "enabled": true,
    "project": "my-deepspeed-run"
  }
}

Testing

  • Verified pattern consistency with existing WandB and Comet backends
  • TrackioMonitor only initializes on rank 0, consistent with other monitors
  • Graceful ImportError with helpful install message if trackio not installed
…#7964)

- Add TrackioMonitor class in deepspeed/monitor/trackio.py
- Add TrackioConfig with enabled and project fields in config.py
- Add check_trackio_availability() helper in utils.py
- Register TrackioMonitor in MonitorMaster in monitor.py
- Trackio is a lightweight offline-first logging library with
  WandB-compatible API, logs can be visualized on HF Hub

Signed-off-by: Pranav Chanduri <preethivardhanc@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 486c1022a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +125 to +131
class TrackioConfig(DeepSpeedConfigModel):
"""Sets parameters for Trackio monitor."""

enabled: bool = False
""" Whether logging to Trackio is enabled. Requires `trackio` package is installed. """

project: str = "deepspeed"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add required Trackio tests and docs

This introduces a new monitoring backend but the diff only changes production code; the workspace AGENTS.md requires new features to include corresponding tests and documentation updates. Please add coverage for the Trackio config/MonitorMaster wiring and logging behavior, and update the monitor docs so users can discover the new trackio configuration.

Useful? React with 👍 / 👎.

- Add TestTrackio covering config defaults, enabled config, and
  write_events() logging behavior (mocks the optional trackio dependency)
- Add TestMonitorMasterTrackioWiring covering MonitorMaster registration
  when trackio is enabled/disabled
- Update docs/_tutorials/monitor.md with Trackio overview, config
  example, and Custom Monitoring API reference

Addresses Codex review feedback on PR deepspeedai#8065

Signed-off-by: Pranav Chanduri <preethivardhanc@gmail.com>
@chanduripranav

Copy link
Copy Markdown
Author

Added unit tests covering TrackioConfig defaults, MonitorMaster wiring,
and write_events() logging behavior, plus updated docs/_tutorials/monitor.md
with the Trackio overview and config example. Should address the earlier
Codex feedback on missing tests/docs.

@chanduripranav

Copy link
Copy Markdown
Author

Note: the CI failure in modal-torch-latest / DeepSpeedAI CI appears
unrelated to this PR — it's a 3600s timeout in
test_zero_user_backward.py::test_leaf_module_non_scalar_backward, which
has nothing to do with the Trackio monitor changes. Likely CI flakiness
on the shared test infra. Happy to have it re-run if a maintainer can
trigger that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant