A rapidly scaling AI-infrastructure company, backing many of the world’s leading research labs and next-generation AI builders, is seeking a Network Engineer focused on Operations and Repair.
They’re building colossal GPU clusters in the US - think 100k+ GPUs, liquid cooling, multi-GW power draw. This is the infrastructure that literally determines how fast the future gets built.
This role is for an experienced network operations engineer who wants true ownership. You’ll be the primary operator for a datacenter region, responsible for keeping large-scale network fabrics healthy, responding to complex incidents, and coordinating repair and recovery when things go wrong.
This is not a NOC role and not a design-only position. You’ll work closely with centralized monitoring teams, deployment engineers, and onsite operations to ensure production networks stay available and performant.
What you’ll do
Own network operations for an assigned datacenter region, supporting datacenter deployments, turn-ups, and expansions
Act as Tier 2/3 escalation point for network incidents
Troubleshoot complex L1–L3 and fabric-level issues
Coordinate network break-fix with onsite teams and vendors
Manage RMAs and vendor escalations
Build and maintain regional/network observability dashboards
Validate production readiness and operational handover
Requirements:
4+ years of network engineering with heavy production ops exposure
Proven experience running and troubleshooting live datacenter networks
Strong incident response and outage leadership experience
Hands-on with EVPN/VXLAN, BGP, CLOS, high-radix switching
Confident in troubleshooting L2/L3, routing, fabric, and physical faults
Experience with SQL-backed dashboards (Grafana, Tableau, similar)
Working knowledge of Python for ops, analysis, or scripting
Pragmatic operator: prioritizes impact, documents as they go
Comfortable with ~30–40% travel
Nice to have
AI/ML or HPC network operations (RDMA, RoCEv2, lossless Ethernet)
Previous site, campus, or regional ops ownership
Hands-on hardware break-fix and RMA coordination
Experience with network monitoring, alerting, and telemetry
Follow-the-sun or globally distributed ops experience
Compensation:
$150k–$260k + meaningful equity
Generous PTO policy
Remote flexibility available, though in-office presence is encouraged.
Seniority level
Mid-Senior level
Employment type
Full-time
Job function
Information Technology
Industries
Software Development
Referrals increase your chances of interviewing at Realm by 2x