Job Description
CoreWeave is The Essential Cloud for AI™, delivering a platform that enables innovators to build and scale AI with confidence. As a Bare Metal Support Engineer, you will support, operate, and maintain CoreWeave’s GPU fleet, ensuring reliability and performance while collaborating with customers and engineering teams.
Responsibilities
- Provide high-level support for customers utilizing bare-metal GPU fleets on CoreWeave Cloud
- Diagnose, triage, and investigate reported customer issues and high-priority incidents, identifying root causes and escalating when necessary
- Develop a deep understanding of customer workloads and use cases to provide tailored technical support
- Coordinate remote troubleshooting and hardware interventions with Data Center Technicians
- Create and maintain internal documentation, including troubleshooting guides, best practices, and knowledge base articles
- Participate in an on-call rotation to support production clusters and ensure operational reliability
- Collaborate with engineering teams to improve hardware reliability, software stability, and system performance
- Implement automation and scripting to streamline support workflows and reduce manual interventions
- Perform in-depth log analysis and debugging across multiple layers of the stack (firmware, drivers, hardware)
- Provide feedback to internal teams on common support issues to drive continuous improvements
- Work with networking teams to troubleshoot connectivity issues affecting customer workloads
- Support supercomputing infrastructure running GPU workloads at scale
- Drive operational excellence by refining internal processes and support methodologies
Skills
- Experience in data centers, GPU clusters, server deployments, system administration, or hardware troubleshooting
- Demonstrated experience driving resolutions and continuous improvements across cross-functional environments and teams within a data center environment
- Intermediate knowledge of Linux (Ubuntu, CentOS, or similar), including command-line proficiency
- Experience with NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing (HPC), and large-scale data center environments
- Experience in networking fundamentals (TCP/IP, VLANs, DNS, DHCP) and troubleshooting tools
- Hands-on experience with firmware updates, BIOS configurations, and driver management
- Experience analyzing system logs and debugging issues across firmware, drivers, and hardware layers
- Experience working with Jira, Confluence, Notion, or other issue-tracking and documentation platforms
- Experience in scripting and automation (Python, Bash, Ansible, or similar)
- You're curious about Kubernetes, Docker, and containerized infrastructure
- You have strong problem-solving skills with a proactive and analytical mindset
- You have excellent communication skills and a demonstrated ability to work collaboratively in a fast-paced environment
Benefits
- Medical, dental, and vision insurance - 100% paid for by CoreWeave
- Company-paid Life Insurance
- Voluntary supplemental life insurance
- Short and long-term disability insurance
- Flexible Spending Account
- Health Savings Account
- Tuition Reimbursement
- Ability to Participate in Employee Stock Purchase Program (ESPP)
- Mental Wellness Benefits through Spring Health
- Family-Forming support provided by Carrot
- Paid Parental Leave
- Flexible, full-service childcare support with Kinside
- 401(k) with a generous employer match
- Flexible PTO
- Catered lunch each day in our office and data center locations
- A casual work environment
- A work culture focused on innovative disruption
Company Overview
Apply To This Job