Job Description
CoreWeave is The Essential Cloud for AI™, providing a platform for innovators to build and scale AI. The Operations Engineer will support the deployment, monitoring, troubleshooting, and maintenance of large-scale InfiniBand fabrics, ensuring their stability and performance. Responsibilities Regularly monitor the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes Investigate and resolve operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks Assist with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams Perform routine maintenance and upgrades on InfiniBand switches and control plane components Collaborate with HPC cluster operations teams to provide troubleshooting and operational expertise Skills At least 1 year of experience with InfiniBand or similar networking technologies Solid understanding of networking concepts, including architectures, topologies, operational best practices, and troubleshooting Experience with Linux system administration and maintenance Proficiency in at least one scripting language Hands-on experience with Nvidia UFM or similar fabric management tools Familiarity with SLURM job scheduler and its role in HPC environments Experience with monitoring and visualization platforms such as Grafana or Prometheus Experience with operational tooling and automation frameworks like Ansible Knowledge of data center operations, including server racks, and cabling Python or Bash scripting Benefits Medical, dental, and vision insurance - 100% paid for by CoreWeave Company-paid Life Insurance Voluntary supplemental life insurance Short and long-term disability insurance Flexible Spending Account Health Savings Account Tuition Reimbursement Ability to Participate in Employee Stock Purchase Program (ESPP) Mental Wellness Benefits through Spring Health Family-Forming support provided by Carrot Paid Parental Leave Flexible, full-service childcare support with Kinside 401(k) with a generous employer match Flexible PTO Catered lunch each day in our office and data center locations A casual work environment A work culture focused on innovative disruption Company Overview CoreWeave provides cloud infrastructure services designed to support artificial intelligence and high-performance computing workloads. It was founded in 2017, and is headquartered in Livingston, New Jersey, USA, with a workforce of 1001-5000 employees. Its website is