Director of AI-ML Infra & Ops Engineering

🌍 Remote, USA 🎯 Full-time 🕐 Posted Recently

Job Description

About the position Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by diversity and inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health equity on a global scale.

Join us to start Caring. Connecting. Growing together. As the Director of Infrastructure and Ops, you will manage support related to UAIS (United AI Studio - enterprise AI/ML platform). This director-level role requires proven experience with managing SRE teams for large-scale AI/ML platforms guaranteeing stability, reliability, scalability, and performance. Extensive experience with modern Infrastructure and DevOps tools and paradigms, as well as hands-on knowledge with major cloud-based services like Azure, AWS and GCP is a must.

You'll enjoy the flexibility to work remotely from anywhere within the U.S. as you take on some tough challenges. Responsibilities • Manage geographically distributed SRE support for users of the UAIS platform: triage support, liaise with customers, actively participate to war rooms, work with suppliers • Improve automation across the infrastructure lifecycle, leveraging Infrastructure as Code (IaC) and DevOps principles and best practices to streamline deployment and management processes • Manage monitoring frameworks for infrastructure, identifying areas for performance improvement, optimization, and ensuring high availability • Manage disaster recovery and business continuity plans to ensure minimal downtime and data integrity • Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats • Provide strong technical and personal mentorship Requirements • Bachelor's degree in computer science, information technology, or a related STEM field • 12+ years of cloud infrastructure experience: Proven experience working on large-scale, cloud-based enterprise-level platform, deep understanding of multi-cloud architectures, specifically Azure, AWS, and GCP, with hands-on experience in cloud management • 6+ years of practical experience in Infrastructure-as-Code and bolthires/CD tools like Terraform, Git Actions and alike • 6+ years of practical experience in containerization technologies (Kubernetes, Docker), observability and orchestration • 5+ years of practical experience in Scripting & Automation Skills:

Advanced proficiency in scripting languages such as Python and Bash to support automation and system integration efforts • 5+ years of experience leading geographically-distributed support and SRE teams Nice-to-haves • Strong understanding of security best practices and experience ensuring compliance with relevant regulatory frameworks • Exposure to modern tools and techniques in MLOps and LLMOps fields • Exposure to AI/ML-specific infrastructure tools (e.g., MLflow, Kubeflow) for managing and deploying models at scale • Experience working within a healthcare or regulated industry, with solid understanding of the unique challenges and compliance requirements • Ability to work independently, manage multiple projects simultaneously, and adapt to changing priorities in a fast-paced environment Benefits • Comprehensive benefits package • Incentive and recognition programs • Equity stock purchase • 401k contribution Apply tot his job