Job Description
About the position
- Responsibilities
- Design, analyze, and troubleshoot large-scale distributed systems.
- Participate in on-call rotation, engage with product teams to fix production outages, and carry forward action items to improve ongoing reliability.
- Develop effective tooling, alerts, and response to both identify and address reliability risks including automatic problem detection and mitigation.
- Requirements
- Minimum 7+ years experience.
- Proficient in Linux.
- Expert in configuration management tools like Ansible.
- Knowledgeable in creating CI/CD pipelines, with Jenkins as a preference.
- Skilled in optimizing container builds.
- Hands-on experience with Kubernetes or OpenShift.
- Comfortable writing scripts in Bash and Python.
- Practical experience in building React front-end applications with strong proficiency in JavaScript/TypeScript.
- Expertise in developing backend services and APIs, particularly using Python frameworks.
- Strong understanding of both SQL and NoSQL databases.
- Familiar with task scheduling tools such as Kafka, Redis, and Celery for asynchronous task processing.
- Nice-to-haves
- Strongly preferred experience in working with production Kubernetes/OpenShift environments.
- In depth experience with the Ansible, Python, Terraform, and CI/CD tools such as Jenkins, IBM Continuous Delivery, ArgoCD.
- Hands on experience crafting alerts and dashboards using tools such as Instana, New Relic, Grafana/Prometheus.
Apply Now
Apply Now