Job Description
Microsoft is a leading technology company dedicated to empowering every person and organization on the planet. The Site Reliability Engineer will leverage technical expertise to improve the reliability and performance of large-scale distributed systems, collaborating with product teams and automating operational tasks to enhance service efficiency.
Responsibilities
- Independently creates, tests, and deploys changes through a safe deployment process (SDP) to enhance code quality and improve the observability, security, reliability and operability of one or more platforms, systems, or products operating at scale
- Leverages technical expertise in cloud technologies and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or the automation to improve product components or features supported by their team
- Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles
- Utilizes technical knowledge of systems/platforms and insights drawn from product engineering teams, security best practices, artificial intelligence (AI)/machine learning (ML), and telemetry analyses to suggest potential improvements in code base and designs across components and features of one or more products
- Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale
- Develops alerts and instrumentation across components and features to monitor product capacity, related security risk, and resource demands and analyze telemetry data using existing capacity planning models
- Draws insights from analyses of capacity and resource data to optimize component and feature code to manage resources and capacity across limited range of use conditions and system parameters
- Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, security, reliability, performance, and/or efficiency of components and features, leveraging the artificial intelligence (AI) and machine learning (ML) capabilities
- Proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams
- Utilizes insights from performance and resource monitoring tools to identify whether there is a need to optimize the efficiency of component and feature code, or if changes to compute resources are required
- Models the predicted effect of changes to code and/or compute resources across components or features to document the efficacy of proposed solutions
- Proposes changes and drives implementation of solutions to identified performance and resource challenges
- Embody our culture and values
Skills
- Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
- Candidates must be able to meet Microsoft, customer, and/or government security screening requirements are required for this role
- The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
- Failure to maintain or obtain the appropriate U.S. Government clearance and/or customer screening requirements may result in employment action up to and including termination
- This position requires successful verification of the stated security clearance to meet federal government customer requirements
- You will be asked to provide clearance verification information prior to an offer of employment
- This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
- This position requires verification of U.S. citizenship due to citizenship-based legal restrictions
- Citizenship will be verified via a valid passport, or other approved documents, or verified US government Clearance
- Experience working on large-scale distributed services with on-call responsibilities
- Ability to build and influence broadly towards common goals and priorities
- Experience with distributed database systems such as SQL and PostgreSQL
Company Overview
Apply To This Job