Job Description
TITLE: Remote Infrastructure Engineer jobs – Full‑Time Senior Position in Brentwood, California | AWS, Terraform, Kubernetes, $115k‑$150k – Remote Infrastructure & Cloud Architecture Role --- **Who we are** We’re **ByteForge**, a mid‑size SaaS provider that grew from a two‑person garage project to a platform serving more than 2 million end‑users across North America. Our core product—an API‑driven data‑pipeline that powers real‑time analytics for retail chains—runs entirely in the cloud, and the reliability of that pipeline is what keeps our customers awake at night (in a good way). We’ve been pulling double‑shifts on call for the last 18 months because a recent acquisition added a new data‑ingestion module that increased our daily traffic by 45 %. The engineering leadership team decided that we need a dedicated **Remote Infrastructure Engineer** to own the underlying platform, bring systematic automation, and finally give the on‑call crew a predictable shift schedule. That’s why we’re hiring in Brentwood, California—because the talent pool there has a reputation for pragmatic cloud expertise, and we want someone who can relate to the same regional tech community while working fully remotely. **Why this role exists now** When we launched the new ingestion service, we saw three concrete pain points: 1. **Spikes in latency** that breached our 99.9 % SLA on 12 % of daily requests. 2. **Infrastructure cost overruns** that pushed our AWS bill from $1.1 M to $1.5 M in 6 months, a 38 % increase. 3. **Manual provisioning** of Kubernetes clusters and VPCs that caused a mean time to recovery (MTTR) of 84 minutes after a failure, far above our target of <30 minutes. The senior engineer we’re adding will be the person who designs the automation that turns those spikes into data points we can predict, cuts monthly cloud spend by at least 10 % within the first year, and reduces MTTR to under 20 minutes. **What you’ll actually do** - **Architect and implement** a fully automated, IaC‑driven environment using **Terraform** and **AWS CloudFormation** that can spin up identical staging, production, and disaster‑recovery clusters in under 10 minutes. - **Manage our Kubernetes fleet** (currently 12 clusters, 340 nodes) leveraging **Helm** and **Kustomize** to version control all manifests, ensuring that every rollout is reversible. - **Build and maintain CI/CD pipelines** in **Jenkins** and **GitHub Actions** that push infrastructure changes through a gated approval process, integrating security scans via **HashiCorp Vault** and **Trivy**. - **Instrument the stack** with **Prometheus**, **Grafana**, and **New Relic** to surface latency, error‑rate, and cost metrics in real time, setting alerts that feed directly into our on‑call rotation. - **Collaborate with the security team** to enforce least‑privilege IAM policies, manage secrets, and run quarterly compliance checks (SOC 2, ISO 27001) using **AWS Config** and **AWS Security Hub**. - **Mentor a small team of 4 junior engineers**, guiding them through best practices for cloud cost optimization, container security, and incident post‑mortems. - **Run capacity planning** on quarterly forecasts, using **AWS Cost Explorer** and **CloudHealth** to model growth scenarios and recommend right‑sizing recommendations that keep the expense curve flat. - **Drive the on‑call rotation** redesign: moving from a 24/7 “fire‑fighting” model to a predictably scheduled, run‑book‑first approach that reduces fatigue and improves resolution quality. - **Document everything** in Confluence, ensuring that any new hire in Brentwood, California can walk through a “day‑in‑the‑life” playbook without having to ask a senior colleague. **Who you’ll work with** - **Product Engineering (12 engineers)**: You’ll be their go‑to for infrastructure feasibility, helping them understand the cost implications of new feature flags. - **Data Science (5 analysts)**: They need reliable, low‑latency pipelines; you’ll work with them to fine‑tune cluster autoscaling policies. - **Security & Compliance (3 specialists)**: You’ll partner on audits and embed security controls directly into the IaC pipeline. - **Customer Success (8 reps)**: Occasionally you’ll join calls with a high‑value client in Brentwood, California who wants to understand how a new region will impact latency. - **Executive leadership**: The CTO (based in Austin) meets weekly, and the CFO (who lives in Brentwood, California) tracks infrastructure spend closely. Your reports will influence quarterly budgeting discussions. **Our tech stack (the tools you’ll be getting hands‑on with)** 1. **AWS (EC2, RDS, S3, EKS, Lambda)** 2. **Terraform (v1.5+)** 3. **Kubernetes (v1.27)** 4. **Helm & Kustomize** 5. **Docker (v24)** 6. **Jenkins + GitHub Actions** 7. **Prometheus & Grafana** 8. **New Relic APM** 9. **HashiCorp Vault** 10. **Ansible (for VM configuration)** 11. **AWS CloudWatch & CloudTrail** 12. **Splunk (log aggregation)** You’ll also get to experiment with **GitLab CI** and **Azure** if a client asks for a multi‑cloud proof‑of‑concept. We consider the list “the tools we love today,” not a static requirement. **Metrics you’ll be judged on (the numbers that matter)** | Metric | Target (12‑month horizon) | |--------|---------------------------| | Cloud cost reduction | ≥ 10 % YoY | | SLA compliance (99.9 % uptime) | ≥ 99.95 % | | MTTR for infra incidents | ≤ 20 minutes | | Automation coverage (IaC vs manual) | 95 %+ | | Team satisfaction (internal survey) | ≥ 4.5/5 | | On‑call fatigue index (self‑reported) | ↓ 30 % | Your first 90 days will be a “learning sprint”: you’ll audit existing pipelines, map out the biggest cost drivers, and submit a roadmap that outlines the automation milestones. Success is measured not just by ticking boxes but by the tangible improvement in the numbers above. **What we offer (the real stuff, not buzzwords)** - **Salary**: $115k – $150k base, commensurate with experience, plus a quarterly bonus tied to the cost‑reduction targets. - **Equity**: 0.15 % option pool that vests over 4 years with a 1‑year cliff. - **Remote‑first policy**: While we say “remote,” we provide a $2,500 stipend for a home office upgrade (standing desk, monitors, ergonomic chair). You’ll still attend two quarterly “team‑offsites” in Brentwood, California—we’ve found the coffee there keeps ideas flowing. - **Health benefits**: Medical, dental, vision, and a $1,200 per‑year wellness allowance (gym, meditation apps, you name it). - **Learning budget**: $2,000 annually for certifications (AWS, CKA, etc.) and conference tickets (AWS re:Invent, KubeCon). We’ve covered travel to remote conferences before, even for folks based in Brentwood, California. - **Paid time off**: 20 days + federal holidays, plus a “recharge week” you can take any time after your first six months. - **Family‑friendly policies**: Parental leave (up to 12 weeks paid), flexible schedule (you set the core hours, we just need you for the on‑call overlap). **A human moment** > “When I first joined ByteForge, I was pulling 2‑hour on‑call after‑hours because we didn’t have proper run‑books. Within three months, the new automation I helped build reduced my average incident time from 84 minutes to under 12 minutes. That change wasn’t just a metric—it gave me evenings back with my kids. Knowing my work directly improves someone’s personal life is why I stay here.” – *Lena, Senior Infrastructure Engineer (based in Brentwood, California)* **Why you should apply now (the “why now” in plain language)** Our next major release is scheduled for Q2 2026, and the new ingestion engine will double the data volume we process. The engineering leaders have already earmarked a $500k budget for infrastructure automation, but they need a senior engineer who can turn that budget into concrete pipelines, cost savings, and a calmer on‑call rotation. If you love digging into cloud bills, writing Terraform modules that feel like poetry, and mentoring junior talent, you’ll find this role both challenging and rewarding. **What a typical day looks like (remotely, from anywhere in the US, but we’ll be hiring in Brentwood, California)** - **08:30 – 09:00** – Quick stand‑up on Zoom with the platform team (all are in different time zones, but we overlap for an hour). - **09:00 – 10:30** – Review recent CloudWatch alerts; triage any spikes and add a new Prometheus rule if needed. - **10:30 – 11:15** – Pair‑program with a junior engineer on a Terraform module that provisions a new VPC for a client in the Midwest. - **11:15 – 12:00** – Write a short post‑mortem in Confluence, adding a run‑book snippet for a “node‑drain” incident we observed yesterday. - **12:00 – 13:00** – Lunch break (we encourage you to step away, then maybe read the latest AWS blog post). - **13:00 – 14:30** – Deploy a Helm chart to a sandbox cluster, test a new autoscaling policy using **KEDA**, and monitor the results in Grafana. - **14:30 – 15:30** – Attend a 30‑minute security sync with the compliance team; discuss IAM role redesign to meet upcoming SOC 2 audit requirements. - **15:30 – 16:00** – Update the cost‑optimization dashboard in **AWS Cost Explorer**, flag any resources that have been idle > 48 hours. - **16:00 – 16:30** – End‑of‑day “handoff” notes posted in Slack for the next on‑call engineer (who’s based out of Brentwood, California this week). **How to apply (simple, no‑nonsense process)** 1. **Submit your resume** via our career portal (link below). Include a short paragraph (2‑3 sentences) describing the biggest cloud‑cost reduction you’ve delivered. 2. **Technical screen** (30 minutes) with our lead architect – focused on your experience with Terraform, Kubernetes, and AWS networking. 3. **Take‑home design exercise** (no more than 4 hours). You’ll design a modular, reusable Terraform configuration for a multi‑AZ EKS cluster that meets a 99.95 % SLA and adheres to cost‑optimization best practices. We’ll provide the spec; we’ll not expect a fully coded solution, just architecture diagrams and pseudo‑code. 4. **Final interview** with the CTO and a senior engineer (45 minutes). Expect a mix of culture fit, leadership style, and a deep dive into the take‑home exercise. 5. **Offer** – if everything aligns, you’ll receive an offer within 5 business days after the final interview. **A final word from our CTO** > “Infrastructure is the skeleton that holds up the experience we promise our customers. When you join us, you’re not just writing code—you’re shaping how millions of users see our product, and you’ll see that impact in the numbers day after day.” – *Michele Ramirez, CTO, ByteForge (also a resident of Brentwood, California)* --- If you’re a hands‑on engineer who prefers concrete outcomes over vague buzzwords, enjoys automating the boring stuff so that teams can focus on delivering value, and wants to work in a place where your fellow engineers are as honest about challenges as they are about successes, we’d love to hear from you. Apply today and help us build a more resilient, cost‑effective, and human‑centric remote infrastructure—right from Brentwood, California and everywhere else you call home. Apply tot his job