Site Reliability Engineer

Job Type

Full-Time

About the Role

Site Reliability Engineer — Position Summary

Adroit Affine is looking for a seasoned Site Reliability Engineer to own the reliability, performance and scalability of production systems serving enterprise clients across multiple industries and geographies. You will define and track SLOs, build automation to eliminate toil, lead incident response and partner with engineering teams to make services observable, resilient and self-healing.

Key Responsibilities
• Define, measure and report on SLOs, SLAs and error budgets for critical production services.
• Build and maintain observability infrastructure: metrics (Prometheus/Thanos), logging (ELK/Loki) and distributed tracing (Jaeger/Tempo).
• Lead incident triage, establish communication protocols and facilitate thorough post-incident reviews (PIRs).
• Identify and eliminate toil through automation using Python, Go or Bash.
• Partner with engineering teams during design to bake reliability into new services.
• Manage and optimise Kubernetes clusters (EKS, GKE or AKS) and related cloud infrastructure.
• Define capacity planning models and lead proactive performance testing and chaos engineering.
• Maintain and improve runbooks, incident response playbooks and SRE documentation.

Required Qualifications
• 4+ years of SRE, platform engineering or senior DevOps experience in production.
• Deep understanding of Linux systems, networking and distributed systems fundamentals.
• Strong proficiency with Kubernetes in production (scheduling, networking, storage, security policies).
• Hands-on experience with Prometheus, Grafana, ELK, Jaeger or equivalents.
• Scripting / automation skills in Python, Go or Bash.
• Experience with at least one major cloud provider (AWS, GCP or Azure) at infrastructure depth.
• Demonstrated experience managing on-call rotations and high-severity incident response.

Preferred Qualifications
• CKA / CKAD or AWS/Google/Azure professional-level certification.
• Experience with chaos engineering tools (Chaos Monkey, LitmusChaos, Gremlin).
• Knowledge of eBPF-based observability (Cilium, Pixie).
• Familiarity with GitOps (ArgoCD, Flux) and progressive delivery (Flagger, Argo Rollouts).

Compensation & Benefits
• Salary: $75,000 – $125,000 based on experience.
• Comprehensive health, dental and vision insurance.
• On-call compensation and recognition program.
• Cloud / SRE certification reimbursement.
• Remote-friendly with a modern Phoenix, AZ office.
• High-trust environment with significant autonomy and ownership.

Requirements

4+ years of SRE, platform engineering or senior DevOps experience in production.
Deep Linux, networking and distributed systems knowledge.
Strong Kubernetes proficiency; hands-on Prometheus, Grafana, ELK/Loki, Jaeger.
Python, Go or Bash scripting; experience with AWS, GCP or Azure at infrastructure depth.
Demonstrated on-call and high-severity incident response experience.

About the Company

Adroit Affine is a premier IT solutions and staffing company headquartered in Phoenix, AZ. We partner with Fortune 100 enterprises globally, delivering cloud transformation, cybersecurity, data analytics, DevOps and IT staffing services. Our team of 100+ trained professionals brings deep technical expertise and a commitment to measurable outcomes. We offer competitive compensation, certification reimbursement, a professional development budget and a collaborative culture where every engineer has the opportunity to make a real impact.

Apply Now