Return to Opportunities Apply Now View Team Opportunities

Senior Site

Reliability Engineer

New York
$
180,000 Per Year
Permanent

Reference: 44820

Business Sector: Infrastructure

Description

Senior Site Reliability Engineer
Terms: Direct Hire
Work Location: 4 days onsite, NYC

Join a high-impact engineering environment where reliability and scalability are mission-critical. This role is ideal for a seasoned Site Reliability Engineer who thrives on building resilient, cloud-native systems and driving operational excellence at scale. You’ll play a key role in shaping modern infrastructure practices, enhancing system performance, and ensuring business continuity across complex production environments. Working cross-functionally with engineering, security, and platform teams, you’ll influence architecture decisions, standardize automation, and lead initiatives that directly improve uptime, efficiency, and recovery readiness.
Key Responsibilities
Reliability & Operational Excellence

Drive service reliability by defining and managing SLOs/SLIs, error budgets, and performance benchmarks
Build and enhance observability frameworks, including monitoring, logging, tracing, and alerting, to improve system visibility and reduce downtime
Lead incident management processes, including on-call optimization, root cause analysis, and continuous improvement initiatives
Partner with development teams to optimize system performance, capacity planning, and fault tolerance

Cloud Infrastructure & Architecture

Architect and support highly available, fault-tolerant cloud environments across distributed systems
Implement scalable infrastructure patterns such as autoscaling, load balancing, backup strategies, and data replication
Promote cloud governance standards, including access controls, environment design, and operational guardrails

Automation, IaC & DevOps Practices

Develop and maintain Infrastructure as Code solutions to enable consistent, repeatable infrastructure deployments
Build and optimize CI/CD pipelines to support secure, automated software delivery
Champion DevOps best practices, including automated testing, immutable infrastructure, and progressive deployment strategies
Ensure consistency across environments and proactively manage configuration drift

Disaster Recovery & Business Continuity

Define and align recovery objectives (RTO/RPO) with business and technical stakeholders
Design and implement disaster recovery strategies, including failover mechanisms, backup solutions, and validation processes
Lead structured disaster recovery testing and ensure continuous improvement based on findings
Maintain comprehensive recovery documentation, runbooks, and readiness reporting

Required Qualifications

7+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or similar roles supporting production systems
Strong expertise in observability tools and monitoring frameworks
Hands-on experience designing and operating cloud-based systems, particularly within AWS environments
Proficiency with Infrastructure as Code tools (e.g., Terraform, CloudFormation, or similar)
Experience building and maintaining CI/CD pipelines and automation frameworks
Demonstrated experience implementing disaster recovery strategies and validating recovery objectives
Experience with container orchestration platforms such as Kubernetes
Strong understanding of Linux systems, networking fundamentals, and distributed system troubleshooting
Proficiency in at least one scripting or programming language (e.g., Python, Go, Bash)
Strong documentation and communication skills

Preferred Qualifications

Exposure to multi-cloud environments (e.g., Azure or Oracle Cloud)
Experience with service mesh architectures, API gateways, and distributed tracing tools
Familiarity with observability standards such as OpenTelemetry
Knowledge of cloud security and compliance best practices, including IAM and secrets management
Experience with advanced deployment strategies (e.g., canary releases, blue/green deployments)
Relevant certifications in cloud or Kubernetes technologies
Experience with modern tools such as ArgoCD or cluster autoscaling solutions

Apply Now Return to Opportunities View Team Opportunities

Consultant

Adrian Kinnersley