Senior Site

Reliability Engineer

  • New York

  • $

    180,000 Per Year

  • Permanent

Reference: 44820

Business Sector: Infrastructure

Description

Senior Site Reliability Engineer
Terms: Direct Hire
Work Location: 4 days onsite, NYC

Join a high-impact engineering environment where reliability and scalability are mission-critical. This role is ideal for a seasoned Site Reliability Engineer who thrives on building resilient, cloud-native systems and driving operational excellence at scale. You’ll play a key role in shaping modern infrastructure practices, enhancing system performance, and ensuring business continuity across complex production environments. Working cross-functionally with engineering, security, and platform teams, you’ll influence architecture decisions, standardize automation, and lead initiatives that directly improve uptime, efficiency, and recovery readiness.
Key Responsibilities
Reliability & Operational Excellence
  • Drive service reliability by defining and managing SLOs/SLIs, error budgets, and performance benchmarks
  • Build and enhance observability frameworks, including monitoring, logging, tracing, and alerting, to improve system visibility and reduce downtime
  • Lead incident management processes, including on-call optimization, root cause analysis, and continuous improvement initiatives
  • Partner with development teams to optimize system performance, capacity planning, and fault tolerance

Cloud Infrastructure & Architecture
  • Architect and support highly available, fault-tolerant cloud environments across distributed systems
  • Implement scalable infrastructure patterns such as autoscaling, load balancing, backup strategies, and data replication
  • Promote cloud governance standards, including access controls, environment design, and operational guardrails

Automation, IaC & DevOps Practices
  • Develop and maintain Infrastructure as Code solutions to enable consistent, repeatable infrastructure deployments
  • Build and optimize CI/CD pipelines to support secure, automated software delivery
  • Champion DevOps best practices, including automated testing, immutable infrastructure, and progressive deployment strategies
  • Ensure consistency across environments and proactively manage configuration drift

Disaster Recovery & Business Continuity
  • Define and align recovery objectives (RTO/RPO) with business and technical stakeholders
  • Design and implement disaster recovery strategies, including failover mechanisms, backup solutions, and validation processes
  • Lead structured disaster recovery testing and ensure continuous improvement based on findings
  • Maintain comprehensive recovery documentation, runbooks, and readiness reporting

Required Qualifications
  • 7+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or similar roles supporting production systems
  • Strong expertise in observability tools and monitoring frameworks
  • Hands-on experience designing and operating cloud-based systems, particularly within AWS environments
  • Proficiency with Infrastructure as Code tools (e.g., Terraform, CloudFormation, or similar)
  • Experience building and maintaining CI/CD pipelines and automation frameworks
  • Demonstrated experience implementing disaster recovery strategies and validating recovery objectives
  • Experience with container orchestration platforms such as Kubernetes
  • Strong understanding of Linux systems, networking fundamentals, and distributed system troubleshooting
  • Proficiency in at least one scripting or programming language (e.g., Python, Go, Bash)
  • Strong documentation and communication skills

Preferred Qualifications
  • Exposure to multi-cloud environments (e.g., Azure or Oracle Cloud)
  • Experience with service mesh architectures, API gateways, and distributed tracing tools
  • Familiarity with observability standards such as OpenTelemetry
  • Knowledge of cloud security and compliance best practices, including IAM and secrets management
  • Experience with advanced deployment strategies (e.g., canary releases, blue/green deployments)
  • Relevant certifications in cloud or Kubernetes technologies
  • Experience with modern tools such as ArgoCD or cluster autoscaling solutions