Senior Site
Reliability Engineer
-
New York
-
$
180,000 Per Year
-
Permanent
Reference: 44820
Business Sector: Infrastructure
Description
Terms: Direct Hire
Work Location: 4 days onsite, NYC
Join a high-impact engineering environment where reliability and scalability are mission-critical. This role is ideal for a seasoned Site Reliability Engineer who thrives on building resilient, cloud-native systems and driving operational excellence at scale. You’ll play a key role in shaping modern infrastructure practices, enhancing system performance, and ensuring business continuity across complex production environments. Working cross-functionally with engineering, security, and platform teams, you’ll influence architecture decisions, standardize automation, and lead initiatives that directly improve uptime, efficiency, and recovery readiness.
Key Responsibilities
Reliability & Operational Excellence
- Drive service reliability by defining and managing SLOs/SLIs, error budgets, and performance benchmarks
- Build and enhance observability frameworks, including monitoring, logging, tracing, and alerting, to improve system visibility and reduce downtime
- Lead incident management processes, including on-call optimization, root cause analysis, and continuous improvement initiatives
- Partner with development teams to optimize system performance, capacity planning, and fault tolerance
Cloud Infrastructure & Architecture
- Architect and support highly available, fault-tolerant cloud environments across distributed systems
- Implement scalable infrastructure patterns such as autoscaling, load balancing, backup strategies, and data replication
- Promote cloud governance standards, including access controls, environment design, and operational guardrails
Automation, IaC & DevOps Practices
- Develop and maintain Infrastructure as Code solutions to enable consistent, repeatable infrastructure deployments
- Build and optimize CI/CD pipelines to support secure, automated software delivery
- Champion DevOps best practices, including automated testing, immutable infrastructure, and progressive deployment strategies
- Ensure consistency across environments and proactively manage configuration drift
Disaster Recovery & Business Continuity
- Define and align recovery objectives (RTO/RPO) with business and technical stakeholders
- Design and implement disaster recovery strategies, including failover mechanisms, backup solutions, and validation processes
- Lead structured disaster recovery testing and ensure continuous improvement based on findings
- Maintain comprehensive recovery documentation, runbooks, and readiness reporting
Required Qualifications
- 7+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or similar roles supporting production systems
- Strong expertise in observability tools and monitoring frameworks
- Hands-on experience designing and operating cloud-based systems, particularly within AWS environments
- Proficiency with Infrastructure as Code tools (e.g., Terraform, CloudFormation, or similar)
- Experience building and maintaining CI/CD pipelines and automation frameworks
- Demonstrated experience implementing disaster recovery strategies and validating recovery objectives
- Experience with container orchestration platforms such as Kubernetes
- Strong understanding of Linux systems, networking fundamentals, and distributed system troubleshooting
- Proficiency in at least one scripting or programming language (e.g., Python, Go, Bash)
- Strong documentation and communication skills
Preferred Qualifications
- Exposure to multi-cloud environments (e.g., Azure or Oracle Cloud)
- Experience with service mesh architectures, API gateways, and distributed tracing tools
- Familiarity with observability standards such as OpenTelemetry
- Knowledge of cloud security and compliance best practices, including IAM and secrets management
- Experience with advanced deployment strategies (e.g., canary releases, blue/green deployments)
- Relevant certifications in cloud or Kubernetes technologies
- Experience with modern tools such as ArgoCD or cluster autoscaling solutions
-
Senior Platform Engineer
-
New York
-
180,000 Per Year
-
Permanent
Read MoreSenior Platform Engineer Work Location: 4 days onsite, NYC Terms: Direct Hire Overview: This is a hands-on opportunity for an experienced Platform Engineer to help build and scale a modern, cloud-native infrastructure that powers critical application
-
-
Senior DevOps Engineer
-
New York
-
180,000 Per Year
-
Permanent
Read MoreWe’re looking for a Senior DevOps Engineer to help build, maintain, and develop a highly available, 24×7 financial services infrastructure. You’ll own key components of the platform, drive automation across the engineering lifecycle, and ensure that
-