
Senior Site Reliability Engineer (SRE)
Wait — Check First
- Check if your CV is ATS-ready for Salla
- Get AI-rewritten bullet points
- Download Gulf-ready CV
60 seconds. $3.99 one-time.
As a Senior SRE at Salla, you will lead reliability initiatives, handle complex incidents, improve platform performance, and guide engineering teams toward building resilient systems. You will also participate in the on-call rotation as part of our commitment to platform reliability.
Reliability & Incident Management
• Lead high-severity incident response and drive post-incident reviews.
• Troubleshoot complex issues across applications, infrastructure, and networks.
• Improve MTTR through better monitoring, alerts, and diagnostic tooling.
• Participate in the on-call rotation supporting production systems.Performance & Scalability
• Identify and resolve performance bottlenecks and scaling challenges.
• Conduct load testing and capacity planning for high-traffic scenarios.Infrastructure & Operations
• Enhance cloud-native infrastructure, deployment processes, and automation.
• Improve resilience, fault-tolerance, and recovery mechanisms across systems.Observability
• Build and refine dashboards, alerts, metrics, logs, and traces.
• Define SLIs/SLOs and improve visibility into system behavior.Tooling & Automation
• Develop tools that reduce operational toil and increase reliability.
• Contribute to infrastructure-as-code, CI/CD pipelines, and GitOps workflows.Collaboration
• Work closely with engineering teams to ensure services are robust and production-ready.
• Mentor engineers on reliability, debugging, and operational best practices.Bonus Skills
• Background in large-scale, high-traffic systems.
• Experience with fault-tolerant design, DR, and HA patterns.
• Familiarity with SLOs, SLIs, and error budgets.Location Preference
• Candidates located within GMT 0 to +6 time zones are preferred to align with team collaboration and on-call coverage.Requirements
• Strong experience with Kubernetes, service mesh technologies, and cloud platforms (AWS, GCP, or Azure).
• Deep understanding of Linux, networking, distributed systems, and load balancing.
• Hands-on experience with Terraform or similar Infrastructure-as-Code tools.
• Experience with observability platforms such as Prometheus, Grafana, Loki, Mimir, Elastic, or equivalent.
• Proficiency in scripting or programming languages such as Bash, Python, or Go.
• Experience with CI/CD pipelines and GitOps practices.
• Strong debugging, incident response, and performance analysis skills.
Requirements
- •Strong experience with Kubernetes, service mesh technologies, and cloud platforms (AWS, GCP, or Azure).
- •Deep understanding of Linux, networking, distributed systems, and load balancing.
- •Hands-on experience with Terraform or similar Infrastructure-as-Code tools.
- •Experience with observability platforms (Prometheus, Grafana, Loki, Mimir, Elastic, or equivalent).
- •Proficiency in scripting or programming languages (Bash, Python, or Go).
- •Experience with CI/CD pipelines and GitOps practices.
- •Strong debugging, incident response, and performance analysis skills.
Nice to Have
- •Background in large-scale, high-traffic systems.
- •Experience with fault-tolerant design, DR, and HA patterns.
- •Familiarity with SLOs, SLIs, and error budgets.
Responsibilities
- •Lead high-severity incident response and drive post-incident reviews.
- •Troubleshoot complex issues across applications, infrastructure, and networks.
- •Improve MTTR through better monitoring, alerts, and diagnostic tooling.
- •Participate in the on-call rotation supporting production systems.
- •Identify and resolve performance bottlenecks and scaling challenges.
- •Conduct load testing and capacity planning.
- •Enhance cloud-native infrastructure, deployment processes, and automation.
- •Improve resilience, fault-tolerance, and recovery mechanisms.
Related Jobs
- See what Salla's hiring system sees in your CV
- Get AI-rewritten bullet points
- Download Gulf-ready CV
60 seconds. $3.99 one-time.
