Sonalake is a software partnering company that helps our clients realise their product roadmaps. Product design and engineering are at the heart of our business. Our engineering teams work with clients right across the stack; UX, UI design, frontend, backend, analytics, infrastructure, operations - and everything else that goes into delivering great products.

We thrive on variety and are highly adaptable. Our teams are exposed to domains as varied as telecom billing, ad tech, securities-based lending, travel tech analytics, and many more.

Innovation is central to our mission; anticipating future client needs, analysing emerging technologies and developing new products and services.

We are now seeking to grow our service management capabilities. That’s where you come in!

You will

Identify and create service level indicators (SLIs) using historical data. Create realistic service level objectives (SLOs) in order to meet SLAs
Establish monitoring tools and the system wide observability necessary to support rapid response during service level interruptions
Participate in on-call activities, high-priority incident response, and disaster recovery activities
Participate in incident retrospectives to improve overall resolution times
Have an “automate first” mentality around incident response, infrastructure, monitoring, and play-books
Provide technical leadership through mentoring, a commitment to technical excellence, accountability, transparency and skills development
Stay up-to-date with the latest application SRE developments and trends to continually improve internal processes and tooling
Assess current applications and architecture to determine where Site Reliability Engineering methodologies can be meaningfully applied
Create an open, honest, accountable and collaborative team environment, providing timely and meaningful feedback

You may be a fit for this role if you have

5+ years of experience with agile, site reliability engineering practices
Demonstrable operational experience successfully supporting large scale cloud deployments; including areas such incident management, on-call, metrics and monitoring, and general observability
Deep understanding of networking; Global Service Load Balancing, BGP, Network Redundancy, DNS, routing algorithms
Deep operational experience managing multi-region, multi-cloud large scale infrastructure
Experience implementing scalability, availability, and resiliency principles
A track record applying SRE principles and tools used for cloud-native applications (e.g. Terraform, microservice architecture, Kubernetes, Docker, Istio, Envoy, Skaffold, Spinnaker)

We take pride in being a people-oriented company. Openness and opportunity are really important to us. We build teams that span from experienced leaders to bright graduates and work to develop all of us within our coaching culture.

Site Reliability Engineer

Popular searches