Sonalake is a software partnering company that helps our clients realise their product roadmaps. Product design and engineering are at the heart of our business. Our engineering teams work with clients right across the stack; UX, UI design, frontend, backend, analytics, infrastructure, operations - and everything else that goes into delivering great products.
We thrive on variety and are highly adaptable. Our teams are exposed to domains as varied as telecom billing, ad tech, securities-based lending, travel tech analytics, and many more.
Innovation is central to our mission; anticipating future client needs, analysing emerging technologies and developing new products and services.
We are now seeking to grow our service management capabilities. That’s where you come in!
- Identify and create service level indicators (SLIs) using historical data. Create realistic service level objectives (SLOs) in order to meet SLAs
- Establish monitoring tools and the system wide observability necessary to support rapid response during service level interruptions
- Participate in on-call activities, high-priority incident response, and disaster recovery activities
- Participate in incident retrospectives to improve overall resolution times
- Have an “automate first” mentality around incident response, infrastructure, monitoring, and play-books
- Provide technical leadership through mentoring, a commitment to technical excellence, accountability, transparency and skills development
- Stay up-to-date with the latest application SRE developments and trends to continually improve internal processes and tooling
- Assess current applications and architecture to determine where Site Reliability Engineering methodologies can be meaningfully applied
- Create an open, honest, accountable and collaborative team environment, providing timely and meaningful feedback
You may be a fit for this role if you have
- 5+ years of experience with agile, site reliability engineering practices
- Demonstrable operational experience successfully supporting large scale cloud deployments; including areas such incident management, on-call, metrics and monitoring, and general observability
- Deep understanding of networking; Global Service Load Balancing, BGP, Network Redundancy, DNS, routing algorithms
- Deep operational experience managing multi-region, multi-cloud large scale infrastructure
- Experience implementing scalability, availability, and resiliency principles
- A track record applying SRE principles and tools used for cloud-native applications (e.g. Terraform, microservice architecture, Kubernetes, Docker, Istio, Envoy, Skaffold, Spinnaker)
We take pride in being a people-oriented company. Openness and opportunity are really important to us. We build teams that span from experienced leaders to bright graduates and work to develop all of us within our coaching culture.