Background : Decision Analytics has just laid the foundations to move from a collection of individual products towards a consolidated cloud-based platform delivering industry leading capabilities across data, analytics and decisioning, embedding latest technologies such as machine learning, Big Data and AI.
We have good momentum and our core product has trebled in revenues over the last 3 years and if we were a stand-alone business we would probably qualify as a unicorn.
As part of the next phase in our growth, we are looking to expand our Site Reliability Engineering team to offer round the global cover.
As an organisation we are fully convinced that everything should be automated and that software should run software and believe in the Site Reliability Engineering model.
We have established a platform using cutting edge technology, such as Kubernetes, containers, pipelines and monitoring. The candidate will be a forward-looking engineer with an understanding of how SRE will enable operations in the future.
You will have broad operations and automation interests and not shy away from the operational aspects of life and understand that the best way to build reliability is to break things often.
The ideal candidate will have experience of operations, a passion for automation and an interest in software development or they will have experience of software development, a passion for automation and an interest in operational excellence.
If you have incident manager skills and are able to manage rationally and calmly during a crisis that would be an added bonus.
There is an expectation to work occasional peak weekends as well as some on call requirements. This is the beginning of a growing team and we are looking for individuals to grow with it.
As a team lead you will have a small team ( 4 ) of SRE’s to manage, coach, mentor and inspire. Working with other SRE teams as well as regional stakeholders and a global IT team, ensuring stakeholders are kept in the loop and local processes adhered to will be a necessary part of the role.
Job Responsibilities : Primary Accountabilities :
Uptime of Experian One Experian’s Cloud SaaS offering for Decision Analytics.
Significant Demands :
Monitoring and Alerting of our platform
Responding to incidents and restoring service
Over time, gaining a good enough understanding of the systems to efficiently triage issues and find owners for problem resolution
An ability to identify an issue or a manual process and ensure that they never occur again
Incident management; able to co-ordinate others and be co-ordinated during service disruptions with a focus on restoring availability
Ability to write complex queries using various tools
Reviewing systems designs and implementations to identify resiliency, scalability and monitoring issues prior to implementation
Strong Knowledge of Kubernetes, Infrastructure as Code, High availability principles.
Excellent communication skills in English with colleagues across the globe.
Able to lead and mentor more junior colleagues both technically and from a line management perspective
Working Practices and Relationships :
Strong relationships with other members of the SRE team, primary based in Kuala Lumpur but also London, Arizona, Sofia
Working relationships with colleagues in other departments, third parties who support backing applications.
Collaborative relationships with developers, security and architects to influence them to build resilient, maintainable solutions
Proficiency in one programming or scripting language and willingness to apply software development best practices to an operational role
Direct experience of supporting complex, highly scaled systems in production
Linux knowledge, experience troubleshooting and predicting issues in advance
Networking, troubleshooting and monitoring
Cloud Native application designs for high performance, scalability and resilience
Incident Management and co-ordination, Blameless PIRs
Kubernetes, OpenShift, Splunk, Dynatrace, Thousand Eyes, ServiceNow, Jira, Jenkins, Python, Prometheus
Java, Cassandra, Redis, RunDeck, MongoDB, Apigee, Okta, PostGres, AWS, Azure, GCP
Infrastructure as Code, Git Ops
Key Behaviours :
Excellent communication skills. Written and verbal fluency in English is required
Highly organised and with a good attention to detail
Working across boundaries - geographically, teams, language and cultural
Curious and willing and able to learn new technologies and practices
Cloud aware, you understand how cloud technologies differ from other technical approaches and are able to explain these to others.
Lives and breathes availability and operational excellence in technology
Demonstrate an ability to lead a team, manage others through coaching and develop their careers.
Is this you?
You strive to remove repetitive tasks from your daily existence
You are a keen following of technology trends
You believe that software is to be used not to be admired.
You solve for the future as well as the immediate
You empower others to deliver
You develop trust, you make conflict constructive, create commitment, drive accountability and drive results
You are articulate, clear, concise, and you can tailor your approach to the audience
You can manage stakeholders at all levels and influence decision making