Cloud Engineer - Chameleon Technologies
Job #2464: Chameleon Technologies is searching for (3) Cloud Engineers for a contract-to-hire opportunity with a public company. We are seeking reliability engineers that are going to be accountable for a Live Site Production environment. You will be responsible for the fidelity of the alerts that come through the Service Now Event Manager and resolution of Incidents, regardless if they are directly resolving or escalating to Engineering.
In addition, the Cloud Engineer will also be accountable and responsible for automating alert rules, alert tuning (via updating, editing, and authoring event management rules), and alert fidelity in the Event Management module.
This company has the belief that Cloud Engineers are the beating heart and soul of their superb service levels, industry-leading customer satisfaction, and core to their continued growth. They empower the Cloud Engineers to deliver and manage their services with high availability and stellar performance levels at cloud-speed, and at cloud-scale!
Specifically, we are searching for someone who has the enthusiasm for cloud services, brings fresh ideas, demonstrates a unique and informed viewpoint, and enjoys collaborating with a cross-functional team to develop real-world solutions and positive customer experiences. We seek individuals who constantly seek out ways to improve services, design solutions, and automate responses to events.
- Run the delivery of services via the production environment through effective monitoring and by taking a holistic end-to-end perspective of system health across the global live site environment.
- Drive efficiency and take ownership of the end-to-end workflow, response time, relief time and long-term resolution to each incident that impacts, degrades, or otherwise affects our customers or the underlying infrastructure or application
- Initiate and lead cross-team technical and troubleshooting bridges for complex and impacting incidents to drive immediate customer resolution.
- Lead and drive forensic investigations and root cause eradication. Identify short and long-term action items to mitigate and eliminate faults.
- Build software and systems that manage platform infrastructure and applications.
- Improve reliability, quality, and time-to-market across our suite of software solutions
- Measure service performance and build analytics with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Provide primary operational oversight and direction for multiple large distributed software applications, across internationally located datacenters hosting our infrastructure.
- Analyze and respond to alerts in real-time, promote (or create automation rules to promote) to alerts to incidents and drive immediate relief to high priority issues.
- Drive and lead Major Incident bridges to quickly resolve complex, high impacting, or highly visible incidents, as lead Crisis Manager.
- Author and edit knowledge base articles for frequent symptoms and alerts. Automate common response actions.
- Regularly review, tune, and regulate alerts from disparate systems. Author event rules to build hierarchies, correlate across configuration items and reduce noise to ensure fidelity of signal, and subsequent service-levels
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Collaborate across Engineering and Development teams for Security, Disaster Recovery, Virtual Desktop, Desktop as a Service, Hybrid Cloud, and Distributed Storage requirements and buildout as we add and extend services.
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and orchestration
- Balance feature development speed and reliability with well-defined service level objectives
- Act as the gatekeeper to ensure rigor such that no planned changes are permitted during service impacting events on a common configuration item.
Required Skills and Qualifications
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
- Familiarity with Core internet services (DNS, FTP, SMTP, TCP/UDP, Database technologies, CDN, Hypervisor, Storage, VPN, Storage, and Application servers).
- Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
- Experience with cloud-based incident, change, and problem management processes and familiarity with cloud speed.
- Direct experience and enthusiasm for working in an interrupt-driven environment.
- Bachelor's degree in Computer Science, Information Technology, or technical or scientific discipline considered.
- Previous success in engineering and service support
- Previous experience with monitoring and mgmt. tools like Prometheus, ELK, Grafana
- Coding experience beyond simple scripts
- Previous experience in large scale or hyper-growth cloud environments