Technical Duty Officer – Sr Site Reliability Engineer

Chicago, IL, US

Job Description / Skills Required

The Technical Duty Officer (TDO) at Groupon is a senior technical leadership role that integrates Site Reliability Engineering and Solutions Architecture with ITIL-based change and incident management, with responsibilities as an Incident Commander, change manager, and a senior technical resource responsible for preventing, identifying, triaging, documenting, investigating, mitigating, and recovering from site/service impacting incidents across Groupon’s 600+ globally dispersed services, during any given 8 hour shift. TDOs are also responsible for assessing, approving and scheduling risky changes, load testing, and maintenance windows, and for coordinating and driving Incident Reviews, best practices, and overseeing Problem Management (Service Actions, Top Ops issues). The TDO is responsible for global site availability and reliability and for identifying and resolving all site/business impacting events worldwide.

A strong candidate should have an expert knowledge of the technologies and best practices used in a high-volume, highly virtualized Linux/KVM/Dockers/NGINX environment, servicing 62mm customers/merchants, integrating over 600 geo-distributed services/platforms serving NAM, LATAM, EMEA, and APAC eCommerce websites and their supporting business/marketing services.

We’re looking for an individual who enjoys solving complex problems, who can act independently, and who can stay calm and focused in high stress situations – while driving paths to mitigation and restoration utilizing SMEs and teams across the entire global Groupon development and operations environments. The position requires a quick learning, constant attentiveness, critical thinking and decision making skills, and continuous situational awareness.

The position requires a strong but persuasive personality, a broad scope of knowledge with expertise in two or more major knowledge areas (networking, storage, kernels, database/object-store, caching/proxying, SOA API, message buses) and broad overall experience in large production eCommerce environments. Equally important is the ability to perform some project management and ticketing work (ticket and action item updates, projects requiring some programming/scripting, some project management/scheduling, and follow-up meetings). The TDOs are usually too busy with day to day activities during working shifts to perform deep dive analyzes or spend a lot of time on projects/coding, so this isn’t per say a DevOps role, but does require a DevOps background.

Responsibilities:
As part of a Global TDO team, you would work rotating morning and evening shifts, M-F, and be on-call one weekend out of every 5 weekends. The shifts are setup to allow for project days, as well as on-duty days. You will be required to work out of one of the Groupon regional offices (PA, SF, Seattle, CHI, or NY) at least 3 days a week.

Prioritize the focus of the Global Systems Engineering Center staff and SRE resources for both routine and significant site events, planned maintenance windows, or risky changes.

Review, approve and schedule all risky changes and maintenance window activities.

Take ownership of all site or service impacting events until they are mitigated/recovered, or handed off, including all documentation and action items.

During a crisis or service impacting event, lead the effort with SOC, SRE and OPS/Development SMEs to triage, investigate, mitigate, and recover.

Manage real-time communications during service outages with both technical and non-technical audiences.

Evangelize Best Practices to the rest of the company.

Follow-up with service owners on Incident Review actions items, change approvals, and general requests for assistance.

Help develop policies and procedures that improve overall production stability.

Design and create tools to manage the site and services.

Drive and/or participate in daily Site Status meetings and Incident Reviews to prevent incidents and and improve overall product quality and stability.

Foster relationships with development teams and technology leaders across the company.

Groupon provides a global marketplace where people can buy just about anything, anywhere, anytime. We’re enabling real-time commerce across an expanding range of categories including local businesses, travel destinations, consumer products, and live or lively events. At the same time, we are providing advertising options and tools that merchants can use to grow and manage their businesses. Culturally, we believe that great people make great companies and that starting with the customer and working backward moves us forward. Community matters to us on an internal, local and global scale—it’s fundamental to our company’s growth and to the well-being of the world at large. We also value self-awareness, candor, lunch and WiFi. If we match with you, please apply to join us.