Dylan Bochman
Sr. Site Reliability Engineer - Technical Incident Manager
Dylan Bochman
Sr. Site Reliability Engineer - Technical Incident Manager
Sr. Site Reliability Engineer - Technical Incident Manager
Specializing in Reliability, Resilience, and Incident Management, with experience spanning SRE and Product Management at Nvidia, Groq, HashiCorp, and Spotify. Focused on enhancing service availability and streamlining operations in complex AI and cloud environments.
Remote
Currently Employed
Career Goals
I'm currently at Nvidia, where I'm applying my resilience expertise and incident management skills to build operational resilience for AI inference infrastructure. My goal remains to empower engineers, drive operational excellence, and cultivate collaborative, blameless engineering cultures.
Core Expertise
Technical Skills
- Action Item Tracking
- Alerting Strategy
- Automated Remediation
- Backstage
- Blameless Retrospectives
- Chaos Engineering
- Crisis Communication
- Customer Communication
- Datadog
- Error Budgets
- Executive Reporting
- Executive Updates
- Facilitation
- Incident Command
- Kubernetes
- Learning Culture
- Observability
- Onboarding
- Prometheus
- Root Cause Analysis
- Runbook Development
- Runbooks
- SLO/SLI Design
- Status Pages
- Synthetic Testing
- Terraform
Let's Connect
Interested in discussing on-call tooling, challenging incidents, or potential opportunities?