● COMPLETED · 2024
PagerDuty & Grafana
Incident Platform
Configured enterprise-grade on-call management and real-time monitoring dashboards — reducing incident response time by 35% and downtime by 20%.
ROLE
IT Operations Engineer
YEAR
2024
TYPE
Infrastructure & Monitoring
COMPANY
M2
35%
Faster Response
20%
Less Downtime
24/7
Monitoring
500+
Users Protected
// the_problem
The Problem
✕
Incident response was entirely reactive — the team only knew about issues when users called to report them.
✕
No structured on-call rotation meant the same people got called for everything at any hour of the day.
✕
No escalation policies meant critical P1 incidents and minor questions were treated identically.
✕
Grafana was installed but completely unconfigured with no actionable dashboards.
✕
Mean time to resolution was high because there was no clear ownership, workflow, or runbook.
// the_solution
The Solution
Configured PagerDuty with structured on-call schedules, escalation chains, and monitoring tool integrations.
Built Grafana dashboards that surface system health proactively before users experience any impact.
Defined clear severity levels P1 through P4 with matching response time SLAs for each.
Integrated PagerDuty with existing monitoring tools for fully automated alert routing and acknowledgment.
// architecture
System Architecture
System Metrics
→
Grafana Dashboards
→
Alert Rules
→
PagerDuty
→
On-Call Engineer
→
Resolution
// features_built
8 Features Built
01
On-Call Schedule Design
Weekly rotating on-call schedules across the IT team ensuring fair distribution and coverage.
02
Escalation Policy Build
Multi-tier escalation — primary to secondary to manager — with time-based auto-escalation triggers.
03
Monitoring Integration
Connected existing monitoring stack to PagerDuty for fully automated alert ingestion and routing.
04
Severity Classification
Defined P1 / P2 / P3 / P4 severity levels with matching response time SLAs and procedures.
05
Grafana Dashboard Build
System health dashboards covering servers, network performance, applications, and key services.
06
Proactive Alert Rules
Threshold-based alerts configured to catch degradation before it becomes a full outage.
07
Incident Runbooks
Documented step-by-step response procedures for the 10 most frequent incident types.
08
Post-Incident Review
Established blameless post-mortem workflow for all P1 and P2 incidents.
// tech_stack
Technology Stack
PagerDuty
On-call scheduling, escalation policies, incident management, and alert routing platform.
Grafana
Real-time system health dashboards, data visualization, and alert rule configuration.
Monitoring Integrations
Connected existing monitoring tools to PagerDuty via native integrations and webhooks.
Microsoft Teams
Integrated PagerDuty incident notifications into team channels for broad visibility.
Snipe-IT
Asset tracking integrated into incident context for faster hardware-related resolution.
// delivery_timeline
Delivery Timeline
Phase 1
Foundation
PagerDuty account setup, team onboarding, basic on-call schedule configuration.
Phase 2
Escalation
Multi-tier escalation policies, severity level definitions, SLA configuration.
Phase 3
Monitoring
Grafana dashboard build, alert rule configuration, monitoring tool integrations.
Phase 4
Process
Runbook documentation, post-mortem process establishment, full team training.
// key_outcomes
Key Outcomes
35% improvement in incident response time from alert to engineer engagement and acknowledgment
20% reduction in overall system downtime through proactive monitoring and early detection
Fair on-call distribution eliminated the burnout caused by ad-hoc escalation to the same people
Clear severity classification ended the all-hands-for-minor-issues anti-pattern
Grafana dashboards gave the team operational visibility they had never had before
Documented runbooks reduced mean time to resolution for the most common incident types
500+ users across M2 directly benefited from the improvement in system reliability and uptime