Incident Management
Incidents are the cornerstone of operational response when services fail. Pulsimo provides a comprehensive incident management system that tracks the entire lifecycle from detection through resolution.
Overview
When a monitored endpoint fails health checks beyond its configured threshold, Pulsimo automatically creates an incident. This incident becomes the central hub for all information, actions, and collaboration related to that outage. From automatic creation to post-mortem generation, the incident management system guides your team through effective incident response.
Automatic Detection
Incidents created instantly when thresholds exceeded - no manual intervention required
Alert Management
Stop notification spam by acknowledging incidents - prevent alert fatigue
Collaboration
Multiple team members can work on same incident with investigation notes
Post-Mortems
Automatic report generation with timeline, metrics, and complete audit trail
Incident Lifecycle
Every incident progresses through four defined states:
OPEN
Red BadgeTriggered When:
- • Endpoint fails consecutive health checks (threshold exceeded)
- • Automatic incident creation, no manual intervention
System Actions:
- ✓ Creates incident record
- ✓ Sends alerts to notification channels
- ✓ Displays in Incidents page
- ✓ Records start time
ACKNOWLEDGED
Yellow BadgeTriggered When:
Team member clicks "Acknowledge" button (Member or Admin role)
System Actions:
- ✓ Stops sending repeat alerts
- ✓ Records who acknowledged and when
- ✓ Notifies team of acknowledgement
Best Practice: Acknowledge immediately when starting work and add a note explaining what you're doing
INVESTIGATING
Blue BadgeWhat This Means:
- • Active troubleshooting underway
- • Root cause analysis in progress
- • Fix being implemented
User Actions Available:
- ✓ Add detailed investigation notes
- ✓ Document findings and troubleshooting steps
- ✓ Attach screenshots or logs
- ✓ Update progress
RESOLVED
Green BadgeSystem Actions:
- ✓ Marks incident as resolved
- ✓ Calculates total downtime
- ✓ Calculates time-to-resolve (TTR)
- ✓ Updates MTTR metrics
- ✓ Sends recovery notification
User Actions Available:
- ✓ Add resolution notes (what fixed it)
- ✓ Generate post-mortem report
- ✓ Export incident data (JSON)
- ✓ View complete timeline
How to Acknowledge an Incident
When you receive an alert notification:
- Navigate to the Incidents page in the sidebar
- Click on the open incident
- Click the "Acknowledge" button
- Add optional notes about your investigation plan
- This stops repeat notifications and prevents alert fatigue
Key Features
MTTR Tracking
Mean Time To Resolution calculated automatically for every incident
Complete Audit Trail
Every action timestamped and attributed to specific users
Incident Analytics
Track incident frequency, patterns, and affected services
Post-Mortem Generation
Automated reports with timeline, metrics, and resolution details
Incident Metrics
Pulsimo automatically calculates key metrics for every incident:
| Metric | Description | Example |
|---|---|---|
| MTTR | Mean Time To Repair - Average resolution time | 23.5 minutes |
| MTTD | Mean Time To Detect - Time until incident created | 2.3 minutes |
| Total Downtime | Complete duration service was unavailable | 1,410 seconds (23.5 min) |
| Affected Checks | Number of failed health check attempts | 47 failed checks |
| Incident Count (24h) | Total incidents in last 24 hours | 3 incidents |
Incident Linking & Relationships
Powerful New Feature
Connect related incidents for better root cause analysis. Pulsimo automatically suggests relationships when it detects patterns.
Relationship Types
Duplicate
Same issue reported multiple times
Example: "Website slow" + "App loading slow" = Same database issue
Caused By
One incident is the root cause of another
Example: Database Down → Backend Degraded → Frontend Errors
Blocks
Resolving one requires fixing another first
Example: Database Migration Stuck blocks New Feature Deployment
Parent/Child
Breaking down complex incidents into sub-tasks
Example: System Outage → Database Recovery + Cache Rebuild
Automatic Root Cause Analysis
Pulsimo analyzes incidents and automatically detects patterns:
Cascade Failures
Detects dependency chain failures
- • Visual graph showing cascade path
- • Timeline of impact spread
- • Automatic "caused by" links suggested
Common Endpoint
Multiple incidents on same service
- • Tracks incident frequency per endpoint
- • Suggests infrastructure investigation
- • Identifies problematic services
Deployment Related
Incidents after deployments
- • Correlates incidents with deploys
- • Suggests rollback if needed
- • 100% time correlation detection
Time Correlation
Incidents in same time window
- • Detects simultaneous outages
- • Suggests external factors
- • Network/infrastructure investigation
Response Performance Metrics
Track how well your team responds to incidents:
Time to Detect
Time to Acknowledge
Time to Identify
Time to Resolve
Communication Timeline
Every incident has a detailed timeline showing system events and team updates:
Incident created - Backend API health check failed
Checking database logs for connection issues
Service recovered - Health check passed
Best Practices
For On-Call Engineers
- • Acknowledge immediately
- • Update timeline every 15 minutes
- • Link related incidents
- • Document as you go
For Team Leads
- • Review weekly metrics
- • Post-mortem critical incidents
- • Validate automatic links
- • Update response playbooks
For DevOps/SRE
- • Use incident groups for mass outages
- • Track cascade patterns
- • Monitor false positive rate
- • Automate common remediations