Incident Management
Incidents are the cornerstone of operational response when services fail. Pulsimo provides a comprehensive incident management system that tracks the entire lifecycle from detection through resolution.
Overview
When a monitored endpoint fails health checks beyond its configured threshold, Pulsimo automatically creates an incident. This incident becomes the central hub for all information, actions, and collaboration related to that outage. From automatic creation to post-mortem generation, the incident management system guides your team through effective incident response.
Automatic Detection
Incidents created instantly when thresholds exceeded - no manual intervention required
Alert Management
Stop notification spam by acknowledging incidents - prevent alert fatigue
Collaboration
Multiple team members can work on same incident with investigation notes
Post-Mortems
Automatic report generation with timeline, metrics, and complete audit trail
Incident Lifecycle
Every incident progresses through four defined states:
OPEN
Red BadgeTriggered When:
- • Endpoint fails consecutive health checks (threshold exceeded)
- • Automatic incident creation, no manual intervention
System Actions:
- ✓ Creates incident record
- ✓ Sends alerts to notification channels
- ✓ Displays in Incidents page
- ✓ Records start time
ACKNOWLEDGED
Yellow BadgeTriggered When:
Team member clicks "Acknowledge" button (Member or Admin role)
System Actions:
- ✓ Stops sending repeat alerts
- ✓ Records who acknowledged and when
- ✓ Notifies team of acknowledgement
Best Practice: Acknowledge immediately when starting work and add a note explaining what you're doing
INVESTIGATING
Blue BadgeWhat This Means:
- • Active troubleshooting underway
- • Root cause analysis in progress
- • Fix being implemented
User Actions Available:
- ✓ Add detailed investigation notes
- ✓ Document findings and troubleshooting steps
- ✓ Attach screenshots or logs
- ✓ Update progress
RESOLVED
Green BadgeSystem Actions:
- ✓ Marks incident as resolved
- ✓ Calculates total downtime
- ✓ Calculates time-to-resolve (TTR)
- ✓ Updates MTTR metrics
- ✓ Sends recovery notification
User Actions Available:
- ✓ Add resolution notes (what fixed it)
- ✓ Generate post-mortem report
- ✓ Export incident data (JSON)
- ✓ View complete timeline
How to Acknowledge an Incident
When you receive an alert notification:
- Navigate to the Incidents page in the sidebar
- Click on the open incident
- Click the "Acknowledge" button
- Add optional notes about your investigation plan
- This stops repeat notifications and prevents alert fatigue
Key Features
MTTR Tracking
Mean Time To Resolution calculated automatically for every incident
Complete Audit Trail
Every action timestamped and attributed to specific users
Incident Analytics
Track incident frequency, patterns, and affected services
Post-Mortem Generation
Automated reports with timeline, metrics, and resolution details
Incident Metrics
Pulsimo automatically calculates key metrics for every incident:
| Metric | Description | Example |
|---|---|---|
| MTTR | Mean Time To Repair - Average resolution time | 23.5 minutes |
| MTTD | Mean Time To Detect - Time until incident created | 2.3 minutes |
| Total Downtime | Complete duration service was unavailable | 1,410 seconds (23.5 min) |
| Affected Checks | Number of failed health check attempts | 47 failed checks |
| Incident Count (24h) | Total incidents in last 24 hours | 3 incidents |
Best Practices
📝 Document Everything: Add investigation notes as you troubleshoot. Future you (and your team) will thank you when writing the post-mortem.
⚡ Acknowledge Quickly: Acknowledge incidents immediately when you start working on them to stop alert spam and signal to others that it's being handled.
🔍 Review Patterns: Regularly review incident history to identify recurring issues and proactively address root causes.