Incident Management

Incidents are the cornerstone of operational response when services fail. Pulsimo provides a comprehensive incident management system that tracks the entire lifecycle from detection through resolution.

Overview

When a monitored endpoint fails health checks beyond its configured threshold, Pulsimo automatically creates an incident. This incident becomes the central hub for all information, actions, and collaboration related to that outage. From automatic creation to post-mortem generation, the incident management system guides your team through effective incident response.

Automatic Detection

Incidents created instantly when thresholds exceeded - no manual intervention required

Alert Management

Stop notification spam by acknowledging incidents - prevent alert fatigue

Collaboration

Multiple team members can work on same incident with investigation notes

Post-Mortems

Automatic report generation with timeline, metrics, and complete audit trail

Incident Lifecycle

Every incident progresses through four defined states:

OPEN

Red Badge

Triggered When:

• Endpoint fails consecutive health checks (threshold exceeded)
• Automatic incident creation, no manual intervention

System Actions:

✓ Creates incident record
✓ Sends alerts to notification channels
✓ Displays in Incidents page
✓ Records start time

ACKNOWLEDGED

Yellow Badge

Triggered When:

Team member clicks "Acknowledge" button (Member or Admin role)

System Actions:

✓ Stops sending repeat alerts
✓ Records who acknowledged and when
✓ Notifies team of acknowledgement

Best Practice: Acknowledge immediately when starting work and add a note explaining what you're doing

INVESTIGATING

Blue Badge

What This Means:

• Active troubleshooting underway
• Root cause analysis in progress
• Fix being implemented

User Actions Available:

✓ Add detailed investigation notes
✓ Document findings and troubleshooting steps
✓ Attach screenshots or logs
✓ Update progress

RESOLVED

Green Badge

System Actions:

✓ Marks incident as resolved
✓ Calculates total downtime
✓ Calculates time-to-resolve (TTR)
✓ Updates MTTR metrics
✓ Sends recovery notification

User Actions Available:

✓ Add resolution notes (what fixed it)
✓ Generate post-mortem report
✓ Export incident data (JSON)
✓ View complete timeline

How to Acknowledge an Incident

When you receive an alert notification:

Navigate to the Incidents page in the sidebar
Click on the open incident
Click the "Acknowledge" button
Add optional notes about your investigation plan
This stops repeat notifications and prevents alert fatigue

Key Features

MTTR Tracking

Mean Time To Resolution calculated automatically for every incident

Complete Audit Trail

Every action timestamped and attributed to specific users

Incident Analytics

Track incident frequency, patterns, and affected services

Post-Mortem Generation

Automated reports with timeline, metrics, and resolution details

Incident Metrics

Pulsimo automatically calculates key metrics for every incident:

Metric	Description	Example
MTTR	Mean Time To Repair - Average resolution time	23.5 minutes
MTTD	Mean Time To Detect - Time until incident created	2.3 minutes
Total Downtime	Complete duration service was unavailable	1,410 seconds (23.5 min)
Affected Checks	Number of failed health check attempts	47 failed checks
Incident Count (24h)	Total incidents in last 24 hours	3 incidents

Incident Linking & Relationships

Powerful New Feature

Connect related incidents for better root cause analysis. Pulsimo automatically suggests relationships when it detects patterns.

Relationship Types

🔄

Duplicate

Same issue reported multiple times

Example: "Website slow" + "App loading slow" = Same database issue

⬅️

Caused By

One incident is the root cause of another

Example: Database Down → Backend Degraded → Frontend Errors

🚫

Blocks

Resolving one requires fixing another first

Example: Database Migration Stuck blocks New Feature Deployment

👨‍👧

Parent/Child

Breaking down complex incidents into sub-tasks

Example: System Outage → Database Recovery + Cache Rebuild

Automatic Root Cause Analysis

Pulsimo analyzes incidents and automatically detects patterns:

Cascade Failures

Detects dependency chain failures

• Visual graph showing cascade path
• Timeline of impact spread
• Automatic "caused by" links suggested

Common Endpoint

Multiple incidents on same service

• Tracks incident frequency per endpoint
• Suggests infrastructure investigation
• Identifies problematic services

Deployment Related

Incidents after deployments

• Correlates incidents with deploys
• Suggests rollback if needed
• 100% time correlation detection

Time Correlation

Incidents in same time window

• Detects simultaneous outages
• Suggests external factors
• Network/infrastructure investigation

Response Performance Metrics

Track how well your team responds to incidents:

Time to Detect

< 5 min

Target

Time to Acknowledge

< 15 min

Target

Time to Identify

< 30 min

Target

Time to Resolve

< 2 hours

Target

Communication Timeline

Every incident has a detailed timeline showing system events and team updates:

11:42:15 AMSystem

Incident created - Backend API health check failed

11:45:00 AMJohn

Checking database logs for connection issues

12:05:45 AMSystem

Service recovered - Health check passed

Best Practices

For On-Call Engineers

• Acknowledge immediately
• Update timeline every 15 minutes
• Link related incidents
• Document as you go

For Team Leads

• Review weekly metrics
• Post-mortem critical incidents
• Validate automatic links
• Update response playbooks

For DevOps/SRE

• Use incident groups for mass outages
• Track cascade patterns
• Monitor false positive rate
• Automate common remediations

Frequently Asked Questions

Q: How do I link incidents together?

A: Open the primary incident, click "Link Related" button, select relationship type, choose the related incident, and save.

Q: Can I create incidents manually?

A: Yes! Go to Dashboard → Incidents → Create Incident. Fill in title, description, severity, and affected services.

Q: What happens when I acknowledge an incident?

A: Repeat alerts stop sending, your name is recorded as the responder, and the team is notified that someone is working on it.

Q: How is MTTR calculated?

A: Mean Time To Resolution is calculated from incident creation to resolution, averaged across all incidents in the selected timeframe.