Core Concepts5 min read
Core Concepts
Master the fundamental building blocks of OpsKnight incident management
Core Concepts
Understanding OpsKnight's core concepts is essential for building an effective incident management workflow. This section explains how each component works, why it exists, and how they connect to form a complete system.
The Big Picture
OpsKnight is built around a simple but powerful model: Alerts become Incidents, Incidents trigger Escalations, Escalations notify People.
YOUR MONITORING STACK
┌─────────────────────────────────────────────────┐
│ Datadog • Prometheus • CloudWatch • Sentry │
│ GitHub Actions • Custom Webhooks • 20+ more │
└─────────────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ SERVICES │
│ Your monitored systems with ownership & policy │
│ API Gateway • Database • Payment Service │
└─────────────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ INCIDENTS │
│ Actionable work items with lifecycle tracking │
│ OPEN • ACKNOWLEDGED • SNOOZED • SUPPRESSED • RESOLVED │
└─────────────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ ESCALATION POLICIES │
│ Multi-step notification chains with delays │
│ Step 1 → Step 2 → Step 3 → Repeat │
└─────────────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ SCHEDULES │
│ Who is on-call right now? │
│ Layers • Rotations • Overrides │
└─────────────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ NOTIFICATIONS │
│ Multi-channel delivery until acknowledged │
│ Email • SMS • Push • Slack • WhatsApp │
└─────────────────────────────────────────────────┘
Concepts at a Glance
| Concept | What It Is | Why It Matters |
|---|---|---|
| Dashboard | Command center for real-time visibility | See the forest, not just the trees |
| Services | Systems you monitor with ownership | Alerts need context and routing |
| Incidents | Actionable work items from alerts | Track resolution from trigger to close |
| Escalation Policies | Who gets notified and when | Ensure issues never fall through cracks |
| Schedules | On-call rotations and shifts | Fair distribution of on-call duty |
| Teams | User groups with shared responsibility | Organize people and permissions |
| Users | Individuals with roles and preferences | Authentication and personalization |
| Analytics | Metrics, SLAs, and trends | Measure and improve performance |
| Postmortems | Incident retrospectives | Learn from failures |
| Status Pages | Public service health communication | Transparency with customers |
| Integrations | Connections to monitoring tools | Route alerts from your stack |
| Scalability | Capacity targets and tuning | Understand scale limits and optimizations |
How Concepts Connect
The Alert-to-Resolution Flow
- Alert Received: A monitoring tool sends a webhook to OpsKnight
- Service Identified: The routing key maps the alert to a Service
- Incident Created: OpsKnight creates an Incident with deduplication
- Escalation Started: The Service's Escalation Policy is triggered
- On-Call Found: The Schedule determines who is currently on-call
- Notification Sent: The on-call person receives alerts via configured channels
- Acknowledge: The responder acknowledges, stopping further escalation
- Resolve: The incident is fixed and marked resolved
Key Relationships
Service ──────────────────┐
│ │
├── owns ────▶ Incidents
│ │
└── uses ────▶ Escalation Policy
│
├── Step 1 ──▶ Schedule ──▶ User (on-call)
├── Step 2 ──▶ User (backup)
└── Step 3 ──▶ Team ──▶ All members
Terminology Quick Reference
| Term | Definition |
|---|---|
| Alert | Raw event from a monitoring tool (webhook payload) |
| Incident | Actionable work item created from one or more alerts |
| Dedup Key | Identifier to group related alerts into one incident |
| Urgency | Impact level: HIGH, MEDIUM, or LOW |
| Priority | Business priority: P1-P5 (most to least critical) |
| Escalation | Process of notifying additional people over time |
| On-Call | The person currently responsible for responding |
| Shift | A time period when someone is on-call |
| Layer | A rotation pattern within a schedule |
| Override | Temporary change to normal on-call assignment |
| MTTA | Mean Time To Acknowledge |
| MTTR | Mean Time To Resolve |
| SLA | Service Level Agreement (target response times) |
Recommended Reading Order
If you're new to OpsKnight, we recommend reading the concepts in this order:
- Services — Start here to understand the foundation
- Incidents — Learn the core workflow
- Escalation Policies — Understand notification routing
- Schedules — Configure on-call rotations
- Teams — Organize your responders
- Dashboard — Master the command center
- Analytics — Track and improve performance
Deep Dives
For Incident Responders
- Incident Lifecycle — Understand statuses and actions
- Bulk Actions — Manage alert storms efficiently
- Mobile Access — Respond from anywhere
For On-Call Managers
- Schedule Layers — Build complex coverage patterns
- Overrides — Handle vacations and swaps
- Fair Rotation — Balance on-call burden
For Operations Teams
- SLA Configuration — Set response time targets
- Custom Fields — Track additional metadata
- Integrations — Connect your monitoring stack
For Leadership
- Analytics Dashboard — Executive metrics
- Status Pages — Customer communication
- Postmortems — Organizational learning
- Scalability — Capacity planning and growth readiness
Common Patterns
Pattern 1: Simple Team Setup
For small teams with straightforward needs:
1 Service → 1 Escalation Policy → 1 Schedule (weekly rotation)
Pattern 2: Primary/Secondary Coverage
For better redundancy:
1 Service → 1 Escalation Policy:
Step 1: Primary Schedule (immediate)
Step 2: Secondary Schedule (5 min delay)
Step 3: Manager (10 min delay)
Pattern 3: Follow-the-Sun
For global teams:
1 Service → 1 Escalation Policy:
Step 1: Schedule (auto-selects based on timezone)
Layer 1: US Team (9am-5pm PT)
Layer 2: EU Team (9am-5pm CET)
Layer 3: APAC Team (9am-5pm JST)
Next Steps
Choose your path:
- New to incident management? Start with Services
- Setting up on-call? Jump to Schedules
- Connecting tools? See Integrations
- Measuring performance? Explore Analytics
Last updated for v1
Edit this page on GitHub