Core Concepts

Understanding OpsKnight's core concepts is essential for building an effective incident management workflow. This section explains how each component works, why it exists, and how they connect to form a complete system.

The Big Picture

OpsKnight is built around a simple but powerful model: Alerts become Incidents, Incidents trigger Escalations, Escalations notify People.

                    YOUR MONITORING STACK
    ┌─────────────────────────────────────────────────┐
    │  Datadog • Prometheus • CloudWatch • Sentry     │
    │  GitHub Actions • Custom Webhooks • 20+ more    │
    └─────────────────────────┬───────────────────────┘
                              │
                              ▼
    ┌─────────────────────────────────────────────────┐
    │                    SERVICES                      │
    │  Your monitored systems with ownership & policy  │
    │  API Gateway • Database • Payment Service        │
    └─────────────────────────┬───────────────────────┘
                              │
                              ▼
    ┌─────────────────────────────────────────────────┐
    │                    INCIDENTS                     │
    │  Actionable work items with lifecycle tracking   │
    │  OPEN • ACKNOWLEDGED • SNOOZED • SUPPRESSED • RESOLVED │
    └─────────────────────────┬───────────────────────┘
                              │
                              ▼
    ┌─────────────────────────────────────────────────┐
    │              ESCALATION POLICIES                 │
    │  Multi-step notification chains with delays      │
    │  Step 1 → Step 2 → Step 3 → Repeat               │
    └─────────────────────────┬───────────────────────┘
                              │
                              ▼
    ┌─────────────────────────────────────────────────┐
    │                   SCHEDULES                      │
    │  Who is on-call right now?                       │
    │  Layers • Rotations • Overrides                  │
    └─────────────────────────┬───────────────────────┘
                              │
                              ▼
    ┌─────────────────────────────────────────────────┐
    │              NOTIFICATIONS                       │
    │  Multi-channel delivery until acknowledged       │
    │  Email • SMS • Push • Slack • WhatsApp          │
    └─────────────────────────────────────────────────┘

Concepts at a Glance

Concept	What It Is	Why It Matters
Dashboard	Command center for real-time visibility	See the forest, not just the trees
Services	Systems you monitor with ownership	Alerts need context and routing
Incidents	Actionable work items from alerts	Track resolution from trigger to close
Escalation Policies	Who gets notified and when	Ensure issues never fall through cracks
Schedules	On-call rotations and shifts	Fair distribution of on-call duty
Teams	User groups with shared responsibility	Organize people and permissions
Users	Individuals with roles and preferences	Authentication and personalization
Analytics	Metrics, SLAs, and trends	Measure and improve performance
Postmortems	Incident retrospectives	Learn from failures
Status Pages	Public service health communication	Transparency with customers
Integrations	Connections to monitoring tools	Route alerts from your stack
Scalability	Capacity targets and tuning	Understand scale limits and optimizations

How Concepts Connect

The Alert-to-Resolution Flow

Alert Received: A monitoring tool sends a webhook to OpsKnight
Service Identified: The routing key maps the alert to a Service
Incident Created: OpsKnight creates an Incident with deduplication
Escalation Started: The Service's Escalation Policy is triggered
On-Call Found: The Schedule determines who is currently on-call
Notification Sent: The on-call person receives alerts via configured channels
Acknowledge: The responder acknowledges, stopping further escalation
Resolve: The incident is fixed and marked resolved

Key Relationships

Service ──────────────────┐
    │                     │
    ├── owns ────▶ Incidents
    │                     │
    └── uses ────▶ Escalation Policy
                          │
                          ├── Step 1 ──▶ Schedule ──▶ User (on-call)
                          ├── Step 2 ──▶ User (backup)
                          └── Step 3 ──▶ Team ──▶ All members

Terminology Quick Reference

Term	Definition
Alert	Raw event from a monitoring tool (webhook payload)
Incident	Actionable work item created from one or more alerts
Dedup Key	Identifier to group related alerts into one incident
Urgency	Impact level: HIGH, MEDIUM, or LOW
Priority	Business priority: P1-P5 (most to least critical)
Escalation	Process of notifying additional people over time
On-Call	The person currently responsible for responding
Shift	A time period when someone is on-call
Layer	A rotation pattern within a schedule
Override	Temporary change to normal on-call assignment
MTTA	Mean Time To Acknowledge
MTTR	Mean Time To Resolve
SLA	Service Level Agreement (target response times)

Deep Dives

For Incident Responders

Incident Lifecycle — Understand statuses and actions
Bulk Actions — Manage alert storms efficiently
Mobile Access — Respond from anywhere

For On-Call Managers

Schedule Layers — Build complex coverage patterns
Overrides — Handle vacations and swaps
Fair Rotation — Balance on-call burden

For Operations Teams

SLA Configuration — Set response time targets
Custom Fields — Track additional metadata
Integrations — Connect your monitoring stack

For Leadership

Analytics Dashboard — Executive metrics
Status Pages — Customer communication
Postmortems — Organizational learning
Scalability — Capacity planning and growth readiness

Common Patterns

Pattern 1: Simple Team Setup

For small teams with straightforward needs:

1 Service → 1 Escalation Policy → 1 Schedule (weekly rotation)

Pattern 2: Primary/Secondary Coverage

For better redundancy:

1 Service → 1 Escalation Policy:
    Step 1: Primary Schedule (immediate)
    Step 2: Secondary Schedule (5 min delay)
    Step 3: Manager (10 min delay)

Pattern 3: Follow-the-Sun

For global teams:

1 Service → 1 Escalation Policy:
    Step 1: Schedule (auto-selects based on timezone)
        Layer 1: US Team (9am-5pm PT)
        Layer 2: EU Team (9am-5pm CET)
        Layer 3: APAC Team (9am-5pm JST)

Next Steps

Choose your path:

New to incident management? Start with Services
Setting up on-call? Jump to Schedules
Connecting tools? See Integrations
Measuring performance? Explore Analytics