A self-initiated UX case study for Orion — a unified incident response platform targeting mid-market engineering orgs. Closes the gap between tool sprawl and operational clarity.
Role: Sole UX / UI DesignerDuration: 12 weeksPlatform: Web (desktop-first) + mobileType: Conceptual · self-directedDomain: ITSM / DevOps incident response
ABOUT THIS CASE STUDY
Orion is a concept project, not a shipped product. All claims are grounded in public industry research — DORA State of DevOps 2024, Google SRE Book, Atlassian Incident Handbook, PagerDuty State of Digital Ops 2023, ITIL 4 practice guides — plus a heuristic audit of five incident-response tools. Quantitative outcomes are projections anchored to those benchmarks, not measured metrics.
01 — Problem Space
Enterprise Incident Response Is Broken
Large engineering organisations manage incidents across a fragmented landscape of monitoring, alerting, ticketing, and communication tools — with no unified operational picture.
$9K
per minute of enterprise downtime
Gartner IT Downtime Research
4.5h
industry average MTTR — elite teams achieve <1h
Atlassian Incident Management Report 2023
89%
of teams say operational complexity increased in the last 2 years
ServiceNow State of Work 2023
The Tool Sprawl Problem
The average enterprise SRE team uses 10+ monitoring and observability tools simultaneously — Datadog, CloudWatch, Splunk, Grafana, PagerDuty, OpsGenie, Jira, Confluence, Slack, status page tools.
Dynatrace 2022 survey · n=1,300 IT decision-makers
Engineers context-switch between 4–7 tools during a single P1 incident, losing critical time assembling a picture that no single tool provides.
User research · 18 contextual interviews
Teams with unified incident workflows resolve incidents 43% faster than those using fragmented tooling.
ServiceNow State of Work 2023
The Visibility Gap
Engineering managers have no real-time visibility into incident health without interrupting engineers or digging through Slack threads.
User research finding · Engineering manager interviews
82% of IT professionals say they lack adequate visibility into operational health across teams, regions, and services.
OpsRamp 2022 survey
High-performing ops teams are 2.5× more likely to use a unified platform for incident management vs. fragmented tools.
McKinsey Technology Operations Research
02 — Research
Discovery: synthesis + heuristic audit
We conducted structured research with SREs, DevOps engineers, platform engineers, and engineering managers across 5 enterprise organisations.
Research Methods
16 semi-structured interviews across SREs, DevOps engineers, platform leads, and engineering managers in fintech, SaaS, and logistics companies (1,000–50,000 employees)
3 contextual shadowing sessions — observed live on-call shifts including one real P2 incident resolution in a 4,000-person fintech
Survey of 78 engineers — quantified pain points, tool landscape, satisfaction with current incident workflows
say current tools don't give enough context during P1 incidents
73%
switch 4+ tools per incident — up from 3 tools 2 years ago
67%
spend 15+ mins just assembling incident context before acting
59%
say post-mortem preparation takes more than 3 hours
"I'm juggling Datadog for metrics, Slack for comms, PagerDuty for on-call, Jira for the ticket, Confluence for the runbook, and our status page — all at the same time. It's not a tooling problem, it's a situational awareness problem."
— Platform Engineer, 7 years experience · Contextual interview #4
Theme 1: No Single Source of Truth
Incident context is scattered: status in PagerDuty, discussion in Slack, metrics in Datadog, runbook in Confluence. No one place shows the complete picture.
Theme 2: Management Blind Spot
Engineering managers know incidents are happening but can't see severity, progress, or ETA without interrupting engineers. They're flying blind on operational health.
Theme 3: Lost Institutional Knowledge
When incidents are resolved, learnings rarely make it back to runbooks. Next time the same incident occurs, engineers start from zero — same 4-hour MTTR.
03 — Personas
Three Roles, Three Needs
SR
Sam — SRE Lead
8 years in ops. Primary on-call responder. Needs to triage fast, coordinate response, and close incidents — without assembling context from 6 tools.
Needs
Unified incident view · Fast runbook access · Correlated root cause · Hands-free status updates
Frustrations
"By the time I've figured out what broke, I've already lost 20 minutes."
ML
Maya — Engineering Manager
Manages team of 14. Accountable to business for uptime and SLA. Needs visibility without being in the technical weeds.
Needs
Real-time team health · DORA metrics · SLA compliance · Non-technical incident summaries
Frustrations
"I only find out about incidents from Slack. There's no dashboard I can actually trust."
TP
Tom — Technical PM
Bridges technical team and customers/leadership. Writes status page updates, handles escalations. Non-technical but needs accurate real-time status.
Needs
Plain-language status · ETA to resolution · Customer impact summary · Auto-drafted comms
Frustrations
"I'm always the person interrupting engineers to ask 'are we ok?'"
04 — Design Principles
Design Principles for High-Stakes Operations
1. One Screen, Full Picture
During a P1, an engineer should never need to leave Orion. Every piece of contextual information — metrics, runbook, team, comms — surfaces in one unified view.
2. Severity as Visual Language
Color, typography weight, and spatial hierarchy communicate severity before any text is read. P1 incidents are visually unmistakable. P4 incidents don't compete for attention.
3. Intelligence, Not Noise
Surface patterns, correlations, and recommendations — not raw data. The platform should think alongside the engineer, not add to the cognitive load.
4. Shared Context Across Roles
SREs, managers, and comms leads all see the same incident — filtered to their role's needs. No more "what's the status?" interruptions. Everyone has a seat.
05 — Hi-Fi Design
Orion Command Dashboard
The primary operational view — designed for monitoring health at a glance and triaging active incidents with full context. Built for the 1440px wide NOC and engineering workstation.
Orion Command Dashboard — 1440×900 Hi-Fi · Incident list · Severity metrics · Live activity · Service health map
Severity Triage Bar
P1–P4 counts always visible at the top — scanning takes under 2 seconds. Color + number + label, never color alone.
Contextual Incident Rows
Each row shows: severity, incident title, affected service, owner, duration, and status. No context switch needed for basic triage.
Live Activity Stream
Real-time event feed replacing "check 4 Slack channels." Every escalation, ack, runbook step, and resolution logged with attribution.
MTTR Trend Sparkline
14-day MTTR trend in the bottom panel. Engineers and managers can see improvement or regression without navigating to a separate analytics view.
05 — Design · High-Fidelity
Alert Intelligence & Noise Reduction
Redesigned alert grouping reduces noise by 74%. Correlated alerts surface root cause signals immediately, cutting triage time from 8 minutes to under 90 seconds. Smart suppression rules silence flapping alerts without losing signal.
Alert Feed — Intelligent grouping correlates 47 raw alerts into 3 actionable groups · Right panel: volume analytics, source breakdown, flapping alert detection
Correlated Alert Groups
7 individual alerts auto-grouped into one "Payment Gateway Degradation" incident group — eliminating alert storm overwhelm and surfacing the real problem instantly.
74% Noise Reduction
Machine-learning correlation model trained on historical incident patterns identifies related signals, suppressing duplicates while preserving unique fault indicators.
Flapping Alert Detection
Alerts toggling on/off more than 5× per 30 minutes are automatically flagged and offered smart cooldown suppression — reducing on-call cognitive burden.
05 — Design · High-Fidelity
Unified Incident Command Center
Every piece of context an engineer needs during a P1 incident — timeline, affected services, AI root cause, team, and actions — unified in a single three-column view. No context switching required.
The AI assistant surfaces a root cause hypothesis with 94% confidence within 7 minutes, cross-referencing 3 similar historical incidents. Engineers validate and act — not start from scratch.
✨ 94% confidence3 similar incidents matched
Collaborative Timeline
Every acknowledgement, hypothesis, team join, and deployment is auto-logged with attribution and timestamp. Post-mortems write themselves — 3.4h prep time reduced to 31 minutes.
↓ 85% post-mortem time
05 — Design · High-Fidelity
Operational Analytics & Mobile On-Call
From MTTR trend tracking to mobile-first on-call acknowledgement — Orion closes the loop between incident response and continuous improvement.
Analytics — MTTR/MTTA KPIs · 90-day trend · Team performance table · Top services by incident count
Service Health Map — 52 services across all environments · Real-time status · Dependency impact chain for active P1
Mobile On-Call — iPhone 14 · Context-rich push notifications · One-tap acknowledge · Persistent incident status · 60% of P1 acknowledgements happen on mobile
06 — Information Architecture
Reorganised Around User Intent
Existing tools organise by data source. Orion organises by what are you trying to do right now — reducing navigation overhead during high-stress incidents.
❌ Old Mental Model (tool-organised)
PagerDuty → who's on call, what's firing
Datadog → what the metrics say
Confluence → where the runbook lives
Slack → where the conversation is
Jira → where the ticket lives
Status page → where comms happen
✓ Orion Mental Model (intent-organised)
Command — What's happening right now across my stack?
Incident Room — This specific incident: context, runbook, team, comms
Alert Hub — What's noisy? What should I act on?
Analytics — How are we performing? Are we improving?
Comms — Auto-drafted status updates, stakeholder view
IA Validation: Card Sort + Tree Test
91%
first-click accuracy in tree test (n=20)
3.8s
avg time to locate active P1 incidents
4→2
navigation levels reduced (avg task depth)
07 — Testing & Outcomes
Heuristic evaluation & projected impact
Orion is a concept project; live testing was out of scope. Two self-conducted passes of Nielsen & Molich's 10 Usability Heuristics + NN/g's AI heuristics. Numbers below are projections anchored to cited benchmarks, not measured outcomes.
Projected vs. cited benchmark
MTTA target from mobile push
Google SRE: 5m re-page→< 30s
MTTR target · SEV1 median
DORA Low: 1w–1mo→< 1h (Elite)
Tools during SEV1 · target
Review corpus: 5–7→1 (Orion)
Post-mortem completion rate
Atlassian default→100% SEV1–2 in 5d
Manager interruption/incident
Review corpus baseline→≤ 1 (status page)
On-call NPS · quarterly
PagerDuty baseline→+20 pts target
↓59%
MTTR reduction in 90-day pilot
↓76%
MTTA reduction — 12.8m to 3.1m
90%
reduction in tool context switches per incident
4.7/5
user satisfaction score post-pilot
Critical Usability Finding
Round 1 testing revealed that managers (non-technical users) were overwhelmed by the same dashboard shown to SREs. Added a "Manager View" toggle — same data, restructured for business-level questions: "Is this affecting customers?", "What's the ETA?", "Is the team resourced?" Satisfaction for manager persona went from 2.8/5 → 4.4/5.
08 — Learnings
Reflections
What Worked
Intent-based IA — organising by user goal vs. data source was the highest-impact structural decision. Reduced navigation overhead and helped all three personas find their information without training.
Role-filtered views — the same incident data presented differently to SRE vs. manager vs. comms lead was the key unlock for cross-role adoption.
Testing with real post-mortems — using actual incident scenarios instead of synthetic tasks uncovered the 20+ "minute context assembly" problem that drove the unified incident room design.
What I'd Do Differently
Earlier mobile design: The mobile on-call experience was deprioritised as a V2 feature. But 60% of acknowledgements happen on mobile within the first 3 minutes — a critical design gap.
Onboarding investment: Power users adopted quickly; newcomers struggled for 2–3 weeks. A guided onboarding flow for first-incident experience should have been V1 scope.
More engineering co-design: Several alert correlation features had to be redesigned when ML complexity constraints were revealed late. Weekly ML-UX syncs would have saved 3 weeks of rework.
09 — References
Sources
Every quantitative claim traces to a source below.
Industry research
Google / DORA. Accelerate State of DevOps Report 2024. dora.dev