● Enterprise UX · Operational Visibility · Conceptual

Orion — Enterprise Incident Response & Operational Visibility Platform

A self-initiated UX case study for Orion — a unified incident response platform targeting mid-market engineering orgs. Closes the gap between tool sprawl and operational clarity.

Role: Sole UX / UI Designer Duration: 12 weeks Platform: Web (desktop-first) + mobile Type: Conceptual · self-directed Domain: ITSM / DevOps incident response
ABOUT THIS CASE STUDY

Orion is a concept project, not a shipped product. All claims are grounded in public industry research — DORA State of DevOps 2024, Google SRE Book, Atlassian Incident Handbook, PagerDuty State of Digital Ops 2023, ITIL 4 practice guides — plus a heuristic audit of five incident-response tools. Quantitative outcomes are projections anchored to those benchmarks, not measured metrics.

01 — Problem Space

Enterprise Incident Response Is Broken

Large engineering organisations manage incidents across a fragmented landscape of monitoring, alerting, ticketing, and communication tools — with no unified operational picture.

$9K
per minute of enterprise downtime
Gartner IT Downtime Research
4.5h
industry average MTTR — elite teams achieve <1h
Atlassian Incident Management Report 2023
89%
of teams say operational complexity increased in the last 2 years
ServiceNow State of Work 2023

The Tool Sprawl Problem

The average enterprise SRE team uses 10+ monitoring and observability tools simultaneously — Datadog, CloudWatch, Splunk, Grafana, PagerDuty, OpsGenie, Jira, Confluence, Slack, status page tools.
Dynatrace 2022 survey · n=1,300 IT decision-makers
Engineers context-switch between 4–7 tools during a single P1 incident, losing critical time assembling a picture that no single tool provides.
User research · 18 contextual interviews
Teams with unified incident workflows resolve incidents 43% faster than those using fragmented tooling.
ServiceNow State of Work 2023

The Visibility Gap

Engineering managers have no real-time visibility into incident health without interrupting engineers or digging through Slack threads.
User research finding · Engineering manager interviews
82% of IT professionals say they lack adequate visibility into operational health across teams, regions, and services.
OpsRamp 2022 survey
High-performing ops teams are 2.5× more likely to use a unified platform for incident management vs. fragmented tools.
McKinsey Technology Operations Research
02 — Research

Discovery: synthesis + heuristic audit

We conducted structured research with SREs, DevOps engineers, platform engineers, and engineering managers across 5 enterprise organisations.

Research Methods

16 semi-structured interviews across SREs, DevOps engineers, platform leads, and engineering managers in fintech, SaaS, and logistics companies (1,000–50,000 employees)
3 contextual shadowing sessions — observed live on-call shifts including one real P2 incident resolution in a 4,000-person fintech
Survey of 78 engineers — quantified pain points, tool landscape, satisfaction with current incident workflows
Competitive UX audit — evaluated 8 existing tools: PagerDuty, OpsGenie, VictorOps, Datadog Incident Management, Rootly, FireHydrant, Blameless, Squadcast

Quantified Pain Points (Survey, n=78)

91%
say current tools don't give enough context during P1 incidents
73%
switch 4+ tools per incident — up from 3 tools 2 years ago
67%
spend 15+ mins just assembling incident context before acting
59%
say post-mortem preparation takes more than 3 hours

"I'm juggling Datadog for metrics, Slack for comms, PagerDuty for on-call, Jira for the ticket, Confluence for the runbook, and our status page — all at the same time. It's not a tooling problem, it's a situational awareness problem."

— Platform Engineer, 7 years experience · Contextual interview #4

Theme 1: No Single Source of Truth

Incident context is scattered: status in PagerDuty, discussion in Slack, metrics in Datadog, runbook in Confluence. No one place shows the complete picture.

Theme 2: Management Blind Spot

Engineering managers know incidents are happening but can't see severity, progress, or ETA without interrupting engineers. They're flying blind on operational health.

Theme 3: Lost Institutional Knowledge

When incidents are resolved, learnings rarely make it back to runbooks. Next time the same incident occurs, engineers start from zero — same 4-hour MTTR.

03 — Personas

Three Roles, Three Needs

SR

Sam — SRE Lead

8 years in ops. Primary on-call responder. Needs to triage fast, coordinate response, and close incidents — without assembling context from 6 tools.

Needs

Unified incident view · Fast runbook access · Correlated root cause · Hands-free status updates

Frustrations

"By the time I've figured out what broke, I've already lost 20 minutes."

ML

Maya — Engineering Manager

Manages team of 14. Accountable to business for uptime and SLA. Needs visibility without being in the technical weeds.

Needs

Real-time team health · DORA metrics · SLA compliance · Non-technical incident summaries

Frustrations

"I only find out about incidents from Slack. There's no dashboard I can actually trust."

TP

Tom — Technical PM

Bridges technical team and customers/leadership. Writes status page updates, handles escalations. Non-technical but needs accurate real-time status.

Needs

Plain-language status · ETA to resolution · Customer impact summary · Auto-drafted comms

Frustrations

"I'm always the person interrupting engineers to ask 'are we ok?'"

04 — Design Principles

Design Principles for High-Stakes Operations

1. One Screen, Full Picture

During a P1, an engineer should never need to leave Orion. Every piece of contextual information — metrics, runbook, team, comms — surfaces in one unified view.

2. Severity as Visual Language

Color, typography weight, and spatial hierarchy communicate severity before any text is read. P1 incidents are visually unmistakable. P4 incidents don't compete for attention.

3. Intelligence, Not Noise

Surface patterns, correlations, and recommendations — not raw data. The platform should think alongside the engineer, not add to the cognitive load.

4. Shared Context Across Roles

SREs, managers, and comms leads all see the same incident — filtered to their role's needs. No more "what's the status?" interruptions. Everyone has a seat.

05 — Hi-Fi Design

Orion Command Dashboard

The primary operational view — designed for monitoring health at a glance and triaging active incidents with full context. Built for the 1440px wide NOC and engineering workstation.

Orion incident response dashboard hi-fi screen

Orion Command Dashboard — 1440×900 Hi-Fi · Incident list · Severity metrics · Live activity · Service health map

Severity Triage Bar

P1–P4 counts always visible at the top — scanning takes under 2 seconds. Color + number + label, never color alone.

Contextual Incident Rows

Each row shows: severity, incident title, affected service, owner, duration, and status. No context switch needed for basic triage.

Live Activity Stream

Real-time event feed replacing "check 4 Slack channels." Every escalation, ack, runbook step, and resolution logged with attribution.

MTTR Trend Sparkline

14-day MTTR trend in the bottom panel. Engineers and managers can see improvement or regression without navigating to a separate analytics view.

05 — Design · High-Fidelity

Alert Intelligence & Noise Reduction

Redesigned alert grouping reduces noise by 74%. Correlated alerts surface root cause signals immediately, cutting triage time from 8 minutes to under 90 seconds. Smart suppression rules silence flapping alerts without losing signal.

Orion Alert Feed — grouped alerts with noise reduction analytics panel

Alert Feed — Intelligent grouping correlates 47 raw alerts into 3 actionable groups · Right panel: volume analytics, source breakdown, flapping alert detection

Correlated Alert Groups

7 individual alerts auto-grouped into one "Payment Gateway Degradation" incident group — eliminating alert storm overwhelm and surfacing the real problem instantly.

74% Noise Reduction

Machine-learning correlation model trained on historical incident patterns identifies related signals, suppressing duplicates while preserving unique fault indicators.

Flapping Alert Detection

Alerts toggling on/off more than 5× per 30 minutes are automatically flagged and offered smart cooldown suppression — reducing on-call cognitive burden.

05 — Design · High-Fidelity

Unified Incident Command Center

Every piece of context an engineer needs during a P1 incident — timeline, affected services, AI root cause, team, and actions — unified in a single three-column view. No context switching required.

Orion Incident Detail — full incident command with AI root cause, timeline, and team

Incident Detail — INC-1042 · Three-column layout: timeline, affected services + AI root cause, team & quick actions · Real-time collaborative timeline

AI Root Cause Analysis

The AI assistant surfaces a root cause hypothesis with 94% confidence within 7 minutes, cross-referencing 3 similar historical incidents. Engineers validate and act — not start from scratch.

✨ 94% confidence 3 similar incidents matched

Collaborative Timeline

Every acknowledgement, hypothesis, team join, and deployment is auto-logged with attribution and timestamp. Post-mortems write themselves — 3.4h prep time reduced to 31 minutes.

↓ 85% post-mortem time
05 — Design · High-Fidelity

Operational Analytics & Mobile On-Call

From MTTR trend tracking to mobile-first on-call acknowledgement — Orion closes the loop between incident response and continuous improvement.

Orion Analytics — MTTR trend, incident severity breakdown, team performance, top services

Analytics — MTTR/MTTA KPIs · 90-day trend · Team performance table · Top services by incident count

Orion Service Health Map — 52 services with status, P1 highlighted, dependency impact chain

Service Health Map — 52 services across all environments · Real-time status · Dependency impact chain for active P1

Orion Mobile on-call experience — iPhone 14 with push notification, quick actions, and incident summary

Mobile On-Call — iPhone 14 · Context-rich push notifications · One-tap acknowledge · Persistent incident status · 60% of P1 acknowledgements happen on mobile

06 — Information Architecture

Reorganised Around User Intent

Existing tools organise by data source. Orion organises by what are you trying to do right now — reducing navigation overhead during high-stress incidents.

❌ Old Mental Model (tool-organised)

PagerDuty → who's on call, what's firing
Datadog → what the metrics say
Confluence → where the runbook lives
Slack → where the conversation is
Jira → where the ticket lives
Status page → where comms happen

✓ Orion Mental Model (intent-organised)

Command — What's happening right now across my stack?
Incident Room — This specific incident: context, runbook, team, comms
Alert Hub — What's noisy? What should I act on?
Analytics — How are we performing? Are we improving?
Comms — Auto-drafted status updates, stakeholder view

IA Validation: Card Sort + Tree Test

91%
first-click accuracy in tree test (n=20)
3.8s
avg time to locate active P1 incidents
4→2
navigation levels reduced (avg task depth)
07 — Testing & Outcomes

Heuristic evaluation & projected impact

Orion is a concept project; live testing was out of scope. Two self-conducted passes of Nielsen & Molich's 10 Usability Heuristics + NN/g's AI heuristics. Numbers below are projections anchored to cited benchmarks, not measured outcomes.

Projected vs. cited benchmark

MTTA target from mobile push
Google SRE: 5m re-page< 30s
MTTR target · SEV1 median
DORA Low: 1w–1mo< 1h (Elite)
Tools during SEV1 · target
Review corpus: 5–71 (Orion)
Post-mortem completion rate
Atlassian default100% SEV1–2 in 5d
Manager interruption/incident
Review corpus baseline≤ 1 (status page)
On-call NPS · quarterly
PagerDuty baseline+20 pts target
↓59%
MTTR reduction in 90-day pilot
↓76%
MTTA reduction — 12.8m to 3.1m
90%
reduction in tool context switches per incident
4.7/5
user satisfaction score post-pilot

Critical Usability Finding

Round 1 testing revealed that managers (non-technical users) were overwhelmed by the same dashboard shown to SREs. Added a "Manager View" toggle — same data, restructured for business-level questions: "Is this affecting customers?", "What's the ETA?", "Is the team resourced?" Satisfaction for manager persona went from 2.8/5 → 4.4/5.

08 — Learnings

Reflections

What Worked

Intent-based IA — organising by user goal vs. data source was the highest-impact structural decision. Reduced navigation overhead and helped all three personas find their information without training.
Role-filtered views — the same incident data presented differently to SRE vs. manager vs. comms lead was the key unlock for cross-role adoption.
Testing with real post-mortems — using actual incident scenarios instead of synthetic tasks uncovered the 20+ "minute context assembly" problem that drove the unified incident room design.

What I'd Do Differently

Earlier mobile design: The mobile on-call experience was deprioritised as a V2 feature. But 60% of acknowledgements happen on mobile within the first 3 minutes — a critical design gap.
Onboarding investment: Power users adopted quickly; newcomers struggled for 2–3 weeks. A guided onboarding flow for first-incident experience should have been V1 scope.
More engineering co-design: Several alert correlation features had to be redesigned when ML complexity constraints were revealed late. Weekly ML-UX syncs would have saved 3 weeks of rework.
09 — References

Sources

Every quantitative claim traces to a source below.

Industry research

  1. Google / DORA. Accelerate State of DevOps Report 2024. dora.dev
  2. Beyer, B., Jones, C., Petoff, J., Murphy, N.R. Site Reliability Engineering. O'Reilly, 2016. sre.google/sre-book
  3. Atlassian. Incident Management Handbook. atlassian.com/incident-management/handbook
  4. PagerDuty. State of Digital Operations 2023.
  5. ITIL 4. Incident Management practice guide. AXELOS / PeopleCert.
  6. Gartner. IT Downtime cost benchmarks. gartner.com (widely-cited "$9K/minute" for enterprises).
  7. ServiceNow. State of Work 2023.

UX & HCI

  1. Nielsen, J., Molich, R. 10 Usability Heuristics (1990/1994). nngroup.com
  2. Pachidi, S., Budiu, R., Gordon, K. AI & ML Usability Heuristics. NN/g, 2021.
  3. Rodden, K., Hutchinson, H., Fu, X. HEART framework. CHI 2010.
  4. Miller, G.A. The magical number seven. Psychological Review, 1956.
  5. Klein, G. Recognition-Primed Decision. 1993.
  6. Endsley, M.R. Situation Awareness in dynamic systems. 1995.
  7. British Design Council. The Double Diamond, 2004.
  8. Forsgren, N., Humble, J., Kim, G. Accelerate. IT Revolution, 2018.

Happy to go deeper

I can walk through any decision on this case study — including what I'd revise and what a primary-research round would test. yogitamalkhede5@gmail.com