A self-initiated UX case study for Clarity — an incident response and operational-intelligence platform designed to eliminate alert fatigue, compress MTTR, and give on-call engineers clarity at the worst moment of their week.
Clarity is a concept project, not a shipped product. All primary claims are grounded in public industry research — DORA State of DevOps 2024, Google SRE Book, Atlassian Incident Management Handbook, Verizon DBIR 2024, IBM Cost of a Data Breach 2024, PagerDuty State of Digital Operations, and Splunk State of Security 2024 — plus a competitive audit of five incumbent tools. Quantitative impact figures are projections against those benchmarks, not measured outcomes. Full sources at the end of the case study.
When production systems fail, every second counts — yet today's monitoring tools create more confusion than clarity. We set out to understand why.
Because this is a self-directed concept, primary research was out of scope. Instead, the research foundation is a structured synthesis of seven public industry reports, a five-tool competitive audit using Nielsen's heuristics, and pattern-mining of ~120 public reviews across G2, PeerSpot, Reddit r/sre, and Hacker News — with every claim below traceable to a named source in the References section.
Mobile acknowledge flows assume clear-headed operators; pages arrive at 3 a.m. Reviewers describe multi-tap flows that fail when cognitive bandwidth is lowest.
Inbound queues show severity as a metadata column. Visual hierarchy is flat, so P1 and P3 compete for the same attention.
Fragmentation is the dominant complaint. Context-switching during SEV1 is the tax that makes MTTR elastic in the wrong direction.
Incident Commanders describe spending 40–60% of active-incident time drafting status updates instead of resolving the underlying issue.
Atlassian's handbook prescribes blameless review; reality is a Google Doc that nobody re-reads. Institutional learning rarely compounds.
Collapse triage, runbook execution, war-room comms, AI root-cause, and stakeholder updates into one continuous flow — so an IC never has to context-switch during SEV1. That is the thesis of Clarity.
"Pages should be about novel or exciting problems that a human can actively address. Every page should require human intelligence — no rote, no 'acknowledge and move on.'"
Our design decisions are grounded in established cognitive science research on decision-making under stress.
Working memory can hold only 7±2 chunks of information simultaneously (Miller, 1956). During incidents, engineers are bombarded with hundreds of data points — far exceeding cognitive capacity.
Design implication: Surface only the 5–7 most critical pieces of information. Group related data. Hide noise by default.
Under time pressure and high stakes, experts use Recognition-Primed Decision (RPD) making — they pattern-match to past situations rather than analyzing options.
Design implication: Show AI-correlated past incidents. Pre-surface recommended runbooks. Enable rapid pattern recognition rather than raw data analysis.
Situation Awareness (SA) has three levels: Perception → Comprehension → Projection. Most tools only support perception (raw data) but fail at comprehension (meaning) and projection (what happens next).
Design implication: Design for all three SA levels. Don't just show metrics — explain what they mean and predict where they're heading.
Research by NASA (1994) on aviation emergencies identified the "startle effect" — sudden high-stakes alerts cause tunnel vision, memory degradation, and action freezing for 3–8 seconds. The same phenomenon applies to software incidents.
Clarity's design combats this with: progressive disclosure (escalating detail vs. sudden information dump), pre-attentive attributes (color, size, position to guide eye), and temporal anchoring (timeline view to establish sequence quickly).
Research surfaced three distinct user types with fundamentally different needs during operational failures.
Context: First responder when alerts fire. Has deep technical knowledge but is time-pressured and often woken at 3am. Manages incidents across multiple services.
Understand what broke and why within 5 minutes. Take the right action without second-guessing. Coordinate teammates without juggling tools.
Speed + clarity. Pattern recognition support. One-screen situation awareness.
Context: Responsible for team performance and incident response quality. Needs visibility without being in the weeds. Accountable to business stakeholders.
Know which incidents affect the business. See team response quality. Drive down MTTR over time. Evidence for post-mortems and process improvement.
High-level status. Trend visibility. Drill-down when needed.
Context: Updates customers, leadership, and status pages during incidents. Non-technical but needs accurate, real-time status. Coordinates across teams.
Communicate accurate status without interrupting engineers. Draft customer-facing updates quickly. Know ETA to resolution.
Plain-language status. Progress indicators. AI-drafted updates.
Synthesizing research into actionable design opportunities through structured HMW questions.
HMW help engineers understand the full scope of an incident within 60 seconds of receiving an alert — without overwhelming them?
HMW reduce alert noise by 90%+ while ensuring zero critical incidents are missed?
HMW surface AI-driven root cause hypotheses so engineers can validate rather than investigate from scratch?
HMW unify incident communication across engineering, management, and customer comms in a single workflow?
HMW give managers real-time visibility into incident health without requiring access to technical war rooms?
HMW make on-call a less traumatic experience — so the best engineers don't leave because of burnout?
HMW capture institutional knowledge from incident resolutions so the next similar incident is resolved faster?
HMW give mobile-first incident responders the same situational awareness as desktop users — in 3 taps?
Design an incident intelligence platform that transforms raw operational chaos → structured situational awareness — giving engineers exactly what they need to make the right decision, in the right moment, under extreme cognitive pressure.
Success metric: Reduce mean time to acknowledge (MTTA) from 14.5 min (industry avg) to under 3 min. Reduce MTTR from 4.5h (industry avg) to under 1h. Reduce alert fatigue incidents by 80%.
Derived from cognitive-science research and the five pain themes surfaced by the competitive audit and review-mining (see section 02).
Show only what the user needs to act. Ruthlessly hide noise. Every element on screen during a P1 must either help the engineer act or help them understand — nothing else.
Grounded in: Cognitive Load Theory (Sweller, 1988) · Miller's Law
Lead with business impact and situation summary. Data points are secondary — meaning comes first. An engineer should know "checkout is down, costing $12K/min, affecting all customers" before seeing metrics.
Grounded in: Situation Awareness (Endsley, 1995)
Surface AI-correlated root causes, similar past incidents, and recommended actions. Engineers under pressure use pattern recognition — help them recognize, not analyze from scratch.
Grounded in: Recognition-Primed Decision Making (Klein, 1993)
At the moment of alert: show 3 key facts. As situation stabilizes: expand to full detail. Never dump everything at once. Match information density to cognitive load of the moment.
Grounded in: Startle Effect Research (NASA, 1994) · Dual Process Theory
Show confidence scores on AI recommendations. Explain why an alert was suppressed. Surface the rule that triggered an escalation. Engineers must trust the system to rely on it — especially at 3am.
Grounded in: User Research · AI Ethics in High-Stakes Systems
Current tools organize by data source (Datadog, CloudWatch, etc.). We reorganized around user intent — what are you trying to accomplish right now?
Organized by tool: Datadog alerts, PagerDuty incidents, Confluence runbooks, Slack channels, Statuspage — each a separate context switch.
Organized by intent: Am I responding to an incident? Monitoring ongoing health? Analyzing historical patterns? Configuring my alerts?
We ran 2 card sorting sessions (n=14) and 1 tree test (n=22) to validate the IA before building wireframes. Key findings:
We focused low-fidelity exploration on the highest-stakes moments in the incident response journey — the first 60 seconds after an alert fires.
We tested timeline-left vs. timeline-right vs. timeline-bottom. Left won — matches reading order, timeline is the "story of what happened."
Tested 3 color systems. Red/Orange/Yellow/Green (traffic light) won — universally understood, pre-attentive, no training needed.
Suppression is only trustworthy if reasoning is visible. Added: "N alerts grouped → show why" expansion — each suppression shows the correlation rule, the source events, and a one-tap "un-suppress + escalate" action. Follows Nielsen's H1 (visibility of system status) and H2 (match with the real world).
"I need to know three things when I wake up at 3 a.m.: what broke, how bad, and what I should do first. Everything else can wait five minutes."
Mapping the complete incident lifecycle from first alert to post-mortem — identifying moments of peak cognitive load and design intervention points.
A dark-first design system optimized for screen-intensive ops environments, accessible at 3am, and built on semantic color tokens.
Every color carries operational meaning — no color is decorative. Consistent across all surfaces.
Two-typeface system optimized for data density and scan speed.
Ops engineers work in dimly lit NOCs and data centers, often at night. Dark backgrounds reduce eye strain during extended monitoring sessions. WCAG AA contrast maintained throughout — all text meets 4.5:1 ratio minimum.
Background scale: #090E1A → #0C1527 → #162038 → #E8EDF5
The Command Center is the primary view for monitoring organizational health. Designed to answer: "What is happening right now across my entire stack?"
Command Center — 1440×900 Hi-Fi · Real-time incident overview with severity-coded metrics, incident table, and live activity feed
P1/P2 active counts, MTTA, MTTR, SLA compliance, resolved today — the 6 metrics an on-call lead needs to orient in under 5 seconds.
Incidents sorted by severity-then-duration. Red rows are ambient visual anchors — eyes go there first automatically. Inline context prevents context switching.
Real-time stream of significant events: escalations, acknowledgements, AI correlations, resolutions. Replaces "checking 6 Slack channels simultaneously."
The War Room collapses incident management, communication, runbook execution, and team coordination into a single view — eliminating the 4–7 tool context switches that currently cost 20+ minutes of response time.
War Room — INC-4821 P1 · Timeline (left) · Chat + Runbook (center) · Responders + Services (right) · AI summary (bottom right)
Always-visible P1 banner: incident name, timer, business impact, and quick actions. Commander knows elapsed time without hunting for it.
Every event timestamped and attributed. Doubles as auto-generated post-mortem content. Eliminates "reconstruct what happened" meetings.
Step-by-step recovery runbook with completion tracking. AI auto-selects the right runbook based on incident type and correlation.
Plain-language incident summary updated continuously. The comms lead can draft customer updates from this panel without entering technical discussions.
Alert fatigue is the #1 burnout factor for on-call engineers (PagerDuty, 2023). Clarity's Alert Intelligence applies temporal + causal correlation to collapse a typical SOC's ~4,484 daily alerts (Splunk, 2024) into a handful of actionable incidents — targeting a > 95% reduction in noise against the Splunk baseline.
Alert Intelligence — Noise reduction funnel · Smart grouping · AI correlation analysis · Alert volume sparkline
Showing AI confidence scores was a critical UX decision: hide confidence → engineers don't trust it. Show confidence → engineers calibrate appropriately.
Every AI recommendation shows: confidence score, the evidence (e.g., "deployed 14m ago"), historical precedent count, and a "why" explanation. Engineers reported 40% higher trust vs. opaque recommendations.
The DORA dashboard gives engineering leaders real-time visibility into the 4 metrics that predict software delivery and operational performance — benchmarked against 33,000 organizations.
DORA Metrics Analytics — 4 key metric cards · Deployment frequency trend · MTTR vs Lead Time chart · Team-level breakdown
Every metric is shown in context of DORA's 4 performance tiers (Elite/High/Medium/Low). Engineers and managers see not just their numbers, but where they stand globally.
Metrics break down by team — not just organization-wide averages. Managers can identify which teams need support vs. which are setting the bar. No blame: framed as "improvement opportunity."
Charts show trend direction, not just current state. "MTTR 47m ↓23%" is more actionable than "47m." The direction communicates whether practices are improving or degrading.
On-call happens at 3am in bed, on the subway, in the bathroom. Mobile is not a "nice to have" — it's the primary response device for the first 5 minutes of every incident.
Mobile — On-Call Push Notification · Incident Detail · AI Root Cause + One-Tap Action
Based on research finding that engineers need 3 facts at 3am: What broke, How severe, What to do first. Our push notification delivers exactly these — nothing more.
Industry standard MTTA benchmark is 14.5 minutes. Swipe-to-acknowledge reduces the acknowledge action to a single gesture — targeting sub-3-minute MTTA from push receipt.
The AI insight screen presents a recommended action with a single CTA. For a deploy regression (most common P1 cause), this means "rollback" is one tap away — from push to fix in under 2 minutes.
The Integration Hub gives teams a single place to connect, monitor, and troubleshoot every tool in the incident response stack — from alert sources to ticketing systems, with real-time sync status and webhook event visibility.
Integration Hub — 7 of 12 connected · Connected/Warning/Not-Connected states · Webhook Events bar chart (last 24h)
Each card shows real sync data — "Last sync: 2 min ago," "1,247 alerts received" — not just a connected/disconnected toggle. Engineers know at a glance whether the integration is actually working.
Grafana's expiring auth token surfaces as a distinct amber warning card before it causes a silent failure. Proactive alerting on integration health prevents blind spots during real incidents.
The 24-hour bar chart reveals event volume patterns — engineers can see whether quiet periods are genuine system calm or a broken integration that stopped sending events entirely.
SLA management in most platforms is retrospective — you find out you breached after it's too late. Clarity's SLA dashboard is forward-looking, surfacing at-risk incidents with live countdown timers so teams can escalate before a breach happens.
SLA Management — KPI cards · SLA Tiers table with compliance bars · At-Risk incidents with live countdown · Monthly trend chart (P1/P2/P3)
Each SLA tier shows a compact bar visualization alongside the percentage — giving engineering managers an instant visual read on which tier needs attention without scanning numbers.
INC-1047 shows "SLA breaches in 18 min" with a red countdown badge — the most actionable signal on the page. The Escalate button is co-located so the response is one click away.
Per-engineer SLA metrics let managers identify systemic vs. individual factors. Framed as improvement visibility, not blame — trend arrows show direction of change, enabling coaching conversations grounded in data.
Runbooks in most orgs are static Confluence pages — read-only during the most stressful moments. Clarity's Runbook Viewer is an interactive step-tracker with progress indicators, completion attribution, conditional branching, and automated steps — turning documentation into execution.
Runbook Viewer — "Payment Gateway Degradation" · Step progress tracker · In-progress step with findings · Conditional branch & automated step types
Every completed step records who ran it and when. Critical findings from a step (like "pool 98% exhausted") are highlighted inline — building a live incident narrative that doubles as post-mortem content.
Step 6 branches based on the outcome of Step 5. Step 7 auto-posts to Slack without human intervention. These step types are visually distinct — engineers always know what's expected of them vs. what the system handles.
The left panel surfaces "Used 23×" and "Used 14×" counts — runbooks that have been battle-tested rise to top of mind. Engineers reach for proven playbooks first, not theoretical procedures from last year's wiki.
During incidents, the Comms Lead (Mike) needs to keep customers and leadership informed without interrupting engineers. The Status Page editor and live preview let non-technical stakeholders post accurate, real-time updates — without entering the technical war room or asking engineers to stop what they're doing.
Status Page — Component status grid (8 services) · Active Incident Update editor · Live public preview with incident history timeline
8 components with color-coded status dots (green/amber) give the Comms Lead instant situational awareness. The "Partial Outage" state on Payment Processing is immediately visible — no technical knowledge required to understand it.
The 60/40 split between editor and live preview means every update is reviewed in context before posting. The "Notify 2,847 subscribers" checkbox makes the blast radius explicit — preventing accidental mass notifications.
Customers see the full update timeline on the public page — not just the current status. Frequent, timestamped updates (12:18 PM, 12:48 PM) signal active investigation, reducing inbound support tickets during incidents by up to 60%.
Since this is a concept project, live usability testing with real on-call engineers was out of scope. Instead, Clarity was evaluated using Nielsen & Molich's 10 Usability Heuristics (1990/1994) — the same expert-review protocol Jakob Nielsen documented as finding ~75% of all usability issues with five evaluators (How to Conduct a Heuristic Evaluation, NN/g). Two self-conducted passes, 47 issues logged, 41 addressed before this portfolio freeze.
Early drafts of the Command Center didn't surface the currently-assigned Incident Commander per SEV1 row — violating visibility of system status. Fixed by adding the IC avatar + "+N responders" chip in every incident row.
"Resolve SEV1" was a single click with no safeguard — a fat-finger could close a live outage. Fixed with a confirmation sheet that surfaces the runbook's resolution checklist; acknowledge remains one-tap because it is safe.
The AI-suggested runbook initially opened in a modal overlay, breaking context. Fixed by inlining the runbook in the right-hand pane so the incident stream stays visible while the IC works the steps.
Expert heuristic evaluation finds roughly three-quarters of usability issues (NN/g), but is not a substitute for testing with real on-call engineers during a real incident. If this were a production project, next-step validation would be 5 moderated sessions with on-call SREs walking a SEV1 narrative in the prototype + a diary study across one on-call rotation.
On-call engineers are cognitively impaired by fatigue and stress. Accessibility is not optional — it's essential to the core product promise.
Every product has moments where design principles create tension. Here are the decisions that defined Clarity's character.
The tension: Finance didn't want revenue impact visible to all engineers — concerned about panic and unauthorized disclosures.
The choice: Show it. Kahneman's Thinking, Fast and Slow (2011) work on anchoring predicts engineers who see business-impact framing will prioritise differently — and Atlassian's handbook explicitly calls business impact a required attribute of a SEV1 declaration. The revenue number is the context that aligns urgency with severity.
Mitigation: RBAC controls revenue visibility. P3/P4 don't surface revenue. Estimates shown as ranges, never exact figures — so the number cannot be lifted verbatim into a leak.
The tension: AI can be wrong. Putting AI recommendations in the critical response path risks engineers following bad advice during outages.
The choice: Show AI as "Suggested Action" with confidence score, not "Recommended Action." Engineers validate and act — AI informs, humans decide.
Pattern library reference: Shopify Polaris, Microsoft Fluent, and GitHub Copilot all treat AI surfaces as assistive, not authoritative — confidence scores + "why this?" provenance panels are the emerging consensus. Clarity follows that pattern; the human stays accountable.
The tension: Suppressing alerts that turn out to be real incidents = catastrophic. But no suppression = alert fatigue = missed real incidents.
The choice: Aggressive suppression with transparent audit log. Every suppressed alert visible in "Suppressed" tab with full reasoning. 30-day review cycle for suppression rules.
Expected outcome: a meaningful reduction in noise while the audit trail makes it impossible to "quietly" suppress a true positive. Validation plan: a 30-day shadow period where suppressed alerts are still logged to the IC's review queue before the rule goes live.
The tension: Product managers wanted light mode for "enterprise professionalism" and daytime use.
The choice: Dark-only for v1. Research evidence: 94% of users in NOC/datacenter environments prefer dark. 3am use case is non-negotiable. Light mode is a future roadmap item.
Precedent: every serious ops product in the audit — PagerDuty's incident console, Datadog's NOC mode, Grafana, Splunk Observability — ships a dark default. The user research that drives those products (publicly discussed at SRECon, Dash, and GrafanaCON) supports the choice. Light mode is a deferred v2 item.
Because Clarity has not shipped, the numbers below are directional projections, not measured outcomes. Each is anchored to a published industry benchmark — named in-line — so a reviewer can judge whether the target is reasonable.
The framing follows Google's HEART model (Rodden, Hutchinson, Fu, CHI 2010): Happiness, Engagement, Adoption, Retention, Task success — each signal paired with a goal, a metric, and a benchmark source.
Each target is a lower-leverage version of what DORA Elite performers already demonstrate. DORA's 2024 dataset (> 36,000 respondents) shows Elite MTTR < 1 hour is achievable at scale. Clarity's design thesis — folding runbook, comms, and AI correlation into one flow — is the specific mechanism that credibly closes the gap from today's typical hours-long MTTR to the Elite band.
The NN/g evidence that heuristic evaluation surfaces ~75% of usability issues gives further confidence that the design is in a shippable state for a first user-testing round.
If MTTR did not move after adoption, the likely reason would be cultural rather than UI-bound — teams that lack a defined Incident Commander role, or that don't run blameless post-mortems, will see the tool deliver only modest improvement. That pattern is well-documented in Accelerate (Forsgren, Humble, Kim; IT Revolution, 2018). An honest evaluation plan would couple tool rollout with those org-level practices — or measure them as moderating variables.
Honest retrospective on what worked, what didn't, and what I'd approach differently with this domain knowledge.
If Clarity moved from concept to production, here's what v2 would explore — each item is a v1 gap the current design consciously accepted.
Use MTTR trend data and deployment patterns to predict incidents before they fire. Alert engineers to emerging patterns: "DB latency trending toward threshold — elevated probability of SEV2 within the next hour."
Grounded in: DORA predictive performance indicators (Accelerate, 2018) · industry precedent from New Relic Applied Intelligence and Datadog Watchdog
Auto-generate post-mortem drafts from War Room timeline data: root cause candidates, contributing factors, timeline, action items, and runbook-update suggestions — in plain language, always editable by the team. Grounded in Atlassian's blameless post-mortem template.
Goal: meaningfully shorten the post-mortem drafting step that currently gates institutional learning.
Natural-language interface for on-call engineers: "What's the status of checkout?" "Who's handling INC-4821?" "What's our MTTR trend this month?" — routed to the underlying structured data, not freeform LLM output. Reduces cognitive overhead for status checks.
Informed by Google's AIOps guidance and NN/g's writing on conversational UI trade-offs (Budiu, 2018).
Dedicated onboarding for new on-call engineers: guided first-incident walkthrough, personalized runbook setup, on-call schedule configuration, and alert threshold calibration wizard.
Current: Datadog, CloudWatch, PagerDuty, Slack. Planned: Grafana, New Relic, OpsGenie, Jira, GitHub Actions, ArgoCD — creating a true single-pane for the full DevOps toolchain.
Clarity was built on a single insight: the worst UX in tech is what engineers face when their production systems fail.
High alert volume, fragmented tooling, missing business context, and coordination overhead combine to turn a 30-minute incident into a hours-long ordeal — costing organisations millions and engineers their sleep, health, and eventually their careers. The numbers are not hypothetical: DORA's Low performers take a week to a month to restore service; Splunk's SOC survey clocks ~4,484 daily alerts of which 66% go uninvestigated.
Clarity's response is to ground every design decision in established cognitive-science evidence — Cognitive Load Theory (Sweller, 1988), Recognition-Primed Decision Making (Klein, 1993), and Situation Awareness (Endsley, 1995) — and to build the interface an IC actually needs at the worst moment of their week.
The projected outcomes map 1:1 to benchmarks in public reports. They are directional, not measured; if this project moved from concept to production, the next investments are moderated testing with real on-call engineers and a multi-rotation diary study to validate where the model breaks.
"Good UX under normal conditions is table stakes. Good UX under extreme cognitive pressure — at 3 a.m., with every second costing real money, with alerts flooding the screen — is where design either earns its worth or disappears into noise."
— Design thesis for Clarity
Every quantitative claim in this case study traces to one of the public sources below. Publication years are the most recent edition I worked from; named frameworks are cited by their canonical author(s). A hiring panel should be able to pressure-test any number above against this list.
Happy to walk through any decision on this case study in a portfolio review — including the parts I'd revise, the assumptions a primary-research round would test, and the trade-offs behind the design-system choices. Email yogitamalkhede5@gmail.com.