Designing AI That Earns User Trust,
Not Demands It

Read this to see how I:

  • Used continuous discovery to build a case that became central to our strategy for retaining two at-risk enterprise accounts
  • Led the design of Balto's first generative AI product, making decisions across configuration, coverage, and a new user segment
  • Established a design principle, verification over blind trust, that reshaped how Balto approaches every GenAI feature since
QA analyst workflow prototype

The Problem: A Legacy Approach in a Changing Market

Quality assurance (QA) is the process contact centers use to verify that agents follow regulations and meet service expectations on every call.

The catalyst was an urgency around two important customers and the low likelihood of them renewing. Competitors had pitched them a compelling vision:

  • 100% QA call coverage
  • Simpler, self-maintained configuration
  • Fully automated AI-driven evaluation

Through continuous discovery, UX had already been building a case to address the configuration challenges and the shortcomings of our existing product. That overlap turned my research findings into part of our strategy, and UX became a component to save those accounts.

Spreadsheet showing QA evaluation criteria mapped across multiple customer playbooks
Customer scorecards — shared evaluation criteria

Research: Listening to the People Evaluating Us

I established relationships with QA managers who would use the product daily and started to understand their pains.

During these initial conversations, the frustrations piled up quickly. Configuration was painful. Maintaining scorecards required ongoing support, and in many cases they each had their own flavor of how they liked to operate their QA department, which directly shaped how they expected to configure and use their tooling.

A recurring theme emerged, which resonated with our earlier research: QA analysts, the people who conducted daily evaluations, were doing it across Excel sheets, other software, and Balto to check for call events, but were not able to bring their workflow into our software. This confirmed that an earlier product strategy had completely dismissed the existence of this user segment.

The understanding that started to solidify made the path forward clearer. We needed to introduce GenAI and make it easy to self-service to stay competitive. We needed to address the promises of coverage our competitors had made. And we needed to design for a user segment our platform had never served.

User flow diagrams mapping the QA analyst ecosystem, jobs to be done, and end-to-end analyst workflow
Mapping user flows while mocking up with the team

Three Decisions That Changed How Users Work

Decision 1: Rethinking Configuration for the Age of GenAI

Configuration was one of the most consistent pain points across our user base. Our existing solution relied on "playbook events," essentially nested query builders that required users to construct complex conditional logic to evaluate calls. The system was rigid, broke when any connected part of the platform changed, and required Customer Success to help configure and maintain.

Our engineering team had already been experimenting with prompt-driven call evaluation, but dropping a prompt field into a broken experience wouldn't solve what users were actually struggling with. If we were rethinking evaluation, we needed to rethink scorecard configuration entirely.

The design challenge was that a scorecard criterion needed to serve two audiences at once: it had to be human-readable for analysts performing evaluations, and it had to give the LLM what it needed to evaluate the call accurately. I worked with the AI engineering team to merge both priorities into a single configuration flow. The top half of each criterion is where users write their assessment guidelines in plain language. The bottom half is where they configure the prompt that drives the AI evaluation, with tabs for prompt writing, model settings, and testing.

Configuration went from a technical task requiring support to something a QA manager could set up independently. Evaluations became more accurate because the LLM was inferring meaning from conversation context rather than matching rigid event patterns, which meant that changes across other parts of the platform would no longer break QA workflows.

Legacy scorecard configuration UI with complex nested event builders
Before: event-based configuration
Legal Disclosures
Set Value
After: plain-language criteria and AI prompting UI

Decision 2: Reframe "Evaluate Everything" into "Evaluate What Matters"

Competitors were promising 100% AI-powered evaluation, every single call scored automatically. Our customers were told this was the standard for modern QA.

Our engineering team had raised earlier that year that the compute costs of full coverage would be unsustainable, and as a smaller company we had a clear expectation to keep costs manageable. That meant we needed to rethink not just how we evaluated, but what we evaluated.

I facilitated a series of workshops and conversations bringing together engineers and product leadership. Using a design thinking approach, I structured the sessions around sense-making first: giving each team space to share what mattered most to them and the constraints they were working within. From that shared understanding, as we aligned, an idea around layered eligibility emerged.

The solution was smart eligibility. Users could define filtering criteria like call length, teams of agents, and other legacy filters, while our system added a layer of LLM-powered eligibility checks on top. Only calls that match get evaluated. For edge cases, an override lets analysts manually pull in calls that the filters missed.

The layered approach improved accuracy significantly. We built a proof of concept with one of our customers to validate a list of eligible calls for evaluation, which took our eligibility from the low 60s to around 90% accurate. Evaluating every call meant drowning in low-value and noisy results. Targeted evaluation surfaced the calls that actually needed attention, more accurately and at a fraction of the cost.

1

Call Metadata

Calls that meet the criteria specified in the query below will qualify for scoring. Note that a single call may be assessed by several scorecards concurrently.

Call length is at least
2 Minutes
and at most
60 Minutes
AND
Agents
Includes
Select Agents
AND
Agents
Excludes
Select Agents
AND
Tags
Includes
Select one or more tags
AND
Playbook is:
Select Playbooks (Optional)
2

Eligibility

Disqualify the call if the eligibility criteria below are not met during the conversation.

Metadata and prompt-powered eligibility

Decision 3: Design for Verification, Not Blind Trust

QA analysts spent their days context-switching, jumping between internal documentation, call transcripts, and audio recordings, trying to find the specific moments where something happened on a call. Every evaluation meant hunting through an entire conversation to locate the evidence. It was slow, repetitive, and mentally numbing.

I proposed a question to engineering and product: what if we could identify those moments automatically? What if the interface could point analysts directly to the evidence they needed, criterion by criterion?

Through early usability testing, we learned analysts needed this at the criterion level, where they were already in context and making a judgment. So we built a proof of concept: a "play from here" button embedded inside each criterion that jumped the analyst directly to the relevant moment in both the audio and the transcript.

Medicare Supplement Outbound

2/2
3/4
3/3
2/2
5/5
Overall Score 90%

Compliance Disclosure Review

0/2
0/4
0/3
Overall Score Pending
1 Criteria Needs Scoring

Introduction

Did the agent introduce themselves by name and state the company within the first 30 seconds?

Jump to moment: 0:28

Did the agent confirm they were speaking with the account holder before continuing?

Jump to moment: 1:12

Needs Analysis

Did the agent ask open-ended questions to uncover the customer's existing coverage and pain points?

Jump to moment: 4:17

Did the agent listen for and acknowledge the customer's specific health and budget concerns?

Did the agent ask about preferred physicians and medications before presenting plan options?

Jump to moment: 9:44
QA analyst interface elements

We moved away from assuming users would blindly trust the AI. Instead, we gave them a way to work with it on their own terms. If an AI evaluation was right, they confirmed it and moved on. If it wasn't, they overrode it. Either way, the user stayed in control.

The early results signaled that QA analysts were able to evaluate twice as many calls in the same amount of time compared to their previous workflow, without sacrificing the accuracy that compliance required.

The Impact: From Churn Risk to Champion

We lost one of the two accounts. Their timeline moved faster than ours, but the customer who stayed became one of our strongest partners to this day.

Slack message celebrating 91.9% accuracy with the customer who stayed CRM record showing the closed lost account
Sometimes timelines do not align

After launching QA Copilot, I tightened up our process for gathering feedback and measuring engagement and user retention. UX-Lite scores went from 2.3 to a consistent 4 out of 5 quarter over quarter.

Outside of metrics, this was the start of a new approach to our product. Balto has been an AI-native company since its inception. For most of the company's history, we kept the intelligence in the background and assumed users would trust the results, and this never fully worked.

QA Copilot changed that. For the first time, the user and the machine visibly worked side by side in the same workflow. One surfaced evidence. The other made the call. Trust wasn't something we expected. It was something users built for themselves, one evaluation at a time.