Stay up to date with AI
Posts
How to Improve Your AI Solutions So Customers Become Obsessed

How to Improve Your AI Solutions So Customers Become Obsessed

How to build and optimize your AI solutions fast with data, metrics, and feedback.

Ali Z.Š
May 28, 2025

You've shipped a demo, raised funding, maybe even landed your first few customers, but:

Your AI system doesn’t always work
Customers are getting frustrated
You’re not sure how to debug or improve things
And you're stuck in “maybe we should change the model again?” model

Most companies would start conducting big changes and hoping for the best.

Just by using data, feedback, you can improve your AI solution becuse you will know where the root cause of the problem is.

How To Build a Simple Evaluation System That Actually Works

Helps you find and fix real problems
Works whether you're using RAG or just a pure LLM
Gives you signals that lead to real improvements
Can be run weekly without stress

Step 1: Start With Your Objective

Before metrics, models, or dashboards — ask the real questions:

What does this AI system do?
Who is it for?
What does a “good” response look like?
What are the failure modes we want to avoid?

Example:

If you're building a support bot, a “good” answer is fast, correct, and solves the user's problem. A bad one is slow, made-up, or off-topic.

Step 2: Create a Synthetic Test Data Set

You don’t need a million rows of data. You need 20–50 realistic, representative test cases.

Each entry should include:

A realistic user query
The ideal answer you want
(For RAG systems) The documents that answer should come from

Use GPT to generate drafts, then manually edit for quality.

This becomes your golden dataset — your test bed for all future changes.

You can either

Step 3: Build a Simple CLI or Script to Run Evals

You don’t need any expensive software. Just write a script that:

Takes the test queries
Sends them through your system (prompt → retrieval → model)
Logs the:
- Query
- Retrieved documents
- Generated answer
- Time to respond
- Evaluation scores

Now you can test your AI system with every update — and compare results over time.

Step 4: Start With Basic Retrieval Metrics (If Using RAG)

If your AI depends on retrieved documents (like in RAG), you must start by checking retrieval.

Two essential metrics:

Retrieval Precision@k – What % of retrieved docs are relevant?
Retrieval Recall@k – What % of all relevant docs were retrieved?

If your system is pulling garbage, your answers will be garbage too, no matter how good your LLM is.

Equation of both metric formulas

Step 5: Add Core Answer Quality Metrics

Now score the actual answers. Keep it simple:

Metric	What it Tells You	How to Measure
Accuracy	Is it factually correct?	LLM-as-a-judge or human
Helpfulness	Is it useful to the user?	LLM-as-a-judge or user vote
Faithfulness	Does it stick to the retrieved content?	Compare the answer to the context
Toxicity/Bias	Is it safe and appropriate?	Perspective API, Detoxify

Start here. These will give you 80% of what you need.

Step 6: Use LLM-as-a-Judge to Automate Scoring

You don’t need a QA team or whole team oversight

Feed GPT or Claude:

The user query
The context (retrieved docs)
The answer
Evaluation instructions (e.g. “Rate faithfulness 1–5”)

It returns scores like:

{
  "Faithfulness": 4,
  "Helpfulness": 5,
  "Acceptable": "Yes"
}

Validate a few by hand.

Then scale it across your test set.

Step 7: Track Product Metrics (Optional)

Once your core quality metrics are solid, track real-world UX impact:

Latency / TTFT – Time to First Token
Acceptable Rate – % of good answers
Escalation Rate – % of answers that needed human backup
Resolution Rate – % of issues resolved fully by AI

These connect AI quality to customer outcomes.

Step 8: Save Versioned Snapshots

Every time you make changes (new prompt, model, retriever):

Re-run your synthetic set
Save:
- Retrieval metrics
- Answer quality metrics
- Logs and traces
- Notes on what changed

Store them in Git, Notion, or even a Google Sheet. You'll build a track record over time.

Summary

You don’t need:

Fancy MLOps stacks and expensive tools.
50+ metrics
Model switching every week

You need:

A clear objective
A small but sharp test set
A way to trace and debug
The right metrics
User feedback
Gradual improvements

Ship faster. Learn faster. Get your AI system back on track.

Want to ship or improve your AI solution fast as possible? - BOOK FREE AI AUDIT CALL