Straw Prompting

A demo is not an eval. That is the AI rollout mistake hiding in plain sight.

Jun 04, 2026

An AI demo is not an eval.

That is the mistake behind a lot of brittle enterprise AI rollouts. The demo proves the model can answer one polished question in one controlled room. An eval — a test set of real cases, scored against a written standard — tells you whether the system can survive actual users.

Most teams still ship the demo.

That is what I am calling straw prompting: building an AI feature out of a hand-tuned prompt, testing it on the cases that already work, and launching before the messy cases arrive.

It looks like a house.

It is not ready for weather.

The happy path is not the product

Air Canada learned this in public.

In 2022, a customer used Air Canada’s website chatbot while booking travel after a death in the family. The chatbot told him he could buy a full-price ticket, travel, and then apply later for the airline’s bereavement fare.

That was wrong.

The airline’s actual bereavement policy said the discount could not be requested after travel had already happened. The chatbot even linked to the page with the correct policy, but the answer it gave in the conversation was still misleading.

The customer relied on the chatbot, travelled, and later asked for the fare difference. Air Canada refused. In 2024, British Columbia’s Civil Resolution Tribunal found Air Canada liable for negligent misrepresentation and ordered it to pay damages, interest, and fees.

The product lesson is not “chatbots are bad.”

It is sharper than that.

Air Canada had the correct information on its own website. The failure was that the interactive answer surface did not reliably behave like the policy it was supposed to represent.

That is the straw-prompting pattern in the wild.

The company did not just need a chatbot that could answer bereavement questions. It needed a system that could notice policy conflict, ground answers in the current source of truth, and escalate when the answer affected money, rights, or obligations.

A working answer was not enough.

The answer had to be governed like part of the product.

The wolf is variance

The wolf is not a smarter model from a competitor.

The wolf is variance: the messy spread of real-world inputs that your demo never saw.

Users misspell things. They paste half a policy. They use internal acronyms. They ask two questions at once. They omit context. They ask in anger. They ask in formats your prompt writer did not imagine.

OpenAI’s eval guidance makes the point plainly: AI systems need task-specific tests that reflect real-world conditions, edge cases, and continuous change. “Looks good to me” is an anti-pattern, not a launch gate.

That is the trap. A demo that works ten times in a row does not tell you how the system behaves on the eleventh thousand.

The launch review should expose the failure mode

You do not need a 40-point governance checklist to spot straw prompting.

You need a launch review that separates symptoms from causes. Run it against five patterns:

Polished examples, not error rates — the team tested the demo, not the system. Require a real-input eval set before launch.
No one can name the top failure modes — the risk model is missing. Write the failure list before tuning the prompt.
Output quality judged by vibes — the team has no shared standard. Create a rubric before reviewing answers.
“Guardrails later” in the plan — controls are being treated as cleanup. Define refusal, escalation, and source-of-truth rules now.
The prompt writer also wrote the tests — the eval is likely overfit to the happy path. Add messy cases from real usage, edge conditions, policy boundaries, and expert review.

The point is not to punish the team for moving quickly. It is to find the root cause while the cost of fixing it is still low.

Build the test before the prompt

The fix is not a giant governance program. It is a different order of operations.

Start with the test.

For any AI feature, collect 100 to 300 real or realistic inputs from the hardest part of the workflow. Include short requests, long requests, ambiguous requests, boundary cases, policy conflicts, and cases where the right answer is “I do not know.”

Then write the rubric.

What counts as correct? What counts as unsafe? When should the model refuse? When should it escalate to a human? What evidence should it cite? What must it never invent?

Only then tune the prompt.

The prompt has to earn its way through the mess. Otherwise you are not improving the product. You are decorating the demo.

The brick house has gates, owners, and reruns

Once the root cause is clear, prevention is boring on purpose.

Brick-house AI rollouts put four controls in place before launch:

A golden set. The small, trusted collection of examples that defines what “good” looks like. It includes normal cases, edge cases, and failures you never want to see twice.
A named owner. Someone owns the eval set, the rubric, and the launch threshold. Without an owner, quality becomes a meeting mood.
Escalation rules. The system knows when to refuse, when to cite the source of truth, and when to send the user to a person.
Scheduled reruns. Models change. Prompts change. Retrieval changes. Product policy changes. The system that passed Tuesday may not pass next Tuesday.

What changes Monday

Pick the next AI feature waiting for launch approval.

Do one thing: replace the demo review with an eval review.

The launch decision should not be: “Did the examples look good?”

It should be:

Did the system pass the real cases, against the written rubric, with clear escalation rules and an owner for reruns?

If not, do not approve the launch.

Approve the demo, if you want.

Just do not mistake it for a brick house.

Sources

OpenAI — Evals: https://platform.openai.com/docs/guides/evals
OpenAI Cookbook — Evals design guide: https://cookbook.openai.com/examples/evaluation/evals_design_guide
NIST — AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
Deeth Williams Wall — BC Tribunal Finds Air Canada Liable: https://www.dww.com/articles/bc-tribunal-finds-air-canada-liable-for-inaccurate-advice-given-by-website-chatbot
American Bar Association — BC Tribunal Confirms Companies Remain Liable: https://www.americanbar.org/groups/business_law/resources/business-law-today/2024-february/bc-tribunal-confirms-companies-remain-liable-information-provided-ai-chatbot/

Product Theatre

Discussion about this post

Ready for more?