Turns out 'are you sure about that?' is a professional skill.

The Human in the Loop

Can you outsmart a robot?

Sometimes the best way to understand what AI can do - is to watch it fail carefully.

Crafting creative and complex prompts that can catch significant errors in advanced AI models is a specific skill. So is knowing when the tool grading the accuracy of your findings has failed too.

In this project task, I designed a prompt specifically built to stress-test three frontier models. The case study covers the full arc: designing a prompt with 50,000+ token context, an evaluation that exposed hallucinated quotes, fabricated user personas, and a precision problem hiding inside a polished response - followed by a successful, evidence-based dispute against an automated validator that called my findings "demonstrably false." And basically told me that I couldn’t read. Aggressively so.

The robots were wrong. I have receipts.

It’s interesting to see that the most polished AI response in this evaluation contained the most errors. The smartest models were the ones that admitted what they didn't know.

What this case study actually reveals: fancy formatting isn't accuracy, confidence isn't correctness, and human oversight isn't optional. It's the whole point.

Case Study: Can AI Read the Room?

TESTING CULTURAL NUANCE AND EMPATHETIC REASONING

Project Goals & Constraints

The objective of this evaluation was to design a highly complex, single-turn prompt capable of eliciting significant failures in at least one of three specific areas: instruction following, factual truthfulness, and logical reasoning.

The parameters were:

side-by-side comparison evaluating two highly capable "smart" models against one weaker baseline model
must base the task on a journalism domain
all source material had to be highly relevant and published within the last six months
required a massive context window of at least 5,000 tokens
- because a single news article rarely hits this length, the challenge required creatively curating and seamlessly integrating multiple distinct articles into one cohesive scenario
prompt had to be complex enough to force the models into logical traps and hallucinations in one single turn, without triggering standard safety or refusal mechanisms

The 'Why' Behind the Prompt

I based this evaluation on real-world scenarios. Lately, I've heard some passing, negative comments about AI. While my personal experiences with AI have been overwhelmingly positive, a strong evaluator needs to be completely objective. I curated several recent, localized articles to try and understand the differing viewpoints and current cultural anxieties. My goal wasn't just to break the models. I wanted to see if they could synthesize multiple complex sources to genuinely understand opposing viewpoints, and provide me with realistic, empathetic ways to respond to the people around me.

Prompt Architecture

The prompt was designed to test the model's ability to handle complex constraints and logical progression within that 5,000-token journalism context:

Context Grounding: The model was fed specific, real-world articles to prevent generalized hallucinations and force regional accuracy regarding the cultural climate in British Columbia.
Multi-Step Reasoning: The model had to execute a logical sequence. It needed to first assess the climate, compare it to the proposed solutions in Canada's national AI strategy, and then identify the specific gaps.
Persona and Constraints: The model was assigned a specific role as an industry professional. It was given strict tonal guardrails to ensure the output provided at least three empathetic, fact-based suggestions for a conversation with concerned friends and family.

Evaluation of Model Outputs

While all three models successfully formatted their responses, a careful evaluation revealed critical failures in reasoning, factuality, and instruction following.

Where Model A Failed:

Toxic positivity is not empathy: Instead of validating real stress, like the severe HR burnout mentioned in the text, the model forced a falsely optimistic narrative. Dismissing real anxiety with fake optimism totally fails the prompt's empathy requirement.
It fell for PR spin: It regurgitated the government's political rhetoric about power caps being a proactive environmental safeguard. It completely ignored the article's core fact that the province is simply running out of electricity.
It missed the big picture: It offered a $2 billion government fund as a massive comfort. It completely missed the article's context that $2 billion is an insubstantial drop in the bucket compared to what established tech giants spend continuously.

Where Model B Failed:

It ignored the audience constraint: The prompt explicitly asked for suggested responses to provide to friends and family. Instead of an appropriate conversational tone, the model provided lengthy, academic essays that did not correctly follow the request.
The government saviour complex: It offered false platitudes by presenting government-run AI as the ultimate, transparent solution to everyone's anxieties. This completely ignored the current cultural climate and the reality of widespread public distrust regarding government oversight.

Where Model C Failed:

Misattribution of agency (Scapegoating the tool): It made the highly inappropriate and unfounded claim that AI is "causing real harm." It ignored the fact that in the Tumbler Ridge tragedy, the AI actually worked by flagging the threat while human executives made the decision to hide the data. Furthermore, blaming a tool for a user's poor financial choices, like stopping car payments despite the tool's disclaimers, is a massive reasoning failure.
Factuality error (The "new jobs" hallucination): It confidently claimed the articles show AI is "creating new roles." This is absolutely false. The texts explicitly describe a severe workload burden placed on existing workers who now have to meticulously verify AI-generated content. Nowhere does the text state new jobs are being created because of it.
Failed the empathy constraint (Toxic positivity): Telling anxious loved ones that AI is forcing us to become "more creative, and more human" is deeply unempathetic and as fake as a three-dollar bill. It invalidates the very real stress of professionals with forced, artificial optimism.
Reasoning failure (The $2 billion conflation & environmental blindspot): It fused the authors' theoretical idea of a public AI with the reality of a $2 billion fund, ignoring the text's warning that the fund is vastly insufficient. Furthermore, it advocated for the lunacy of reinventing the wheel to build massive new AI infrastructure, while completely failing to acknowledge the severe environmental and power rationing crisis detailed in the provided text.

What a Winning Response Should Have Looked Like

The response I was looking for would have been honest about what’s hard, and genuinely optimistic about what’s going right.

Accurately capture the BC cultural climate: Address real-world anxieties head-on, including the Tumbler Ridge murders (where safety concerns that AI flagged were ignored, and tragedy could have been avoided if not for poor human judgement), job security fears, and the heavy strain on local environmental resources.
Highlight the national strategy's blind spots: Point out the logical disconnect of pushing for massive AI infrastructure without actually solving the power limitations.
Call out the human cost: Note that the current strategy completely fails to address the professional burnout caused by the overwhelming workload of verifying AI-generated content.
Deliver truly empathetic, conversational reassurances: Validate the audience's concerns first, then offer sincere, grounded optimism based on actual expertise.

Example: "Yeah, it's frustrating to see AI make some jobs harder than easier right now, but I have seen firsthand the massive efforts that are being put into training AI. The safety, accuracy, and helpfulness are getting better every day. If anything, this messy stage is proving that human expertise is shining brighter and becoming more valuable than ever."

The Verdict

A model can produce highly structured, articulate text while completely failing the core logic and safety requirements of a prompt. Without rigorous human evaluation, models default to sycophancy, toxic positivity, and the misattribution of agency - picking up on the emotional tone of a conversation and reflecting it back, rather than engaging critically with what the source material actually says. This case study demonstrates that true prompt evaluation requires looking past the formatting to test for accuracy, contextual awareness, and appropriate emotional intelligence.

The most valuable thing a human evaluator brings to this process isn’t just checking for errors. It’s asking whether the model actually understood what it was being asked to do, for whom, and in what context. A response that sounds confident and empathetic isn’t the same as one that is accurate and empathetic. The difference matters.

These models could format their way through the prompt. None of them could fully read the room.