Can AI Make Good Decisions About Itself (Part 2)?

Listen to this article:

In Part 1 of this blog series, we explored how to design an automated system capable of evaluating LLMs with rigor and scale. But along the way, a deeper realization emerged: Every technical challenge of AI evaluation is also a human one. Our pursuit of objectivity exposed the biases embedded not just in data or algorithms but in ourselves. In Part 2, we turn the lens inward, investigating how human subjectivity and perception influence the very standards we use to measure truth and quality.

No current LLM, regardless of sophistication, can fully supplant human judgment and accountability, especially in high-stakes domains. The models, trained on human-generated data, inherit our wisdom and our biases - yet they lack true understanding of context, ethical nuance and the lived experiences that inform human judgment. Yet humans, for all our intuitive brilliance, are inconsistent, biased and limited in processing capacity.

These reciprocal imperfections led us to integrate human oversight at critical checkpoints of the evaluation pipeline:

The LLM generates an initial evaluation with consistency and scale.
A human expert reviews and refines with contextual understanding and ethical sensitivity.
Both assessments are preserved, creating a continuous learning loop between human and machine.

This hybrid approach doesn’t seek perfection from either humans or AI, but instead generates rigor from their complementary strengths. At Loka, rigor doesn’t equal an impossible quest for perfection; instead we aim to achieve the highest possible quality within real-world constraints. We deliver work that is thoughtful, reliable and impactful—grounded in human judgment yet elevated by the scale and speed of AI. We know perfection is unattainable, but excellence is not. It’s the human version of perfection: deliberate, rigorous and always evolving.

The LLM provides consistency, tireless processing and the ability to hold vast information in context. Humans contribute intuition, passion, ethical reasoning and the ability to consider subtle contextual factors that allow us to get a better read on the situation.

In our system, the LLM handled assessments across predefined dimensions, while human users reviewed and revised its output based on their own insight. The original scores were preserved and shared with the monitoring team, enabling us to refine our prompts and surface edge cases over time. This collaboration offers more than just a better evaluation system. It also proposes a model for how AI and humanity might productively collaborate across domains: neither replacing the other, but each compensating for the other's fundamental limitations to create something truly exceptional.

‍

Decoding Judgment

It’s important to acknowledge that perfect consensus in evaluation is unattainable, even among human evaluators. Research findings from Zheng et al. (2023) demonstrate that GPT-4 achieves an 85% agreement rate with human evaluators, slightly exceeding the 81% agreement rate observed among humans themselves. More revealing still, even when human evaluators initially disagreed with GPT-4's assessment, they nevertheless considered the AI's judgment reasonable in 75% of cases and were willing to revise their original evaluation in 34% of instances. To put these numbers in perspective: Out of 100 evaluated questions, human evaluators would agree with GPT-4 on about 85. In the remaining 15, humans would still see GPT-4’s answer as reasonable in roughly 12 cases, and in five of those cases, they were even willing to change their answer, leaving only about three instances of complete disagreement. The fact that GPT-4 showed a slightly higher agreement with humans than humans among themselves highlights an important truth about evaluation: It’s inherently subjective, whether performed by humans or machines.

We built the evaluation output around three core pillars:

Evaluation Results
Explanation
Feedback

Each pillar plays a critical role in transforming raw assessment into actionable insight.

In our early experiments, looking solely at evaluation results would’ve led us to a premature conclusion: Our system appeared flawless, consistently reaching near-maximum scores across all dimensions. This illusion of perfection masked a deeper truth: The evaluator simply wasn't being "judgy" enough. This experience highlighted why comprehensive explanations are essential as they expose the reasoning behind judgments, revealing biases and limitations that scores alone conceal. With this insight, we refined our approach. By making the evaluator more stringent, the scores went down, but the feedback became specific and actionable, while explanations rigorously justified each score.

Yet as with all things human, we discovered the importance of recognizing when to stop. There's a delicate balance in creating the evaluation system that’s comparable to finding the sweet spot between underfitting and overfitting in machine learning. An overly judgmental evaluator becomes rigid and unforgiving, missing the nuances that human eyes perceive as correct rather than as flaws.

Equally essential is resilience—not just in the evaluator itself, but in the broader evaluation system it belongs to. In our context, resilience means the ability to tolerate occasional errors, adapt to edge cases and continue delivering useful results even when individual judgments are imperfect. Striking this balance, being both critical and flexible, is what ultimately makes an evaluation process robust and trustworthy.

‍

The Challenges: Mirrors of Human Imperfection

Crafting an automated evaluator that approaches human judgment isn't without profound challenges. We encountered—and confronted—several fundamental biases and limitations along the way.

Position Bias: LLMs, like humans, can be subtly influenced by sequence. We both tend to favor options presented earlier, a cognitive quirk that shapes our perception in ways we rarely notice.
Verbosity Bias: Longer isn't always better, yet both humans and LLMs have a natural tendency to associate quantity with quality.
Self-Enhancement Bias: One of the most fascinating challenges was self-enhancement bias, the tendency of LLMs to favor outputs generated by themselves. Studies (Zheng at. al, 2023) show up to 25% higher win rates when evaluating their own outputs versus human-generated content. This mirrors our own human tendency to prefer ideas that align with our existing patterns of thought.
Limited Mathematical and Reasoning Capabilities: LLMs demonstrate weaknesses in tasks requiring precise calculation or complex logical reasoning, limitations that remind us of the distinction between statistical pattern recognition and true understanding.
Domain Knowledge Gaps: LLMs are generalists in a world that more often than not favors specialists. Their assessments in highly specialized fields can miss critical nuances that experts would immediately recognize. This limitation highlights the fundamental value of human expertise—the deep, contextual understanding that comes from years of focused experience.
Diversity Bias: Perhaps most concerning are the judgment shifts based on identity-related markers. In healthcare contexts, studies (Zack et. al, 2024) found GPT-4 was 9% less likely to recommend advanced imaging for Black patients with identical symptoms to white patients, while women were 8% less likely to receive stress test recommendations than their male counterparts with the same cardiac risk factors. These biases reflect not inherent flaws in AI but the historical biases embedded in the human-generated training data, a sobering reminder of how our own imperfect judgments become encoded in our technological creations.

These challenges reveal something essential about both AI and human judgment: neither is perfect and both work best when combined. The biases in our AI systems mirror our own cognitive limitations, while human judgment, though rich in context and experience, is inconsistent, unscalable and shaped by unconscious biases. This recognition brings us back to the central theme of our approach: The most robust evaluation emerges from the partnership between human and machine, each compensating for the other's inherent limitations, as we saw in our evaluator workflow, where model output was refined and iteratively improved through human oversight.

‍

We envision a partnership where technology enhances our uniquely human abilities, while we humans guide and shape AI's vast processing power. This isn’t just a practical fix for evaluation challenges; it’s a vision for a new kind of shared intelligence, one that values human insight while embracing the improvement potential of technology, each filling in the gaps in the other's capabilities and creating together what neither could achieve alone.

Build, evaluate, refine, repeat—with enthusiasm for progress, humility about limitations, and hunger for deeper understanding. Maybe perfection will remain out of reach. But excellence lies in refining the questions—and in the breakthroughs that follow.

AI/ML

October 17, 2025

Loka Staff

Raquel Cardoso

ML Engineer

AND

Can AI Make Good Decisions About Itself (Part 2)?

Listen to this article:

Decoding Judgment

The Challenges: Mirrors of Human Imperfection

Raquel Cardoso

Petunia - The New Framework for RNAi Design

Loka’s RealTicket Honored by Awwwards

AI Won't Save Humanity Until We Get the Story Right

Loka's syndication policy

Can AI Make Good Decisions About Itself (Part 2)?

Listen to this article:

Decoding Judgment

The Challenges: Mirrors of Human Imperfection

Raquel Cardoso

Other articles you might like

Petunia - The New Framework for RNAi Design

Loka’s RealTicket Honored by Awwwards

AI Won't Save Humanity Until We Get the Story Right

Loka's syndication policy