The Strange Brew of AI: Using Humans to Reverse-Sear Intelligence

Content Is King Again — But This Time, So Are the Editors

Jul 26, 2025

Carter Adamson

#data

Remember when Netflix felt like a cheat code? One app. Every show. No commercials. No 12-second trailers starting just because your mouse accidentally flirted with the thumbnail. Then came Hulu, Prime Video, HBO Max, Tubi, Pluto, Shudder, Peacock (??), and now your TV is basically a UX multiple homicide.

Could we be witnessing the beginnings of an AI equivalent?

Today, we’re channel-surfing between OpenAIClaudeGeminiMistralMetaGroq. Each model with its own vibe, strengths, personality, tone and blind spots. Claude is the valedictorian coder who quotes Knuth. OpenAI can tell you how to reverse-sear a ribeye while explaining Hamlet. Gemini Deep Research gives you a 90-page peer-reviewed PDF and asks you to draw your own conclusions. Meta… Meta is open-sourcing its existential crisis.  xAI (a.k.a., “Agentic Incel”) is still trying to DOGE questions about its political affiliations. And Siri, meanwhile… is still earnestly reciting Wikipedia summaries from 2016 and hoping for the best. Bless.

It’s a strange brew. The models are multiplying, the benchmarks are breaking, and the users are still happy to channel surf (and pay the subscription prices for the privilege to do so).

But eventually, novelty fades. Expectations rise. And when the stakes are high—health, lawsuits, capital decisions, industrial safety—users stop surfing. They start sticking. One or two models or services start to become embedded within key consumer and enterprise decision points, achieve vendor lock-in, and suddenly, like floorboards, become impossible to rip out.

Like it once was at the advent of the commercial Internet back in the 90s, it certainly does feel like “content is, once again, king” – or at least key grist for the mill of these large models. But perhaps this time, the editors of this “content” are too.

Foundational Models Are Becoming the Cable Bundles of AI

Foundational models are beginning to resemble the modern “cable bundles” of the AI era—everyone has access to the channels now. The real value is shifting to the guide: who surfaces the right answer, in the right moment, for the right domain. Whether this guide lives exclusively within the model or within the application layer… or both, remains to play out – e.g., OpenAI is both model and (primitive) application layer right now.

Model scale is no longer a differentiator on its own. It’s the editorial layer—context, feedback, domain-tuning—that makes a model useful, trustworthy, and retained.

Accuracy Benchmarks, Retention—and Why Reinforcement Learning (RL) Matters

Despite their general fluency, today's frontier models are far from universally reliable. And while they are improving (and even teaching each other) at speeds that we’ve not seen the likes of before, there is little room for error in highly-regulated sectors with life-or-death consequences.

  • Claude 3 Opus scores 86.8% on MMLU; GPT-4 Turbo sits at 88.7%—impressive, but still brittle under ambiguity or domain complexity.

  • On MedQA, a test for U.S. medical licensing, top models only recently crossed 90%—but that still implies 1 in 10 high-stakes answers is wrong.

What moves the needle? Reinforcement Learning from Human Feedback (RLHF).

Anthropic reported a 20–40% gain in factual accuracy after multi-round reinforcement on specialized tasks. OpenAI has emphasized that their GPT-4 Turbo product benefits from "continuous post-training optimization" based on RLHF loops.

Yet accuracy alone isn’t stickiness. ROI, Retention and product re-use are becoming even more telling:

  • In enterprise pilots, Glean’s search product sees >80% weekly active usage post-deployment.

  • OpenEvidence reports 50% faster diagnostic pathing compared to traditional clinical decision tools, driving repeat use among physicians.

  • PolyAI call agents show 35–50% CSAT uplift, which helps secure long-term contracts with service ops and BPOs.

According to Scale AI’s recent enterprise LLM report, more than 70% of companies experimenting with AI adoption cite RLHF as their key differentiator for fine-tuning model utility.

Retention follows relevance. Relevance follows reinforcement.

Is RLHF the New SEO?

(Or, at least a key component?)

It’s how you earn trust. Rank relevance. And become the default.

Crowdworker feedback doesn’t cut it anymore. Especially not in law, medicine, finance, or safety-critical workflows. You need domain-specific editors and reinforcement from real users in context.

And just like growth teams test landing pages and flows, models now need to A/B test their own prompts and outputs.

  • What earns trust in diagnostics?

  • What format accelerates litigation prep?

  • What structure increases usage in field ops?

This is cognitive UX. And it’s what will separate vertical leaders from horizontal noise.

The Editorial Infrastructure Stack

To reinforce trust and improve outputs, you need tools—not just talent. Here's who's powering that layer:

  • SurgeAI – Human feedback with domain fluency. Lawyers, clinicians, policymakers—not gig workers. Recently surpassed 1 million expert completions and has supported more than 500 live model deployments.

  • Labelbox – End-to-end data engine. Supporting 10 of the Fortune 50 for labeling, feedback iteration, QA, analytics—everything you need to tune in production.

  • Datasaur – Structured annotation optimized for team workflows. Backed by Y Combinator and used across dozens of applied vertical AI deployments.

  • OpenPipe, Humanloop, Contextual.ai – Managing prompt orchestration, multi-turn flows, feedback memory, and evals for some of the most accurate vertical systems in market.

  • StabilityAI – Open-sourcing RLHF tooling used in academic and commercial labs alike.

  • Arize – AI observability platform helping teams monitor and improve model performance in production, including drift detection and user feedback analysis.

  • Comet – Experiment tracking and model management infrastructure used by ML teams to organize, evaluate, and reproduce results at scale.

  • Patronus AI – Automates evaluation of model performance across tasks like factuality, hallucination, and reasoning. Helps teams debug and ship reliable GenAI products.

Together, they make up some of the editorial backend of the modern AI stack—turning raw models into usable and more accurate systems.

Proprietary Data ≠ Moat. Reinforced Interpretation Is.

Everyone has the docs now. What matters is what you do with them.

The moat isn’t ingestion—it’s interpretation. And reinforcement is the wedge.

  • OpenEvidence is learning how physicians think and decide in very specific fields with very specific types of patients and treatments

  • Harvey understands the nuance and tempo of legal workflows, especially in litigation-heavy practices

  • Glean adapts to the language and logic of your org

Who Might Win on Proprietary Data?

Some categories are already seeing early traction:

  • Anonymized customer support logs – Parloa and PolyAI train on millions of real call transcripts to improve latency and resolution.

  • Health records – Global market projected at $39B by 2025, with decision support as the fastest-growing segment. Clinical notes, outcomes, and imaging data power OpenEvidence, Hippocratic, MedLM.

  • Legal filings – Harvey’s training included hundreds of thousands of annotated court records. Legal tech market projected at $37B by 2030.

  • Financial & market data – Canoe, Composer, and Nasdaq draw on 30+ petabytes of proprietary data. The market for financial data services is expected to reach $42B+ by 2031.

  • Maps and location data – A $53B market by 2030. Still underutilized, yet central to agentic reasoning and autonomy.

  • Music and audio metadata – Generative music tools leverage 10M+ tagged tracks. The broader AI music market expected to surpass $3.5B by 2028.

According to McKinsey, proprietary data can improve domain-specific AI performance by as much as 35%. But the business model remains fuzzy. Will this data be licensed, hoarded, shared, or simply scraped? With the Trump-era bill defining this as “fair use” just passing—for now—we’re in a gray zone.

At the Head of the Pack

Near-term:
  • OpenEvidence – 8.5M+ clinical consults/month. Partnered with UCSF, NYU Langone, and growing across 200+ hospital systems.

  • Harvey – Used by Latham, Allen & Overy, and a reported 45% of AmLaw 100 firms. Most traction in litigation, less so in contracts.

  • Glean – $100M+ ARR, 90%+ user satisfaction rate, and average 11h/month saved per user.

  • SurgeAI, Labelbox, Datasaur – The hidden force behind high-performing models. Surge alone supports dozens of vertical AI builders and is currently at $1bn ARR.

  • Prewave, Cradle, Parloa, Canoe, Roshi, SpecterOps – Real traction. Real users. Real Outcomes. Real revenue. Fast iteration and defensible UX wins.

Longer-term:

Those who collapse human feedback into the product loop—not the labeling backlog. Reinforcement can’t be a lagging task farm. It has to be an ambient part of the UX.

  • Example: Glean doesn’t ask you to label documents. It watches how you search, click, revise, and shares that signal across the org. That feedback improves relevance in real time—without a human in a separate loop.

  • Example: Harvey’s legal outputs are refined live by attorneys who correct and reframe outputs as they draft, not in some sandboxed feedback portal.

Those who give up on “model quality” as the KPI—and optimize for decision qualityNobody cares if your model gets 89.3% vs 90.1% on MMLU if it doesn’t make better decisions in the field.

  • Example: OpenEvidence doesn’t try to outscore GPT-4 on trivia. It helps doctors make faster, safer clinical calls—and reports a 50% reduction in diagnostic time. That’s the metric that matters.

  • Example: Prewave isn’t building a general language model. It’s reducing industrial downtime by predicting supply chain disruptions with multilingual inputs.

Those who act like platforms but think like editors. You win not by aggregating more input—but by shaping sharper output. Editorial taste becomes infra.

  • Example: SurgeAI doesn’t just label data. It curates expert reviewers (lawyers, clinicians, policymakers) who reinforce outputs with domain nuance—and helps clients build full-stack feedback ops.

  • Example: PolyAI isn't selling a “call center GPT.” It’s trained its models on actual calls across logistics, telecom, and energy—making it sound less like an LLM, more like a high-performing rep.

Final Thought

We’re past “Wow, the model wrote a sonnet.”

Now it’s:
  • Did it help me?

  • Was it right?

  • Will I trust it again tomorrow?

In that world:
  • Models are the medium

  • Content is the differentiator

  • Editors are the wedge

Because yes—content is king again. But this time, content editors are too.