The training data heist nobody priced in
Because here is the thing about AI that gets lost in every keynote, every Jensen Huang leather-jacket moment, every Sam Altman tour of the Gulf states. The model is not the product. The chip is not the product. The data is the product. The model is a function of the data it was trained on. Garbage in, garbage out. Brilliance in, GPT in.
For a decade, the data was free. Wikipedia, Reddit, the open web, Common Crawl, Books3 (which turned out to be pirated, but that's another story). OpenAI, Anthropic, Google, Meta, every frontier lab built their first generation of models on scraped data they did not pay for and, in many cases, were not licensed to use. The New York Times is still suing. Getty Images is still suing. Authors are still suing. The free buffet is being shut down one lawsuit at a time.
Which brings us to the constraint nobody at NVIDIA's GTC wants to talk about. The frontier models have eaten the internet. Literally. GPT-5, Claude 4, Gemini Ultra, Grok 4, they have all been trained on something close to the entire publicly available text corpus of human civilisation. There is no more internet to scrape. The next leg of AI capability does not come from more web data. It comes from data that does not yet exist in a usable form.
That data falls into three buckets. First, proprietary enterprise data sitting inside Fortune 500 companies (medical records, legal contracts, engineering schematics, customer support logs) that has never been digitised in a model-ready format. Second, synthetic data generated by other models, which sounds elegant until you realise it leads to model collapse if you do not curate it ruthlessly. Third, and most valuable, expert-labelled data created by humans who actually know what they are doing. PhD-level chemistry annotations. Doctor-reviewed clinical reasoning chains. Lawyer-tagged contract clauses. Software engineer feedback on code outputs.
This third bucket is where the real money is moving. Scale AI was selling this kind of work to Meta and OpenAI for hundreds of millions a quarter before Meta paid roughly $14 billion to acquire 49% of it in 2024. That single transaction told you everything about how the frontier labs view this problem. They will pay almost anything for high-quality, expert-labelled, defensibly-licensed training data, because the alternative is a model that hallucinates in regulated industries and gets them sued into oblivion.
And here is the second-order play that the market has not woken up to. Scale AI is now effectively a Meta subsidiary, which means every other frontier lab (OpenAI, Anthropic, Google, xAI, the Chinese labs, the sovereign AI projects in the UAE and Saudi Arabia, the defence-grade work happening at Anduril and Palantir) needs an alternative. Needs one urgently. Needs one that is not owned by a direct competitor.
That is a market with roughly $15-20 billion of annual demand and one credible independent player at scale who is publicly listed, profitable, and has been doing this work for thirty-eight years before anyone called it AI.
Most people have never heard of them. The ones who have, mostly remember them as a 1990s-era document-conversion outfit that turned dusty corporate archives into searchable text. That memory is wrong by about a decade. The company deliberately pivoted in 2018, retooled its entire delivery infrastructure around model training workflows, and by 2024 was supplying labelled data and model evaluation services to five of the seven largest generative AI companies in the world.
Revenue grew 96% year-over-year in fiscal 2024. Customer concentration risk is real. Margins are expanding. The stock has been volatile because the market keeps trying to value it like an outsourced services business when it is in fact one of the most strategically positioned independent data suppliers in the entire AI value chain.
The frontier labs have eaten the internet, and the next leg of AI capability belongs to whoever can manufacture the data that does not yet exist.
You're reading the open chapter.
The full Deep Dive — including the contrarian thesis, the chain reaction, the stocks we rate True North, and the catalysts we're tracking — is reserved for Compass members.
- The full contrarian thesis with our supporting data
- 3–5 specific stocks we rate True North on this theme
- Catalysts and dates we're watching for confirmation
- Saturday Deep Dive every week (52/year)
- Free Tuesday & Thursday e-letters included
- Full Compass Ratings on 4,000+ stocks
$9/mo · $89/yr · $499 lifetime. Cancel anytime. 7-day money-back guarantee.