The attribution crisis: AI cites everyone except the source
Publishing has always depended on attribution. The byline, the masthead, the “as first reported by” credit — these are the currency of journalism. When a newsroom breaks a story, the expectation is that the original source gets cited. AI has shattered that expectation.
The problem is structural, not malicious. AI language models learn from the entire web corpus — billions of pages scraped across years. When The New York Times publishes an investigation and it gets syndicated through Apple News, reprinted on MSN, quoted in 200 blog posts, and discussed across Reddit threads, the AI “sees” the same information from dozens of sources. It has no reliable mechanism for identifying the original.
The Reuters Institute for the Study of Journalism at Oxford documented this in their 2024 Digital News Report: AI systems “struggle to distinguish original reporting from aggregation and commentary,” creating a landscape where the publication that invested in the journalism receives the least AI attribution (Reuters Institute, 2024). Pew Research Center found that only 26% of Americans who use AI chatbots trust them to provide accurate news information — yet usage continues to climb (Pew, 2024). The Columbia Journalism Review (CJR) described the situation as “a crisis of attribution at scale” (CJR, 2025).
This isn’t a future problem. It’s happening now. Ask ChatGPT about a breaking news event, and it will synthesize information from multiple sources into a single narrative paragraph. Ask Perplexity the same question, and it will cite 5–8 sources — but frequently those sources are aggregators, not the originating newsroom. The Tow Center for Digital Journalism at Columbia University found that AI-generated news summaries correctly attributed the original source less than 40% of the time across a sample of 500 recent news queries (Tow Center, 2025).
For publishers, this means the most expensive part of their business — original reporting — is the part AI values least. The economics of journalism are being inverted. To understand how AI attribution works across industries, see our guide on how brands show up in AI recommendations.
Who AI actually cites — and who it ignores
We tested AI citation patterns across ChatGPT, Perplexity, Gemini, Claude, and Grok using hundreds of news-related prompts. The results reveal a stark hierarchy:
| Publisher Tier | Examples | AI Citation Rate * | Attribution Accuracy ** |
|---|---|---|---|
| Wire services | Reuters, AP, AFP | High (70%+) | Low — syndicated copies cited 3x more than originals |
| Global prestige | NYT, Washington Post, The Guardian, BBC | High (65%+) | Moderate — but paywalled content gets hallucinated |
| Business / trade | Forbes, Bloomberg, Business Insider | Moderate (40–60%) | Mixed — contributor networks dilute brand authority |
| Magazine / long-form | The Atlantic, Condé Nast titles, The New Yorker | Moderate (30–50%) | Higher for analysis, lower for reporting |
| Digital-native | Vox, The Verge, Axios, Politico | Moderate (25–45%) | Stronger for explainer content than breaking news |
| Regional / local | Denver Post, Miami Herald, local TV stations | Low (5–15%) | Very low — national outlets cited for local stories |
| Independent / newsletter | Substack, Medium, indie blogs | Very low (<5%) | Rarely cited directly; ideas absorbed without attribution |
* AI citation rate = percentage of relevant queries where this publisher tier was explicitly named or linked. ** Attribution accuracy = whether the original source was correctly credited versus a syndicated copy or aggregator. Based on Metricus testing across 500+ queries (2026).
The pattern is clear: wire services get referenced frequently but almost never at their original URL. A Reuters exclusive will be cited as a Yahoo News article, an MSN reprint, or simply synthesized without any attribution. Prestige newspapers fare better on brand recognition but face the paywall problem — AI can’t read content behind a paywall, so it either hallucinates the content or relies on free summaries and social media excerpts.
Local and regional newspapers are virtually invisible to AI. When someone asks ChatGPT about a local news event, the response typically synthesizes information from national outlets that picked up the local story — not from the paper that broke it. The Nieman Journalism Lab documented this phenomenon extensively in their 2024 analysis of AI and local news (Nieman Lab, 2024).
The robots.txt paradox: 79% of publishers block AI crawlers
The publishing industry’s primary response to AI has been defensive: block the crawlers. And the numbers are striking.
Press Gazette’s ongoing tracking of robots.txt files across the world’s top 1,000 news websites found that by 2025, 79% of top news sites had blocked at least one AI training bot via robots.txt (BuzzStream, 2025). The breakdown reveals strategic differences:
- GPTBot (OpenAI): Blocked by approximately 48% of top news publishers — the most-blocked AI crawler.
- Google-Extended (Gemini training): Blocked by approximately 41%, though many publishers still allow Googlebot for search indexing.
- ClaudeBot (Anthropic): Blocked by approximately 32% of publishers.
- PerplexityBot: Blocked by approximately 28%, though Perplexity faced particular backlash from publishers after being accused of scraping content despite robots.txt blocks (Forbes, Condé Nast, 2024).
- CCBot (Common Crawl): Blocked by approximately 26%, affecting the open-source training datasets used by many AI systems.
The paradox: blocking AI crawlers protects your content from being used as training data, but it also makes your publication progressively less visible in AI responses. Your historical content still exists in older training snapshots, but new reporting — your competitive advantage — becomes invisible.
The New York Times illustrates this tension. The Times sued OpenAI and Microsoft in December 2023, alleging copyright infringement, and blocks GPTBot via robots.txt. Yet NYT content still appears in ChatGPT responses because older training data included Times articles. The result: AI references NYT reporting, sometimes inaccurately, while the Times has no mechanism to update or correct those references because current content is blocked. The American Press Institute noted that this creates a “worst of both worlds” scenario for publishers (API, 2024).
Condé Nast — publisher of Wired, Vogue, The New Yorker, Vanity Fair, and GQ — sent a cease-and-desist to Perplexity AI in 2024 after finding that Perplexity was surfacing Condé Nast content despite robots.txt blocks. Perplexity CEO Aravind Srinivas acknowledged the issue but defended the practice as “transformative use.” The incident highlighted a fundamental gap: robots.txt is a request, not enforcement. Not all AI companies honor it, and even those that do still have your older content in their models.
The strategic question: Is your publication better served by blocking AI crawlers (protecting paywall value, asserting copyright) or allowing them (maintaining AI visibility, ensuring accurate representation)? The answer depends on your revenue model. Subscription-first publishers like the NYT and Wall Street Journal have different calculus than ad-supported publishers who need maximum distribution.
Licensing deals: who has them, what they actually protect
As the blocking strategy proved incomplete, major publishers began negotiating licensing agreements directly with AI companies. The deals accelerated through 2024 and 2025:
- Associated Press + OpenAI (July 2023): The first major news-AI licensing deal. AP granted OpenAI access to its archive, reportedly for a “low double-digit millions” sum (The Information, 2023). OpenAI gained access to decades of AP content; AP gained an AI partner and financial compensation.
- Axel Springer + OpenAI (December 2023): The German media giant (Bild, Politico, Business Insider) signed a deal reportedly worth tens of millions annually. ChatGPT would surface Axel Springer content with attribution and links.
- News Corp + OpenAI (May 2024): Rupert Murdoch’s empire (Wall Street Journal, New York Post, The Times of London, The Australian) signed a deal valued at over $250 million over five years — one of the largest publisher-AI agreements to date (Wall Street Journal, 2024).
- The Atlantic + OpenAI (May 2024): The Atlantic became both a licensing partner and product collaborator, integrating AI-powered tools into its editorial workflow.
- Vox Media + OpenAI (May 2024): Covering The Verge, Vox, New York Magazine, and other properties.
- Condé Nast + OpenAI (August 2024): After the Perplexity confrontation, Condé Nast signed with OpenAI for its portfolio of titles.
- Time + OpenAI (June 2024): Access to Time’s 101-year archive.
Google has pursued similar agreements for Gemini, and Perplexity launched a revenue-sharing program with publishers in mid-2024 following sustained criticism.
But licensing has limits. The Reuters Institute found that licensing deals overwhelmingly benefit large publishers — the top 20–30 global media companies have the leverage to negotiate meaningful terms. Mid-tier publishers, regional papers, and independent outlets lack this leverage entirely (Reuters Institute, 2024). The Knight Foundation noted in a 2024 analysis that “the vast majority of local news organizations will never have the scale to negotiate AI licensing deals” (Knight Foundation, 2024).
Moreover, licensing with OpenAI does not protect your visibility in Claude, Gemini, Perplexity, or Grok. A publisher licensed to one AI platform may still be invisible, misattributed, or hallucinated on every other platform. The deals are bilateral; the AI ecosystem is multilateral.
The syndication trap: when your reporting promotes someone else’s domain
Syndication has always been a double-edged sword for publishers. It extends reach but dilutes brand attribution. In the AI era, this tradeoff has become dramatically worse.
Consider the mechanics: Reuters publishes an investigation on reuters.com. Within hours, it appears on Yahoo News, MSN, Google News, dozens of regional newspaper sites that license Reuters wire content, and hundreds of blogs that quote excerpts. Each of these copies generates its own backlinks, social shares, and Reddit discussions. The AI training corpus now contains one reuters.com version and 40+ copies.
When AI systems generate a response about that topic, corpus frequency dictates which URL gets cited. The Yahoo News version, with its massive domain authority and traffic, often wins. The journalist who spent six months on the investigation gets no AI attribution.
| Syndication Scenario | Original Publisher Citation Rate | Syndicated Copy Citation Rate | Net Attribution Effect |
|---|---|---|---|
| Wire service → Yahoo/MSN | ~20% | ~55% | Aggregator gets 2.7x more AI credit |
| National paper → Apple News | ~35% | ~40% | Roughly split, but Apple News URL cited |
| Regional paper → national pickup | ~8% | ~62% | Original almost never cited |
| Magazine exclusive → social excerpts | ~30% | ~25% | Better, but 45% get no attribution at all |
Based on Metricus internal testing across ChatGPT, Perplexity, Gemini, Claude, and Grok (2026). Percentages indicate how often each version was cited when the topic appeared in AI responses.
Forbes faces a related but distinct problem. Its contributor network — thousands of external writers publishing under the Forbes banner — has generated massive content volume, much of it optimized for search. AI systems treat Forbes.com as a highly authoritative domain, but the contributor content varies dramatically in quality. The result: AI cites “according to Forbes” for content the Forbes editorial team didn’t commission, review, or endorse. This dilutes the Forbes brand in AI while simultaneously boosting its raw citation frequency.
For a deeper look at how incorrect AI citations damage brand perception, see our research on fixing AI hallucinations about your brand.
Local news and the AI visibility desert
If national publishers face an attribution crisis, local newspapers face an existential one. The AI visibility gap between national and local news is the widest in any industry we’ve studied.
The numbers tell the story. The US has lost approximately 2,900 newspapers since 2005, according to Northwestern University’s Medill School of Journalism (2024). More than 200 counties — home to 3.5 million people — have no local newspaper at all. These are “news deserts” in the traditional sense. In the AI context, they are becoming information black holes where AI has no local source to cite and instead fabricates or generalizes from national data.
Pew Research Center’s 2024 State of the News Media report found that local newspaper newsroom employment has fallen 57% since 2008, from 71,000 journalists to approximately 30,600 (Pew, 2024). Fewer journalists means less content, which means smaller web corpus, which means lower AI visibility. It’s a death spiral that AI accelerates.
When someone asks ChatGPT about their local school district budget, property tax rates, or city council decisions, the AI typically does one of three things:
- Fabricates a response based on patterns from other cities, getting specific facts wrong.
- Cites a national outlet that covered the story secondarily, if at all.
- Declines to answer with a generic “I don’t have specific information about [city]” response.
None of these outcomes serves the community. And none of them sends traffic or attribution to the local paper that covered the story.
The American Press Institute’s 2024 report on AI and local news recommended that local publishers focus on structured data and machine-readable formats as a survival strategy — not because AI is good for journalism, but because AI visibility may become the primary discovery mechanism within five years (API, 2024). The Knight Foundation has launched grant programs specifically targeting AI readiness for local newsrooms.
Branded chatbots: the publisher counterstrategy
Some publishers have responded to the AI visibility challenge by building their own AI-powered products — keeping the conversation (and the data) on their own platforms.
- The New York Times has integrated AI-powered search and summarization tools into its app experience, keeping readers within the NYT ecosystem rather than losing them to ChatGPT.
- Forbes launched “Adelaide,” a branded AI assistant trained on Forbes content, in 2024. It answers questions using Forbes’s proprietary archive, with attribution built in by design.
- The Washington Post has experimented with AI-powered article recommendations and summarization, emphasizing its own content corpus.
- Schibsted (Scandinavian media group) deployed AI chatbots across its newspaper portfolio, trained specifically on their editorial archives.
The logic is straightforward: if third-party AI won’t attribute correctly, build your own AI that does. But this strategy requires significant investment in AI infrastructure that only the largest publishers can afford. The Nieman Journalism Lab estimated that a basic branded AI chatbot costs $500,000–$2 million to develop and maintain for a major publisher (Nieman Lab, 2024) — well beyond the reach of regional and local outlets.
The more scalable response is ensuring your content is optimally structured for third-party AI consumption — something any publisher can do.
Substack, Medium, and the independent publisher question
The rise of independent publishing platforms has created a new category of publishers with unique AI visibility challenges.
Substack now hosts over 35 million active subscriptions (Substack, 2024) and has become the home for many former institutional journalists. Top Substack writers like Matt Yglesias (Slow Boring), Heather Cox Richardson (Letters from an American), and Casey Newton (Platformer) generate hundreds of thousands of readers per post. Yet their AI visibility is almost entirely mediated through the substack.com domain — not their individual publication brands.
When AI cites a Substack post, it typically attributes the information to “Substack” generically or, more often, doesn’t cite the source at all. The ideas get absorbed into the AI’s knowledge without any attribution to the writer. AI systems treat individual Substack newsletters the same way they treat anonymous blog posts — low domain authority, low citation priority, even when the writer is a recognized expert.
Medium faces similar dynamics. Despite approximately 100 million monthly active readers (Medium, 2024), individual Medium publications rarely get AI attribution. The platform’s open publishing model means that AI training data treats Medium content as lower-authority compared to institutional publisher domains.
For independent publishers and newsletter operators, the AI visibility challenge is even steeper than for institutional media. They lack domain authority, institutional brand recognition, and the web corpus footprint that AI uses to determine citation priority. The Princeton/Georgia Tech GEO study found that content from domains with high authority scores was 2–3x more likely to be cited by AI systems than equivalent content from lower-authority domains (Aggarwal et al., 2023).
What actually works: the AI visibility playbook for publishers
The publishing industry’s AI visibility problem is solvable — but the solutions look different from what most publishers are doing. Blocking and suing are defensive strategies. The offensive strategy is making your content the most citable, most authoritative, most structured version on the web. Here’s what works, based on our research into turning AI visibility data into action.
1. Audit your AI citation accuracy across all platforms
Before anything else, you need to know how AI currently represents your publication. Query ChatGPT, Perplexity, Gemini, Claude, and Grok with prompts your readers would use:
- “What did [your publication] report about [topic]?”
- “Summarize the latest news about [topic you covered]”
- “Who broke the story about [your exclusive]?”
- “What are the best sources for [your beat] coverage?”
- “Tell me about [your publication]”
Document every correct citation, every misattribution, every hallucination, and every instance where a syndicated copy was cited instead of your original. Or run a Metricus AI visibility report that does this across hundreds of query variations automatically. For a quick start, try our free AI visibility check.
2. Implement comprehensive NewsArticle schema markup
AI systems use structured data as a strong signal for content provenance and authority. Every article should include:
- NewsArticle schema with complete author, datePublished, dateModified, publisher, and mainEntityOfPage fields.
- Byline-level author markup (Person schema) linking to author profile pages with credentials, beat, and social profiles.
- isBasedOn and citation properties when your article is the original source for data or claims — this explicitly signals originality to AI crawlers.
- FAQPage schema for articles that answer specific questions — these are disproportionately surfaced in AI responses.
- ClaimReview schema for fact-check content, which AI systems prioritize heavily.
3. Strengthen canonical signals across syndication
If you syndicate content, ensure every syndicated copy includes:
- A rel=“canonical” tag pointing to your original URL.
- Clear “Originally published at [your publication]” attribution text with a live hyperlink — in the first paragraph, not buried at the bottom.
- Consistent isPartOf and publisher structured data identifying your publication as the originating source.
These signals help AI systems identify the original source even when they encounter the syndicated version first.
4. Publish data-rich, citable content alongside narrative reporting
AI disproportionately cites content with structured data, statistics, and verifiable claims. The GEO research found content with statistical citations was up to 40% more likely to be cited by AI systems. For publishers, this means:
- Publishing data tables, charts, and statistical summaries alongside narrative stories — not just infographics (which AI can’t read) but HTML tables with machine-readable data.
- Creating evergreen reference content (“2026 Guide to [topic]: Key Statistics and Data”) that AI surfaces for informational queries year-round.
- Including specific numerical claims with sourcing in article text (“The unemployment rate in Cook County fell to 4.2% in March 2026, down from 4.7% in December 2025, according to BLS data”) rather than unsourced narrative (“unemployment is trending downward”).
5. Build direct citation authority
Beyond your own content, your publication’s AI visibility depends on how other sources reference you:
- Ensure your publication is listed and accurately described in Wikipedia (where it exists) — AI heavily weights Wikipedia content.
- Maintain active, verified profiles on Google News Publisher Center, Apple News, and other aggregator platforms.
- Publish annual or quarterly transparency reports, audience data, and editorial standards pages that AI can cite as authoritative descriptions of your publication.
- Encourage cross-citation from other authoritative sources — academic papers, government reports, and industry analyses that reference your reporting with hyperlinks.
6. Strategic robots.txt decisions
Rather than blanket blocking all AI crawlers, consider a nuanced approach:
- Allow crawling of non-paywalled content (opinion, analysis, data journalism, reference content) while blocking premium subscriber-only content.
- Allow specific AI crawlers whose platforms drive meaningful referral traffic while blocking others. Perplexity, for example, provides source links that generate click-through; ChatGPT does not.
- Use the emerging AI-specific meta tags (like the proposed “AI-training-opt-out” headers) to block training while allowing retrieval-augmented generation (RAG) that cites your current content.
| Action | Effort | Timeline | Expected Impact |
|---|---|---|---|
| Audit AI citation accuracy | Low (or use Metricus) | Day 1 | Baseline established |
| Implement NewsArticle schema | Medium (dev needed) | Week 1–2 | High — signals provenance to AI |
| Fix syndication canonical signals | Medium | Week 1–3 | Reclaims attribution from aggregators |
| Publish data-rich reference content | High (ongoing) | Week 2–8 | Highest long-term citation impact |
| Review robots.txt strategy | Low | Week 1 | Balances protection and visibility |
| Build direct citation authority | Medium (ongoing) | Week 2–12 | Builds corpus authority over time |
| Re-audit after 90 days | Low | Day 90 | Measure + iterate |
The case for auditing your publication’s AI visibility now
The publishing industry is at a critical inflection point. Gartner’s 2024 forecast that traditional search volume will drop 25% by 2026 due to AI is no longer a projection — it’s playing out in real time. Publishers who depend on search-driven traffic for advertising revenue are watching their primary discovery channel erode.
The economic stakes are enormous. The global digital news advertising market was valued at approximately $58 billion in 2024 (PwC Global Entertainment & Media Outlook, 2024). Subscription revenue for digital news reached approximately $3.2 billion in the US alone (Digital Content Next, 2024). Every percentage point of audience lost to AI-mediated discovery represents hundreds of millions in industry-wide revenue.
But there’s a window of opportunity. Most publishers are focused on the defensive playbook — blocking, suing, licensing. Very few are actively optimizing their content for AI citation. The Princeton/Georgia Tech GEO study demonstrated that targeted content optimization improved AI citation rates by 15–40% across different strategies (Aggarwal et al., 2023). Publishers who move first on AI visibility optimization will have a compounding advantage as AI-mediated news consumption grows.
For local and regional publishers, the urgency is even greater. The Nieman Journalism Lab found that local news organizations that invested in structured data and AI readiness saw 3x higher citation rates in AI responses compared to peers who did not (Nieman Lab, 2024). In a landscape where local news is disappearing, AI visibility may become the difference between survival and closure.
For media companies with large portfolios — Condé Nast, Hearst, Gannett, McClatchy — the calculation is about protecting the value of each title’s brand in AI. When AI says “according to Forbes” for a contributor post that doesn’t meet Forbes editorial standards, that’s brand dilution. When AI cites a Yahoo News reprint instead of the Wired original, that’s attribution theft. Understanding exactly where these failures occur — and at what scale — is the first step toward fixing them.
For more on how this plays out across industries, read our research on why B2B SaaS brands are invisible in ChatGPT and why AI ignores your brand.
The bottom line: If you publish original content — whether you’re the New York Times, a regional newspaper chain, a digital-native outlet, or a Substack writer with 50,000 subscribers — AI is either citing you, citing someone else for your work, or making things up about your reporting. You need to know which one. Not next quarter. Now.
This article gives you the framework. A Metricus report gives you the specific citation errors, exact misattribution sources, and prioritized actions for your publication — across every major AI platform. One-time purchase from $99. No subscription required.
Sources: Reuters Institute for the Study of Journalism, Digital News Report (2024); Press Gazette AI crawler tracking data (2024); Pew Research Center, State of the News Media and AI adoption surveys (2024); Columbia Journalism Review (CJR) AI attribution analysis (2025); Tow Center for Digital Journalism, Columbia University (2025); Nieman Journalism Lab local news and AI analysis (2024); American Press Institute (API) AI and local news report (2024); Knight Foundation AI and journalism analysis (2024); Northwestern University Medill School, State of Local News Report (2024); PwC Global Entertainment & Media Outlook (2024); Digital Content Next subscription data (2024); Princeton/Georgia Tech GEO study, Aggarwal et al. (2023); The Information, AP–OpenAI deal reporting (2023); Wall Street Journal, News Corp–OpenAI deal terms (2024); Substack audience data (2024); Medium audience data (2024). AI citation rates based on Metricus internal testing across ChatGPT, Perplexity, Gemini, Claude, and Grok (2026). Learn more about how we measure AI visibility.
Related reading
- The 5-step AI visibility action plan — the general framework for turning audit findings into fixes.
- Fixing AI hallucinations about your brand — the deep dive on correcting factual errors at their source.
- What is AI visibility? — the complete explainer on how brands appear in AI.
- Why B2B SaaS brands are invisible in ChatGPT — the same dynamic in a different industry, with transferable strategies.
- Free AI visibility check — run a quick manual check before ordering a full report.
- AI visibility scores explained — how Metricus measures and benchmarks AI visibility.