Why most scores are unreliable
If you're new to the topic, start with our complete guide to AI visibility. This article goes deeper into the scoring methodology.
Ask ChatGPT "what's the best CRM for small teams?" right now. Then ask it again in a new session. You'll likely get a different answer. AI chatbots are nondeterministic — they don't give the same response every time.
This means any tool that runs a query once and reports a score is giving you noise, not signal. Yet that's exactly what most free AI visibility checkers do: one query, one answer, one score. It's like checking the weather by looking out the window once and declaring "it never rains here."
The problem in numbers: In our testing, running the same query 10 times on ChatGPT produced mention rates ranging from 20% to 80% for the same brand. A single query captured less than 15% of the actual pattern. You need 50+ queries per platform to get a statistically meaningful result.
Why Do Different Tools Report Different AI Visibility Scores?
If you have tested your brand's AI visibility with more than one tool, you have probably noticed something frustrating: the scores don't match. One tool says you have 45% visibility, another says 22%, and a third says 61%. This isn't a bug — it reflects fundamental differences in how these tools measure.
The biggest variable is API vs real UI measurement. Many tools query AI models through developer APIs, which can return different results than what a real user sees in the ChatGPT or Perplexity interface. API responses often lack the web-search grounding that real sessions include, producing a completely different set of recommendations. Tools that simulate actual user sessions — opening a browser, typing a query, reading the response — capture what buyers actually experience.
Timing differences compound the problem. AI chatbots are nondeterministic — they change answers between runs, sometimes within minutes. A tool that measured your visibility on Monday morning may report a different score than one that measured on Tuesday afternoon, even using identical queries.
Then there is the prompt set. Different tools use different queries. One might ask "best CRM for startups" while another asks "CRM software recommendations for small business." These are subtly different prompts that produce different AI responses. The choice of prompts shapes the score more than most users realize.
This is why methodology matters more than the headline number. When evaluating any AI visibility tool, ask: how do they query, how often do they run each query, and do they disclose their prompt set? Learn about our measurement methodology →
The nondeterminism problem
According to data from Search Engine Land (tracking 2,500 prompts monthly), AI source citations change 40–60% month over month. Temperature settings, session context, server-side A/B tests, and model updates all contribute. A brand visible today might be invisible next month — or vice versa.
This has three implications for scoring:
- Volume matters more than precision. Running 200 queries gives you a distribution, not a point estimate. A "41% mention rate" from 200 queries is meaningful. A "mentioned" or "not mentioned" from 1 query is not.
- Cross-platform comparison reveals real patterns. If you're mentioned on most platforms, that's signal. If you're mentioned on only one, that's a problem regardless of the exact percentage.
- Trends matter more than absolutes. A score of 35% that was 20% last month is excellent progress. A score of 60% that was 80% last month is a problem. (For a framework on tracking this over time, see the 90-day AI visibility playbook.)
How a meaningful score is calculated
Here's how Metricus calculates AI visibility scores — and what to look for in any tool you evaluate, whether you're running audits in-house or selecting a tool for agency client work:
1. Query design. We generate 20–50 target queries based on your category, brand name, and competitors. These mirror real buyer queries: "best [category]," "[brand] vs [competitor]," "[brand] pricing," "[brand] alternatives."
2. Multi-run execution. Each query runs multiple times per platform to account for nondeterminism. For the Deep Dive tier, that means comprehensive coverage across AI.
3. Scoring. For each query-platform combination, we track: Was the brand mentioned? In what position? Was it recommended or just listed? Were the facts accurate? Was a source cited?
4. Aggregation. The overall visibility score is the percentage of query-platform combinations where your brand was mentioned. Sub-scores break this down by platform, by query type, and by mention quality (recommended vs. listed vs. compared unfavorably).
What to look for in any tool
Regardless of which tool you use, ask: how many queries per platform? If the answer is under 30, the data is noise. Do they disclose their methodology? If not, you can't validate the results. Do they differentiate between RAG-based retrieval and parametric knowledge? If not, they're conflating two fundamentally different visibility channels.
The 5 metrics that matter
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Mention rate | % of queries where your brand appears | Your basic visibility — are you in the conversation? |
| Positioning | Where you appear in the response (1st, 2nd, 3rd+) | AI responses have steep attention decay: the first-mentioned brand captures the majority of follow-up queries. Position 3+ is functionally invisible. |
| Recommendation rate | % of mentions where AI actively recommends you | AI chatbots often name 4–5 brands but only "recommend" 1–2. Users treat those recommendations like a trusted referral, not a search result. |
| Accuracy | Number of factual errors across platforms | AI hallucinations about pricing or features spread across platforms. One wrong fact in ChatGPT often appears in Perplexity and Copilot too. |
| Sentiment | The tone and framing when your brand is discussed | AI models often hedge with qualifiers like "but some users report..." — these caveats shape perception even when you're technically mentioned. |
What's a "good" score?
Based on 500+ audits across SaaS, e-commerce, and professional services (see also our benchmark study of 182 LLM prompts for the underlying data):
| Score Range | Interpretation | Action |
|---|---|---|
| 0–15% | Invisible. AI doesn't know you exist. | Start with structured data and third-party listings |
| 15–35% | Occasional mentions. Inconsistent presence. | Fix errors, create comparison content |
| 35–60% | Visible. Mentioned in most relevant queries. | Optimize positioning and recommendation rate (see the 5-step action plan) |
| 60–80% | Strong. Consistently mentioned and recommended. | Maintain and defend position |
| 80%+ | Dominant. The default recommendation in your category. | Monitor for competitor catch-up |
28%
The average AI visibility score across our audits. Most brands are far less visible in AI than they assume. The median "expected" score is 55%. Reality is usually half that.
The gap between perceived and actual AI visibility is roughly 2x. Brands that assume they're visible typically score far lower when measured. The first step isn't optimization — it's accurate measurement. If you haven't audited your actual AI presence, any optimization work is guesswork.
Sources: Source citation volatility (40–60% monthly change): Search Engine Land, tracking 2,500 prompts monthly.