Why most scores are unreliable

AI chatbots are nondeterministic — they do not give the same response every time. What Metricus found in our AI visibility report testing is that running the same query 10 times on ChatGPT produced mention rates ranging from 20% to 80% for the same brand. Any tool that runs a query once and reports a score is giving you noise, not signal. You need 50+ queries per platform to get a statistically meaningful result.

The problem in numbers: A single query captures less than 15% of the actual pattern. Different tools report different scores because they use different measurement methods, prompt sets, timing, and scoring methodologies.

API vs. real UI measurement

The biggest variable between tools is API vs real UI measurement. Many tools query AI models through developer APIs, which return different results than what users see in the actual ChatGPT or Perplexity interface. API responses often lack the web-search grounding that real sessions include, producing a completely different set of recommendations. What we found is that tools simulating actual user sessions — opening a browser, typing a query, reading the response — capture what buyers actually experience.

The 5 metrics that matter

Based on our analysis across hundreds of audits, five metrics provide a meaningful picture of AI visibility:

  • Mention rate: How often your brand appears across all tested queries and platforms.
  • Positioning: Where your brand appears in the response — first recommendation, listed among alternatives, or mentioned as an afterthought.
  • Recommendation quality: Whether you are actively recommended versus merely listed or compared unfavorably.
  • Factual accuracy: Whether AI gets your details right — pricing, features, positioning, founding date.
  • Citation sources: Which URLs and sources feed the AI’s answer about your brand.

The overall score hides what matters. A brand might score 90% on broad queries and 0% on industry-specific ones. The breakdown by query type matters more than the headline number.

Cross-platform score variation

What we found when testing the same brand across multiple AI platforms is that scores vary significantly. A brand might score 75% visibility on ChatGPT and 40% on Perplexity for identical queries. This happens because each platform uses different training data, different retrieval mechanisms, and different synthesis logic. Perplexity, which relies heavily on real-time web search, tends to surface brands with strong recent web presence. ChatGPT, which leans more on training data, favors brands with established parametric presence.

This cross-platform variation is why any meaningful AI visibility assessment must test across all major platforms rather than relying on a single one. A score from ChatGPT alone tells you less than half the story. Metricus AI visibility reports test across all major AI platforms specifically because buyer behavior is distributed — different buyers use different AI tools, and your visibility needs to be consistent across all of them.

How scores change over time

AI visibility scores are not static. Model updates, competitor content changes, and shifts in the AI’s retrieval behavior can move your score by 20+ points between quarters. What we found is that the most common cause of score drops is competitor activity — when a competitor publishes strong new content or earns significant third-party coverage, AI models shift their recommendations accordingly. Conversely, we have seen brands increase their scores by 15–30 points after addressing the specific gaps identified in an initial audit.

What is a “good” score?

In our B2B SaaS audits, scores ranged from 50% to 83%. Above 60% is competitive in most categories. But what we found is that the category matters enormously: niche categories with fewer competitors tend to produce higher scores, while crowded categories like CRM or project management produce lower scores even for market leaders. A Metricus AI visibility report breaks down your score by query type, platform, and competitor so you can see exactly where gaps exist.

What we also found is that scores should be interpreted in the context of query type distribution. A brand scoring 80% overall might achieve that by dominating broad queries while being invisible for the industry-specific and use-case-specific queries that often represent the highest-intent buyers. Conversely, a brand scoring 55% overall might be the top recommendation for the most commercially valuable queries in its category. The score breakdown by query type — broad, comparative, use-case-specific, and industry-specific — reveals where the real opportunities and risks lie.

Last updated: April 2026