The spot-check illusion

You typed your company name into ChatGPT. The answer looked reasonable — maybe it got your product category right, mentioned a couple of features, listed you alongside competitors. You closed the tab feeling fine about it. Then your colleague ran the same query from their laptop. Different answer. Different competitors listed. Your brand mentioned third instead of first — or not mentioned at all.

This is not a bug. This is how large language models work. Every response is generated fresh, and the output varies based on built-in randomness, context, timing, and infrastructure factors that no user controls. What you saw in your one query was one sample from a distribution of possible answers — and that distribution is far wider than most people realize.

The core problem: A single ChatGPT query about your brand is like checking the weather by glancing out the window once. You might catch sunshine, you might catch rain. Neither tells you the climate.

Why ChatGPT gives different answers every time

ChatGPT does not retrieve pre-written answers from a database. It generates every response from scratch by predicting what words should come next, based on patterns learned during training. This prediction process is inherently probabilistic — the model assigns probability weights to thousands of possible next tokens and samples from that distribution. The result is that two identical prompts can produce meaningfully different outputs.

Multiple technical factors compound this variability:

  • Temperature and sampling: A randomness parameter controls how creative or conservative the model is. Even at low settings, the output is not fully deterministic. Higher settings — common in conversational products like ChatGPT — introduce more variation by design.
  • Floating-point arithmetic: LLMs perform billions of mathematical operations per response. Tiny rounding differences in floating-point calculations can cascade through the model, producing different final outputs from identical inputs.
  • Hardware parallelism: GPUs process operations in parallel, and the order in which those operations complete can vary between runs. This introduces nondeterminism at the infrastructure level that exists independently of any model setting.
  • Request batching: Cloud inference systems batch multiple users’ requests together. The result for one request can depend on which other requests were in the same batch — a factor that changes from second to second.

The net effect is that even when OpenAI sets temperature to zero internally, outputs are not guaranteed to be identical. A 2026 study documented that even under the most deterministic settings possible, LLM outputs still varied across runs. For consumer-facing products like ChatGPT, where temperature is set higher to make conversations feel natural, the variation is substantially greater.

What the research shows

The scale of this variability is not theoretical. Multiple studies have now quantified it.

SparkToro: 2,961 queries, less than 1% list repetition

SparkToro recruited 600 volunteers who ran 2,961 individual AI queries across 12 different prompts on three major AI platforms. Each prompt was run 60 to 100 times per platform to generate statistically meaningful samples. The finding: AI tools returned the same brand recommendation list less than 1% of the time. When it came to ordering, researchers estimated you would need roughly 1,000 runs before seeing two identical ordered lists.

Despite this extreme per-query variability, the research did find that AI platforms drew from a relatively consistent consideration set. For headphone recommendations, brands like Bose, Sony, Sennheiser, and Apple appeared in 55% to 77% of the 994 responses — but their position, the surrounding recommendations, and the framing changed nearly every time.

Washington State University: 73% consistency rate

A March 2026 WSU study tested ChatGPT by feeding it the same prompts 10 times each. The model produced consistent answers only about 73% of the time. It frequently contradicted itself, sometimes flipping answers on the same question across runs. When adjusted for random chance, the accuracy was just 60% better than guessing.

Metricus internal testing: 20% to 80% mention rate swings

In our own AI visibility report testing, running the same brand query 10 times on ChatGPT produced mention rates ranging from 20% to 80% for the same brand. This is why any tool that runs a query once and reports a score is giving you noise, not signal. A meaningful AI visibility score requires 50+ queries per platform to reach statistical significance.

What this means for your brand

The implications for brand visibility in AI are significant. When you spot-checked ChatGPT and saw your brand mentioned, you were seeing one outcome from a slot machine that spins differently every time. Your colleague who got a different answer was seeing another equally valid pull.

This creates several concrete problems:

  • False confidence: A favorable single query makes you think your AI presence is fine. You deprioritize it. Meanwhile, the majority of actual buyer queries may be returning answers where your brand is absent or mispositioned.
  • False alarm: Conversely, one bad query can trigger panic. You saw a competitor listed first and your brand missing entirely. But that might have been a low-probability outcome that occurs in only 15% of queries.
  • No baseline for improvement: Without knowing your actual aggregate visibility, you cannot measure whether any action you take — updating your website, publishing content, earning press — actually moved the needle.
  • Inconsistent buyer experience: Some percentage of buyers asking AI about your category will see your brand prominently recommended. Others will never see it mentioned. You have no visibility into the ratio.

The real question is not “what did ChatGPT say?” It is “across 100 buyer queries on multiple AI platforms, how often does my brand appear, where is it positioned, and is the information accurate?”

How many queries it takes to see the real picture

SparkToro’s research suggests a minimum of 60 to 100 queries per prompt per platform to establish a statistically meaningful pattern. Our own testing at Metricus confirms this range: at fewer than 50 queries, the variance is too high to distinguish real patterns from noise.

But the number of queries per prompt is only one dimension. To understand your actual AI visibility, you also need:

  • Multiple prompt variations: Buyers do not all phrase their questions the same way. “Best CRM for startups” and “what CRM should a 10-person company use” are functionally the same question but can produce different AI responses. Our benchmark study of 182 prompts found that prompt phrasing significantly affects which brands surface.
  • Multiple platforms: A brand might score 75% visibility on ChatGPT and 40% on Perplexity for identical queries. Each platform uses different training data, retrieval mechanisms, and synthesis logic. Testing one platform gives you less than half the picture.
  • Real user interface testing: Many measurement tools use developer APIs, which return different results than the actual consumer interface. API responses often lack the web-search grounding that real sessions include, producing a completely different set of recommendations. As we explain in our analysis of how AI visibility scores work, the measurement method matters as much as the measurement itself.

The math adds up quickly. If you need 60+ queries across 20+ prompt variations across 3+ platforms, you are looking at 3,600+ individual AI queries to build a reliable picture of your brand’s AI visibility. This is not something you can do by hand over lunch.

What to do instead of spot-checking

The first step is accepting that your single query told you almost nothing. That is not a comfortable realization, but it is the necessary starting point.

From there, you need systematic measurement. Not one query, not five, not even twenty — but a statistically valid sample across the prompts real buyers use, on the platforms they actually use, through the interfaces they actually interact with.

What systematic measurement reveals that spot-checking cannot:

  • Your actual mention rate: Not whether you appeared once, but what percentage of buyer queries surface your brand.
  • Your positioning distribution: Are you typically the first recommendation, one of several alternatives, or an afterthought? The distribution matters more than any single instance.
  • Accuracy patterns: Is AI getting your pricing, features, or positioning wrong? A single query might show correct information while the majority of responses contain errors. If AI is saying wrong things about your brand, you need to know the pattern, not a snapshot.
  • Competitive context: Which competitors consistently appear alongside or instead of you, and in what framing?
  • Platform gaps: Where you are visible versus invisible across different AI systems, and which gaps matter most based on where your buyers are.

A Metricus AI visibility report does exactly this: 50+ queries per platform, across multiple AI systems, through real user interfaces, with statistical analysis of mention rates, positioning, accuracy, and competitive context. The result is not a single number but a detailed map of how AI actually represents your brand to the people asking about your category.

Your one ChatGPT query was not wrong. It was just one data point out of thousands. The question is what the other thousands look like.

Last updated: April 2026