How LLMSight Measures AI Visibility (and Why a Single Number is a Lie)

Imagine you ran a Google ranking check and the result randomly varied by 30 positions every time you refreshed. You would not trust the tool. Yet that is exactly what most "AI search rank trackers" expose without telling you. They send a single query to ChatGPT, see your business mentioned, write down "rank 2", and ship it as a dashboard number. Run the same query an hour later and you might see "rank 7" or "not mentioned at all". Both readings are equally true. Neither is a measurement.

LLMSight is built on a different premise. AI engines are probabilistic, so visibility is probabilistic. Our job is not to pretend otherwise. It is to make probabilistic measurement honest, useful, and actionable. This article walks through the seven techniques we use to do that.

The Problem No One Talks About

Generative AI models are stochastic. That is a polite way of saying they roll dice. When ChatGPT generates an answer, it is sampling from a probability distribution over possible next words. Even at low randomness settings, two identical queries on the same model can produce different outputs. The variation gets bigger when web search is involved (because the underlying search results shift), when the user has a personal history (because memory features bias responses), and when the time of day or server changes (because routing to different model versions can flip).

This means three uncomfortable things for any AI visibility tracker:

A single query result is a sample of size 1. It tells you almost nothing about typical behaviour.
Your "rank" can swing wildly between identical scans even if nothing about your business changed.
Reporting that single result as a number is misleading by construction.

Most tools paper over this. We made it the centre of our design.

We Sample Every Question Multiple Times

For every question we track on every AI source, LLMSight does not just ask once. We ask multiple times in a single scan and look at how often your business is mentioned across the samples. The result is a presence rate, not a yes-or-no answer.

If your business shows up in 4 out of 5 samples, your presence rate is 80 percent for that question on that source. If it shows up in 0 out of 5, you can be reasonably sure you are absent. If it shows up in 2 out of 5, the picture is fundamentally uncertain and we treat it that way.

How many samples we run depends on your plan, because every additional sample costs us a real call to a real AI source. Our Starter plan runs 3 samples per API source. Growth and Scale run 5 samples per API source by default. Browser-based sources (the kind that scrape ChatGPT, Gemini, or Perplexity through their actual web interfaces) cost dramatically more per call, so we run fewer of those, typically 1 to 3 samples. The trade-off is intentional: more samples mean tighter confidence, but also higher cost. We expose the trade-off honestly rather than hiding it behind a single artificial number.

See It For Yourself

Run a free scan with multi-sample sampling

Every paid scan on LLMSight uses multi-sample probing. See real presence rates and confidence intervals on your own business in 60 seconds.

Check Your Visibility Free

Confidence Intervals, Not Rankings

Once we have a presence rate, the next question is: how confident are we in this number? A presence rate of 4 out of 5 (80 percent) sounds great until you realise the statistical confidence interval around it spans roughly 38 percent to 96 percent. Five samples just is not enough data to make that 80 percent feel solid.

LLMSight reports a Wilson 95 percent confidence interval next to every presence rate. Wilson is a well-established statistical method for binomial proportions that behaves correctly at the extremes (when something always happens or never happens) where simpler approximations break down. In plain language: we tell you not just "your business appeared 80 percent of the time" but also "we are 95 percent confident the true rate is somewhere between 38 and 96 percent."

That second number is the honest one. If a competing tool tells you "your rank is 2", they are giving you a single point on a distribution. We give you the distribution. When the band is tight (say 75 to 85 percent), you can make decisions on it. When the band is wide, you know to wait for more data, not to chase noise.

Adaptive Sampling: Spend More Where it Matters

Sampling more is expensive. Sampling enough is essential. Our solution is adaptive sampling: we start with a base number of samples per question, then automatically run more samples on the questions where the result is genuinely uncertain.

If a question shows your business in every single sample (presence rate 100 percent), more samples will not change much. The signal is clear. We stop. If a question never mentions your business in any sample (presence rate 0 percent), same logic. We stop. But if a question shows your business in roughly half the samples, the result is on the edge and another five or ten samples could meaningfully sharpen the confidence interval. So we keep going, up to a per-plan ceiling.

This concentrates measurement effort exactly where it pays off. The questions you most need clarity on (the contested ones, where the AI is genuinely indecisive about your category) get the deepest analysis. The questions where you are clearly in or clearly out get measured efficiently. You see the same trustworthy results across every cell, but the underlying spend matches the underlying uncertainty.

Decomposed Judging for Reliable Classification

Once we have a response from an AI source, we still have to read it carefully. Did the answer mention your business by name? Did it cite your domain as a source? Did it describe you indirectly (the "Toyota dealer near downtown")? Was the mention favourable, neutral, or critical?

Asking a single AI judge to do all of that at once turns out to be unreliable. Judges are biased toward longer answers, they over-fit on whichever signal they read first, and they are inconsistent on edge cases like neutral mentions versus mildly positive ones. So we split the work.

Our judge runs in two stages. The first stage focuses purely on structure: it identifies every brand and product in the response, classifies each one as cited, named, or implicitly referenced, and extracts the position of each mention. This is a tightly scoped task and judges are extremely reliable at it. The second stage runs only when something was actually mentioned and it asks a single, narrow question: was the portrayal positive, neutral, or negative? Decomposed like this, each judgment is more reliable than asking everything in one shot.

A side benefit: when an AI source gives a long detailed response that does not mention your business at all, we skip the sentiment stage entirely. That keeps the cost of measuring "not present" cells low, which matters a lot when those cells make up most of a typical scan for a less-established business.

Quality Flags: When We Refuse to Publish a Number

Sometimes things go wrong. An AI source rate-limits us. A web scrape hits a captcha. A response comes back empty because the model refused to answer. A network blip eats one of our samples. Any of these can leave a cell with fewer successful samples than we planned for.

Most trackers silently average over whatever data they got. We refuse to. Every cell is tagged with a quality flag:

Ok: at least 80 percent of attempts succeeded, no suspicious failure pattern. Number is publishable.
Degraded: 50 to 80 percent succeeded but failures look random. We publish the number with a wider confidence interval and a small warning indicator.
Suspect: failures are concentrated in a way that suggests the surviving samples are biased (for example, captchas firing only on prompts that produce certain answers). We do not publish a number, we flag the cell for review.
Insufficient: fewer than 50 percent of attempts succeeded. We do not publish, we surface the failure rate and what failed.

This is how we avoid the failure mode where a vendor's number looks confident but is silently wrong because half the data was missing. Quality flags are visible in your dashboard on every metric. You always know what you are looking at.

Every Number Has Evidence Behind It

For every sample we run, LLMSight stores the verbatim response from the AI source, along with metadata about which model version produced it, what time, in what region. You can click any number on your dashboard and see the actual sentences that produced it. No black box.

This evidence layer matters for two reasons. First, when you dispute a result (and you will), we can show you exactly why we classified things the way we did. The judge's decision is auditable down to the character span where your business was mentioned. Second, when we improve our judging methodology in the future, we can replay the new judge on historical data and compare. Every extraction is tagged with a version fingerprint so the dashboard can tell the difference between "your visibility actually changed" and "we got better at measuring."

That is the bargain we offer: probabilistic measurement, but with full evidence. You do not have to trust us. You can verify.

Why This Honesty Matters

You could imagine a competing tool that just reports "rank 2" everywhere. The dashboard would feel cleaner. The quarterly report would feel more decisive. But the moment a customer A/B-tests that tool against ground truth, or runs the same scan twice and gets different rankings, the credibility evaporates.

Probabilistic measurement is not a weakness in a product like ours. It is the only honest answer to the underlying technology. AI engines are stochastic. Visibility is a distribution. Pretending otherwise is the choice every other tool in this category makes, and it is exactly why their numbers do not survive contact with reality.

LLMSight gives you confidence intervals, presence rates, decomposed mention classifications, quality flags, and the verbatim evidence behind every result. The reason we do all of this is the same reason a doctor reads your chart instead of guessing your blood pressure: when the underlying signal is noisy, the right response is more measurement, more rigor, and more transparency. Not less.

If you have been told AI visibility is too unstable to track usefully, you have been told a half-truth. It is too unstable to track naively. With the right techniques, it is one of the most actionable signals you can build a strategy on.

Get Started

See your AI visibility with confidence intervals

Run a free scan and get a real presence rate, Wilson CI, and the verbatim AI responses behind every number. No more single-shot rank theatre.

Scan Your Business Free