Identity: Does the Corpus Know You Exist?

Darshpreet Singh · May 12, 2026 · 16 min read

NoteA note on the examples in this piece: the businesses referenced throughout are drawn from real diagnostics conducted using The Visibility Code Diagnostic methodology. Specific names are anonymized to protect the businesses audited; the patterns, numbers, and visibility outcomes are real.

Open ChatGPT. Type four words: "What is [your business name]." Hit enter.

Read what comes back. The next thirty seconds will tell you more about your AEO position than any audit, dashboard, or schema validator ever will.

There are four things the AI can do with that prompt, and only four. Each one tells you a different truth about where your business sits in the corpus.

It can describe you accurately and confidently, with details that match reality. Your category, your location, your size, your differentiators. The AI knows who you are. We will call this Resolved Identity.

It can describe you partially. The AI recognizes the name. It places you roughly in your category. But the details are vague, hedged, or only correct in narrow contexts. "I think this is a regional fitness studio in Florida, though I am not certain about specific services." The AI knows of you, but does not know you. We will call this Partial Identity.

It can fail to recognize you at all. The AI either admits it does not know the business, asks for clarification, or - worse - confabulates plausible-sounding details that have no relationship to your actual operation. "I am not sure about a specific company by that name, but a business called [yours] in your industry might offer..." Everything after "might" is invented. We will call this Absent Identity.

Or it can return the wrong business entirely. Your name collides with another company, a generic phrase, or an unrelated concept, and the AI confidently describes that other thing while using your name. We will call this Confused Identity, and as we will see later in this piece, it is often worse than Absent Identity.

Four states. Resolved, Partial, Absent, Confused. You are in one of them right now. Most readers of this framework have never tested which one. The rest of this piece is about why that test matters more than anything else you might be working on, and why the work to move between states is fundamentally different from the AEO work most operators are currently doing.

The Iceberg

We first described Identity as binary. Either the AI knew you, or it did not. Twelve months of running The Visibility Code Diagnostic across hundreds of businesses taught us that framing was almost right and slightly wrong in a way that matters.

Here is the more accurate picture. Identity operates on a threshold plus a gradient. There is a line below which Identity might as well not exist. Below that line, your business is invisible to AI engines no matter what else you do. No amount of schema markup, structured data, or content optimization changes the answer when someone asks "what is your business." The AI does not know. Period.

Above the line, Identity becomes a gradient. It can be weak. It can be moderate. It can be dominant. The higher you sit above the line, the more consistently the AI can describe you, the more contexts in which it can place you, and the more likely you are to surface in answers to category-level questions where your name is not in the prompt.

Think of it as an iceberg. The waterline is the threshold. Below the waterline, you are invisible. Above the waterline, your visibility is determined by how far above you sit. The iceberg has elevation. The threshold is real.

We can see this directly in a focused study we ran across eleven businesses, approximately fifteen hundred prompts, and the five major AI engines - one of the empirical foundations for the framework you are reading. The patterns held consistently across hundreds of additional audits we have conducted since.

Take a regional personal training studio in southwest Florida. Organization schema, LocalBusiness schema, FAQPage schema, perfect meta tags, clean and well-structured site. Across one hundred and thirty-seven prompts to the five major AI engines about senior fitness training, semi-private personal training, and southwest Florida fitness options - their exact category, their exact city - they appeared zero times. Zero hits across all five engines. They are below the waterline.

Take a regional Honda dealership chain in upstate New York. Over thirty years of operating history. Multiple locations. Established reputation. Real customer base. Across the same kind of prompt set, they appear in roughly three to ten percent of relevant queries, depending on the engine. Never as the AI's first recommendation. Never on broad questions like "how do I choose a car dealership." But they show up. They are above the waterline, but only just.

Take a national fitness chain - HIIT-based, heart-rate-driven, thousands of studios. They appear in fourteen to sixty-eight percent of relevant prompts depending on the engine. The AI describes them with consistent details across engines: heart-rate-based training, interval workouts, branded scoring vocabulary, proprietary technology. They are well above the waterline, with substantial elevation.

Same diagnostic test. Three different positions on the iceberg. The work to move between those positions is not the same work most AEO content tells you to do.

What Corpus-Level Identity Actually Is

To understand why on-site optimization does not move Identity, you need to understand where Identity actually lives. It does not live on your website. It lives in something we will call the corpus.

When ChatGPT, Gemini, Claude, Perplexity, and Grok answer questions about your business, they are not looking at your website in real time most of the time. They are drawing on a representation of your business that was built during their training process - and updated, in some cases, through real-time retrieval of web sources. The training process exposed each model to enormous quantities of text from across the open web: news articles, blog posts, Wikipedia, Reddit threads, forum discussions, podcast transcripts, academic papers, product reviews, directory listings, social media. From that exposure, the model built compressed representations of entities mentioned in that text - including, possibly, your business.

The corpus is not any single one of those engines. It is the underlying pattern of representations that exists across all of them, built from substantially overlapping training data. When third-party sources mentioned your business consistently, with consistent identifying information, the model built a representation. When they did not mention you, or mentioned you sparsely, or mentioned you with conflicting details, the model either built no representation or built a fragmented one.

This is why Identity behaves the way it does. It is not built from your claims about yourself. It is built from what other people, on other websites, in other contexts, have written about you. Your website is a destination. The corpus is built from sources.

The implication is one of the hardest pills to swallow for many readers, especially those who have spent the last two years optimizing schema markup and producing AI-friendly content. Schema markup affects whether search engines and AI crawlers can parse your site. It does not change the corpus's representation of your business. The corpus already exists. It already has a representation of you, or no representation, or the wrong representation. Your site cannot reach into that representation and edit it.

What can edit the corpus's representation of you is more text about you, on third-party authoritative sources, written consistently over time. That is the mechanism. Everything here, and in the pieces that follow, flows from that mechanism.

The Four Identity States in Detail

Resolved Identity

Multiple authoritative third-party sources have written about your business consistently, with consistent identifying information, over enough time and in enough volume for the corpus to have absorbed a coherent representation. National brands. Major chains. Public companies. Established BigLaw firms. Universities. Brands with Wikipedia presence and substantial press coverage.

The diagnostic signature: ask any of the five major engines about your business and they all describe you accurately, with consistent details. They know your category. They know your size. They know your differentiators. They might disagree on minor specifics. They agree on the core picture.

Take a top-tier global law firm - PE-dominant, headquartered in Chicago, founded in the early twentieth century. Ask any AI engine about it and you get the same picture: large international law firm, dominant in private equity, top revenue tier, specific founding date, specific scale numbers attached. Confident description. No hedging. The corpus has absorbed the firm completely.

What Resolved Identity enables: visibility on category-level questions where your name is not in the prompt. When someone asks "what are the top private equity law firms," the firm appears without being mentioned by the user. The AI reaches for them because the corpus has them placed in that category. This is the goal state. Identity at this level converts into citation rates that smaller, less-resolved brands cannot achieve regardless of how much on-site work they do.

Partial Identity

Some corpus exposure, but inconsistent or thin. The AI knows the name and roughly the category but hedges on details, varies in confidence across engines, or surfaces you only in narrow contexts that match your specific domain.

The diagnostic signature: AI engines describe you accurately when asked directly by name, but you do not appear in category-level answers unless the prompt is very specific. Ask about "car dealerships in Albany" and the regional Honda chain we mentioned earlier might appear. Ask broader questions about "how to choose a car dealership in upstate New York" and they vanish.

Most regional businesses with real operating history live here. Established. Real revenue. Real customers. Real reputation in their local market. But the corpus exposure is uneven - they have some Wikipedia mentions perhaps, some local press, some directory listings, some Reddit discussion. Not nothing. Not enough.

What Partial Identity enables: name-recognition queries work. Hyper-specific category-and-location queries sometimes work. Broad category questions almost never work. You are visible to people who already know about you and are searching for more information. You are invisible to the much larger group asking general category questions.

This is the hardest position for many businesses to recognize they are in, because the diagnostic on direct-name queries looks fine. The AI describes the business. The owner concludes Identity is solved. Then they look at category-level prompts and see they appear in zero of them. The gap between direct-name visibility and category-level visibility is the diagnostic for Partial Identity. If both work, you are Resolved. If only direct-name works, you are Partial.

Absent Identity

The corpus has no meaningful representation of your business. Either you are too new, too small, or insufficiently mentioned by third parties for the corpus to have absorbed you.

The diagnostic signature: ask any AI engine about your business by name, and the response is one of three things. Vague generalizations that could apply to anyone in your category. An admission that the AI does not know the company. Or - and this is the dangerous one - confabulation, where the AI invents plausible-sounding details to fill the gap.

Confabulation is dangerous because it looks like recognition to readers who do not look closely. The AI says: "Your Business is a fitness training company that focuses on personalized programs and community-driven training experiences." The reader thinks: the AI knows us. Look closer. Those words could describe ten thousand fitness businesses. They are not specific to you. They are pattern completion based on the category your name suggests. The AI has filled in plausible category-defaults because it has no real information.

The southwest Florida fitness studio we discussed earlier lives here. So do most of the regional service businesses, new SaaS ventures, and specialized B2B firms we audit. A regional irrigation company in Ontario, despite proper FAQPage schema and a real local operation, registered two hits across two hundred and eighty-five prompts - both on Perplexity, both essentially saying "this business exists in the area but lacks specific reviews." A B2B research firm in the imaging vertical, despite decades of operating history in its narrow domain, registered zero across one hundred and forty-seven prompts. A new AI tool startup, despite extensive technical content and clear category positioning, registered zero across one hundred and forty-four prompts. Below the waterline, regardless of what their websites do.

What Absent Identity enables: nothing. Below-threshold Identity means you are invisible to AEO. The other pillars cannot compound on top of nothing. Authority work and Specificity work, which we cover in the pieces that follow, are wasted effort if Absent Identity is the underlying state. Solve Identity first or in parallel. Never defer it.

Confused Identity

The corpus has the wrong belief about you, usually because your name collides with a generic phrase or another business. The AI confidently describes a different entity while using your name.

The diagnostic signature: ask about your business and the AI describes another company, uses your name as a generic descriptor (treating an adjective phrase as a brand name, or treating a brand name as a generic phrase), or oscillates between treating it as a brand name and a category description. Different engines may produce different wrong answers. The signature is not absence - it is wrong presence.

Take a real example. An early-stage automotive SaaS startup whose name collides with an extremely common adjective phrase appearing across millions of unrelated contexts in training data. When we tested four hundred and forty-one prompts about cars, automotive products, and related categories, the brand appeared to surface five times. Closer reading told a different story. One hit was Gemini using the brand name as an adjective in a sentence about car upgrades for beginners - nothing to do with the brand. Two hits were Perplexity and Grok describing a different company with a similar name in a different country and a different category. Two hits were Claude explicitly stating it did not recognize the brand: "This is not a specific company I can provide exact information for."

Five apparent hits. Zero true visibility. The corpus has no representation of the startup, but it has strong representations of the adjective phrase its name collides with and the unrelated other-brand. The startup's signal is being filtered through that existing noise.

This is why Confused Identity is often worse than Absent Identity. Absent businesses are invisible. Confused businesses are competing against existing wrong-beliefs in the corpus. Even good signal you produce gets attributed to the wrong entity, or interpreted through the wrong frame, or flagged as ambiguous and discarded. Your work has to first overcome the existing confusion, then build the correct representation. Two jobs instead of one.

If your business has a name that collides with generic phrases or established other-brands, your Identity work is harder, and you should know that going in. Sometimes the right move is rebranding. Sometimes it is adding consistent disambiguating modifiers everywhere - your brand name plus a specific category descriptor used uniformly across all communications - so that the corpus eventually associates the modifier with the correct entity. Sometimes it is accepting that this pillar will require longer time horizons than for clean-name brands.

Why Identity Is the Upstream Pillar

We have asserted, but not yet demonstrated, that Identity sits upstream of Authority and Specificity. The diagnostic data makes the case directly.

In every business in the focused study with Absent Identity, the visibility outcome was zero. Not low. Not occasional. Zero. The southwest Florida fitness studio: zero hits across one hundred and thirty-seven prompts despite optimized schema and category-relevant content. The B2B imaging research firm: zero hits across one hundred and forty-seven prompts despite decades of operating history. The new AI tool startup: zero hits across one hundred and forty-four prompts despite technical content and clear positioning. The Ontario irrigation company: two hits across two hundred and eighty-five prompts, both essentially passive acknowledgments that the business exists rather than meaningful recommendations.

These are businesses that have done what most AEO content tells you to do. Schema. Structured data. Optimized content. The result is identical to having done nothing. They are below the waterline, and below the waterline, on-site optimization does not produce visibility because visibility is not built from the site.

Above the threshold, the picture changes. Authority and Specificity start producing differential visibility. The PE-dominant law firm's Authority - the corpus has absorbed quantitative ranking claims about them as the leading firm in their category - converts into roughly twenty percent visibility on category-level prompts about top corporate law firms. The national fitness chain's Specificity - the corpus has absorbed their branded vocabulary - converts into engine-dependent visibility ranging from fourteen to sixty-eight percent on group fitness prompts. These pillars do real work, and we will go deep on them in the pieces that follow.

But you cannot get differential visibility from a state of total absence. You cannot rank in a list you do not appear in. You cannot have the AI reach for your branded vocabulary if the AI does not know you exist.

Identity is the gate. Authority and Specificity are what happens after the gate is open.

This is why most AEO advice underdelivers in practice. The advice is not wrong, but it is sequenced wrong. Schema, structured content, vocabulary discipline, ranking placements - these all matter. They matter once you are above the threshold. Below it, they are scaffolding around an empty lot.

What Does and Does Not Fix Identity

We can now state plainly what moves Identity and what does not. The list is shorter than most readers expect, and the items on it are different from what most AEO checklists prioritize.

What does not fix Identity

Schema markup of any kind, including Organization, LocalBusiness, FAQPage, HowTo, Service, and Product schema.
Meta tags, title optimization, and on-page SEO.
FAQ pages, knowledge bases, and other on-site content.
More content on your site, written better, in more depth.
Better keyword targeting.
Faster page load times and Core Web Vitals improvements.
Sitemap submission, robots.txt tuning, llms.txt files.

These are SEO and Distribution-layer tactics. They affect how search engines and AI crawlers index your site. They make you more findable when an engine is looking for you specifically. They do not change the corpus's representation of your business, because the corpus is not built from your site. It is built from what others have written about you.

We are not saying these tactics do not matter. They matter. They matter for Distribution-layer outcomes, which we cover in a later piece. They are the floor. Identity work is the ceiling, and you cannot reach the ceiling by making the floor more polished.

What fixes Identity

Wikipedia presence, where your business meets notability requirements.
Industry-authoritative directory listings, especially the canonical directories for your category.
Trade press coverage, particularly from publications dedicated to your specific industry vertical.
Podcast appearances and long-form interviews where you are the named subject.
Reddit and forum presence in communities relevant to your category.
Customer review aggregators with substantial volume and consistent identifying information.
Consistent third-party citations from authoritative sites across the web.

The full operational treatment of how to do each of these lives in the next piece. This piece establishes the principle. The principle is that Identity is built off-site, through accumulated mention density on third-party authoritative sources, over time. It is slower than on-site optimization. It produces no visible results in the first six months. It compounds in years two and three. It is the work most AEO operators are not doing because most AEO advice does not point them at it.

The Diagnostic Exercise

Before you do any Identity build work, you need to know your starting state. The diagnostic takes about fifteen minutes and produces a baseline you can revisit quarterly to measure progress. The progress will be slow. The baseline matters because slow progress invisible to you is indistinguishable from no progress, and that is how most Identity work gets abandoned.

Three steps.

First, ask each of the five major engines - ChatGPT, Gemini, Claude, Perplexity, Grok - the same direct-name question. "What is [your business name]." One question, five engines, five responses. Capture each response. Screenshot it. Save it dated.

Second, compare the responses. Are they consistent across engines? Are they accurate to your actual business? Are the details specific to you, or generic to your category? Look especially for confabulation - plausible-sounding details that do not match reality. Confabulation is the signature of Absent Identity dressed up as recognition.

Third, place yourself in one of the four states. Resolved if all five engines describe you accurately and consistently with specific, true details. Partial if direct-name questions work but the descriptions are thin, hedged, or vary substantially across engines. Absent if the engines do not recognize you or produce generic confabulation. Confused if the engines describe a different entity using your name.

Then run a second test. Pick three to five category-level prompts that someone in your target market might actually type. "Best [your category] in [your city]." "What should I look for in a [your service]." "How do I choose a [your category]." Do not include your business name in any of these prompts. Run them on all five engines. Count how often you appear. If you are in Resolved Identity, you should appear in some of these answers without being named in the prompts. If you do not, your direct-name diagnostic might have been generous - you may actually be in Partial Identity, with recognition only on direct queries.

This baseline is the most important measurement in your AEO program. It will tell you, faster than any tooling, what state you are in. It will tell you, when you re-run it in three or six or twelve months, whether the work you are doing is moving the needle. And it will tell you, by the pattern of which engines see you and which do not, which pillars are weakest and need attention first. We will return to this diagnostic when we treat Citation as the integrated readout of the whole framework.

Operators who want to run this at scale - across many businesses, many prompts, many engines, with structured tracking over time - can do it manually with spreadsheets and screenshots. The methodology is what matters; the tooling is downstream of it. Tools like AppDaddy implement The Visibility Code Diagnostic at scale for operators who do not want to build the tracking themselves. The framework holds either way.

What to Take From This Piece

Three things should stay with you when you put this piece down.

First, Identity is not binary. It has a threshold below which nothing else matters and a gradient above the threshold where elevation determines visibility level. The four states - Resolved, Partial, Absent, Confused - are diagnosable in fifteen minutes. Most readers of this framework have not run the diagnostic. Run it before you read the next piece.

Second, Identity is built off-site. The corpus is not your website. It is built from third-party mentions about you. On-site optimization, no matter how thorough, does not address Identity. If your diagnostic shows Absent or Confused Identity, no amount of schema markup will fix this. You are working on the wrong thing.

Third, Identity is the upstream pillar. Authority and Specificity, which we cover in the pieces that follow, are powerful but require Identity to be at least Partial before they can compound. Below the threshold, work on the other pillars produces no measurable lift. Sequence matters. Identity first, or Identity in parallel. Never Identity deferred.

The next piece is the operational counterpart to this one. We have established what Identity is and why on-site work does not address it. Now we get to what does. Wikipedia. Trade press. Podcasts. Directories. Reddit. The work of building corpus presence over the time horizons it actually takes. It is slower than most readers expect. It is more durable than most readers expect. It is the foundation everything else in the framework rests on.