Technical AEO

The 512-Token Rule: How to Write for Vector Databases

Jul 14, 20269 min read

AI parsers strip your DOM and chunk your text into 300–800 token blocks. Paragraphs that depend on previous paragraphs for meaning fail in RAG retrieval. The Independent Paragraph rule fixes this.

Infographic512-Token Rule: Chunk Sizes + Bleed Problem

Vector DB Chunk Sizes by Embedding Model

OpenAI Ada-002
512 tokens
Perplexity (internal)
400 tokens
Gemini embedding
768 tokens
Cohere embed
512 tokens

How a Document Gets Chunked

Intro para
Para 2
Para 3 ★ Answer
Para 4
Para 5
Para 6
Conclusion
Chunk boundary ≈ 512 tokens

★ Para 3 is in chunk 1. Para 4 is in chunk 2. Paragraph 4 cannot reference Para 3 — different chunk.

Chunk Bleed (Bad)

"This technique (introduced in the section above) works best when combined with the method described earlier. As we showed, the results depend heavily on prior context."

References 'section above' and 'prior context' — meaningless in isolation

Independent Paragraph (Good)

"FAQPage Schema boosts AI citation rates by 2.4×. Each Question/Answer pair is embedded as a standalone unit, so the AI retrieves exactly the relevant answer without surrounding context."

Fully self-contained — the RAG chunk reads perfectly in isolation

Source: RankAsAnswer vector database chunking research · 2025

What is text chunking?

Text chunking is the process by which RAG (Retrieval-Augmented Generation) pipelines split a document into smaller, manageable units before embedding them into a vector database. Each chunk gets independently embedded as a high-dimensional vector and stored. When a user asks a question, the system retrieves the most semantically similar chunks — not the full document.

The dominant chunk size across major AI systems is 300–800 tokens. OpenAI's text-embedding-3-large model processes 8,191 tokens maximum, but practical RAG systems use much smaller chunks to maximize retrieval precision. Perplexity uses approximately 512 tokens per chunk. Gemini's grounding retrieval operates in a similar range.

Why 512 tokens specifically?

512 tokens (approximately 380 words) is the empirically validated sweet spot for RAG retrieval precision. Longer chunks reduce semantic specificity — a 2,000-token chunk about "CRM software" matches fewer queries precisely than three separate 512-token chunks about pricing, integrations, and use cases respectively.

How AI parsers strip your DOM

Before your content reaches a vector database, an ingestion parser must extract clean text from your HTML. Most major AI crawlers use variants of Mozilla's Readability.js — the same algorithm that powers Firefox Reader View — or tools like Jina.ai's reader. These tools aggressively strip the DOM.

What gets stripped: navigation, sidebars, footers, advertisements, cookie banners, JavaScript-rendered content, and any element not inside <main>, <article>, or similar semantic containers. What survives: headings, paragraphs, lists, tables, and Schema markup in JSON-LD blocks.

The stripped plain text gets split at natural boundaries — paragraph breaks, heading boundaries, list ends — into chunks of approximately 512 tokens each. Two paragraphs that appear visually adjacent on your page may end up in completely different chunks in the vector database, separated by thousands of other documents.

The chunk bleed problem

Chunk bleed occurs when a paragraph requires the previous paragraph to be understood. It is the single most common GEO mistake in long-form content, and it is invisible to the writer because it only manifests at the retrieval stage.

Example of chunk bleed: Paragraph 1 introduces a concept — "There are three pricing tiers." Paragraph 2 elaborates — "The first tier includes five users and 10GB storage." When the chunker splits these into separate chunks, Paragraph 2 is semantically incomplete. It refers to "the first tier" without defining what system or product the tier belongs to. Any query about that product's pricing may retrieve Paragraph 1 but not Paragraph 2, losing the detailed facts entirely.

Chunk bleed is especially severe with: pronouns ("it," "they," "this"), continuation phrases ("additionally," "furthermore," "as mentioned above"), references to numbered lists established in a previous paragraph, and "the following table shows..." when the table is in a separate chunk.

Chunk bleed patterns to eliminate

Pronoun reference"It supports up to 100 users""Acme CRM supports up to 100 users"
Forward reference"The following three points...""CRM selection involves three factors: cost, integrations, and..."
Continuation phrase"Additionally, the platform...""Acme CRM additionally offers..."
Table orphan"The table below shows pricing"Table contains self-describing headers and captions
List dependency"The third option above..."Restate the option name in full

The Independent Paragraph rule

The Independent Paragraph rule states: every paragraph must be fully comprehensible in isolation, without any surrounding context. If you were to extract a single paragraph from your article and show it to someone who had never seen the rest of the content, they should be able to understand exactly what claim is being made, about what subject, with what supporting evidence.

Applying this rule changes how you start sentences. Instead of "It offers three integrations," you write "Acme CRM offers three integrations: Salesforce, HubSpot, and Zapier." Instead of "This makes it the fastest option," you write "Acme CRM's sub-100ms API response time makes it the fastest CRM integration option in the mid-market segment."

Every sentence should name its subject. Every claim should contain its quantitative context. Every list item should be interpretable standalone. This feels redundant when writing for humans who read linearly — but it is mandatory when writing for vector databases that retrieve non-linearly.

The redundancy payoff

Writing that feels "repetitive" to human readers produces the highest cosine similarity scores in vector retrieval. Repeating the subject noun, restating the product name, and quantifying every claim all increase the information signal density of each chunk — which is exactly what vector similarity rewards.

Token math: how long is 512 tokens?

One token is approximately 0.75 words in English. 512 tokens is approximately 384 words — roughly 3–5 solid paragraphs. This means a standard 1,500-word blog post creates 3–4 independent chunks. A 4,000-word post creates 8–10 chunks.

The implication: write each H2 section as if it were a standalone document. Each section will likely become 1–2 chunks. If the section requires its introduction to understand its conclusion, that conclusion chunk will fail in retrieval.

Before and after rewrite examples

Before (chunk bleed): "As we mentioned in the previous section, the three pricing tiers each have distinct features. The first one targets startups, and it comes with the basic set of tools we described earlier. This is the most popular option among our customers."

After (chunk-independent): "Acme CRM's Starter tier is the most popular plan, chosen by 61% of customers. It targets startups and solo consultants, offering contact management, pipeline tracking, and email integration for up to 5 users at $29 per month."

The rewrite eliminates all pronoun references, removes dependency on earlier context, and embeds every key fact — product name, tier name, popularity data, feature list, user limit, and price — directly into the chunk.

GEO chunking checklist

  • Every paragraph names its subject in the first sentence
  • No pronouns whose antecedents are in a different paragraph
  • No phrases like 'as mentioned,' 'as shown above,' or 'the following'
  • Every table has a self-describing caption and column headers
  • Every list item is interpretable without the surrounding paragraph
  • Each H2 section answers one specific question completely
  • Quantitative claims are embedded in the sentence, not introduced in a prior sentence
Was this article helpful?
Back to all articles