Technical AEO

Bypassing the Boilerplate: The Semantic HTML Rule for AI Crawlers

Aug 18, 20267 min read

LLM ingestion pipelines use Readability.js and similar tools to strip div soup from web pages before indexing. If your core content is not wrapped in semantic HTML containers, it may be treated as boilerplate and excluded from the vector database entirely.

What Readability.js does to your page

Mozilla's Readability.js — the algorithm powering Firefox Reader View, Pocket, and a significant portion of AI content ingestion tools — processes a web page by attempting to identify the main article content and remove everything else. It is aggressive, opinionated, and stateless.

Readability.js works by scoring candidate content blocks based on link density (high link density = navigation = remove), tag semantics (article, main = keep; div with generic classes = evaluate), text length (short text blocks = likely UI = remove), and parent container context. A content block inside a clearly semantic container like <article> is preserved with high confidence. A content block inside a <div class="content-wrapper"> is evaluated probabilistically — and frequently discarded.

Jina.ai's reader — the tool used by many AI pipelines including some Perplexity integrations — uses a similar approach. The result: AI ingestion pipelines often silently fail to index content that is not wrapped in semantic HTML, with no error and no visibility for the content author.

The silent indexing failure

When Readability.js fails to identify your main content block, it falls back to extracting the longest continuous text block on the page. If your navigation or footer contains more text than a sparse content section, the parser may index your navigation as your "content." You would never know — and your citation rate would be zero despite the page being crawled.

The div soup problem

Modern JavaScript-heavy web applications frequently produce deeply nested <div> structures with utility class names: <div class="flex flex-col max-w-3xl mx-auto px-4">. These class names tell a human developer about layout but tell Readability.js nothing about content meaning. The algorithm cannot distinguish between a div.flex.flex-col wrapping a blog article and a div.flex.flex-col wrapping a sidebar advertisement.

Next.js, React, and Vue applications built with Tailwind CSS or similar utility frameworks are particularly susceptible. The framework encourages div-based layouts with utility class names — a pattern that produces excellent visual results but terrible AI parser readability.

The semantic container hierarchy

The correct semantic HTML hierarchy for AI-parseable content has three levels. At the page level: <main> wraps the primary content area of the page. Every page should have exactly one <main> element. At the content level: <article> wraps a self-contained piece of content (a blog post, a product page, a guide). At the section level: <section> wraps a thematic grouping within the article.

This hierarchy gives Readability.js a clear signal: everything inside <main><article> is primary content. Strip everything outside.

Semantic HTML parser survival rates

ContainerParser outcomeCitation impact

<main><article>Always preservedFull indexing

<main> aloneUsually preservedGood indexing

<article> aloneUsually preservedGood indexing

<section>Context-dependentPartial indexing

<div class='content'>ProbabilisticUnreliable

<div class='flex'>Likely strippedMinimal indexing

Tags that survive AI parsers

These HTML elements are consistently preserved by Readability.js, Jina.ai reader, and equivalent parsers: <main>, <article>, <section> (within article), <h1> through <h4>, <p>, <ul>/<ol>/<li>, <table> with semantic markup, <figure>/<figcaption>, <blockquote>, and <script type="application/ld+json"> (JSON-LD Schema).

Tags that get stripped

These elements are consistently stripped by AI parsers: <nav>, <header>, <footer>, <aside>, <form>, <button>, <input>, <script> (non-JSON-LD), <style>, <canvas>, <iframe>, and <div> elements classified as navigation, advertisements, or boilerplate by density heuristics.

Implementation checklist

Every page has exactly one <main> element wrapping the primary content area
Blog posts and long-form pages wrap content in <article> inside <main>
Section headings are inside <section> elements, not bare <div> containers
No content-critical text is inside <nav>, <header>, or <footer>
JSON-LD Schema blocks use <script type="application/ld+json">, not class-based injection
Tables use <table>, <thead>, <tbody>, <th>, <td> — not CSS grid layouts pretending to be tables
Lists use <ul>/<ol>/<li> — not div-based flex layouts

How to audit your semantic HTML

The fastest audit method: pass your page URL through Firefox Reader View (Ctrl+Alt+R). Whatever Firefox Reader View shows is approximately what Readability.js sends to the vector database. If Reader View strips your content, Perplexity and ChatGPT likely will too.

Alternatively, run your URL through Jina.ai's reader (r.jina.ai/[your-url]) to see exactly what the plain-text extraction produces. Any content not present in that output is not being indexed for AI retrieval.

The React/Next.js fix

In Next.js App Router, wrap your page content in a semantic structure: <main><article>...your content...</article></main>. This single change can immediately restore indexing for pages that were silently failing parser extraction.

The 512-Token Rule

How chunking works after your HTML survives the parser.

Visual RAG

Why canvas and JS charts are invisible to RAG pipelines.

Was this article helpful?

Back to all articles