Bypassing the Boilerplate: The Semantic HTML Rule for AI Crawlers
LLM ingestion pipelines use Readability.js and similar tools to strip div soup from web pages before indexing. If your core content is not wrapped in semantic HTML containers, it may be treated as boilerplate and excluded from the vector database entirely.
What Readability.js does to your page
Mozilla's Readability.js — the algorithm powering Firefox Reader View, Pocket, and a significant portion of AI content ingestion tools — processes a web page by attempting to identify the main article content and remove everything else. It is aggressive, opinionated, and stateless.
Readability.js works by scoring candidate content blocks based on link density (high link density = navigation = remove), tag semantics (article, main = keep; div with generic classes = evaluate), text length (short text blocks = likely UI = remove), and parent container context. A content block inside a clearly semantic container like <article> is preserved with high confidence. A content block inside a <div class="content-wrapper"> is evaluated probabilistically — and frequently discarded.
Jina.ai's reader — the tool used by many AI pipelines including some Perplexity integrations — uses a similar approach. The result: AI ingestion pipelines often silently fail to index content that is not wrapped in semantic HTML, with no error and no visibility for the content author.
The silent indexing failure
The div soup problem
Modern JavaScript-heavy web applications frequently produce deeply nested <div> structures with utility class names: <div class="flex flex-col max-w-3xl mx-auto px-4">. These class names tell a human developer about layout but tell Readability.js nothing about content meaning. The algorithm cannot distinguish between a div.flex.flex-col wrapping a blog article and a div.flex.flex-col wrapping a sidebar advertisement.
Next.js, React, and Vue applications built with Tailwind CSS or similar utility frameworks are particularly susceptible. The framework encourages div-based layouts with utility class names — a pattern that produces excellent visual results but terrible AI parser readability.
The semantic container hierarchy
The correct semantic HTML hierarchy for AI-parseable content has three levels. At the page level: <main> wraps the primary content area of the page. Every page should have exactly one <main> element. At the content level: <article> wraps a self-contained piece of content (a blog post, a product page, a guide). At the section level: <section> wraps a thematic grouping within the article.
This hierarchy gives Readability.js a clear signal: everything inside <main><article> is primary content. Strip everything outside.
Semantic HTML parser survival rates
<main><article>Always preservedFull indexing<main> aloneUsually preservedGood indexing<article> aloneUsually preservedGood indexing<section>Context-dependentPartial indexing<div class='content'>ProbabilisticUnreliable<div class='flex'>Likely strippedMinimal indexingTags that survive AI parsers
These HTML elements are consistently preserved by Readability.js, Jina.ai reader, and equivalent parsers: <main>, <article>, <section> (within article), <h1> through <h4>, <p>, <ul>/<ol>/<li>, <table> with semantic markup, <figure>/<figcaption>, <blockquote>, and <script type="application/ld+json"> (JSON-LD Schema).
Tags that get stripped
These elements are consistently stripped by AI parsers: <nav>, <header>, <footer>, <aside>, <form>, <button>, <input>, <script> (non-JSON-LD), <style>, <canvas>, <iframe>, and <div> elements classified as navigation, advertisements, or boilerplate by density heuristics.
Implementation checklist
- Every page has exactly one <main> element wrapping the primary content area
- Blog posts and long-form pages wrap content in <article> inside <main>
- Section headings are inside <section> elements, not bare <div> containers
- No content-critical text is inside <nav>, <header>, or <footer>
- JSON-LD Schema blocks use <script type="application/ld+json">, not class-based injection
- Tables use <table>, <thead>, <tbody>, <th>, <td> — not CSS grid layouts pretending to be tables
- Lists use <ul>/<ol>/<li> — not div-based flex layouts
How to audit your semantic HTML
The fastest audit method: pass your page URL through Firefox Reader View (Ctrl+Alt+R). Whatever Firefox Reader View shows is approximately what Readability.js sends to the vector database. If Reader View strips your content, Perplexity and ChatGPT likely will too.
Alternatively, run your URL through Jina.ai's reader (r.jina.ai/[your-url]) to see exactly what the plain-text extraction produces. Any content not present in that output is not being indexed for AI retrieval.
The React/Next.js fix
<main><article>...your content...</article></main>. This single change can immediately restore indexing for pages that were silently failing parser extraction.