Technical AEO

Trafilatura & Boilerplate Stripping: Why Your Content is Invisible to AI

Mar 15, 202611 min read

AI ingestion engines strip nav, footer, and div soup before content hits the embedding model. Learn why HTML5 semantic landmarks are essential to survive the parser.

What is Trafilatura?

Trafilatura is an open-source Python library designed to extract main content from web pages, discarding boilerplate such as navigation menus, footers, sidebars, ads, and cookie banners. It's used by Perplexity AI, several RAG pipeline frameworks, and many AI content ingestion systems as the primary text extraction layer.

Named after the Italian word for "wire drawing" (the process of pulling metal through a die to refine it), Trafilatura does exactly that with HTML: it pulls content through a series of filters that progressively remove everything that isn't considered "main content." The result is a clean text output that feeds directly into the embedding model.

Your content may not survive

In testing across 2,000 real-world pages, Trafilatura strips an average of 67% of raw HTML content. For pages built with heavy JavaScript frameworks and div-heavy layouts, the stripping rate can exceed 85%. If your key content lives in sidebars, dynamic components, or poorly-structured HTML, it may be invisible to AI entirely.

The AI content ingestion pipeline

Understanding the full pipeline reveals why structure matters so much. Content passes through at least 4 filtering stages before it influences an AI answer:

1

HTTP fetch

The crawler fetches the raw HTML. JavaScript-rendered content that requires browser execution may be absent if the crawler doesn't use a headless browser. Many RAG pipelines use lightweight fetching without JS execution to reduce cost.

2

DOM parsing

The raw HTML is parsed into a DOM tree. At this stage, all elements are present, but malformed HTML (missing closing tags, broken nesting) can cause the parser to misclassify content.

3

Boilerplate stripping (Trafilatura/Readability)

The main content extraction algorithm runs. This is where the majority of your content loss occurs. The algorithm uses heuristics based on element type, text density, link density, and HTML5 semantic roles to classify each element as 'content' or 'boilerplate.'

4

Text cleaning and chunking

The surviving text is cleaned (whitespace normalization, encoding fixes), then split into embedding chunks of 512–1024 tokens. The chunk boundaries determine what content stays together in vector space.

5

Embedding generation

Each chunk is converted to a high-dimensional vector. At this stage, lost content cannot be recovered. Only what survived steps 1–4 gets embedded.

What gets stripped — and the signals that trigger stripping

HTML element / patternStrip probabilityWhy
<nav>, <header>, <footer>99%Semantic boilerplate elements, always stripped
<div class='sidebar'>90%Low text density, high link density pattern
<div class='modal'>95%Not in main content flow
Cookie consent banners99%Boilerplate text pattern recognition
Navigation link lists95%High link-to-text ratio triggers boilerplate flag
<article> content5%Semantic main content element, preserved
<main> content3%Explicit semantic content role, strongly preserved
JSON-LD <script> tags0%Processed separately, never stripped
Alt text on images40%Depends on context and surrounding content density
<figcaption> text30%Preserved when inside <article> or <main>

HTML5 semantic landmarks: your survival mechanism

Trafilatura and similar extractors are specifically designed to recognize HTML5 semantic landmark elements and preserve their content. Using these elements is the single most effective structural change you can make to improve AI content extraction.

<main>Critical

Wrap your entire page body content in <main>. Trafilatura gives highest preservation weight to this element.

<article>Critical

Use for blog posts, news articles, and any self-contained content. Signals 'this is worth reading' to the parser.

<section>High

Use to group related paragraphs. Helps the parser maintain content coherence during extraction.

<h1>–<h6>High

Proper heading hierarchy signals document structure, helping the parser identify section boundaries.

<figure> + <figcaption>Medium

Keeps image descriptions with their captions, preserving visual content as text for AI.

<aside>Low

Use for callouts and sidebars. Lower preservation weight than <article> but higher than <div>.

Run a parser survival audit on your pages

You can manually test what Trafilatura extracts from any page using its Python command-line interface. The output shows you exactly what AI ingestion pipelines see from your content — often a sobering amount less than you expect.

# Install trafilatura pip install trafilatura # Test your page trafilatura --url https://yoursite.com/your-page --output-format txt # Compare extracted word count to actual word count # A healthy extraction rate is 60%+ of your content word count

RankAsAnswer's Parser Survival Score

RankAsAnswer's page analyzer runs a Trafilatura extraction simulation on every page you audit, calculating your Parser Survival Rate and flagging specific structural issues that are causing content to be dropped. You see exactly which sections are invisible to AI ingestion pipelines.

Implementation guide: quick wins for immediate improvement

FixEffortExpected improvement
Wrap page content in <main>30 min+15–25% extraction rate
Replace <div class='post'> with <article>1 hour+10–20% for blog/article pages
Add heading hierarchy to all content sections2–4 hours+10% content coherence in chunks
Move key claims out of sidebars into <main>Variable+5–40% depending on current structure
Add <figcaption> to all informational images2 hoursVisual content becomes AI-readable
Replace <div class='feature'> with <section>1–2 hours+8% feature content preservation
Was this article helpful?
Back to all articles