Trafilatura & Boilerplate Stripping: Why Your Content is Invisible to AI
AI ingestion engines strip nav, footer, and div soup before content hits the embedding model. Learn why HTML5 semantic landmarks are essential to survive the parser.
What is Trafilatura?
Trafilatura is an open-source Python library designed to extract main content from web pages, discarding boilerplate such as navigation menus, footers, sidebars, ads, and cookie banners. It's used by Perplexity AI, several RAG pipeline frameworks, and many AI content ingestion systems as the primary text extraction layer.
Named after the Italian word for "wire drawing" (the process of pulling metal through a die to refine it), Trafilatura does exactly that with HTML: it pulls content through a series of filters that progressively remove everything that isn't considered "main content." The result is a clean text output that feeds directly into the embedding model.
Your content may not survive
The AI content ingestion pipeline
Understanding the full pipeline reveals why structure matters so much. Content passes through at least 4 filtering stages before it influences an AI answer:
HTTP fetch
The crawler fetches the raw HTML. JavaScript-rendered content that requires browser execution may be absent if the crawler doesn't use a headless browser. Many RAG pipelines use lightweight fetching without JS execution to reduce cost.
DOM parsing
The raw HTML is parsed into a DOM tree. At this stage, all elements are present, but malformed HTML (missing closing tags, broken nesting) can cause the parser to misclassify content.
Boilerplate stripping (Trafilatura/Readability)
The main content extraction algorithm runs. This is where the majority of your content loss occurs. The algorithm uses heuristics based on element type, text density, link density, and HTML5 semantic roles to classify each element as 'content' or 'boilerplate.'
Text cleaning and chunking
The surviving text is cleaned (whitespace normalization, encoding fixes), then split into embedding chunks of 512–1024 tokens. The chunk boundaries determine what content stays together in vector space.
Embedding generation
Each chunk is converted to a high-dimensional vector. At this stage, lost content cannot be recovered. Only what survived steps 1–4 gets embedded.
What gets stripped — and the signals that trigger stripping
HTML5 semantic landmarks: your survival mechanism
Trafilatura and similar extractors are specifically designed to recognize HTML5 semantic landmark elements and preserve their content. Using these elements is the single most effective structural change you can make to improve AI content extraction.
<main>CriticalWrap your entire page body content in <main>. Trafilatura gives highest preservation weight to this element.
<article>CriticalUse for blog posts, news articles, and any self-contained content. Signals 'this is worth reading' to the parser.
<section>HighUse to group related paragraphs. Helps the parser maintain content coherence during extraction.
<h1>–<h6>HighProper heading hierarchy signals document structure, helping the parser identify section boundaries.
<figure> + <figcaption>MediumKeeps image descriptions with their captions, preserving visual content as text for AI.
<aside>LowUse for callouts and sidebars. Lower preservation weight than <article> but higher than <div>.
Run a parser survival audit on your pages
You can manually test what Trafilatura extracts from any page using its Python command-line interface. The output shows you exactly what AI ingestion pipelines see from your content — often a sobering amount less than you expect.
RankAsAnswer's Parser Survival Score