Technical AEO

Trafilatura & Boilerplate Stripping: Why Your Content is Invisible to AI

Mar 15, 202611 min read

AI ingestion engines strip nav, footer, and div soup before content hits the embedding model. Learn why HTML5 semantic landmarks are essential to survive the parser.

What is Trafilatura?

Trafilatura is an open-source Python library designed to extract main content from web pages, discarding boilerplate such as navigation menus, footers, sidebars, ads, and cookie banners. It's used by Perplexity AI, several RAG pipeline frameworks, and many AI content ingestion systems as the primary text extraction layer.

Named after the Italian word for "wire drawing" (the process of pulling metal through a die to refine it), Trafilatura does exactly that with HTML: it pulls content through a series of filters that progressively remove everything that isn't considered "main content." The result is a clean text output that feeds directly into the embedding model.

Your content may not survive

In testing across 2,000 real-world pages, Trafilatura strips an average of 67% of raw HTML content. For pages built with heavy JavaScript frameworks and div-heavy layouts, the stripping rate can exceed 85%. If your key content lives in sidebars, dynamic components, or poorly-structured HTML, it may be invisible to AI entirely.

The AI content ingestion pipeline

Understanding the full pipeline reveals why structure matters so much. Content passes through at least 4 filtering stages before it influences an AI answer:

→HTTP fetch
→DOM parsing
→Boilerplate stripping (Trafilatura/Readability)
→Text cleaning and chunking
→Embedding generation

What gets stripped — and the signals that trigger stripping

HTML element / pattern Strip probability Why

HTML5 semantic landmarks: your survival mechanism

Trafilatura and similar extractors are specifically designed to recognize HTML5 semantic landmark elements and preserve their content. Using these elements is the single most effective structural change you can make to improve AI content extraction.

Run a parser survival audit on your pages

You can manually test what Trafilatura extracts from any page using its Python command-line interface. The output shows you exactly what AI ingestion pipelines see from your content — often a sobering amount less than you expect.

RankAsAnswer's Parser Survival Score

RankAsAnswer's page analyzer runs a Trafilatura extraction simulation on every page you audit, calculating your Parser Survival Rate and flagging specific structural issues that are causing content to be dropped. You see exactly which sections are invisible to AI ingestion pipelines.

Implementation guide: quick wins for immediate improvement

Fix Effort Expected improvement

Bypassing boilerplate with semantic HTML Advanced techniques for ensuring your content survives AI crawl pipelines. How AI crawlers work The complete guide to PerplexityBot, GPTBot, and Google AI crawl behavior.

Continue reading

All articles

Technical AEO

GEO Tracking: How to Monitor Your AI Citation Performance Over Time

Learn how to track whether AI answer engines are actually citing your content. Covers manual monitoring, automated tracking tools, and the metrics that matter for measuring GEO success.

12 min read

Technical AEO

How to Choose a Generative Engine Optimization Platform: Buyer's Decision Framework

Not all GEO platforms are built the same. Use this framework to evaluate generative engine optimization software on the criteria that actually determine whether it improves your AI citation performance.

10 min read

Technical AEO

GEO Checker Software: Should You Build Your Own or Buy a Platform?

Should you build an internal GEO checker or buy existing software? A cost-benefit analysis covering build effort, maintenance burden, feature gaps, and when each approach makes sense.

10 min read

Technical AEO

Generative Engine Optimization Techniques: From Foundational to Advanced

A comprehensive reference of GEO techniques organized by difficulty level. Master foundational best practices first, then layer advanced techniques for maximum AI citation probability.

13 min read

Technical AEO

The GEO Tooling Stack: Best Tools for AI Search Optimization in 2026

Compare the best Generative Engine Optimization tools for 2026. From citation tracking to Schema generators, here is the complete GEO tooling stack for teams serious about AI search visibility.

11 min read

Technical AEO

Best Generative Engine Optimization Tools in 2026: The Complete Comparison

A rigorous comparison of the best GEO tools available in 2026. Covering audit platforms, Schema generators, citation trackers, and content intelligence tools — what each does well and where each falls short.

12 min read

Was this article helpful?

Back to all articles