Back to Articles

Teracrawl: The Web Scraper That Topped 14 Competitors by Rethinking Browser Reuse

[ View on GitHub ]

Teracrawl: The Web Scraper That Topped 14 Competitors by Rethinking Browser Reuse

Hook

A TypeScript web scraper just scored 84.2% on the scrape-evals benchmark—beating Firecrawl, Crawl4AI, and 12 other established tools. Its secret? Knowing when to skip the full browser render.

Context

Large language models have created an insatiable appetite for web content, but feeding them is harder than it looks. Modern websites use JavaScript frameworks, implement sophisticated bot detection, and bury actual content under navigation chrome, ads, and tracking pixels. Traditional HTTP scrapers fail on React apps. Headless browsers work but burn through resources. And even when you successfully scrape a page, you’re left with a bloated HTML soup that costs a fortune in tokens.

This is where Teracrawl enters. It’s positioned as a Firecrawl alternative—a hosted API that converts websites into clean Markdown optimized for LLM consumption. But unlike tools that treat every page the same way, Teracrawl implements a two-phase crawling strategy that intelligently chooses between fast static scraping and full browser rendering. Built on TypeScript and leveraging Browser.cash’s managed Chrome infrastructure, it’s designed specifically for AI agents and RAG pipelines that need reliable, token-efficient web content at scale.

Technical Insight

Teracrawl’s architecture centers on what the maintainers call “smart crawling”—a two-phase approach that optimizes for both speed and success rate. When you request a URL, the system first attempts a fast mode scrape that reuses browser contexts and aggressively blocks images, fonts, and stylesheets. This works beautifully for server-rendered pages and static sites, delivering results in a fraction of the time a full browser render would take. If the content detection algorithm determines the page is incomplete (likely due to client-side JavaScript rendering), Teracrawl automatically falls back to dynamic mode, which launches a fresh browser session and waits for the JavaScript hydration to complete.

The API design is refreshingly simple. Here’s a basic scraping request:

import Teracrawl from 'teracrawl';

const crawler = new Teracrawl({
  apiKey: process.env.BROWSERCASH_API_KEY
});

const result = await crawler.scrape({
  url: 'https://example.com/article',
  formats: ['markdown'],
  waitFor: 'networkidle' // optional, for SPA content
});

console.log(result.markdown);
// Clean, LLM-ready content without navigation, ads, or clutter

What makes this particularly powerful for LLM workflows is the built-in content cleaning. Teracrawl doesn’t just convert HTML to Markdown—it intelligently identifies the main content area, strips out navigation menus, removes base64-encoded images (which would otherwise bloat your token count), and preserves semantic structure. The output is optimized for chunking and embedding, not for visual fidelity.

The tool truly shines when combined with search. Teracrawl integrates with browser-serp to let you query Google and scrape the top results in a single API call:

const searchResults = await crawler.search({
  query: 'machine learning best practices',
  numResults: 5,
  scrapeResults: true
});

// Returns array of objects with URL, title, snippet, AND full markdown content
searchResults.forEach(result => {
  console.log(`${result.title}\n${result.markdown.slice(0, 500)}...`);
});

Under the hood, this kicks off parallel scraping operations for each search result, leveraging the browser session pool to handle concurrency efficiently. For an AI agent doing research, this is transformative—one API call replaces what would otherwise be a search request followed by 5-10 individual scraping operations.

The browser context reuse strategy deserves special attention. Instead of spinning up a new browser instance for every request (expensive and slow), Teracrawl maintains a pool of active sessions. In fast mode, it reuses these contexts with aggressive resource blocking:

// Simplified internal approach
const context = await browserPool.acquire();
await context.route('**/*.{png,jpg,jpeg,gif,svg,css,woff2}', route => route.abort());
const page = await context.newPage();
await page.goto(url, { waitUntil: 'domcontentloaded' });

This is why Teracrawl scored 84.2% on scrape-evals while maintaining competitive performance—it’s fast when it can be, and thorough when it needs to be. The dynamic mode fallback ensures you don’t sacrifice reliability for speed.

The content extraction logic uses DOM analysis to identify the primary content container, looking for article tags, main elements, and content density patterns. This is more sophisticated than simple CSS selectors but less brittle than hard-coded rules for specific sites. It’s a pragmatic middle ground that works across diverse web architectures without requiring per-site configuration.

Gotcha

The biggest limitation is architectural: Teracrawl is completely dependent on Browser.cash for browser provisioning. This isn’t just a recommended integration—it’s a hard requirement. You cannot run Teracrawl without a Browser.cash API key, which means you’re locked into their pricing model (pay-per-request) and subject to their infrastructure availability. For organizations with strict data residency requirements, on-premise compliance needs, or simply a preference for self-hosted tooling, this is a non-starter. The repository doesn’t include fallback support for local Playwright or Puppeteer instances.

The search functionality introduces another dependency: browser-serp must be running as a separate service. While the scraping itself is a simple API call, search queries require you to deploy and maintain an additional component. The documentation doesn’t make this entirely clear upfront, and developers expecting a fully self-contained solution will hit this wall when they try to use search features. Additionally, the content extraction, while generally effective, offers limited customization. There’s no way to specify custom CSS selectors, execute arbitrary JavaScript, or define extraction rules for specific use cases. If the automatic content detection doesn’t capture what you need, your options are limited.

Verdict

Use if: You’re building LLM applications (RAG systems, AI agents, research tools) that need production-grade web scraping with minimal engineering effort, you’re comfortable with pay-per-request pricing and external API dependencies, you’re dealing with JavaScript-heavy sites or anti-bot protection where simple HTTP requests fail, or you value proven reliability (that 84.2% benchmark score) over configuration flexibility. The unified search-and-scrape API alone justifies adoption for AI agents doing web research. Skip if: You need on-premise deployment without external dependencies, require granular control over extraction logic or custom scraping rules, want to avoid per-request API costs in favor of fixed infrastructure expenses, or you’re only scraping simple static sites where Playwright + Turndown.js would suffice. The Browser.cash lock-in is real—make sure you’re comfortable with that trade-off before committing production workloads.