Screaming Frog is the go-to crawler for most SEOs, but you’ve probably hit its walls: the 500-URL cap on the free version, RAM maxing out on large sites, or wanting to automate crawls without babysitting a GUI. Scrapy is the open-source Python framework that removes those limits.

If you can run npm install or git clone, you can run Scrapy. The learning curve is real but manageable, especially if you’re already getting comfortable with CLI tools through agentic coding workflows.

Why Scrapy?

Key Benefits

Screaming Frog works great for quick audits. But it has limits:

Limitation Impact
500 URL free limit Requires $259/year license for larger sites
Memory-hungry Large crawls can consume 8GB+ RAM
GUI-dependent Difficult to automate or schedule
Limited customization Configuration options are fixed

Scrapy solves these:

Scrapy What You Get
Free and open-source No URL limits, no license fees
Lower memory footprint Disk-backed queues keep RAM in check
CLI-native Scriptable, cron-able, CI/CD-ready
Full Python customization Extract what you need, filter how you want
Pause/resume Stop and continue large crawls anytime
Scrapy won't replace Screaming Frog for everything. Quick audits are still faster in a GUI. But for large-scale crawls, automation, and custom extraction, it's worth having in your toolkit.

Installation

Setup

Scrapy runs on Python. Use a virtual environment to keep things clean:

Debian/Ubuntu:

sudo apt install python3.11-venv
python3 -m venv venv
source venv/bin/activate
pip install scrapy

macOS:

python3 -m venv venv
source venv/bin/activate
pip install scrapy

Windows:

python -m venv venv
venv\Scripts\activate
pip install scrapy
Virtual Environments Matter
Always use a venv. Installing globally causes dependency conflicts and breaks reproducibility.

Creating a Project

With Scrapy installed:

scrapy startproject myproject
cd myproject
scrapy genspider sitename example.com

This creates:

myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            sitename.py

Spider code goes in spiders/sitename.py. Configuration lives in settings.py.

Settings for Polite Crawling

Critical

Configure settings.py before running anything. Getting blocked wastes more time than crawling slowly.

# Polite crawling
CONCURRENT_REQUESTS_PER_DOMAIN = 5
DOWNLOAD_DELAY = 1
ROBOTSTXT_OBEY = True

# AutoThrottle - adjusts speed based on server response
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
AUTOTHROTTLE_DEBUG = True

# Safety limits
CLOSESPIDER_PAGECOUNT = 10000

# Output
FEED_EXPORT_ENCODING = "utf-8"
With AutoThrottle enabled, 5 concurrent requests is a reasonable starting point. AutoThrottle will back off automatically if the server struggles. Without AutoThrottle, start lower at 1-3.

AutoThrottle

AutoThrottle monitors server response times and adjusts crawl speed automatically:

  • Fast responses → speeds up
  • Slow responses → backs off
  • Errors/timeouts → slows down significantly

Unlike Screaming Frog’s fixed delays, it adapts to actual server conditions.

Status Code Handling

By default, Scrapy’s HttpErrorMiddleware silently drops non-2xx responses. This means 404s, 301s, 500s are discarded before reaching your callback. Your crawl might show 100% 200 status codes, not because the site is perfect, but because errors are being filtered out.

Add this to your spider class to capture all status codes:

handle_httpstatus_list = [200, 301, 302, 403, 404, 500, 502, 503]

Screaming Frog captures all status codes by default. This setting brings Scrapy in line with that behavior.

Real-World Performance

Benchmarks

Actual numbers from a test crawl with 5 concurrent requests and AutoThrottle enabled:

Crawl Progress Pages/Minute Notes
0-200 pages 14-22 Ramp-up phase
200-500 pages 10-12 Stabilizing
500-1,000 pages 7-10 AutoThrottle adjusting
1,000+ pages 5-7 Steady state
Speed vs. Reliability
These speeds look slow. That's the point. AutoThrottle prioritizes server health over raw speed. Getting blocked and restarting wastes more time than a methodical crawl.

Feature Comparison

Feature Screaming Frog Scrapy
Cost Free <500 URLs, ~$259/year Free, open source
Max crawl size Memory-limited Disk-backed queues
Customization Limited config options Full Python code
Scheduling Manual or third-party Native CLI, cron-able
Pause/Resume Yes Yes (with JOBDIR)
Learning curve Low (GUI) Medium (code)
Rate limiting Basic fixed delays AutoThrottle (adaptive)
JavaScript rendering Optional (Chrome) Optional (playwright/splash)
Status codes All by default Requires configuration
Subdomain filtering GUI checkboxes Code (flexible regex)
Export formats CSV, Excel, etc. JSON, CSV, XML, custom
CI/CD integration Difficult Native

URL Filtering

Precise Control

Screaming Frog uses checkboxes. Scrapy uses code. The tradeoff is learning curve for precision.

Excluding international paths:

import re
from urllib.parse import urlparse

class MySiteSpider(scrapy.Spider):
    name = "mysite"
    allowed_domains = ["example.com", "www.example.com"]
    start_urls = ["https://www.example.com/"]

    # Skip international paths like /uk/, /fr/, /de/
    EXCLUDED_PATTERNS = re.compile(
        r"/(in|au|th|es|hk|sg|ph|my|ca|cn|uk|kr|id|fr|vn|de|jp|nl|it|tw)/"
    )

    def filter_links(self, links):
        filtered = []
        for link in links:
            hostname = urlparse(link.url).hostname or ""
            if hostname not in ("example.com", "www.example.com"):
                continue
            if self.EXCLUDED_PATTERNS.search(link.url):
                continue
            filtered.append(link)
        return filtered

You can filter by URL patterns, query parameters, response headers, page content, or any combination.

Pause and Resume

Essential

For crawls over 1,000 pages, enable pause/resume with JOBDIR:

scrapy crawl myspider -o output.json -s JOBDIR=crawl_state

Scrapy saves state to crawl_state/. Hit Ctrl+C to pause. Run the same command to resume.

Always use JOBDIR for production crawls. Protects against network issues, system restarts, or just needing to stop for the day.

State includes pending URLs, seen URLs, and the request queue. This is more robust than Screaming Frog’s save/load feature because it’s file-based and survives system restarts.

JavaScript Rendering

Scrapy fetches raw HTML only. It doesn’t render JavaScript. This is the same as what curl returns.

For most SEO crawls, this is fine:

  • Meta tags, canonicals, and h1s are usually in the initial HTML
  • Search engines primarily index server-rendered content
  • Most e-commerce and content sites are server-rendered

If your target site renders content client-side, you have options:

Package Notes
scrapy-playwright Uses Chromium/Firefox/WebKit. Recommended for modern JS sites
scrapy-splash Lightweight, Docker-based renderer
scrapy-selenium Older approach, still works

JS rendering is significantly slower and more resource-intensive. Only add it if the site requires it.

Screaming Frog has a similar tradeoff. Enabling JavaScript rendering uses Chrome under the hood and slows crawls considerably.

Memory Management

At ~1,300 pages with full field extraction:

  • Memory: ~265 MB
  • CPU: ~4%

Using JOBDIR moves request queues to disk, keeping memory low. For very large crawls (100k+ URLs), add these settings:

MEMUSAGE_LIMIT_MB = 1024
MEMUSAGE_WARNING_MB = 800
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

This caps memory usage and forces disk-backed queues for the scheduler.

Output Data

Customizable

Basic spider output:

{
    "url": "https://www.example.com/page/",
    "title": "Page Title Here",
    "status": 200
}

For SEO crawls, you’ll want fields similar to what Screaming Frog exports:

def parse_page(self, response):
    yield {
        "url": response.url,
        "status": response.status,
        "title": response.css("title::text").get(),
        "meta_description": response.css("meta[name='description']::attr(content)").get(),
        "meta_robots": response.css("meta[name='robots']::attr(content)").get(),
        "h1": response.css("h1::text").get(),
        "canonical": response.css("link[rel='canonical']::attr(href)").get(),
        "og_title": response.css("meta[property='og:title']::attr(content)").get(),
        "og_description": response.css("meta[property='og:description']::attr(content)").get(),
        "word_count": len(response.text.split()) if response.status == 200 else None,
        "content_type": response.headers.get("Content-Type", b"").decode("utf-8", errors="ignore"),
    }

Add or remove fields based on what you need. CSS selectors work for any on-page element.

Export formats: JSON (-o output.json), JSON Lines (-o output.jsonl), CSV (-o output.csv), XML (-o output.xml).

JSON Lines is best for large crawls. Files are valid line-by-line during the crawl, so you can monitor with tail -f. Standard JSON isn’t valid until the crawl completes.

Screaming Frog → Scrapy

Translation Guide

Mapping SF workflows to Scrapy:

Screaming Frog Action Scrapy Equivalent
Start new crawl scrapy crawl spidername
Set crawl delay DOWNLOAD_DELAY in settings
Limit concurrent threads CONCURRENT_REQUESTS_PER_DOMAIN
Respect robots.txt ROBOTSTXT_OBEY = True
Export to CSV -o output.csv
Save/Load crawl -s JOBDIR=crawl_state
Filter subdomains Code in spider (regex)
Custom extraction CSS/XPath selectors in parse()

Mindset shifts:

  1. Configuration is code. Edit settings.py instead of clicking checkboxes.
  2. Extraction is explicit. You write what data to capture.
  3. Scheduling is native. Add commands to cron or CI/CD.
  4. Debugging is logs. Enable AUTOTHROTTLE_DEBUG to see what’s happening.

Full Workflow

With the standard settings above, you can have Scrapy installed and crawling in under 15 minutes:

python3 -m venv venv
source venv/bin/activate  # venv\Scripts\activate on Windows
pip install scrapy
scrapy startproject urlcrawler
cd urlcrawler
scrapy genspider mysite example.com
# Edit settings.py with polite crawling config
# Edit spiders/mysite.py with your parse logic
scrapy crawl mysite -o urls.jsonl -s JOBDIR=crawl_state

Scrapy Shell

As you build custom configurations, use Scrapy Shell to test your selectors and settings interactively:

scrapy shell "https://example.com"

This opens an interactive Python console with the response already loaded. Test CSS and XPath selectors in real-time before adding them to your spider:

>>> response.css('title::text').get()
'Example Domain'
>>> response.xpath('//h1/text()').get()
'Example Domain'

Scrapy Shell cuts iteration time significantly. Validate extraction logic without running full crawls.

Complete Spider Template

A production-ready spider with URL filtering, status code handling, and full SEO field extraction:

import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urllib.parse import urlparse


class SEOSpider(CrawlSpider):
    name = "seospider"
    allowed_domains = ["example.com"]
    start_urls = ["https://www.example.com"]

    # Capture all HTTP status codes, not just 2xx
    handle_httpstatus_list = [200, 301, 302, 403, 404, 500, 502, 503]

    # URL patterns to exclude
    EXCLUDED_PATTERNS = re.compile(
        r"/(in|au|th|es|hk|sg|ph|my|ca|cn|uk|kr|id|fr|vn|de|jp|nl|it|tw)/"
    )

    rules = (
        Rule(
            LinkExtractor(allow=()),
            callback="parse_page",
            follow=True,
            process_links="filter_links",
        ),
    )

    def filter_links(self, links):
        filtered = []
        for link in links:
            parsed = urlparse(link.url)
            hostname = parsed.hostname or ""

            if hostname not in ("example.com", "www.example.com"):
                continue

            if self.EXCLUDED_PATTERNS.search(link.url):
                continue

            filtered.append(link)
        return filtered

    def parse_page(self, response):
        yield {
            "url": response.url,
            "status": response.status,
            "title": response.css("title::text").get(),
            "meta_description": response.css("meta[name='description']::attr(content)").get(),
            "meta_robots": response.css("meta[name='robots']::attr(content)").get(),
            "h1": response.css("h1::text").get(),
            "canonical": response.css("link[rel='canonical']::attr(href)").get(),
            "og_title": response.css("meta[property='og:title']::attr(content)").get(),
            "og_description": response.css("meta[property='og:description']::attr(content)").get(),
            "word_count": len(response.text.split()) if response.status == 200 else None,
            "content_type": response.headers.get("Content-Type", b"").decode("utf-8", errors="ignore"),
        }

Replace example.com with your target domain. Adjust EXCLUDED_PATTERNS for your site’s URL structure.

When to Use Which

Screaming Frog:

Scrapy:

  • Sites over 10,000 URLs
  • Automated, scheduled crawls
  • Custom extraction needs
  • CI/CD integration
  • Memory constraints
  • Version-controlled configs

The Takeaway

Scrapy has a steeper setup curve than Screaming Frog, but it removes the practical limits GUI crawlers impose. No URL caps, no license fees, lower memory usage, and native automation.

Start small. Crawl a site you know. Use conservative settings. Compare output to Screaming Frog. The data will match, but you’ll have a tool that scales.