What is Scrapy and why should SEOs learn it?

Scrapy is a free, open-source Python framework for web crawling and data extraction. SEOs should learn it because it offers unlimited crawl sizes, full customization, CLI-based automation, and lower memory usage compared to GUI tools like Screaming Frog.

How does Scrapy compare to Screaming Frog?

Scrapy is free with no URL limits, uses less memory on large crawls, offers full Python customization, and integrates natively with CI/CD pipelines. Screaming Frog has an easier learning curve with its GUI but costs $259/year for unlimited crawls and is memory-intensive on large sites.

Can I pause and resume Scrapy crawls?

Yes. Run Scrapy with the JOBDIR setting to enable pause/resume functionality. Press Ctrl+C to pause, and the crawl state is saved. Run the same command again to resume from where you left off.

How do I avoid getting blocked when crawling with Scrapy?

Enable AutoThrottle in settings, use 5 or fewer concurrent requests per domain, add download delays, and obey robots.txt. Scrapy's AutoThrottle automatically adjusts crawl speed based on server response times.

Why does my Scrapy crawl only show 200 status codes?

Scrapy's HttpErrorMiddleware drops non-2xx responses by default. Add handle_httpstatus_list = [200, 301, 302, 403, 404, 500, 502, 503] to your spider class to capture redirects, errors, and other status codes like Screaming Frog does.

Does Scrapy render JavaScript?

No. Scrapy fetches raw HTML only, similar to curl. For sites that render content client-side, add scrapy-playwright or scrapy-splash. Most SEO crawls don't need JS rendering since meta tags and content are typically server-rendered.

How do I include sitemap URLs in my Scrapy crawl?

Override the start_requests method to fetch robots.txt and common sitemap locations. Parse the XML to extract URLs, then yield requests for each. Scrapy's deduplication ensures pages found in both sitemaps and links are only crawled once.

Getting Started with Scrapy: The Open-Source Web Crawling Framework

Screaming Frog is the go-to crawler for most SEOs, but you’ve probably hit its walls: the 500-URL cap on the free version, RAM maxing out on large sites, or wanting to automate crawls without babysitting a GUI. Scrapy is the open-source Python framework that removes those limits.

If you can run npm install or git clone, you can run Scrapy. The learning curve is real but manageable, especially if you’re already getting comfortable with CLI tools through agentic coding workflows.

Why Scrapy?

Key Benefits

Screaming Frog works great for quick audits. But it has limits:

Limitation	Impact
500 URL free limit	Requires $259/year license for larger sites
Memory-hungry	Large crawls can consume 8GB+ RAM
GUI-dependent	Difficult to automate or schedule
Limited customization	Configuration options are fixed

Scrapy solves these:

Scrapy	What You Get
Free and open-source	No URL limits, no license fees
Lower memory footprint	Disk-backed queues keep RAM in check
CLI-native	Scriptable, cron-able, CI/CD-ready
Full Python customization	Extract what you need, filter how you want
Pause/resume	Stop and continue large crawls anytime

Scrapy won't replace Screaming Frog for everything. Quick audits are still faster in a GUI. But for large-scale crawls, automation, and custom extraction, it's worth having in your toolkit.

Installation

Setup

Scrapy runs on Python. Use a virtual environment to keep things clean:

Debian/Ubuntu:

sudo apt install python3.11-venv
python3 -m venv venv
source venv/bin/activate
pip install scrapy

macOS:

python3 -m venv venv
source venv/bin/activate
pip install scrapy

Windows:

python -m venv venv
venv\Scripts\activate
pip install scrapy

Virtual Environments Matter

Always use a venv. Installing globally causes dependency conflicts and breaks reproducibility.

Creating a Project

With Scrapy installed:

scrapy startproject myproject
cd myproject
scrapy genspider sitename example.com

This creates:

myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            sitename.py

Spider code goes in spiders/sitename.py. Configuration lives in settings.py.

Settings for Polite Crawling

Critical

Configure settings.py before running anything. Getting blocked wastes more time than crawling slowly.

# Polite crawling
CONCURRENT_REQUESTS_PER_DOMAIN = 5
DOWNLOAD_DELAY = 1
ROBOTSTXT_OBEY = True

# AutoThrottle - adjusts speed based on server response
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
AUTOTHROTTLE_DEBUG = True

# Safety limits
CLOSESPIDER_PAGECOUNT = 10000

# Output
FEED_EXPORT_ENCODING = "utf-8"

With AutoThrottle enabled, 5 concurrent requests is a reasonable starting point. AutoThrottle will back off automatically if the server struggles. Without AutoThrottle, start lower at 1-3.

AutoThrottle

AutoThrottle monitors server response times and adjusts crawl speed automatically:

Fast responses → speeds up
Slow responses → backs off
Errors/timeouts → slows down significantly

Unlike Screaming Frog’s fixed delays, it adapts to actual server conditions.

Status Code Handling

By default, Scrapy’s HttpErrorMiddleware silently drops non-2xx responses. This means 404s, 301s, 500s are discarded before reaching your callback. Your crawl might show 100% 200 status codes, not because the site is perfect, but because errors are being filtered out.

Add this to your spider class to capture all status codes:

handle_httpstatus_list = [200, 301, 302, 403, 404, 500, 502, 503]

Screaming Frog captures all status codes by default. This setting brings Scrapy in line with that behavior.

Real-World Performance

Benchmarks

Actual numbers from a test crawl with 5 concurrent requests and AutoThrottle enabled:

Crawl Progress	Pages/Minute	Notes
0-200 pages	14-22	Ramp-up phase
200-500 pages	10-12	Stabilizing
500-1,000 pages	7-10	AutoThrottle adjusting
1,000+ pages	5-7	Steady state

Speed vs. Reliability

These speeds look slow. That's the point. AutoThrottle prioritizes server health over raw speed. Getting blocked and restarting wastes more time than a methodical crawl.

Feature Comparison

Feature	Screaming Frog	Scrapy
Cost	Free <500 URLs, ~$259/year	Free, open source
Max crawl size	Memory-limited	Disk-backed queues
Customization	Limited config options	Full Python code
Scheduling	Manual or third-party	Native CLI, cron-able
Pause/Resume	Yes	Yes (with JOBDIR)
Learning curve	Low (GUI)	Medium (code)
Rate limiting	Basic fixed delays	AutoThrottle (adaptive)
JavaScript rendering	Optional (Chrome)	Optional (playwright/splash)
Status codes	All by default	Requires configuration
Subdomain filtering	GUI checkboxes	Code (flexible regex)
Export formats	CSV, Excel, etc.	JSON, CSV, XML, custom
CI/CD integration	Difficult	Native

URL Filtering

Precise Control

Screaming Frog uses checkboxes. Scrapy uses code. The tradeoff is learning curve for precision.

Excluding international paths:

import re
from urllib.parse import urlparse

class MySiteSpider(scrapy.Spider):
    name = "mysite"
    allowed_domains = ["example.com", "www.example.com"]
    start_urls = ["https://www.example.com/"]

    # Skip international paths like /uk/, /fr/, /de/
    EXCLUDED_PATTERNS = re.compile(
        r"/(in|au|th|es|hk|sg|ph|my|ca|cn|uk|kr|id|fr|vn|de|jp|nl|it|tw)/"
    )

    def filter_links(self, links):
        filtered = []
        for link in links:
            hostname = urlparse(link.url).hostname or ""
            if hostname not in ("example.com", "www.example.com"):
                continue
            if self.EXCLUDED_PATTERNS.search(link.url):
                continue
            filtered.append(link)
        return filtered

You can filter by URL patterns, query parameters, response headers, page content, or any combination.

Sitemap Integration

URL Discovery

Screaming Frog has a simple “use sitemap” checkbox. Scrapy requires custom code, but gives you full control over how sitemaps are parsed and integrated with your crawl.

Why add sitemap support?

Discovers URLs not linked from main navigation
Finds orphan pages that link-based crawling would miss
Gets the site’s “official” URL list for comparison
May uncover more pages than following links alone
Essential for complete SEO audits

Add these methods to your CrawlSpider to enable sitemap detection and parsing:

def start_requests(self):
    # First, fetch robots.txt to find sitemaps
    yield Request(
        "https://www.example.com/robots.txt",
        callback=self.parse_robots,
        errback=self.handle_error,
        dont_filter=True,
    )
    # Also try common sitemap locations directly
    common_sitemaps = [
        "https://www.example.com/sitemap.xml",
        "https://www.example.com/sitemap_index.xml",
    ]
    for sitemap_url in common_sitemaps:
        yield Request(
            sitemap_url,
            callback=self.parse_sitemap,
            errback=self.handle_error,
            meta={"sitemap_url": sitemap_url},
        )
    # Also start normal crawl from homepage
    for url in self.start_urls:
        yield Request(url, callback=self.parse_start_url)

def parse_robots(self, response):
    """Parse robots.txt to find sitemap declarations"""
    if response.status != 200:
        return
    for line in response.text.splitlines():
        line = line.strip()
        if line.lower().startswith("sitemap:"):
            sitemap_url = line.split(":", 1)[1].strip()
            if self.is_valid_url(sitemap_url):
                self.logger.info(f"Found sitemap in robots.txt: {sitemap_url}")
                yield Request(
                    sitemap_url,
                    callback=self.parse_sitemap,
                    errback=self.handle_error,
                    meta={"sitemap_url": sitemap_url},
                )

def parse_sitemap(self, response):
    """Parse XML sitemap or sitemap index"""
    if response.status != 200:
        return
    content_type = response.headers.get("Content-Type", b"").decode("utf-8", errors="ignore")
    # Check if this is XML content
    if "xml" not in content_type and not response.text.strip().startswith("<?xml"):
        return
    # Check for sitemap index (contains other sitemaps)
    sitemap_locs = response.xpath("//sitemap/loc/text()").getall()
    if sitemap_locs:
        self.logger.info(f"Found sitemap index with {len(sitemap_locs)} sitemaps")
        for loc in sitemap_locs:
            if self.is_valid_url(loc):
                yield Request(
                    loc,
                    callback=self.parse_sitemap,
                    errback=self.handle_error,
                    meta={"sitemap_url": loc},
                )
    # Parse URL entries from sitemap
    url_locs = response.xpath("//url/loc/text()").getall()
    if url_locs:
        self.logger.info(f"Found {len(url_locs)} URLs in sitemap: {response.url}")
        for loc in url_locs:
            if self.is_valid_url(loc) and self.should_crawl_url(loc):
                yield Request(
                    loc,
                    callback=self.parse_page,
                    errback=self.handle_error,
                )

def parse_start_url(self, response):
    """Handle the start URL and trigger rules"""
    yield from self.parse_page(response)
    yield from self._requests_to_follow(response)

def is_valid_url(self, url):
    """Check if URL is valid and within allowed domains"""
    try:
        parsed = urlparse(url)
        hostname = parsed.hostname or ""
        return hostname in ("example.com", "www.example.com")
    except Exception:
        return False

def should_crawl_url(self, url):
    """Apply the same filtering as filter_links"""
    if self.EXCLUDED_PATTERNS.search(url):
        return False
    return True

def handle_error(self, failure):
    """Handle request errors gracefully"""
    self.logger.warning(f"Request failed: {failure.request.url}")

How it works:

start_requests() overrides the default behavior to fetch sitemaps first
parse_robots() finds Sitemap: lines in robots.txt
parse_sitemap() handles both sitemap indexes and regular sitemaps
XPath //sitemap/loc finds nested sitemaps in sitemap index files
XPath //url/loc finds actual page URLs
The same domain and pattern filtering applies to sitemap URLs
Scrapy’s built-in deduplication prevents double-crawling pages found in both sitemaps and links

Feature	Screaming Frog	Scrapy
Sitemap detection	Checkbox	Custom code
robots.txt parsing	Automatic	Custom code
Sitemap index support	Yes	Yes (with code)
URL filtering	GUI options	Code (full control)
Merge with crawl	Yes	Yes
Custom sitemap locations	Manual entry	Code any location

With sitemap integration, you may discover orphan pages not linked from navigation, old archived content still listed in sitemaps, URL variations with or without trailing slashes, and pages blocked by robots.txt but still present in the sitemap. This gives a more complete picture of the site for SEO audits.

Pause and Resume

Essential

For crawls over 1,000 pages, enable pause/resume with JOBDIR:

scrapy crawl myspider -o output.json -s JOBDIR=crawl_state

Scrapy saves state to crawl_state/. Hit Ctrl+C to pause. Run the same command to resume.

Always use JOBDIR for production crawls. Protects against network issues, system restarts, or just needing to stop for the day.

State includes pending URLs, seen URLs, and the request queue. This is more robust than Screaming Frog’s save/load feature because it’s file-based and survives system restarts.

JavaScript Rendering

Scrapy fetches raw HTML only. It doesn’t render JavaScript. This is the same as what curl returns.

For most SEO crawls, this is fine:

Meta tags, canonicals, and h1s are usually in the initial HTML
Search engines primarily index server-rendered content
Most e-commerce and content sites are server-rendered

If your target site renders content client-side, you have options:

Package	Notes
scrapy-playwright	Uses Chromium/Firefox/WebKit. Recommended for modern JS sites
scrapy-splash	Lightweight, Docker-based renderer
scrapy-selenium	Older approach, still works

JS rendering is significantly slower and more resource-intensive. Only add it if the site requires it.

Screaming Frog has a similar tradeoff. Enabling JavaScript rendering uses Chrome under the hood and slows crawls considerably.

Memory Management

At ~1,300 pages with full field extraction:

Memory: ~265 MB
CPU: ~4%

Using JOBDIR moves request queues to disk, keeping memory low. For very large crawls (100k+ URLs), add these settings:

MEMUSAGE_LIMIT_MB = 1024
MEMUSAGE_WARNING_MB = 800
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

This caps memory usage and forces disk-backed queues for the scheduler.

Output Data

Customizable

Basic spider output:

{
    "url": "https://www.example.com/page/",
    "title": "Page Title Here",
    "status": 200
}

For SEO crawls, you’ll want fields similar to what Screaming Frog exports:

def parse_page(self, response):
    yield {
        "url": response.url,
        "status": response.status,
        "title": response.css("title::text").get(),
        "meta_description": response.css("meta[name='description']::attr(content)").get(),
        "meta_robots": response.css("meta[name='robots']::attr(content)").get(),
        "h1": response.css("h1::text").get(),
        "canonical": response.css("link[rel='canonical']::attr(href)").get(),
        "og_title": response.css("meta[property='og:title']::attr(content)").get(),
        "og_description": response.css("meta[property='og:description']::attr(content)").get(),
        "word_count": len(response.text.split()) if response.status == 200 else None,
        "content_type": response.headers.get("Content-Type", b"").decode("utf-8", errors="ignore"),
    }

Add or remove fields based on what you need. CSS selectors work for any on-page element.

Export formats: JSON (-o output.json), JSON Lines (-o output.jsonl), CSV (-o output.csv), XML (-o output.xml).

JSON Lines is best for large crawls. Files are valid line-by-line during the crawl, so you can monitor with tail -f. Standard JSON isn’t valid until the crawl completes.

Screaming Frog → Scrapy

Translation Guide

Mapping SF workflows to Scrapy:

Screaming Frog Action	Scrapy Equivalent
Start new crawl	`scrapy crawl spidername`
Set crawl delay	`DOWNLOAD_DELAY` in settings
Limit concurrent threads	`CONCURRENT_REQUESTS_PER_DOMAIN`
Respect robots.txt	`ROBOTSTXT_OBEY = True`
Export to CSV	`-o output.csv`
Save/Load crawl	`-s JOBDIR=crawl_state`
Filter subdomains	Code in spider (regex)
Custom extraction	CSS/XPath selectors in `parse()`

Mindset shifts:

Configuration is code. Edit settings.py instead of clicking checkboxes.
Extraction is explicit. You write what data to capture.
Scheduling is native. Add commands to cron or CI/CD.
Debugging is logs. Enable AUTOTHROTTLE_DEBUG to see what’s happening.

Full Workflow

With the standard settings above, you can have Scrapy installed and crawling in under 15 minutes:

python3 -m venv venv
source venv/bin/activate  # venv\Scripts\activate on Windows
pip install scrapy
scrapy startproject urlcrawler
cd urlcrawler
scrapy genspider mysite example.com
# Edit settings.py with polite crawling config
# Edit spiders/mysite.py with your parse logic
scrapy crawl mysite -o urls.jsonl -s JOBDIR=crawl_state

Scrapy Shell

As you build custom configurations, use Scrapy Shell to test your selectors and settings interactively:

scrapy shell "https://example.com"

This opens an interactive Python console with the response already loaded. Test CSS and XPath selectors in real-time before adding them to your spider:

>>> response.css('title::text').get()
'Example Domain'
>>> response.xpath('//h1/text()').get()
'Example Domain'

Scrapy Shell cuts iteration time significantly. Validate extraction logic without running full crawls.

Complete Spider Template

A production-ready spider with URL filtering, status code handling, and full SEO field extraction:

import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urllib.parse import urlparse


class SEOSpider(CrawlSpider):
    name = "seospider"
    allowed_domains = ["example.com"]
    start_urls = ["https://www.example.com"]

    # Capture all HTTP status codes, not just 2xx
    handle_httpstatus_list = [200, 301, 302, 403, 404, 500, 502, 503]

    # URL patterns to exclude
    EXCLUDED_PATTERNS = re.compile(
        r"/(in|au|th|es|hk|sg|ph|my|ca|cn|uk|kr|id|fr|vn|de|jp|nl|it|tw)/"
    )

    rules = (
        Rule(
            LinkExtractor(allow=()),
            callback="parse_page",
            follow=True,
            process_links="filter_links",
        ),
    )

    def filter_links(self, links):
        filtered = []
        for link in links:
            parsed = urlparse(link.url)
            hostname = parsed.hostname or ""

            if hostname not in ("example.com", "www.example.com"):
                continue

            if self.EXCLUDED_PATTERNS.search(link.url):
                continue

            filtered.append(link)
        return filtered

    def parse_page(self, response):
        yield {
            "url": response.url,
            "status": response.status,
            "title": response.css("title::text").get(),
            "meta_description": response.css("meta[name='description']::attr(content)").get(),
            "meta_robots": response.css("meta[name='robots']::attr(content)").get(),
            "h1": response.css("h1::text").get(),
            "canonical": response.css("link[rel='canonical']::attr(href)").get(),
            "og_title": response.css("meta[property='og:title']::attr(content)").get(),
            "og_description": response.css("meta[property='og:description']::attr(content)").get(),
            "word_count": len(response.text.split()) if response.status == 200 else None,
            "content_type": response.headers.get("Content-Type", b"").decode("utf-8", errors="ignore"),
        }

Replace example.com with your target domain. Adjust EXCLUDED_PATTERNS for your site’s URL structure.

When to Use Which

Screaming Frog:

Quick audits under 500 URLs
Results needed in minutes
Visual site exploration
Not comfortable with CLI
Using Screaming Frog data with Redirects.net

Scrapy:

Sites over 10,000 URLs
Automated, scheduled crawls
Custom extraction needs
CI/CD integration
Memory constraints
Version-controlled configs

The Takeaway

Scrapy has a steeper setup curve than Screaming Frog, but it removes the practical limits GUI crawlers impose. No URL caps, no license fees, lower memory usage, and native automation.

Start small. Crawl a site you know. Use conservative settings. Compare output to Screaming Frog. The data will match, but you’ll have a tool that scales.

Getting Started with Scrapy

The Open-Source Alternative to GUI Crawlers for SEOs