Screaming Frog is the go-to crawler for most SEOs, but you’ve probably hit its walls: the 500-URL cap on the free version, RAM maxing out on large sites, or wanting to automate crawls without babysitting a GUI. Scrapy is the open-source Python framework that removes those limits.
If you can run npm install or git clone, you can run Scrapy. The learning curve is real but manageable, especially if you’re already getting comfortable with CLI tools through agentic coding workflows.
Why Scrapy?
Screaming Frog works great for quick audits. But it has limits:
| Limitation | Impact |
|---|---|
| 500 URL free limit | Requires $259/year license for larger sites |
| Memory-hungry | Large crawls can consume 8GB+ RAM |
| GUI-dependent | Difficult to automate or schedule |
| Limited customization | Configuration options are fixed |
Scrapy solves these:
| Scrapy | What You Get |
|---|---|
| Free and open-source | No URL limits, no license fees |
| Lower memory footprint | Disk-backed queues keep RAM in check |
| CLI-native | Scriptable, cron-able, CI/CD-ready |
| Full Python customization | Extract what you need, filter how you want |
| Pause/resume | Stop and continue large crawls anytime |
Installation
Scrapy runs on Python. Use a virtual environment to keep things clean:
Debian/Ubuntu:
sudo apt install python3.11-venv
python3 -m venv venv
source venv/bin/activate
pip install scrapy
macOS:
python3 -m venv venv
source venv/bin/activate
pip install scrapy
Windows:
python -m venv venv
venv\Scripts\activate
pip install scrapy
Creating a Project
With Scrapy installed:
scrapy startproject myproject
cd myproject
scrapy genspider sitename example.com
This creates:
myproject/
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
sitename.py
Spider code goes in spiders/sitename.py. Configuration lives in settings.py.
Settings for Polite Crawling
Configure settings.py before running anything. Getting blocked wastes more time than crawling slowly.
# Polite crawling
CONCURRENT_REQUESTS_PER_DOMAIN = 5
DOWNLOAD_DELAY = 1
ROBOTSTXT_OBEY = True
# AutoThrottle - adjusts speed based on server response
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
AUTOTHROTTLE_DEBUG = True
# Safety limits
CLOSESPIDER_PAGECOUNT = 10000
# Output
FEED_EXPORT_ENCODING = "utf-8"
AutoThrottle
AutoThrottle monitors server response times and adjusts crawl speed automatically:
- Fast responses → speeds up
- Slow responses → backs off
- Errors/timeouts → slows down significantly
Unlike Screaming Frog’s fixed delays, it adapts to actual server conditions.
Status Code Handling
By default, Scrapy’s HttpErrorMiddleware silently drops non-2xx responses. This means 404s, 301s, 500s are discarded before reaching your callback. Your crawl might show 100% 200 status codes, not because the site is perfect, but because errors are being filtered out.
Add this to your spider class to capture all status codes:
handle_httpstatus_list = [200, 301, 302, 403, 404, 500, 502, 503]
Screaming Frog captures all status codes by default. This setting brings Scrapy in line with that behavior.
Real-World Performance
Actual numbers from a test crawl with 5 concurrent requests and AutoThrottle enabled:
| Crawl Progress | Pages/Minute | Notes |
|---|---|---|
| 0-200 pages | 14-22 | Ramp-up phase |
| 200-500 pages | 10-12 | Stabilizing |
| 500-1,000 pages | 7-10 | AutoThrottle adjusting |
| 1,000+ pages | 5-7 | Steady state |
Feature Comparison
| Feature | Screaming Frog | Scrapy |
|---|---|---|
| Cost | Free <500 URLs, ~$259/year | Free, open source |
| Max crawl size | Memory-limited | Disk-backed queues |
| Customization | Limited config options | Full Python code |
| Scheduling | Manual or third-party | Native CLI, cron-able |
| Pause/Resume | Yes | Yes (with JOBDIR) |
| Learning curve | Low (GUI) | Medium (code) |
| Rate limiting | Basic fixed delays | AutoThrottle (adaptive) |
| JavaScript rendering | Optional (Chrome) | Optional (playwright/splash) |
| Status codes | All by default | Requires configuration |
| Subdomain filtering | GUI checkboxes | Code (flexible regex) |
| Export formats | CSV, Excel, etc. | JSON, CSV, XML, custom |
| CI/CD integration | Difficult | Native |
URL Filtering
Screaming Frog uses checkboxes. Scrapy uses code. The tradeoff is learning curve for precision.
Excluding international paths:
import re
from urllib.parse import urlparse
class MySiteSpider(scrapy.Spider):
name = "mysite"
allowed_domains = ["example.com", "www.example.com"]
start_urls = ["https://www.example.com/"]
# Skip international paths like /uk/, /fr/, /de/
EXCLUDED_PATTERNS = re.compile(
r"/(in|au|th|es|hk|sg|ph|my|ca|cn|uk|kr|id|fr|vn|de|jp|nl|it|tw)/"
)
def filter_links(self, links):
filtered = []
for link in links:
hostname = urlparse(link.url).hostname or ""
if hostname not in ("example.com", "www.example.com"):
continue
if self.EXCLUDED_PATTERNS.search(link.url):
continue
filtered.append(link)
return filtered
You can filter by URL patterns, query parameters, response headers, page content, or any combination.
Pause and Resume
For crawls over 1,000 pages, enable pause/resume with JOBDIR:
scrapy crawl myspider -o output.json -s JOBDIR=crawl_state
Scrapy saves state to crawl_state/. Hit Ctrl+C to pause. Run the same command to resume.
State includes pending URLs, seen URLs, and the request queue. This is more robust than Screaming Frog’s save/load feature because it’s file-based and survives system restarts.
JavaScript Rendering
Scrapy fetches raw HTML only. It doesn’t render JavaScript. This is the same as what curl returns.
For most SEO crawls, this is fine:
- Meta tags, canonicals, and h1s are usually in the initial HTML
- Search engines primarily index server-rendered content
- Most e-commerce and content sites are server-rendered
If your target site renders content client-side, you have options:
| Package | Notes |
|---|---|
| scrapy-playwright | Uses Chromium/Firefox/WebKit. Recommended for modern JS sites |
| scrapy-splash | Lightweight, Docker-based renderer |
| scrapy-selenium | Older approach, still works |
JS rendering is significantly slower and more resource-intensive. Only add it if the site requires it.
Screaming Frog has a similar tradeoff. Enabling JavaScript rendering uses Chrome under the hood and slows crawls considerably.
Memory Management
At ~1,300 pages with full field extraction:
- Memory: ~265 MB
- CPU: ~4%
Using JOBDIR moves request queues to disk, keeping memory low. For very large crawls (100k+ URLs), add these settings:
MEMUSAGE_LIMIT_MB = 1024
MEMUSAGE_WARNING_MB = 800
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
This caps memory usage and forces disk-backed queues for the scheduler.
Output Data
Basic spider output:
{
"url": "https://www.example.com/page/",
"title": "Page Title Here",
"status": 200
}
For SEO crawls, you’ll want fields similar to what Screaming Frog exports:
def parse_page(self, response):
yield {
"url": response.url,
"status": response.status,
"title": response.css("title::text").get(),
"meta_description": response.css("meta[name='description']::attr(content)").get(),
"meta_robots": response.css("meta[name='robots']::attr(content)").get(),
"h1": response.css("h1::text").get(),
"canonical": response.css("link[rel='canonical']::attr(href)").get(),
"og_title": response.css("meta[property='og:title']::attr(content)").get(),
"og_description": response.css("meta[property='og:description']::attr(content)").get(),
"word_count": len(response.text.split()) if response.status == 200 else None,
"content_type": response.headers.get("Content-Type", b"").decode("utf-8", errors="ignore"),
}
Add or remove fields based on what you need. CSS selectors work for any on-page element.
Export formats: JSON (-o output.json), JSON Lines (-o output.jsonl), CSV (-o output.csv), XML (-o output.xml).
JSON Lines is best for large crawls. Files are valid line-by-line during the crawl, so you can monitor with tail -f. Standard JSON isn’t valid until the crawl completes.
Screaming Frog → Scrapy
Mapping SF workflows to Scrapy:
| Screaming Frog Action | Scrapy Equivalent |
|---|---|
| Start new crawl | scrapy crawl spidername |
| Set crawl delay | DOWNLOAD_DELAY in settings |
| Limit concurrent threads | CONCURRENT_REQUESTS_PER_DOMAIN |
| Respect robots.txt | ROBOTSTXT_OBEY = True |
| Export to CSV | -o output.csv |
| Save/Load crawl | -s JOBDIR=crawl_state |
| Filter subdomains | Code in spider (regex) |
| Custom extraction | CSS/XPath selectors in parse() |
Mindset shifts:
- Configuration is code. Edit
settings.pyinstead of clicking checkboxes. - Extraction is explicit. You write what data to capture.
- Scheduling is native. Add commands to cron or CI/CD.
- Debugging is logs. Enable
AUTOTHROTTLE_DEBUGto see what’s happening.
Full Workflow
With the standard settings above, you can have Scrapy installed and crawling in under 15 minutes:
python3 -m venv venv
source venv/bin/activate # venv\Scripts\activate on Windows
pip install scrapy
scrapy startproject urlcrawler
cd urlcrawler
scrapy genspider mysite example.com
# Edit settings.py with polite crawling config
# Edit spiders/mysite.py with your parse logic
scrapy crawl mysite -o urls.jsonl -s JOBDIR=crawl_state
Scrapy Shell
As you build custom configurations, use Scrapy Shell to test your selectors and settings interactively:
scrapy shell "https://example.com"
This opens an interactive Python console with the response already loaded. Test CSS and XPath selectors in real-time before adding them to your spider:
>>> response.css('title::text').get()
'Example Domain'
>>> response.xpath('//h1/text()').get()
'Example Domain'
Scrapy Shell cuts iteration time significantly. Validate extraction logic without running full crawls.
Complete Spider Template
A production-ready spider with URL filtering, status code handling, and full SEO field extraction:
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urllib.parse import urlparse
class SEOSpider(CrawlSpider):
name = "seospider"
allowed_domains = ["example.com"]
start_urls = ["https://www.example.com"]
# Capture all HTTP status codes, not just 2xx
handle_httpstatus_list = [200, 301, 302, 403, 404, 500, 502, 503]
# URL patterns to exclude
EXCLUDED_PATTERNS = re.compile(
r"/(in|au|th|es|hk|sg|ph|my|ca|cn|uk|kr|id|fr|vn|de|jp|nl|it|tw)/"
)
rules = (
Rule(
LinkExtractor(allow=()),
callback="parse_page",
follow=True,
process_links="filter_links",
),
)
def filter_links(self, links):
filtered = []
for link in links:
parsed = urlparse(link.url)
hostname = parsed.hostname or ""
if hostname not in ("example.com", "www.example.com"):
continue
if self.EXCLUDED_PATTERNS.search(link.url):
continue
filtered.append(link)
return filtered
def parse_page(self, response):
yield {
"url": response.url,
"status": response.status,
"title": response.css("title::text").get(),
"meta_description": response.css("meta[name='description']::attr(content)").get(),
"meta_robots": response.css("meta[name='robots']::attr(content)").get(),
"h1": response.css("h1::text").get(),
"canonical": response.css("link[rel='canonical']::attr(href)").get(),
"og_title": response.css("meta[property='og:title']::attr(content)").get(),
"og_description": response.css("meta[property='og:description']::attr(content)").get(),
"word_count": len(response.text.split()) if response.status == 200 else None,
"content_type": response.headers.get("Content-Type", b"").decode("utf-8", errors="ignore"),
}
Replace example.com with your target domain. Adjust EXCLUDED_PATTERNS for your site’s URL structure.
When to Use Which
Screaming Frog:
- Quick audits under 500 URLs
- Results needed in minutes
- Visual site exploration
- Not comfortable with CLI
- Using Screaming Frog data with Redirects.net
Scrapy:
- Sites over 10,000 URLs
- Automated, scheduled crawls
- Custom extraction needs
- CI/CD integration
- Memory constraints
- Version-controlled configs
The Takeaway
Scrapy has a steeper setup curve than Screaming Frog, but it removes the practical limits GUI crawlers impose. No URL caps, no license fees, lower memory usage, and native automation.
Start small. Crawl a site you know. Use conservative settings. Compare output to Screaming Frog. The data will match, but you’ll have a tool that scales.