Cloudflare Turnstile clearances¶
Some publications are protected by Cloudflare Turnstile. Our approach is to solve the challenge manually and reuse the cf_clearance cookie (which can last a year) in Scrapy crawls.
Solve the challenge from the server’s IP¶
Note
You must solve the challenge using a browser and operating system for which curl_cffi has a fingerprint. The browser version need not match.
The clearance is bound to the browser and IP that solved the challenge. So, you need to proxy through the server then solve the challenge.
Open a SOCKS proxy through the server that runs Scrapyd:
ssh -D 1080 ocp29.open-contracting.org
Launch a Chrome instance using this proxy server under an isolated profile. On macOS:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \ --proxy-server="socks5://localhost:1080" \ --user-data-dir="/tmp/cf-solve" \ --no-first-run \ https://www.cloudflare.com/cdn-cgi/trace
On the trace page, confirm the
ip=value is the server’s IP, not yours.In the same window, open https://opentender.eu/download (or the relevant source) and solve the “Verify you are human” challenge. Then, collect:
Cookie: DevTools > Application > Cookies > the site’s domain > copy the
cf_clearancevalueUser-Agent: DevTools > Console > run
navigator.userAgent> copy the User-Agent value
See also
Create or update the settings bundle¶
In the Django admin under Scrapy settings bundles, add or edit a bundle named after the source with these settings:
Key |
Value |
|---|---|
CF_CLEARANCE |
The |
CF_USER_AGENT |
The |
CURL_IMPERSONATE |
The closest target name not newer than |
CURL_IP_VERSION |
|
Then, link each relevant publication to the settings bundle.