Cloudflare Turnstile clearances

Some publications are protected by Cloudflare Turnstile. Our approach is to solve the challenge manually and reuse the cf_clearance cookie (which can last a year) in Scrapy crawls.

Solve the challenge from the server’s IP

Note

You must solve the challenge using a browser and operating system for which curl_cffi has a fingerprint. The browser version need not match.

The clearance is bound to the browser and IP that solved the challenge. So, you need to proxy through the server then solve the challenge.

  1. Open a SOCKS proxy through the server that runs Scrapyd:

    ssh -D 1080 ocp29.open-contracting.org
    
  2. Launch a Chrome instance using this proxy server under an isolated profile. On macOS:

    /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
      --proxy-server="socks5://localhost:1080" \
      --user-data-dir="/tmp/cf-solve" \
      --no-first-run \
      https://www.cloudflare.com/cdn-cgi/trace
    
  3. On the trace page, confirm the ip= value is the server’s IP, not yours.

  4. In the same window, open https://opentender.eu/download (or the relevant source) and solve the “Verify you are human” challenge. Then, collect:

    • Cookie: DevTools > Application > Cookies > the site’s domain > copy the cf_clearance value

    • User-Agent: DevTools > Console > run navigator.userAgent > copy the User-Agent value

Create or update the settings bundle

In the Django admin under Scrapy settings bundles, add or edit a bundle named after the source with these settings:

Key

Value

CF_CLEARANCE

The cf_clearance value

CF_USER_AGENT

The User-Agent value

CURL_IMPERSONATE

The closest target name not newer than Chrome/<version> in User-Agent

CURL_IP_VERSION

4 or 6, matching the IP version of ip= value

Then, link each relevant publication to the settings bundle.