Scrape The Web With Crawlee

Written by Nikos Vaggalis

Tuesday, 18 October 2022

Crawlee is an open source web scraping and browser automation library for Node.js designed for productivity. Made by Apify, the popular web scraping and automation platform.

Crawlee is the successor to Apify SDK and escaped Apify's labs after 4 years in development. While the Apify SDK was always open source, the library's name caused users to think its features were restricted to the Apify platform, which was not true. For that reason, the Apify SDK was split into two libraries, Crawlee and Apify SDK. Crawlee will retain all the crawling and scraping-related tools, while at the same time, Apify SDK will continue to exist but keep only the Apify-specific features.

They've really put a lot of work into making it a customizable library. For instance you can start with simple HTTP-based scraping, but switch to browser-based automation by calling Playwright or Puppeteer under the covers or set your proxies to avoid getting blocked by using auto-generated human-like fingerprints, headless browsers, and smart proxy rotation.

Crawlee's features

- Single interface for HTTP and headless browser crawling
- Persistent queue for URLs to crawl (breadth & depth first)
- Pluggable storage of both tabular data and files
- Automatic scaling with available system resources
- Integrated proxy rotation and session management
- Lifecycles customizable with hooks
- CLI to bootstrap your projects
- Configurable routing, error handling and retries
- Dockerfiles ready to deploy
- Written in TypeScript with generics

HTTP crawling

- Zero config HTTP2 support, even for proxies
- Automatic generation of browser-like headers
- Replication of browser TLS fingerprints
- Integrated fast HTML parsers. Cheerio and JSDOM
- Yes, you can scrape JSON APIs as well

Real browser crawling

- JavaScript rendering and screenshots
- Headless and headful support
- Zero-config generation of human-like fingerprints
- Automatic browser management
- Use Playwright and Puppeteer with the same interface
- Chrome, Firefox, Webkit and many others

If you have Node.js installed, you can try Crawlee by running the command below and choose one of the available templates for your crawler.

npx crawlee create my-crawler

Then choose from the drop down list of either Typescript or Javascript templates and hit enter on the one that you like.

After that it's going to automatically generate some boilerplate code and also install all the dependencies that you need to get started.

After the install is done, then if you cd into your new project's folder you'll notice that there are already a bunch of files there; a docker file, a package.json, and a ts config file if you're using Typescript. Do

npm start

and you're good to go.

Crawlee is open-source and runs anywhere, but since it's developed by Apify, it's easy to set up on the Apify platform and run in the cloud, which is Apify's primary endeavor.

crawleesq

More Information

Crawlee.dev

Crawlee Github

Headless Chrome and the Puppeteer Library for Scraping and Testing the Web

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Get Ready For Pure Virtual C++ 2025 Conference
22/04/2025

Pure Virtual C++ is Micorosft's free, one-day, virtual conference for the whole C++ community. This year, it is running on April 30th.

+ Full Story

GitHub Copilot Adds VSCode Agent Mode
14/04/2025

GitHub has released an agent mode and MCP support for VS Code, along with a new GitHub Copilot Pro+ plan with premium requests, the general availability of models from Anthropic, Google, and OpenAI, n [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 18 October 2022 )

Recent Articles

Recent Book Reviews

Popular Articles

More Information

Related Articles

Comments