bomonike

How to automate crawling through the WWW to fetch HTML and data, using Beautiful Soup and other parsers

Overview

Scrape Quotes with exports
Generate scrape
Scrapy Python coding
References
More about Python

This is a step-by-step hands-on tutorial explaining how to scrape websites for information.

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

aiwithbrandon shows “Scrape Any Website for FREE Using DeepSeek & Crawl4AI” using https://brandonhancock.io/deepseek-scraper at https://github.com/bhancockio/ by Brandon Hancock on https://skool.com/ai-developer-accelerator (FREE)

PROTIP: If an API is not available, scrape (extract/mine) specific information by parsing HTML from websites using the Scrapy web scraping (Spider) framework. See blog.

inside a virtual environment
Install by
```
pip install Scrapy
```

Verify by scrape with parameters. The response:

Scrapy 1.8.0 - no active project
 
Usage:
  scrapy <command> [options] [args]
 
Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy
 
  [ more ]      More commands available when run from project directory
 
Use "scrapy <command> -h" to see more info about a command

Notice that there are more commands when the command is run inside a Scrapy folder.

Manually verify that the websites provided by Scrapy framework developers still operate:

https://quotes.toscrape.com

https://books.toscrape.com
Download a sample project using Spyder, assembled from a video tutorial from Pluralsight:
```
git clone https://github.com/wilsonmar/scrapy.git
cd scrapy
ls
```
The repo contains several projects (books-export, quoting).

PROTIP: The pycache (cache) are created by the Python3 compiler to make subsequent executions a little faster in production code. In that folder, a .pyc file contains bytecode associated with each import statement in the code. They are specified in .gitignore for the repo so they don’t get stored in GitHub.

PROTIP: On a Mac, hide all such folders with this command:
```
find . -name '__pycache__' -exec chflags hidden {} \;
```
On Windows:
```
dir * /s/b | findstr __pycache__ | attrib +h +s +r
```

See what commands when in an active project folder:

cd books-export
scrapy

Additional commands are:

   check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  list          List available spiders
  parse         Parse URL (using its spider) and print the results

List what crawlers Scrapy recognizes:
```
scrapy list
```

Still in folder books-export, run the crawl script defined in the lower folder spiders:

scrapy crawl BookCrawler

The output from the command are console messages ending with something like this:

2019-12-25 14:22:53 [scrapy.extensions.feedexport] INFO: Stored json feed (1807 items) in: books.json
2019-12-25 14:22:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 47252,
 'downloader/request_count': 145,
 'downloader/request_method_count/GET': 145,
 'downloader/response_bytes': 786302,
 'downloader/response_count': 145,
 'downloader/response_status_count/200': 144,
 'downloader/response_status_count/404': 1,
 'dupefilter/filtered': 7372,
 'elapsed_time_seconds': 23.466027,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 12, 25, 21, 22, 53, 201722),
 'item_dropped_count': 453,
 'item_dropped_reasons_count/DropItem': 453,
 'item_scraped_count': 1807,
 'log_count/DEBUG': 1953,
 'log_count/INFO': 11,
 'log_count/WARNING': 453,
 'memusage/max': 52436992,
 'memusage/startup': 52436992,
 'request_depth_max': 51,
 'response_received_count': 145,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 144,
 'scheduler/dequeued/memory': 144,
 'scheduler/enqueued': 144,
 'scheduler/enqueued/memory': 144,
 'start_time': datetime.datetime(2019, 12, 25, 21, 22, 29, 735695)}
2019-12-25 14:22:53 [scrapy.core.engine] INFO: Spider closed (finished)

Switch to a text editor to see books.json.

This contains each book’s title, price, imageurl, bookurl.
View the file BookCrawler.py file in the spiders folder.

Functions (from the bottom up) are: parsepage, extractData, writeTxt.

These are the result of edits after a template was generated.

Scrape Quotes with exports

Run:

cd quoting
scrapy crawl QuoteCrawler

2019-12-25 04:09:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 34936,
 'downloader/request_count': 122,
 'downloader/request_method_count/GET': 122,
 'downloader/response_bytes': 176221,
 'downloader/response_count': 122,
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/404': 1,
 'dupefilter/filtered': 1897,
 'elapsed_time_seconds': 6.066887,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 12, 25, 11, 9, 38, 225122),
 'log_count/DEBUG': 123,
 'log_count/INFO': 10,
 'memusage/max': 52887552,
 'memusage/startup': 52887552,
 'request_depth_max': 4,
 'response_received_count': 122,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 121,
 'scheduler/dequeued/memory': 121,
 'scheduler/enqueued': 121,
 'scheduler/enqueued/memory': 121,
 'start_time': datetime.datetime(2019, 12, 25, 11, 9, 32, 158235)}
2019-12-25 04:09:38 [scrapy.core.engine] INFO: Spider closed (finished)

Switch to a text editor to view the file created: “quotes.toscrape.txt”.

Generate scrape

Running the above avoids using these commands to generate the project:
```
scrapy startproject quotes
cd quotes
scrapy genspider QuoteSpider quotes.toscrape.com
```
The response:
```
 Created spider 'QuoteSpider' using template 'basic' in module:
  quotes.spiders.QuoteSpider
```
… and then edit the generated code.

Scrapy Python coding

Now let’s examine the Python code.

Scrapy uses the twisted Python networking engine to visit multiple urls Asynchronously (processing each request in a non-blocking way, without waiting for one request to finish before sending another request).
Scrapy can set and rotate proxy, User Agent, and other HTTP headers dynamically.
Scrapy automatically handles cookies passed between browser and server.
Scrapy’s Spider extract a pipeline of “items” (attributes of a website) to process, such as pushing data to a Neo4j or mysql database.
Scrapy electors uses lxml, which is faster than the Python Beautiful Soup (BS4) library to parse data from inside HTML and XML markup scraped from websites.
Scrapy can export data in various formats (CSV, JSON, jsonlines, XML).

References

https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3

“Automate the Boring Stuff” (free at https://inventwithpython.com) was among the most popular of all tech books. Its author Al Sweigart (@AlSweigart), in VIDEO: “Automating Your Browser and Desktop Apps” [deck] shows Selenium for web browsers. He also shows his VIDEO: pyautogui (pip install pyautogui) open-sourced in GitHub automates MS Paint and Calc on Windows, and Flash apps (non-browser apps). Moving the mouse to the top left corner (0,0) raises the FailSafeException to stop the script running. That’s since there is no hotkey recognition yet.

Web Scraping 101: A Million Dollar SaaS Idea identify sponsors of youtube influencer videos by obtaining transcripts from YouTube videos (using Selenium or Playwright) and loading them into an LLM. by Tech With Tim

AgentQL is a robust query language that identifies elements on a webpage using natural language with the help of AI.

https://www.youtube.com/watch?v=7kbQnLN2y_I This is how I scrape 99% websites via LLM by AI Jason of https://www.skool.com/ai-builder-club/about $37/month https://www.agentql.com/?utm_source=YouTube&utm_medium=Creator&utm_id=AIJason_102024

* Firecrawl
* 6:09 >>> https://r.jina.ai/https://openai.com turns website into markdown format
* Spider cloud supports 50,000 per minute

https://www.youtube.com/watch?v=_Y_1ojMSNdg How to Actually Scrape using LLMs (Free Local Deepseek R1 + crawl4ai + Knowledge Graph) Leonardo Grigorio | The AI Forge

https://www.youtube.com/watch?v=oikVfYUEeS8 Free Scraper Turns ANY WEBSITE into LLM Knowledge INSTANTLY Income stream surfers

https://www.youtube.com/watch?v=_X9pS57BFJw You Don’t NEED AI to Scrape Data (it’s simple do this) John Watson Rooney

More about Python

This is one of a series about Python:

bomonike

Scrape Quotes with exports

Generate scrape

Scrapy Python coding

References

More about Python