How to automate crawling through the WWW to fetch HTML and data, using Beautiful Soup and other parsers
This is a step-by-step hands-on tutorial explaining how to scrape websites for information.
NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.
aiwithbrandon shows “Scrape Any Website for FREE Using DeepSeek & Crawl4AI” using https://brandonhancock.io/deepseek-scraper at https://github.com/bhancockio/ by Brandon Hancock on https://skool.com/ai-developer-accelerator (FREE)
PROTIP: If an API is not available, scrape (extract/mine) specific information by parsing HTML from websites using the Scrapy web scraping (Spider) framework. See blog.
Install by
pip install Scrapy
Verify by scrape with parameters. The response:
Scrapy 1.8.0 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command
Notice that there are more commands when the command is run inside a Scrapy folder.
Manually verify that the websites provided by Scrapy framework developers still operate:
Download a sample project using Spyder, assembled from a video tutorial from Pluralsight:
git clone https://github.com/wilsonmar/scrapy.git cd scrapy ls
The repo contains several projects (books-export, quoting).
PROTIP: The pycache (cache) are created by the Python3 compiler to make subsequent executions a little faster in production code. In that folder, a .pyc file contains bytecode associated with each import statement in the code. They are specified in .gitignore for the repo so they don’t get stored in GitHub.
PROTIP: On a Mac, hide all such folders with this command:
find . -name '__pycache__' -exec chflags hidden {} \;
On Windows:
dir * /s/b | findstr __pycache__ | attrib +h +s +r
See what commands when in an active project folder:
cd books-export scrapy
Additional commands are:
check Check spider contracts crawl Run a spider edit Edit spider list List available spiders parse Parse URL (using its spider) and print the results
List what crawlers Scrapy recognizes:
scrapy list
Still in folder books-export, run the crawl script defined in the lower folder spiders:
scrapy crawl BookCrawler
The output from the command are console messages ending with something like this:
2019-12-25 14:22:53 [scrapy.extensions.feedexport] INFO: Stored json feed (1807 items) in: books.json 2019-12-25 14:22:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 47252, 'downloader/request_count': 145, 'downloader/request_method_count/GET': 145, 'downloader/response_bytes': 786302, 'downloader/response_count': 145, 'downloader/response_status_count/200': 144, 'downloader/response_status_count/404': 1, 'dupefilter/filtered': 7372, 'elapsed_time_seconds': 23.466027, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 12, 25, 21, 22, 53, 201722), 'item_dropped_count': 453, 'item_dropped_reasons_count/DropItem': 453, 'item_scraped_count': 1807, 'log_count/DEBUG': 1953, 'log_count/INFO': 11, 'log_count/WARNING': 453, 'memusage/max': 52436992, 'memusage/startup': 52436992, 'request_depth_max': 51, 'response_received_count': 145, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/404': 1, 'scheduler/dequeued': 144, 'scheduler/dequeued/memory': 144, 'scheduler/enqueued': 144, 'scheduler/enqueued/memory': 144, 'start_time': datetime.datetime(2019, 12, 25, 21, 22, 29, 735695)} 2019-12-25 14:22:53 [scrapy.core.engine] INFO: Spider closed (finished)
Switch to a text editor to see books.json.
This contains each book’s title, price, imageurl, bookurl.
View the file BookCrawler.py file in the spiders folder.
Functions (from the bottom up) are: parsepage, extractData, writeTxt.
These are the result of edits after a template was generated.
Run:
cd quoting scrapy crawl QuoteCrawler
2019-12-25 04:09:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 34936, 'downloader/request_count': 122, 'downloader/request_method_count/GET': 122, 'downloader/response_bytes': 176221, 'downloader/response_count': 122, 'downloader/response_status_count/200': 121, 'downloader/response_status_count/404': 1, 'dupefilter/filtered': 1897, 'elapsed_time_seconds': 6.066887, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 12, 25, 11, 9, 38, 225122), 'log_count/DEBUG': 123, 'log_count/INFO': 10, 'memusage/max': 52887552, 'memusage/startup': 52887552, 'request_depth_max': 4, 'response_received_count': 122, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/404': 1, 'scheduler/dequeued': 121, 'scheduler/dequeued/memory': 121, 'scheduler/enqueued': 121, 'scheduler/enqueued/memory': 121, 'start_time': datetime.datetime(2019, 12, 25, 11, 9, 32, 158235)} 2019-12-25 04:09:38 [scrapy.core.engine] INFO: Spider closed (finished)
Switch to a text editor to view the file created: “quotes.toscrape.txt”.
Running the above avoids using these commands to generate the project:
scrapy startproject quotes cd quotes scrapy genspider QuoteSpider quotes.toscrape.com
The response:
Created spider 'QuoteSpider' using template 'basic' in module: quotes.spiders.QuoteSpider
… and then edit the generated code.
Now let’s examine the Python code.
Scrapy uses the twisted Python networking engine to visit multiple urls Asynchronously (processing each request in a non-blocking way, without waiting for one request to finish before sending another request).
Scrapy can set and rotate proxy, User Agent, and other HTTP headers dynamically.
Scrapy automatically handles cookies passed between browser and server.
Scrapy’s Spider extract a pipeline of “items” (attributes of a website) to process, such as pushing data to a Neo4j or mysql database.
Scrapy electors uses lxml, which is faster than the Python Beautiful Soup (BS4) library to parse data from inside HTML and XML markup scraped from websites.
Scrapy can export data in various formats (CSV, JSON, jsonlines, XML).
https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3
“Automate the Boring Stuff” (free at https://inventwithpython.com) was among the most popular of all tech books. Its author Al Sweigart (@AlSweigart), in VIDEO: “Automating Your Browser and Desktop Apps” [deck] shows Selenium for web browsers. He also shows his VIDEO: pyautogui (pip install pyautogui) open-sourced in GitHub automates MS Paint and Calc on Windows, and Flash apps (non-browser apps). Moving the mouse to the top left corner (0,0) raises the FailSafeException to stop the script running. That’s since there is no hotkey recognition yet.
Web Scraping 101: A Million Dollar SaaS Idea identify sponsors of youtube influencer videos by obtaining transcripts from YouTube videos (using Selenium or Playwright) and loading them into an LLM. by Tech With Tim
AgentQL is a robust query language that identifies elements on a webpage using natural language with the help of AI.
https://www.youtube.com/watch?v=7kbQnLN2y_I This is how I scrape 99% websites via LLM by AI Jason of https://www.skool.com/ai-builder-club/about $37/month https://www.agentql.com/?utm_source=YouTube&utm_medium=Creator&utm_id=AIJason_102024
* Firecrawl
* 6:09 >>> https://r.jina.ai/https://openai.com turns website into markdown format
* Spider cloud supports 50,000 per minute
https://www.youtube.com/watch?v=_Y_1ojMSNdg How to Actually Scrape using LLMs (Free Local Deepseek R1 + crawl4ai + Knowledge Graph) Leonardo Grigorio | The AI Forge
https://www.youtube.com/watch?v=oikVfYUEeS8 Free Scraper Turns ANY WEBSITE into LLM Knowledge INSTANTLY Income stream surfers
https://www.youtube.com/watch?v=_X9pS57BFJw You Don’t NEED AI to Scrape Data (it’s simple do this) John Watson Rooney
This is one of a series about Python: