Download Entire Website Text
Background
I previously worked on a GenAI RAG system where the reference data came from a public website.
For this purpose, I designed a simple web crawler using Selenium that could navigate web pages, traverse sub-pages, and then traverse the links on those sub-pages until reaching a predetermined depth. However, this crawler quickly became inadequate: each depth level had different web page structures, and in fact, I had to rewrite the logic for each level. Moreover, once the page structure changed—for example, if a first-level page jumped to a third-level page—the crawler would quickly crash.
So, in the second version, I implemented a unified crawling approach for each page and used infinite recursion to expand the website. Initially, this solution worked, but as more pages were opened, the Chrome driver controlled by Selenium would mysteriously crash, outputting the error message: "lost connection to the driver."
According to programming principles, writing a dedicated crawler for this single function was wasteful, and if no login authentication was required, using Selenium to crawl web pages seemed like overkill. Therefore, I developed a more general and lightweight solution: using BeautifulSoup to simply extract web page text (including text from PDFs).
Development Process
Step 1: Technology Selection and Architecture Design
Considering the issues with the previous Selenium solution, I chose a more lightweight technology stack:
- BeautifulSoup for HTML parsing, replacing Selenium's browser automation
- requests for HTTP requests, more stable with lower resource consumption
- pdfplumber and PyPDF2 for dual-guaranteed PDF text extraction
- JSON format for result storage, convenient for subsequent data processing
Step 2: Implementing Core Crawling Logic
The entire crawling process is divided into two phases:
Phase 1: Web Content Crawling I designed a queue-based breadth-first search algorithm. Starting from the initial URL, it extracts page content and links, adding new same-domain links to the crawling queue. This ensures:
- Systematic traversal of the entire website
- Avoidance of duplicate visits to the same page
- Control of crawling scope within the same domain
Phase 2: PDF File Processing After collecting all PDF links in the first phase, PDF files are processed separately:
- Download PDFs to temporary files
- Prioritize pdfplumber for text extraction, fallback to PyPDF2 if it fails
- Clean and format the extracted text content
Step 3: Optimization and Error Handling
Several optimization measures were added:
- Request delays: Avoid putting pressure on the target server
- URL normalization: Handle relative paths and query parameters
- Error handling: Graceful degradation when network exceptions or parsing errors occur
- Progress display: Real-time display of crawling progress and statistics
Step 4: Data Storage and Output
All extracted text content and corresponding URLs are saved in JSON format:
[
["Page text content", "Page URL"],
["PDF text content", "PDF file URL"]
]
This format is convenient for subsequent data analysis and processing, particularly suitable for vectorization needs in RAG systems.
Code Highlights
Several key features of this tool:
- Two-phase processing: First crawl all web pages, then process PDFs centrally, avoiding the complexity of mixed processing
- Intelligent URL filtering: Automatically exclude CSS, JS, images, and other non-content files
- Dual PDF parsing: Ensure maximum PDF text extraction success rate
- Domain restriction: Prevent the crawler from "wandering" to other websites
- Resource cleanup: Automatic deletion of temporary PDF files to avoid disk usage
Usage Example
# Create downloader instance
downloader = WebsiteDownloader(
base_url="https://example.com",
delay=0.1 # 100ms delay
)
# Start crawling (maximum 500 pages)
result = downloader.crawl_website(max_pages=500)
# Save to JSON file
filename = downloader.save_to_json()
Summary
Compared to the previous Selenium solution, this BeautifulSoup-based website downloader has clear advantages:
- Better stability: Avoids browser driver crash issues
- Lower resource consumption: No need to launch a full browser
- Stronger processing capability: Supports content extraction from large-scale websites
- Better versatility: Can handle any publicly accessible website
This tool is particularly suitable for preparing training data for RAG systems or scenarios requiring batch extraction of website text content. The entire project code is concise and clear, with just over 300 lines of code, yet fully functional and practical.
GitHub Repository: https://github.com/Harvey-Labs/download-entire-website-text