Web Scraper Application in Python (Requests + BeautifulSoup)
This project demonstrates how to build a web scraper in Python to extract data from websites. You will learn how to fetch webpage content using requests and parse HTML using BeautifulSoup.
The scraper in this project extracts article titles from a blog, but you can modify it to collect other types of data. It is a beginner-friendly way to practice web scraping, error handling, and working with HTML structures.
How the Web Scraper Works
- The program sends a GET request to the target URL using requests.
- The HTML content of the page is parsed with BeautifulSoup.
- Specific elements (like article titles) are extracted using find_all or CSS selectors.
- The extracted data is printed to the terminal or can be stored for further use.
How to use this program:
To use this Web Scraper, you'll first need to install the two required libraries: requests and BeautifulSoup. Open your terminal or command prompt and run:
pip install requests beautifulsoup4
- Save the code: Save the code as a Python file (e.g., web_scraper.py).
- Run the script: Open your terminal, navigate to the directory, and run python web_scraper.py.
Once the libraries are installed, you can run the Python script:
This code is written to scrape a specific blog, so you may need to adjust the find_all method's parameters ('h2' and class_='entry-title') if you want to scrape a different website.
Project Level: Intermediate
You can directly copy the below snippet code with the help of green copy button, paste it and run it in any Python editor you have.
Steps: Follow these stepsStep 1: Copy below code using green 'copy' button.
Step 2: Paste the code on your chosen editor.
Step 3: Save the code with filename and .py extention.
Step 4: Run (Press F5 if using python IDLE)
# web_scraper.py
import requests
from bs4 import BeautifulSoup
import sys
def scrape_website(url):
"""
Fetches a webpage, parses its HTML, and extracts article titles.
Args:
url (str): The URL of the website to scrape.
"""
try:
# Send a GET request to the URL
response = requests.get(url, timeout=10)
# Raise an exception for bad status codes (e.g., 404, 500)
response.raise_for_status()
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all (h2) tags with the class 'entry-title'
# This is specific to a particular blog and may need to be adjusted
# for different websites.
article_titles = soup.find_all('h2', class_='entry-title')
if not article_titles:
print("No article titles found. The website structure may have changed.")
return
print("\n--- Scraped Article Titles ---")
for title in article_titles:
# Extract the text from the 'a' tag within the 'h2' tag
link = title.find('a')
if link:
print(link.text.strip())
print("------------------------------")
except requests.exceptions.RequestException as e:
print(f"Network error: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
def main():
"""
Main function to run the Web Scraper.
"""
print("--- Python Web Scraper ---")
# We will use a sample blog URL for this example
url_to_scrape = "http://books.toscrape.com/"
print(f"Scraping data from: {url_to_scrape}")
scrape_website(url_to_scrape)
# You can easily prompt the user for a URL to make it more dynamic
# user_url = input("\nEnter the URL to scrape: ").strip()
# if user_url:
# scrape_website(user_url)
# else:
# print("URL cannot be empty.")
# This ensures that main() is called only when the script is executed directly.
if __name__ == "__main__":
main()
What You Will Learn From This Project
- How to send HTTP requests in Python
- How to parse HTML with BeautifulSoup
- How to extract structured data from webpages
- Error handling for network requests
- Basics of web scraping ethics and website structure
Limitations and Notes
This scraper works for educational purposes and may break if the website's HTML structure changes. It does not handle JavaScript-rendered content. Always check a website's robots.txt and terms of use before scraping.
Ideas to Extend This Project
- Make the scraper dynamic by asking the user for a URL
- Save the scraped data to a CSV or JSON file
- Scrape multiple pages using pagination
- Add parsing for images, links, or other HTML elements
- Build a GUI to display scraped data interactively
