Web Scraper in Python.
About the project: This is Python project for a Web Scraper.
This program will connect to a website, extract specific data (in this case, article titles from a blog), and display it.
For this to work, we'll use two libraries:
requests to get the webpage's content and
BeautifulSoup to parse the HTML.
How to use this program:
To use this Web Scraper, you'll first need to install the two required libraries: requests and BeautifulSoup. Open your terminal or command prompt and run:
pip install requests beautifulsoup4
- Save the code: Save the code as a Python file (e.g., web_scraper.py).
- Run the script: Open your terminal, navigate to the directory, and run python web_scraper.py.
Once the libraries are installed, you can run the Python script:
This code is written to scrape a specific blog, so you may need to adjust the find_all method's parameters ('h2' and class_='entry-title') if you want to scrape a different website.
Project Level: Intermediate
You can directly copy the below snippet code with the help of green copy button, paste it and run it in any Python editor you have.
Steps: Follow these stepsStep 1: Copy below code using green 'copy' button.
Step 2: Paste the code on your chosen editor.
Step 3: Save the code with filename and .py extention.
Step 4: Run (Press F5 if using python IDLE)
# web_scraper.py
import requests
from bs4 import BeautifulSoup
import sys
def scrape_website(url):
"""
Fetches a webpage, parses its HTML, and extracts article titles.
Args:
url (str): The URL of the website to scrape.
"""
try:
# Send a GET request to the URL
response = requests.get(url, timeout=10)
# Raise an exception for bad status codes (e.g., 404, 500)
response.raise_for_status()
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all (h2) tags with the class 'entry-title'
# This is specific to a particular blog and may need to be adjusted
# for different websites.
article_titles = soup.find_all('h2', class_='entry-title')
if not article_titles:
print("No article titles found. The website structure may have changed.")
return
print("\n--- Scraped Article Titles ---")
for title in article_titles:
# Extract the text from the 'a' tag within the 'h2' tag
link = title.find('a')
if link:
print(link.text.strip())
print("------------------------------")
except requests.exceptions.RequestException as e:
print(f"Network error: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
def main():
"""
Main function to run the Web Scraper.
"""
print("--- Python Web Scraper ---")
# We will use a sample blog URL for this example
url_to_scrape = "http://books.toscrape.com/"
print(f"Scraping data from: {url_to_scrape}")
scrape_website(url_to_scrape)
# You can easily prompt the user for a URL to make it more dynamic
# user_url = input("\nEnter the URL to scrape: ").strip()
# if user_url:
# scrape_website(user_url)
# else:
# print("URL cannot be empty.")
# This ensures that main() is called only when the script is executed directly.
if __name__ == "__main__":
main()