Last modified: Jan 19, 2026 By Alexander Williams
BeautifulSoup News Scraping Tutorial
Web scraping is a powerful skill. It lets you gather data from websites. News sites are a common target. You can track stories and trends.
This guide teaches you how to scrape headlines and links. We will use Python's BeautifulSoup library. It is perfect for beginners.
What You Need to Start
You need Python installed on your computer. Basic Python knowledge is helpful. We will use two main libraries.
First is requests. It fetches web page content. Second is beautifulsoup4. It parses HTML data.
Install them using pip. Open your terminal or command prompt. Run the command below.
pip install requests beautifulsoup4
Understanding the Basics of BeautifulSoup
BeautifulSoup turns HTML into a tree. This tree is easy to navigate. You can search for specific tags.
Think of a news website. Headlines are often in <h2> or <h3> tags. Links are inside <a> tags.
Our goal is to find these tags. Then we extract the text and URLs. For a deeper foundation, read our Web Scraping Guide with BeautifulSoup for Beginners.
Step 1: Fetch the Web Page
We start by getting the HTML. Use the requests.get() function. Provide the URL of the news page.
Always check if the request was successful. Use the .status_code attribute. A 200 code means success.
import requests
url = 'https://example-news-site.com/latest'
response = requests.get(url)
if response.status_code == 200:
html_content = response.text
print("Page fetched successfully!")
else:
print(f"Failed to fetch page. Status code: {response.status_code}")
Step 2: Parse HTML with BeautifulSoup
Now we parse the HTML. Create a BeautifulSoup object. Use the 'html.parser' parser.
The BeautifulSoup() constructor does this. It creates a searchable object.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
print(type(soup)) # Confirms the object is created
Step 3: Find Headlines and Links
This is the core step. We need to inspect the website's structure. Use your browser's developer tools.
Right-click on a headline. Select "Inspect". Look for a common pattern.
Let's assume headlines are in <h2 class="headline"> tags. We use soup.find_all().
# Find all headline elements
headline_tags = soup.find_all('h2', class_='headline')
for h2_tag in headline_tags:
# Extract the text
headline_text = h2_tag.get_text(strip=True)
# Find the link inside the headline (common pattern)
link_tag = h2_tag.find('a')
if link_tag and link_tag.has_attr('href'):
link_url = link_tag['href']
# Print the result
print(f"Headline: {headline_text}")
print(f"Link: {link_url}")
print("-" * 50)
Important: Website structures change. Your selectors might need updates. Always test your code.
Step 4: Handle Relative Links
News sites often use relative links. A link might be "/story/123". This is not a full web address.
We need to convert it to an absolute URL. Use Python's urllib.parse.urljoin() function.
from urllib.parse import urljoin
base_url = 'https://example-news-site.com'
for h2_tag in headline_tags:
link_tag = h2_tag.find('a')
if link_tag and link_tag.has_attr('href'):
relative_url = link_tag['href']
# Convert to absolute URL
absolute_url = urljoin(base_url, relative_url)
print(f"Full Link: {absolute_url}")
Step 5: Store the Scraped Data
Printing data is fine for testing. For real use, save it. Use a list of dictionaries or a file.
Here we store data in a list. Then we can save it as JSON or CSV.
import json
scraped_data = []
for h2_tag in headline_tags:
headline_text = h2_tag.get_text(strip=True)
link_tag = h2_tag.find('a')
if link_tag and link_tag.has_attr('href'):
relative_url = link_tag['href']
absolute_url = urljoin(base_url, relative_url)
article_info = {
'headline': headline_text,
'url': absolute_url
}
scraped_data.append(article_info)
# Save to a JSON file
with open('news_headlines.json', 'w', encoding='utf-8') as f:
json.dump(scraped_data, f, indent=4, ensure_ascii=False)
print(f"Saved {len(scraped_data)} articles to 'news_headlines.json'")
Example Output
After running the script, your JSON file might look like this.
[
{
"headline": "Major Breakthrough in Renewable Energy",
"url": "https://example-news-site.com/story/energy-breakthrough"
},
{
"headline": "Global Markets React to New Policy",
"url": "https://example-news-site.com/story/markets-react"
}
]
Common Challenges and Solutions
Websites use dynamic content loaded by JavaScript. BeautifulSoup alone cannot handle this.
You may need tools like Selenium. Our guide on Scrape AJAX Content with BeautifulSoup can help.
Websites also block scrapers. You must be respectful. Use delays between requests.
Rotate user agents. Follow Avoid Getting Blocked While Scraping BeautifulSoup for tips.
Conclusion
You now know how to scrape news headlines and links. The process is simple. Fetch, parse, find, and store.
Remember to respect website terms of service. Do not overload their servers. Use scraped data ethically.
This is a foundation. You can scrape other data like dates or summaries. Practice on different sites to improve.
For advanced patterns like extracting structured data, see Extract Microdata & JSON-LD with BeautifulSoup.
Happy scraping!