Last modified: Jan 12, 2026 By Alexander Williams
Scrape JSON Data in HTML with BeautifulSoup
Web scraping often targets structured data. This data is sometimes hidden within HTML. Modern websites frequently embed JSON in their source code. This JSON holds product details, user info, or dynamic content. BeautifulSoup is a popular Python library for parsing HTML. It can help you find and extract this embedded JSON data. This guide will show you how.
Why JSON is Embedded in HTML
Websites use JavaScript to create dynamic content. Data is often loaded from an API. This data is formatted as JSON. For performance or simplicity, this JSON is sometimes placed directly into the HTML. It is often inside a <script> tag. This lets the page render quickly. It also makes the data available to client-side scripts. As a scraper, you can access this raw data directly. This is more efficient than parsing the rendered HTML.
Finding the JSON in the HTML
First, you need to inspect the webpage's source. Use your browser's developer tools. Look for <script> tags. The JSON might be assigned to a JavaScript variable. It could also be within a specific script type. Common patterns include `window.__INITIAL_STATE__` or `var data =`. Your goal is to locate the exact tag containing the JSON string.
Basic Setup with BeautifulSoup
Start by installing the necessary libraries. You will need beautifulsoup4 and requests. Use pip to install them if you haven't already.
pip install beautifulsoup4 requests
Then, import the libraries in your Python script. Use requests to fetch the HTML content. Use BeautifulSoup to parse it.
import requests
from bs4 import BeautifulSoup
import json
# Fetch the webpage
url = 'https://example.com/product-page'
response = requests.get(url)
html_content = response.text
# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
Extracting the JSON String
Once the HTML is parsed, find the script tag. Use BeautifulSoup's find() or find_all() methods. You may need to search by tag type or an identifying string. Once you have the script's text content, you need to isolate the JSON part.
# Find all script tags
script_tags = soup.find_all('script')
# Loop through tags to find the one containing your target data
for script in script_tags:
if script.string and 'window.productData' in script.string:
json_string = script.string
# Further processing needed to extract just the JSON
break
The script's string might contain JavaScript code. You only want the JSON object. You might need to use string manipulation or regular expressions to isolate it. For more on using patterns, see our guide on Use Regex with BeautifulSoup for Web Scraping.
Parsing the JSON Data
After extracting the raw JSON string, parse it with Python's json module. Use the json.loads() function. This converts the string into a Python dictionary or list. You can then work with the data normally.
import re
# Example: JSON is assigned to a variable like `var data = {...};`
pattern = r'var productData = ({.*?});'
match = re.search(pattern, json_string, re.DOTALL)
if match:
json_data_str = match.group(1)
try:
product_data = json.loads(json_data_str)
print("Successfully parsed JSON data.")
except json.JSONDecodeError as e:
print(f"Failed to parse JSON: {e}")
Successfully parsed JSON data.
Handling Complex and Nested JSON
The extracted JSON can be complex. It may have nested dictionaries and lists. Use standard Python dictionary and list access methods. Loop through the data to extract the specific fields you need.
# Accessing data from the parsed JSON dictionary
product_name = product_data.get('name')
price = product_data.get('price', {}).get('value')
sku_list = product_data.get('variants', [])
print(f"Product: {product_name}")
print(f"Price: {price}")
for sku in sku_list:
print(f"SKU: {sku.get('id')}")
Common Challenges and Solutions
You might face encoding issues. Ensure your request handles the correct character set. Sometimes the JSON is minified. This makes it harder to find with simple string search. Using re.DOTALL in regex can help. The JSON might be broken or invalid. Wrap your json.loads() call in a try-except block. For help with encoding, read BeautifulSoup Unicode Encoding Issues Guide.
Some sites load data via XHR after the initial page load. In these cases, BeautifulSoup alone may not be enough. You might need to combine it with a tool like Selenium. Learn more in our article Combine BeautifulSoup & Selenium for Web Scraping.
Complete Working Example
Here is a full example. It scrapes a mock product page. It finds JSON in a script tag and extracts the product info.
import requests, json, re
from bs4 import BeautifulSoup
url = 'https://example.mock/product-123'
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
# Target script tag with specific id
target_script = soup.find('script', {'id': '__NEXT_DATA__'})
if target_script:
# Parse the entire script content as JSON (common in Next.js apps)
page_data = json.loads(target_script.string)
# Navigate to the product info within the nested JSON
product_info = page_data['props']['pageProps']['product']
print(f"Product Title: {product_info['title']}")
print(f"In Stock: {product_info['inStock']}")
else:
print("Target script tag not found.")
Conclusion
Scraping JSON embedded in HTML is a powerful technique. It allows you to access clean, structured data directly. Use BeautifulSoup to locate the correct <script> tag. Then, use string methods or regex to isolate the JSON string. Finally, parse it with Python's json module. This method is often faster and more reliable than scraping rendered HTML. Remember to check the website's robots.txt and terms of service. Always scrape responsibly. For your next steps, learn how to Save Scraped Data to CSV with BeautifulSoup to store your results.