Last modified: Nov 22, 2024 By Alexander Williams

Getting Page Source with Python Selenium: Step-by-Step Guide

Retrieving a webpage's source code is a fundamental task in web automation and scraping. Python Selenium provides powerful tools to access and analyze HTML content efficiently.

Basic Setup and Requirements

Before getting started, ensure you have Selenium and the appropriate WebDriver installed. Here's a basic setup example:


from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

# Configure Chrome options
chrome_options = Options()
chrome_options.add_argument('--headless')  # Run in headless mode (optional)

# Initialize the driver
driver = webdriver.Chrome(options=chrome_options)

Getting Page Source Using Selenium

The most straightforward way to get a page's source code is using the page_source property. Here's how to implement it:


# Navigate to the website
driver.get('https://example.com')

# Get the page source
source_code = driver.page_source

# Print first 500 characters of source code
print(source_code[:500])

# Save source code to file
with open('page_source.html', 'w', encoding='utf-8') as f:
    f.write(source_code)

Advanced Source Code Analysis

For more complex scenarios, you might want to combine Selenium with BeautifulSoup for better HTML parsing. This approach is particularly useful when working with iframes.


from bs4 import BeautifulSoup

# Get page source and create BeautifulSoup object
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')

# Find specific elements
titles = soup.find_all('h1')
for title in titles:
    print(title.text)

Dynamic Content Handling

When dealing with dynamic websites, you might need to wait for content to load. This is especially important when handling pop-ups and alerts.


from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Wait for specific element to be present
wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located(('id', 'dynamic-content')))

# Get updated page source after dynamic content loads
updated_source = driver.page_source

Error Handling and Best Practices

Implement proper error handling and logging for robust automation. Consider integrating with Selenium logging best practices for better debugging.


try:
    driver.get('https://example.com')
    source = driver.page_source
except Exception as e:
    print(f"Error retrieving page source: {str(e)}")
finally:
    driver.quit()  # Always close the browser

Conclusion

Getting page source with Python Selenium is a powerful technique for web scraping and automation. Remember to handle dynamic content, implement proper error handling, and follow best practices for optimal results.

Key takeaways include using appropriate wait times, handling dynamic content, and implementing proper error handling for robust automation solutions.