Skip to content

Selenium

Selenium is a portable framework for testing web applications. It also provides a test domain-specific language (Selenese) to write tests in a number of popular programming languages.

Web driver backends

Selenium can be used with many browsers, such as Firefox, Chrome or PhantomJS. But first, install selenium:

pip install selenium

Firefox

Assuming you've got firefox already installed, you need to download the geckodriver, unpack the tar and add the geckodriver binary somewhere in your PATH.

from selenium import webdriver

driver = webdriver.Firefox()

driver.get("https://duckduckgo.com/")

If you need to get the status code of the requests use Chrome instead

There is an issue with Firefox that doesn't support this feature.

Chrome

We're going to use Chromium instead of Chrome. Download the chromedriver of the same version as your Chromium, unpack the tar and add the chromedriver binary somewhere in your PATH.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

opts = Options()
opts.binary_location = '/usr/bin/chromium'
driver = webdriver.Chrome(options=opts)

driver.get("https://duckduckgo.com/")

If you don't want to see the browser, you can run it in headless mode adding the next line when defining the options:

opts.add_argument("--headless")

PhantomJS

PhantomJS is abandoned -> Don't use it

The development stopped in 2018

PhantomJS is a headless Webkit, in conjunction with Selenium WebDriver, it can be used to run tests directly from the command line. Since PhantomJS eliminates the need for a graphical browser, tests run much faster.

Don't install phantomjs from the official repos as it's not a working release -.-. npm install -g phantomjs didn't work either. I had to download the tar from the downloads page, which didn't work either. The project is abandoned, so don't use this.

Usage

Assuming that you've got a configured driver, to get the url you're in after javascript has done it's magic use the driver.current_url method. To return the HTML of the page use driver.page_source.

Close the browser

driver.close()

Set timeout of a response

For Firefox and Chromedriver:

driver.set_page_load_timeout(30)

The rest:

driver.implicitly_wait(30)

This will throw a TimeoutException whenever the page load takes more than 30 seconds.

Get the status code of a response

Surprisingly this is not as easy as with requests, there is no status_code method on the driver, you need to dive into the browser log to get it. Firefox has an open issue since 2016 that prevents you from getting this information. Use Chromium if you need this functionality.

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

capabilities = DesiredCapabilities.CHROME.copy()
capabilities['goog:loggingPrefs'] = {'performance': 'ALL'}

driver = webdriver.Chrome(desired_capabilities=capabilities)

driver.get("https://duckduckgo.com/")
logs = driver.get_log("performance")
status_code = get_status(driver.current_url, logs)

Where get_status is:

def get_status(url: str, logs: List[Dict[str, Any]]) -> int:
    """Get the url response status code.

    Args:
        url: url to search
        logs: Browser driver logs
    Returns:
        The status code.
    """
    for log in logs:
        if log["message"]:
            data = json.loads(log["message"])
            with suppress(KeyError):
                if data["message"]["params"]["response"]["url"] == url:
                    return data["message"]["params"]["response"]["status"]
    raise ValueError(f"Error retrieving the status code for url {url}")

You have to use driver.current_url to handle well urls that redirect to other urls.

If your url is not catched and you get a ValueError, use the next snippet inside the with suppress(KeyError) statement.

content_type = (
    "text/html"
    in data["message"]["params"]["response"]["headers"]["content-type"]
)
response_received = (
    data["message"]["method"] == "Network.responseReceived"
)
if content_type and response_received:
    __import__("pdb").set_trace()  # XXX BREAKPOINT
    pass
And try to see why url != data["message"]["params"]["response"]["url"]. Sometimes servers redirect the user to a url without the www..

Troubleshooting

Chromedriver hangs up unexpectedly

Some say that adding the DBUS_SESSION_BUS_ADDRESS environmental variable fixes it:

os.environ["DBUS_SESSION_BUS_ADDRESS"] = "/dev/null"

But it still hangs for me. Right now the only solution I see is to assume it's going to hang and add functionality in your program to resume the work instead of starting from scratch. Ugly I know...

Issues


Last update: 2021-06-25