Selenium
Selenium is a portable framework for testing web applications. It also provides a test domain-specific language (Selenese) to write tests in a number of popular programming languages.
Web driver backends⚑
Selenium can be used with many browsers, such as Firefox, Chrome or PhantomJS. But first, install selenium
:
pip install selenium
Firefox⚑
Assuming you've got firefox already installed, you need to download the geckodriver, unpack the tar and add the geckodriver
binary somewhere in your PATH
.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://duckduckgo.com/")
If you need to get the status code of the requests use Chrome instead
There is an issue with Firefox that doesn't support this feature.
Chrome⚑
We're going to use Chromium instead of Chrome. Download the chromedriver of the same version as your Chromium, unpack the tar and add the chromedriver
binary somewhere in your PATH
.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.binary_location = '/usr/bin/chromium'
driver = webdriver.Chrome(options=opts)
driver.get("https://duckduckgo.com/")
If you don't want to see the browser, you can run it in headless mode adding the next line when defining the options
:
opts.add_argument("--headless")
PhantomJS⚑
PhantomJS is abandoned -> Don't use it
PhantomJS is a headless Webkit, in conjunction with Selenium WebDriver, it can be used to run tests directly from the command line. Since PhantomJS eliminates the need for a graphical browser, tests run much faster.
Don't install phantomjs from the official repos as it's not a working release -.-. npm install -g phantomjs
didn't work either. I had to download the tar from the downloads page, which didn't work either. The project is abandoned, so don't use this.
Usage⚑
Assuming that you've got a configured driver
, to get the url you're in after javascript has done it's magic use the driver.current_url
method. To return the HTML of the page use driver.page_source
.
Open a URL⚑
driver.get("https://duckduckgo.com/")
Get page source⚑
driver.page_source
Get current url⚑
driver.current_url
Click on element⚑
Once you've opened the page you want to interact with driver.get()
, you need to get the Xpath of the element to click on. You can do that by using your browser inspector, to select the element, and once on the code if you right click there is a "Copy XPath"
Once that is done you should have something like this when you paste it down.
//*[@id=”react-root”]/section/main/article/div[2]/div[2]/p/a
Similarly it is the same process for the input fields for username, password, and login button.
We can go ahead and do that on the current page. We can store these xpaths as strings in our code to make it readable.
We should have three xpaths from this page and one from the initial login.
first_login = '//*[@id=”react-root”]/section/main/article/div[2]/div[2]/p/a'
username_input = '//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[2]/div/label/input'
password_input = '//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[3]/div/label/input'
login_submit = '//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[4]/button/div'
Now that we have the xpaths defined we can now tell Selenium webdriver to click and send some keys over for the input fields.
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, first_login).click()
driver.find_element(By.XPATH, username_input).send_keys("username")
driver.find_element(By.XPATH, password_input).send_keys("password")
driver.find_element(By.XPATH, login_submit).click()
Note
Many pages suggest to use methods like find_element_by_name
, find_element_by_xpath
or find_element_by_id
. These are deprecated now. You should use find_element(By.
instead. So, instead of:
driver.find_element_by_xpath("your_xpath")
It should be now:
driver.find_element(By.XPATH, "your_xpath")
Where By
is imported with from selenium.webdriver.common.by import By
.
Solve element isn't clickable in headless mode⚑
There are many things you can try to fix this issue. Being the first to configure the driver
to use the full screen. Assuming you're using the undetectedchromedriver:
import undetected_chromedriver.v2 as uc
options = uc.ChromeOptions()
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--no-sandbox")
options.add_argument("--headless")
options.add_argument("--start-maximized")
options.add_argument("--window-size=1920,1080")
driver = uc.Chrome(options=options)
If that doesn't solve the issue use the next function:
def click(driver: uc.Chrome, xpath: str, mode: Optional[str] = None) -> None:
"""Click the element marked by the XPATH.
Args:
driver: Object to interact with selenium.
xpath: Identifier of the element to click.
mode: Type of click. It needs to be one of [None, position, wait]
The different ways to click are:
* None: The normal click of the driver.
* wait: Wait until the element is clickable and then click it.
* position: Deduce the position of the element and then click it with a javascript script.
"""
if mode is None:
driver.find_element(By.XPATH, xpath).click()
elif mode == 'wait':
# https://stackoverflow.com/questions/59808158/element-isnt-clickable-in-headless-mode
WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH, xpath))
).click()
elif mode == 'position':
# https://stackoverflow.com/questions/16807258/selenium-click-at-certain-position
element = driver.find_element(By.XPATH, xpath)
driver.execute_script("arguments[0].click();", element)
Close the browser⚑
driver.close()
Change browser configuration⚑
You can pass options
to the initialization of the chromedriver to tweak how does the browser behave. To get a list of the actual prefs
you can go to chrome://prefs-internals
, there you can get the code you need to tweak.
Disable loading of images⚑
options = ChromeOptions()
options.add_experimental_option(
"prefs",
{
"profile.default_content_setting_values.images": 2,
"profile.default_content_setting_values.cookies": 2,
},
)
Disable site cookies⚑
options = ChromeOptions()
options.add_experimental_option(
"prefs",
{
"profile.default_content_setting_values.cookies": 2,
},
)
Bypass Selenium detectors⚑
Sometimes web servers react differently if they notice that you're using selenium. Browsers can be detected through different ways and some commonly used mechanisms are as follows:
- Implementing captcha / recaptcha to detect the automatic bots.
- Non-human behaviour (browsing too fast, not scrolling to the visible elements, ...)
- Using an IP that's flagged as suspicious (VPN, VPS, Tor...)
- Detecting the term HeadlessChrome within headless Chrome UserAgent
- Using Bot Management service from Distil Networks, Akamai, Datadome.
They do it through different mechanisms:
- Use undetected-chromedriver
- Use Selenium stealth
- Rotate the user agent
- Changing browser properties
- Predefined Javascript variables
- Don't use selenium
If you've already been detected, you might get blocked for a plethora of other reasons even after using these methods. So you may have to try accessing the site that was detecting you using a VPN, different user-agent, etc.
Use undetected-chromedriver⚑
undetected-chromedriver
is a python library that uses an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io Automatically downloads the driver binary and patches it.
Installation⚑
pip install undetected-chromedriver
Usage⚑
import undetected_chromedriver.v2 as uc
driver = uc.Chrome()
driver.get('https://nowsecure.nl') # my own test test site with max anti-bot protection
If you want to specify the path to the browser use uc.Chrome(browser_executable_path="/path/to/your/file")
.
Use Selenium Stealth⚑
selenium-stealth
is a python package to prevent detection (by doing most of the steps of this guide) by making selenium more stealthy.
Note
It's less maintained than undetected-chromedriver
so I'd use that other instead. I leave the section in case it's helpful if the other fails for you.
Installation⚑
pip install selenium-stealth
Usage⚑
from selenium import webdriver
from selenium_stealth import stealth
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
# options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r"C:\Users\DIPRAJ\Programming\adclick_bot\chromedriver.exe")
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
url = "https://bot.sannysoft.com/"
driver.get(url)
time.sleep(5)
driver.quit()
You can test it with antibot.
Rotate the user agent⚑
Rotating the UserAgent in every execution of your Test Suite using fake_useragent
module as follows:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
options = Options()
ua = UserAgent()
userAgent = ua.random
print(userAgent)
options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(chrome_options=options)
driver.get("https://www.google.co.in")
driver.quit()
You can also rotate it with execute_cdp_cmd
:
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\WebDrivers\chromedriver.exe')
print(driver.execute_script("return navigator.userAgent;"))
# Setting user agent as Chrome/83.0.4103.97
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'})
print(driver.execute_script("return navigator.userAgent;"))
Changing browser properties⚑
-
Changing the property value of navigator for webdriver to undefined as follows:
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { "source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """ })
You can find a relevant detailed discussion in Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
-
Changing the values of navigator.plugins, navigator.languages, WebGL, hairline feature, missing image, etc. You can find a relevant detailed discussion in Is there a version of selenium webdriver that is not detectable?
-
Changing the conventional Viewport
You can find a relevant detailed discussion in How to bypass Google captcha with Selenium and python?
Predefined Javascript variables⚑
One way of detecting Selenium is by checking for predefined JavaScript variables which appear when running with Selenium. The bot detection scripts usually look anything containing word selenium
, webdriver
in any of the variables (on window object), and also document variables called $cdc_
and $wdc_
. Of course, all of this depends on which browser you are on. All the different browsers expose different things.
In Chrome, what people had to do was to ensure that $cdc_
didn't exist as a document variable.
You don't need to go compile the chromedriver
yourself, if you open the file with vim
and execute :%s/cdc_/dog_/g
where dog
can be any three characters that will work. With perl you can achieve the same result with:
perl -pi -e 's/cdc_/dog_/g' /path/to/chromedriver
Don't use selenium⚑
Even with undetected-chromedriver
, sometimes servers are able to detect that you're using selenium.
A uglier but maybe efective way to go is not using selenium and do a combination of working directly with the chrome devtools protocol with pycdp
(using this maintained fork) and doing the clicks with pyautogui
. See an example on this answer.
Keep in mind though that these tools don't look to be actively maintained, and that the approach is quite brittle to site changes. Is there really not other way to achieve what you want?
Set timeout of a response⚑
For Firefox and Chromedriver:
driver.set_page_load_timeout(30)
The rest:
driver.implicitly_wait(30)
This will throw a TimeoutException
whenever the page load takes more than 30 seconds.
Get the status code of a response⚑
Surprisingly this is not as easy as with requests, there is no status_code
method on the driver, you need to dive into the browser log to get it. Firefox has an open issue since 2016 that prevents you from getting this information. Use Chromium if you need this functionality.
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
capabilities = DesiredCapabilities.CHROME.copy()
capabilities['goog:loggingPrefs'] = {'performance': 'ALL'}
driver = webdriver.Chrome(desired_capabilities=capabilities)
driver.get("https://duckduckgo.com/")
logs = driver.get_log("performance")
status_code = get_status(driver.current_url, logs)
Where get_status
is:
def get_status(url: str, logs: List[Dict[str, Any]]) -> int:
"""Get the url response status code.
Args:
url: url to search
logs: Browser driver logs
Returns:
The status code.
"""
for log in logs:
if log["message"]:
data = json.loads(log["message"])
with suppress(KeyError):
if data["message"]["params"]["response"]["url"] == url:
return data["message"]["params"]["response"]["status"]
raise ValueError(f"Error retrieving the status code for url {url}")
You have to use driver.current_url
to handle well urls that redirect to other urls.
If your url is not catched and you get a ValueError
, use the next snippet inside the with suppress(KeyError)
statement.
content_type = (
"text/html"
in data["message"]["params"]["response"]["headers"]["content-type"]
)
response_received = (
data["message"]["method"] == "Network.responseReceived"
)
if content_type and response_received:
__import__("pdb").set_trace() # XXX BREAKPOINT
pass
url != data["message"]["params"]["response"]["url"]
. Sometimes servers redirect the user to a url without the www.
. Troubleshooting⚑
Chromedriver hangs up unexpectedly⚑
Some say that adding the DBUS_SESSION_BUS_ADDRESS
environmental variable fixes it:
os.environ["DBUS_SESSION_BUS_ADDRESS"] = "/dev/null"
But it still hangs for me. Right now the only solution I see is to assume it's going to hang and add functionality in your program to resume the work instead of starting from scratch. Ugly I know...
Issues⚑
- Firefox driver doesn't have access to the log: Update the section above and start using Firefox instead of Chrome when you need to get the status code of the responses.