BeautifulSoup
BeautifulSoup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
import requests
from bs4 import BeautifulSoup
request = requests.get('{{ url }}')
soup = BeautifulSoup(request.text, "html.parser")
Here are some simple ways to navigate that data structure:
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Installation⚑
pip install beautifulsoup4
The default parser html.parser
doesn't work with HTML5, so you'll probably need to use the html5lib
parser, it's not included by default, so you might need to install it as well
pip install html5lib
Usage⚑
Kinds of objects⚑
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag
, NavigableString
, BeautifulSoup
, and Comment
.
Tag⚑
A Tag
object corresponds to an XML or HTML tag in the original document:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>
The most important features of a tag are its name
and attributes
.
Name⚑
Every tag has a name
, accessible as .name
:
tag.name
# u'b'
If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:.
tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>
Attributes⚑
A tag may have any number of attributes. The tag <b id="boldest">
has an attribute id
whose value is boldest
. You can access a tag’s attributes by treating the tag like a dictionary:
tag['id']
# u'boldest'
You can access that dictionary directly as .attrs
:
tag.attrs
# {u'id': 'boldest'}
You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag
# <b another-attribute="1" id="verybold"></b>
del tag['id']
del tag['another-attribute']
tag
# <b></b>
tag['id']
# KeyError: 'id'
print(tag.get('id'))
# None
Multi-valued attributes⚑
HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class
(that is, a tag can have more than one CSS class). Others include rel
, rev
, accept-charset
, headers
, and accesskey
. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
# ["body"]
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]
If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'
When you turn a tag back into a string, multiple attribute values are consolidated:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
rel_soup.a['rel']
# ['index']
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# <p>Back to the <a rel="index contents">homepage</a></p>
If you parse a document as XML, there are no multi-valued attributes:
NavigableString⚑
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString
class to contain these bits of text:
tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>
A NavigableString
is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString
to a Unicode string with unicode()
:
unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>
You can’t edit a string in place, but you can replace one string with another, using replace_with()
:
tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>
BeautifulSoup⚑
The BeautifulSoup
object represents the parsed document as a whole. For most purposes, you can treat it as a Tag
object. This means it supports most of the methods described in Navigating the tree and Searching the tree.
Navigating the tree⚑
Going down⚑
Tags may contain strings and other tags. These elements are the tag’s children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.
Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.
Navigating using tag names⚑
The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <head>
tag, just say soup.head
:
soup.head
# <head><title>The Dormouse's story</title></head>
soup.title
# <title>The Dormouse's story</title>
You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first <b>
tag beneath the <body>
tag:
soup.body.b
# <b>The Dormouse's story</b>
Using a tag name as an attribute will give you only the first tag by that name:
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
If you need to get all the <a>
tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all()
:
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
.contents
and .children
⚑
A tag’s children are available in a list called .contents
:
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
Instead of getting them as a list, you can iterate over a tag’s children using the .children
generator:
for child in title_tag.children:
print(child)
# The Dormouse's story
.descendants
⚑
The .contents
and .children
attributes only consider a tag’s direct children. For instance, the <head>
tag has a single direct child–the <title>
tag:
head_tag.contents
# [<title>The Dormouse's story</title>]
But the <title>
tag itself has a child: the string The Dormouse’s story
. There’s a sense in which that string is also a child of the <head>
tag. The .descendants
attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:.
for child in head_tag.descendants:
print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story
.string
⚑
If a tag has only one child, and that child is a NavigableString
, the child is made available as .string
:
title_tag.string
# u'The Dormouse's story'
If a tag’s only child is another tag, and that tag has a .string
, then the parent tag is considered to have the same .string
as its child:
head_tag.contents
# [<title>The Dormouse's story</title>]
head_tag.string
# u'The Dormouse's story'
If a tag contains more than one thing, then it’s not clear what .string
should refer to, so .string
is defined to be None
:
print(soup.html.string)
# None
.strings
and .stripped_strings
⚑
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings
generator:
for string in soup.strings:
print(repr(string))
# u"The Dormouse's story"
# u'\n\n'
# u"The Dormouse's story"
# u'\n\n'
These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings
generator instead:
for string in soup.stripped_strings:
print(repr(string))
# u"The Dormouse's story"
# u"The Dormouse's story"
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
Going up⚑
Continuing the “family tree” analogy, every tag and every string has a parent: the tag that contains it.
.parent
⚑
You can access an element’s parent with the .parent
attribute.
title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>
.parents
⚑
You can iterate over all of an element’s parents with .parents
.
Going sideways⚑
When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write.
.next_sibling
and .previous_sibling
⚑
You can use .next_sibling
and .previous_sibling
to navigate between page elements that are on the same level of the parse tree:.
sibling_soup.b.next_sibling
# <c>text2</c>
sibling_soup.c.previous_sibling
# <b>text1</b>
The <b>
tag has a .next_sibling
, but no .previous_sibling
, because there’s nothing before the <b>
tag on the same level of the tree. For the same reason, the <c>
tag has a .previous_sibling
but no .next_sibling
:
print(sibling_soup.b.previous_sibling)
# None
print(sibling_soup.c.next_sibling)
# None
In real documents, the .next_sibling
or .previous_sibling
of a tag will usually be a string containing whitespace.
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
You might think that the .next_sibling of the first <a>
tag would be the second <a>
tag. But actually, it’s a string: the comma and newline that separate the first <a>
tag from the second:
link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
link.next_sibling
# u',\n'
The second <a>
tag is actually the .next_sibling
of the comma:
link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
.next_siblings
and .previous_siblings
⚑
You can iterate over a tag’s siblings with .next_siblings
or .previous_siblings
:
for sibling in soup.a.next_siblings:
print(repr(sibling))
# u',\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# u' and\n'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# u'; and they lived at the bottom of a well.'
# None
for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling))
# ' and\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# u',\n'
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# u'Once upon a time there were three little sisters; and their names were\n'
# None
Searching the tree⚑
By passing in a filter to an argument like find_all()
, you can zoom in on the parts of the document you’re interested in.
Kinds of filters⚑
A string⚑
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the <b>
tags in the document:
soup.find_all('b')
# [<b>The Dormouse's story</b>]
A regular expression⚑
If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its search()
method. This code finds all the tags whose names start with the letter b
; in this case, the <body>
tag and the <b>
tag:
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body
# b
A list⚑
If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the <a>
tags and all the <b>
tags:
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
A function⚑
If none of the other matches work for you, define a function that takes an element as its only argument. The function should return True
if the argument matches, and False
otherwise.
Here’s a function that returns True
if a tag defines the class
attribute but doesn’t define the id
attribute:
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
Pass this function into find_all()
and you’ll pick up all the <p>
tags:
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
# <p class="story">Once upon a time there were...</p>,
# <p class="story">...</p>]
find_all()⚑
The find_all()
method looks through a tag’s descendants and retrieves all descendants that match your filters.
soup.find_all("title")
# [<title>The Dormouse's story</title>]
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'
The name
argument⚑
Pass in a value for name
and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.
This is the simplest usage:
soup.find_all("title")
# [<title>The Dormouse's story</title>]
The keyword
arguments⚑
Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id
, Beautiful Soup will filter against each tag’s id
attribute:
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
You can filter an attribute based on a string, a regular expression, a list, a function, or the value True.
You can filter multiple attributes at once by passing in more than one keyword argument:
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
Searching by CSS class⚑
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class
, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_
:
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
The string argument⚑
With string
you can search for strings instead of tags.
soup.find_all(string="Elsie")
# [u'Elsie']
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]
def is_the_only_string_within_a_tag(s):
"""Return True if this string is the only child of its parent tag."""
return (s == s.parent.string)
soup.find_all(string=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string
matches your value for string.
soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
Searching by attribute and value⚑
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
The limit argument⚑
find_all()
returns all the tags and strings that match your filters. This can take a while if the document is large. If you don’t need all the results, you can pass in a number for limit
.
The recursive argument⚑
If you call mytag.find_all()
, Beautiful Soup will examine all the descendants of mytag
. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False
.
Calling a tag is like calling find_all()⚑
Because find_all()
is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object or a Tag object as though it were a function, then it’s the same as calling find_all()
on that object. These two lines of code are equivalent:
soup.find_all("a")
soup("a")
find()
⚑
find()
is like find_all()
but returning just one result.
find_parent()
and find_parents()
⚑
These methods work their way up the tree, looking at a tag’s (or a string’s) parents.
find_next_siblings()
and find_next_sibling()
⚑
These methods use .next_siblings
to iterate over the rest of an element’s siblings in the tree. The find_next_siblings()
method returns all the siblings that match, and find_next_sibling()
only returns the first one:
first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
first_link.find_next_siblings("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
To go in the other direction you can use find_previous_siblings()
and find_previous_sibling()
Modifying the tree⚑
replace_with
⚑
PageElement.replace_with()
removes a tag or string from the tree, and replaces it with the tag or string of your choice:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a
new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)
a_tag
# <a href="http://example.com/">I linked to <b>example.net</b></a>
Sometimes it doesn't work. If it doesn't use:
a_tag.clear()
a_tag.append(new_tag)
Tips⚑
Show content beautified / prettified⚑
Use print(soup.prettify())
.
Cleaning escaped HTML code⚑
soup = BeautifulSoup(s.replace(r"\"", '"').replace(r"\/", "/"), "html.parser")