Last modified: Feb 15, 2023 By Alexander Williams
Understand How to Use gettext() in Beautifulsoup
gettext() is a Beatifoulsoup method that uses to get all child strings concatenated using the given separator. In this tutorial, we will learn how to use gettext() with examples, and we'll also know the difference between gettext() andย the .string property.
Let's get started.
gettext() Syntax
get_text(separator, strip)
Arguments:
- Separator : identify the delimiter to split.
- Strip : removes space at the beginning and the end.
Arguments by default:
- separator=u""
- strip=False
And all of these arguments are Optional
How to use gettext()
Let's see an example to understand how to use the get_text() method. In the following example, we'll get all child text of the <div> .
from bs4 import BeautifulSoup # ๐๏ธ Import BeautifulSoup
# HTML source
html_source = '''
<div>
<p>child 1</p>
<p>child 2</p>
<p>child 3</p>
</div>
'''
soup = BeautifulSoup(html_source, 'html.parser') # ๐๏ธ Parsing
el = soup.find("div") # ๐๏ธ Find <div> TAG
g_txt = el.get_text() # ๐๏ธ Get text of the <div>
print(g_txt) # ๐๏ธ Print output
Output:
child 1
child 2
child 3
As you can see in the code, we've used get_text() with no arguments.
If you want to remove the newlines \n from the output, set strip=True in the parameter like the example below.
from bs4 import BeautifulSoup # ๐๏ธ Import BeautifulSoup
# HTML source
html_source = '''
<div>
<p>child 1</p>
<p>child 2</p>
<p>child 3</p>
</div>
'''
soup = BeautifulSoup(html_source, 'html.parser') # ๐๏ธ Parsing
el = soup.find("div") # ๐๏ธ Find <div> TAG
g_txt = el.get_text(strip=True) # ๐๏ธ Get Text of the <div> and Remove newline from the output
print(g_txt) # ๐๏ธ Print output
Output:
child 1child 2child 3
To add space between strings, set separator parameters like the example below.
from bs4 import BeautifulSoup # ๐๏ธ Import BeautifulSoup
# HTML source
html_source = '''
<div>
<p>child 1</p>
<p>child 2</p>
<p>child 3</p>
</div>
'''
soup = BeautifulSoup(html_source, 'html.parser') # ๐๏ธ Parsing
el = soup.find("div") # ๐๏ธ Find <div> TAG
g_txt = el.get_text(strip=True, separator=" ") # ๐๏ธ Set separator an dstript
print(g_txt) # ๐๏ธ Print output
Output:
child 1 child 2 child 3
Now, we'll split the response by \n and strip it.
from bs4 import BeautifulSoup # ๐๏ธ Import BeautifulSoup
# HTML source
html_source = '''
<div>
<p>child 1</p>
<p>child 2</p>
<p>child 3</p>
</div>
'''
soup = BeautifulSoup(html_source, 'html.parser') # ๐๏ธ Parsing
el = soup.find("div") # ๐๏ธ find <div> TAG
g_txt = el.get_text(strip=True, separator="\n") # ๐๏ธ Set separator and strip
print(g_txt) # ๐๏ธ Print output
Output:
child 1
child 2
child 3
The difference between get_text() and .string
Let's see some examples to figure out the difference between the get_text() method and the .string property.
Example -1:
from bs4 import BeautifulSoup # ๐๏ธ Import BeautifulSoup
# HTML source
html_source = '''
<div>
<p>child 1</p>
<p>child 2</p>
<p>child 3</p>
</div>
'''
soup = BeautifulSoup(html_source, 'html.parser') # ๐๏ธ Parsing
el = soup.find("div") # ๐๏ธ Find <div> TAG
print(el.get_text()) # ๐๏ธ Get content of div using get_text()
print(el.string) # ๐๏ธ Get Content of <div> using .string
Output of get_text() :
child 1
child 2
child 3
Output of .string :
None
As you can see, the get_text returns the text of div children instead of the .string property. That is because .string is used for getting the text of the given element. And the div tagย have no text.
Example -2:
from bs4 import BeautifulSoup # ๐๏ธ Import BeautifulSoup
# ๐๏ธ HTML source
html_source = '''
<div></div>
'''
soup = BeautifulSoup(html_source, 'html.parser') # ๐๏ธ Parsing
el = soup.find("div") # ๐๏ธ Find <div> TAG
print(el.get_text()) # ๐๏ธ Get Content of empty <div> using .string
print(el.string) # ๐๏ธ Get content of empty <div> using .string
Output of get_text() :
Output of .string :
None
When we try to get the text of an empty tag:
- get_text() returns empty value
- .string returns None
Conclusion
To summarize this article, I'd like to say you should use the get_text() method to get all text inside an element.
For more articles about Beatifoulsoup, scroll down and happy learning </>