Understand How to Use gettext() in Beautifulsoup

Understand How to Use gettext() in Beautifulsoup

gettext() is a Beatifoulsoup method that uses to get all child strings concatenated using the given separator. In this tutorial, we will learn how to use gettext() with examples, and we'll also know the difference between gettext() and the .string property.

Let's get started.

gettext() Syntax

get_text(separator, strip)

Arguments:

  • Separator : identify the delimiter to split.
  • Strip : removes space at the beginning and the end.

Arguments by default:

  • separator=u""
  • strip=False

And all of these arguments are Optional

How to use gettext()

Let's see an example to understand how to use the get_text() method. In the following example, we'll get all child text of the <div> .

from bs4 import BeautifulSoup  # 👉️ Import BeautifulSoup

# HTML source
html_source = ''' 
<div>
<p>child 1</p>  
<p>child 2</p>
<p>child 3</p>
</div>
'''

soup = BeautifulSoup(html_source, 'html.parser')  # 👉️ Parsing

el = soup.find("div") # 👉️ Find <div> TAG

g_txt = el.get_text() # 👉️ Get text of the <div>

print(g_txt) # 👉️ Print output

Output:


child 1
child 2
child 3

As you can see in the code, we've used get_text() with no arguments.

If you want to remove the newlines \n from the output, set strip=True in the parameter like the example below.

from bs4 import BeautifulSoup  # 👉️ Import BeautifulSoup

# HTML source
html_source = ''' 
<div>
<p>child 1</p>  
<p>child 2</p>
<p>child 3</p>
</div>
'''

soup = BeautifulSoup(html_source, 'html.parser')  # 👉️ Parsing

el = soup.find("div") # 👉️ Find <div> TAG

g_txt = el.get_text(strip=True) # 👉️ Get Text of the <div> and Remove newline from the output

print(g_txt) # 👉️ Print output

Output:

child 1child 2child 3

To add space between strings, set separator parameters like the example below.

from bs4 import BeautifulSoup  # 👉️ Import BeautifulSoup

# HTML source
html_source = ''' 
<div>
<p>child 1</p>  
<p>child 2</p>
<p>child 3</p>
</div>
'''

soup = BeautifulSoup(html_source, 'html.parser')  # 👉️ Parsing

el = soup.find("div") # 👉️ Find <div> TAG

g_txt = el.get_text(strip=True, separator=" ") # 👉️ Set separator an dstript

print(g_txt) # 👉️ Print output
 

Output:

child 1 child 2 child 3

Now, we'll split the response by \n and strip it.

from bs4 import BeautifulSoup  # 👉️ Import BeautifulSoup

# HTML source
html_source = ''' 
<div>
<p>child 1</p>  
<p>child 2</p>
<p>child 3</p>
</div>
'''

soup = BeautifulSoup(html_source, 'html.parser')  # 👉️ Parsing

el = soup.find("div") # 👉️ find <div> TAG

g_txt = el.get_text(strip=True, separator="\n") # 👉️ Set separator and strip

print(g_txt) # 👉️ Print output

Output:

child 1
child 2
child 3

The difference between get_text() and .string

Let's see some examples to figure out the difference between the get_text() method and the .string property.

Example -1:

from bs4 import BeautifulSoup  # 👉️ Import BeautifulSoup

# HTML source
html_source = ''' 
<div>
<p>child 1</p>  
<p>child 2</p>
<p>child 3</p>
</div>
'''

soup = BeautifulSoup(html_source, 'html.parser')  # 👉️ Parsing

el = soup.find("div") # 👉️ Find <div> TAG

print(el.get_text())  # 👉️ Get content of div using get_text()

print(el.string) # 👉️ Get Content of <div> using .string

Output of get_text() :

child 1
child 2
child 3

Output of .string :

None

As you can see, the get_text returns the text of div children instead of the .string property. That is because .string is used for getting the text of the given element. And the div tag have no text.

Example -2:

from bs4 import BeautifulSoup  # 👉️ Import BeautifulSoup

# 👇️ HTML source
html_source = ''' 
<div></div>
'''

soup = BeautifulSoup(html_source, 'html.parser')  # 👉️ Parsing

el = soup.find("div") # 👉️ Find <div> TAG

print(el.get_text()) # 👉️ Get Content of empty <div> using .string

print(el.string) # 👉️ Get content of empty <div> using .string

Output of get_text() :


Output of .string :

None

When we try to get the text of an empty tag:

  • get_text() returns empty value
  • .string returns None

Conclusion

To summarize this article, I'd like to say you should use the get_text() method to get all text inside an element.

For more articles about Beatifoulsoup, scroll down and happy learning </>