Last modified: Jan 13, 2023 By Alexander Williams
How to read a string word by word in Python
In Python, strings are sequences of characters, and it's often necessary to process them word by word. For example, you might want to extract the individual words from a sentence or keywords from a paragraph. This article will look at several techniques for reading strings word by word in Python.
If you are ready, Let's get started.
1. Using the string.split() Method
One of the simplest ways to read string word by word is to use the built-in split()
method.
By default, split()
splits a string at whitespace characters (spaces, tabs, and newlines), but you can specify a different delimiter. Here's the syntax:
str.split(delimiter, maxsplit)
delimiter
: Specifies the delimiter to use for splitting the string. If this parameter is not specified, any whitespace (space, tab, newline, etc.) is used as the delimiter.maxsplit
: Specifies the maximum number of splits to be done. The default value is -1, which means "all occurrences".
Now let's see how to use split() to read a string word by word.
# Define a string of text
text = "Python, is an interpreted high-level programming language"
# Split the string on whitespaces and store the result in the variable "result"
result = text.split()
# Print the list of substrings
print(result)
Output:
['This', 'is', 'a', 'simple', 'sentence.']
As you can see, we obtained the text as a list. Now, by using a for loop, we can print the list items one by one.
# Define a sentence as a string
sentence = "This is a simple sentence."
# Split the sentence into a list of words
words = sentence.split()
# Iterate over the list of words
for word in words:
# print each word
print(word)
Output:
This
is
a
simple
sentence.
Voila! The string has been read word by word.
2. Using the re.split() Method
Another way to read string word by word is to use the re
module's split()
function. This function takes a regular expression as its delimiter, which gives you more control over the split.
For example, you can use it to split a string at any non-alphanumeric character:
import re # importing regular expression library
sentence = "This, is a simple! sentence." # original sentence
words = re.split(r'[^\w]', sentence) # using re.split() to split the sentence into words by removing non-alphanumeric characters
print(words)
Output:
['This', 'is', 'a', 'simple', 'sentence', '']
Here is what the code does::
- Import the regular expression module
- Define the original sentence as a string variable
- Use the re.split() function to split the sentence into words by removing non-alphanumeric characters using the regular expression.
r'[^\w]'
- Print the output, which is a list of words and an empty string
Now we can iterate through the list of items and access them one by one Using the following code.
for word in words:
print(word)
Output:
This
is
a
simple
sentence
We got the empty lines in the output because of the empty string in the list of words. This empty string results from the trailing punctuation at the end of the original text.
However, to remove these empty lines, you can use a list comprehension or filter function:
list comprehension
List comprehension is a concise way of creating a new list in Python. It consists of an expression followed by a for
clause, and zero or more if
clauses.
The expression is evaluated for each item in the for
clause and the resulting value is added to the new list if it meets the conditions specified by the if
clauses.
import re
sentence = "This, is a simple! sentence."
words = re.split(r'[^\w]', sentence)
words = [word for word in words if word] # list comprehension
for word in words:
print(word)
Output:
This
is
a
simple
sentence
Let me explain what we've done:
- Create a new list using list comprehension
- For each
word
in thewords
list, check if theword
is truthy (not empty) - If the
word
is truthy, add it to the new list - The new list now contains only the truthy words from the original list, with all empty strings removed.
filter function
The filter() function in Python is a built-in function that returns an iterator were the items are filtered through a function to test if the item is accepted or not.
The filter() takes two arguments a function and an iterable.
The function is applied to each element of the iterable, and only the elements for which the function returns True are included in the new filtered list.
However, let's see how to remove the empty lines using the filter() function.
import re
sentence = "This, is a simple! sentence."
words = re.split(r'[^\w]', sentence)
words = list(filter(None,words)) # Remove Empty Lines
for word in words:
print(word)
Here s what this words = re.split(r'[^\w]', sentence) line does:
- Use the filter() function to create a new filtered list
- Use None as the function argument to filter out any falsy values from the original list
- The iterable passed to the filter is the
words
list - The filtered list only contains truthy elements from the original list.
- Convert the filtered list to a list using
list()
the function - The new list now contains only the truthy elements from the original list and empty strings removed.
3. Using the TextBlob Library
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
TextBlob is built on top of the Natural Language Toolkit (NLTK) library and is easy to use and install.
To install via PIP, execute the following command:
pip install textblob
Let's see how to use the library to read a string word by word.
from textblob import TextBlob # Import the TextBlob library
sentence = "This is a simple sentence." # Define a sentence to be processed
words = TextBlob(sentence).words # Create a TextBlob object and use the words attribute to extract the words in the sentence
print(words) # Print the extracted words
Output:
['This', 'is', 'a', 'simple', 'sentence']
However, TextBlob(sentence).words returns a list of words in a given text. You can use the for loop to print the words one by one.
4.Using for loop
We can also use for loop over the string to read the string word by word. But this method is not recommended.
# String
sentence = "This is a simple sentence."
# Initialize an empty list called 'words' to store the individual words from the sentence
words = []
# Initialize an empty string called 'word' to store the current word being built
word = ""
# Iterate through each character in the sentence
for char in sentence:
# If the current character is a space, append the current 'word' to the 'words' list and reset 'word' to an empty string
if char == " ":
words.append(word)
word = ""
# If the current character is not a space, add it to the current 'word'
else:
word += char
# Append the last word of the sentence to the 'words' list
words.append(word)
# Read word one by one
for word in words:
print(word)
Output:
This
is
a
simple
sentence.
Here are the steps of the code:
- Initialize a string variable called
sentence
to hold the sentence "This is a simple sentence." - Initialize an empty list called
words
to store the individual words from the sentence. - Initialize an empty string called
word
to store the current word being built. - Iterate through each character in the
sentence.
- Within the loop, check if the current character is a space.
- If the current character is a space, append the current
word
to thewords
list and resetword
to an empty string - If the current character is not a space, add it to the current
word
- Append the last word of the sentence to the
words
list - Iterate through each word in the
words
list, - Within the loop, print each word one by one.
Conclusion
In this article, we have covered different techniques for reading strings word by word in Python, including using the built-in split()
method, the re
module's split()
function, the TextBlob library, and for loop.
You can choose the method that best fits your needs. Remember that all the methods return a list of words so you can manipulate and access them easily.