Last modified: Feb 16, 2026 By Alexander Williams
Bag of Words Python Tutorial for NLP
Natural Language Processing (NLP) helps computers understand human language. A fundamental step is converting text into numbers. The Bag of Words (BoW) model is a simple and powerful technique for this task.
It is a cornerstone for many text-based machine learning projects.
What is the Bag of Words Model?
The Bag of Words model represents text based on word frequency. It creates a "bag" or collection of words from a document. The order and grammar of words are ignored.
Only the presence and count of words matter. This simplification makes it computationally efficient. It is excellent for tasks like sentiment analysis and spam detection.
Imagine you have two sentences: "The cat sat" and "The dog ran". The combined vocabulary is ["the", "cat", "sat", "dog", "ran"]. Each sentence is then represented as a vector counting these words.
How Bag of Words Works: A Simple Example
Let's break down the process with a basic Python example. We will not use any libraries first.
# Step 1: Define our corpus (collection of documents)
corpus = [
"I love Python programming",
"Python is great for data science",
"I love data science and Python"
]
# Step 2: Create a vocabulary from all unique words
vocabulary = set()
for document in corpus:
for word in document.lower().split():
vocabulary.add(word)
vocabulary = sorted(list(vocabulary))
print("Vocabulary:", vocabulary)
Vocabulary: ['and', 'data', 'for', 'great', 'i', 'is', 'love', 'programming', 'python', 'science']
Now, we create vectors. Each document becomes a list of numbers. Each number corresponds to the count of a vocabulary word in that document.
# Step 3: Create Bag of Words vectors manually
bow_vectors = []
for document in corpus:
word_counts = {word: 0 for word in vocabulary}
for word in document.lower().split():
if word in word_counts:
word_counts[word] += 1
# Create vector in the order of the vocabulary
vector = [word_counts[word] for word in vocabulary]
bow_vectors.append(vector)
print(f"Document: '{document}'")
print(f"BoW Vector: {vector}")
Document: 'I love Python programming'
BoW Vector: [0, 0, 0, 0, 1, 0, 1, 1, 1, 0]
Document: 'Python is great for data science'
BoW Vector: [0, 1, 1, 1, 0, 1, 0, 0, 1, 1]
Document: 'I love data science and Python'
BoW Vector: [1, 1, 0, 0, 1, 0, 1, 0, 1, 1]
This output shows our text as numerical vectors. The first vector [0,0,0,0,1,0,1,1,1,0] means: for "I love Python programming", the words "i", "love", "programming", and "python" each appear once.
Implementing Bag of Words with Scikit-Learn
Doing this manually is educational but impractical. The scikit-learn library provides robust, optimized tools. We use the CountVectorizer class.
This is part of a broader toolkit for feature extraction in machine learning.
from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus
documents = [
"The weather is sunny and warm",
"I enjoy sunny weather",
"Today is a warm sunny day"
]
# Step 1: Initialize the CountVectorizer
# We will not remove stop words for clarity in this example.
vectorizer = CountVectorizer()
# Step 2: Fit the model and transform the documents
# 'fit' learns the vocabulary, 'transform' creates the vectors.
X = vectorizer.fit_transform(documents)
# Step 3: Inspect the results
print("Vocabulary (Feature Names):")
print(vectorizer.get_feature_names_out())
print("\nDense Matrix Representation:")
print(X.toarray())
print("\nShape of the matrix (documents, vocabulary size):", X.shape)
Vocabulary (Feature Names):
['and' 'day' 'enjoy' 'is' 'sunny' 'the' 'today' 'warm' 'weather']
Dense Matrix Representation:
[[1 0 0 1 1 1 0 1 1]
[0 0 1 0 1 0 0 0 1]
[0 1 0 1 1 0 1 1 0]]
Shape of the matrix (documents, vocabulary size): (3, 9)
The output is a matrix. Each row is a document. Each column is a word from the vocabulary. The number is the count of that word in the document.
Scikit-learn's CountVectorizer handles tokenization, lowercasing, and building the vocabulary automatically.
Improving the Basic Bag of Words Model
The basic model has limitations. It treats all words as equally important. Common words like "the" or "is" can dominate. We can improve it.
1. Removing Stop Words
Stop words are frequent, low-meaning words. Removing them focuses on meaningful content. CountVectorizer can do this.
vectorizer_stop = CountVectorizer(stop_words='english')
documents = ["This is a sample sentence with some stop words."]
X_stop = vectorizer_stop.fit_transform(documents)
print("Vocabulary without stop words:", vectorizer_stop.get_feature_names_out())
Vocabulary without stop words: ['sample' 'sentence' 'stop' 'words']
2. Using N-grams
Single words (unigrams) lose context. N-grams are sequences of N words. They can capture phrases like "not good".
vectorizer_ngram = CountVectorizer(ngram_range=(1, 2)) # Unigrams and Bigrams
documents = ["machine learning is fun"]
X_ngram = vectorizer_ngram.fit_transform(documents)
print("Vocabulary with n-grams:", vectorizer_ngram.get_feature_names_out())
Vocabulary with n-grams: ['fun' 'is' 'is fun' 'learning' 'learning is' 'machine' 'machine learning']
3. Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF is a more advanced weighting scheme. It reduces the weight of words that appear in many documents. This highlights unique, important words.
Scikit-learn provides TfidfVectorizer for this. It's a direct upgrade from simple word counts.
Practical Application: Simple Text Classification
Bag of Words vectors are perfect as input for classifiers. Here's a minimal example for sentiment.
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
# Simple labeled data: 0=Negative, 1=Positive
texts = [
"I love this product", "This is terrible",
"Great experience", "Worst purchase ever",
"Highly recommend", "Poor quality"
]
labels = [1, 0, 1, 0, 1, 0] # Corresponding sentiments
# Create Bag of Words features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
# Split data and train a classifier
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
# Predict on a new sentence
new_text = ["This is a great product"]
new_vector = vectorizer.transform(new_text)
prediction = classifier.predict(new_vector)
print(f"Prediction for '{new_text[0]}': {'Positive' if prediction[0]==1 else 'Negative'}")
Prediction for 'This is a great product': Positive
This shows the core pipeline: text -> BoW vector -> machine learning model -> prediction.
Limitations of the Bag of Words Model
Despite its utility, BoW has significant drawbacks. Understanding these is key.
Loss of Word Order: "Dog bites man" and "Man bites dog" have the same vector. All semantic context from word sequence is lost.
High Dimensionality: With a large vocabulary, vectors become very long and sparse (full of zeros). This can be inefficient.
Semantic Meaning: It does not understand word meaning. "Happy," "joyful," and "glad" are treated as completely different, unrelated words.
For these reasons, more advanced models like Word2Vec, GloVe, or transformer-based embeddings (e.g., BERT) are often used in modern NLP. However, they build upon the foundational concept of vectorizing text.
Conclusion
The Bag of Words model is a vital first step into NLP with Python. It transforms raw text into a numerical format that machines can process. Using CountVectorizer from scikit-learn makes implementation straightforward.
Remember its strengths: simplicity, speed, and effectiveness for many tasks. Also, be aware of its weaknesses: ignoring word order and semantics. Start with BoW for baseline models in text classification or clustering.
Then, explore more sophisticated techniques like TF-IDF and word embeddings as your projects grow. Mastering Bag of Words gives you the essential skill of text vectorization, a requirement for any journey in machine learning and data science.