Last modified: Dec 09, 2024 By Alexander Williams
Python Pandas sample() Explained
The sample()
method in Pandas allows you to randomly select rows or columns from a DataFrame or Series. It’s useful for testing, validation, and visualization.
What is the Pandas sample() Method?
The sample()
method is a powerful tool for randomly selecting a subset of data from a DataFrame or Series.
It supports sampling with or without replacement, making it ideal for various data tasks.
Syntax of sample()
The syntax for sample()
is as follows:
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
Parameters:
- n: Number of items to return. Mutually exclusive with
frac
. - frac: Fraction of items to return.
- replace: Whether to sample with replacement. Default is False.
- weights: Sampling probabilities for each item.
- random_state: Ensures reproducibility of random sampling.
- axis: Axis to sample from (0 for rows, 1 for columns).
Basic Example of sample()
Here’s a simple example demonstrating how to randomly sample rows from a DataFrame:
import pandas as pd
# Create a sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
"Age": [25, 30, 35, 40, 45],
"Score": [85.5, 90.3, 88.7, 92.1, 84.0]
}
df = pd.DataFrame(data)
# Sample 2 random rows
sampled_df = df.sample(n=2)
print("Random Sample:")
print(sampled_df)
Random Sample:
Name Age Score
1 Bob 30 90.3
4 Eve 45 84.0
Sampling a Fraction of Rows
You can specify a fraction of rows to sample using the frac
parameter:
# Sample 50% of rows
frac_sample = df.sample(frac=0.5, random_state=42)
print("50% Sample:")
print(frac_sample)
50% Sample:
Name Age Score
1 Bob 30 90.3
4 Eve 45 84.0
The random_state
parameter ensures reproducibility for consistent results.
Sampling Columns Instead of Rows
By default, sample()
operates on rows. To sample columns, set axis=1
:
# Sample one random column
column_sample = df.sample(n=1, axis=1)
print("Random Column:")
print(column_sample)
Random Column:
Age
0 25
1 30
2 35
3 40
4 45
Sampling with Replacement
Enable the replace=True
option to sample rows or columns with replacement:
# Sample 3 rows with replacement
replace_sample = df.sample(n=3, replace=True, random_state=10)
print("Sample with Replacement:")
print(replace_sample)
Sample with Replacement:
Name Age Score
1 Bob 30 90.3
3 David 40 92.1
1 Bob 30 90.3
Weighted Sampling
The weights
parameter lets you assign probabilities for each row or column to influence selection:
# Sample with weights
weighted_sample = df.sample(n=2, weights=[0.1, 0.2, 0.4, 0.2, 0.1], random_state=15)
print("Weighted Sample:")
print(weighted_sample)
Weighted Sample:
Name Age Score
2 Charlie 35 88.7
3 David 40 92.1
Real-World Use Cases
The sample()
method is widely used for testing machine learning models, preparing visualizations, or creating test datasets.
For more data sampling and transformation, explore our guide on Python Pandas apply() Simplified.
Common Errors
Errors like ValueError can occur if n
or frac
exceed the number of rows or columns available. Always validate your data size.
Conclusion
The Pandas sample()
method provides a simple way to randomly select data for analysis, testing, or visualization.
Mastering sample()
enables efficient handling of random sampling tasks in DataFrames and Series.