10 minutes
Pandas
Intro
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. It was created by Wes McKinney in 2008.
The name has nothing to do with the cute animal, it’s derived from the term “panel data”, an econometrics term for data sets that include observations over multiple time periods for the same individuals. It is also a play on the phrase “Python data analysis”.
A good resource to learn more about Pandas is the book “Python for Data Analysis” by Wes McKinney.
The environment
Make sure you can run the following commands in your terminal, without errors:
python3.12 --version
Python 3.12.4
Inside the Python’s REPL (Read-Eval-Print Loop), you get inside by typing python3.12
in the terminal:
import numpy as np
import pandas as pd
print(np.__version__)
1.26.4
print(pd.__version__)
2.2.2
If you get similar results, you are ready to go.
Data Structures
The main data structures in Pandas are the Series
and DataFrame
objects.
The Series
object is a one-dimensional array-like object containing a sequence of values of the same type and an associated array of data labels, called its index.
The DataFrame
object is a two-dimensional array with both row and column indices. The name comes from the R data.frame
object.
Operations on Series
Below are some examples of using the Series
object.
Create a Series with a default index
The string representation of a Series shows the index on the left and the values on the right.
s = pd.Series([1,2,3,4,5,6,7])
s
0 1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: int64
s.array
<NumpyExtensionArray>
[1, 2, 3, 4, 5, 6, 7]
Length: 7, dtype: int64
s.index
RangeIndex(start=0, stop=7, step=1)
Create a Series with a custom index
s = pd.Series([7,77,777], index= ['a', 'b', 'c'])
s
a 7
b 77
c 777
dtype: int64
Selecting values from a Series
s = pd.Series([7,77,777], index= ['a', 'b', 'c'])
s['b']
77
s[['c','b','a']]
c 777
b 77
a 7
dtype: int64
From Python dictionary to Series and back
data = {1:'a', 2:'b', 3:'c'}
s = pd.Series(data)
s
1 a
2 b
3 c
dtype: object
s.to_dict()
{1: 'a', 2: 'b', 3: 'c'}
# passing `index` changes the order of the elements
s = pd.Series(data, index=[3,2,1])
s
3 c
2 b
1 a
dtype: object
Data alignment for arithmetic operations
Exemplified for addition, but it works for other operations as well.
s1 = pd.Series({'a':1, 'b':2})
s2 = pd.Series({'a':9, 'b':8})
s1 + s2
a 10
b 10
dtype: int64
Operations on DataFrame
Creating a DataFrame from a dictionary of equal-length lists
frame = pd.DataFrame({'even':[0,2,4,6,8], 'odd':[1,3,5,7,9], 'prime':[1,2,3,5,7]})
frame
even odd prime
0 0 1 1
1 2 3 2
2 4 5 3
3 6 7 5
4 8 9 7
# to change the order of the columns use the `columns` parameter
frame = pd.DataFrame({'even':[0,2,4,6,8], 'odd':[1,3,5,7,9], 'prime':[1,2,3,5,7]}, columns=['prime','odd','even'])
frame
prime odd even
0 1 1 0
1 2 3 2
2 3 5 4
3 5 7 6
4 7 9 8
Retrieving a column from a DataFrame as a Series
frame = pd.DataFrame({'even':[0,2,4,6,8], 'odd':[1,3,5,7,9], 'prime':[1,2,3,5,7]})
primes = frame['prime']
primes
0 1
1 2
2 3
3 5
4 7
Name: prime, dtype: int64
type(primes)
<class 'pandas.core.series.Series'>
Retrieving a row from a DataFrame as a Series
frame = pd.DataFrame({'even':[0,2,4,6,8], 'odd':[1,3,5,7,9], 'prime':[1,2,3,5,7]})
frame.loc[4]
even 8
odd 9
prime 7
Name: 4, dtype: int64
frame.iloc[4]
even 8
odd 9
prime 7
Name: 4, dtype: int64
Assigning values to a column
frame
even odd prime
0 0 1 1
1 2 3 2
2 4 5 3
3 6 7 5
4 8 9 7
frame.even = [2,4,6,8,10]
frame
even odd prime
0 2 1 1
1 4 3 2
2 6 5 3
3 8 7 5
4 10 9 7
Adding a new column
frame
even odd prime
0 0 1 1
1 2 3 2
2 4 5 3
3 6 7 5
4 8 9 7
frame['prime_is_odd'] = frame['prime'] % 2 != 0
frame
even odd prime prime_is_odd
0 0 1 1 True
1 2 3 2 False
2 4 5 3 True
3 6 7 5 True
4 8 9 7 True
Deleting a column
frame.columns
Index(['even', 'odd', 'prime', 'prime_is_odd'], dtype='object')
del frame['prime_is_odd']
frame.columns
Index(['prime', 'odd', 'even'], dtype='object')
Transposing a DataFrame
i.e. swaping rows and columns
frame
even odd prime
0 0 1 1
1 2 3 2
2 4 5 3
3 6 7 5
4 8 9 7
frame.T
0 1 2 3 4
even 0 2 4 6 8
odd 1 3 5 7 9
prime 1 2 3 5 7
From DataFrame to Numpy’s ndarray
NumPy arrays will be covered in another post. For now you can think of them as a more powerful version of Python’s lists.
frame
even odd prime
0 0 1 1
1 2 3 2
2 4 5 3
3 6 7 5
4 8 9 7
frame.to_numpy()
array([[0, 1, 1],
[2, 3, 2],
[4, 5, 3],
[6, 7, 5],
[8, 9, 7]])
The Index object
- is immutable
- can be shared between data structures
- is array-like and also behaves like a fixed-size set
- can contain duplicate labels
index = pd.Index(['a','a','b','c'])
index
Index(['a', 'a', 'b', 'c'], dtype='object')
index[1:]
Index(['a', 'b', 'c'], dtype='object')
# index[0] = 'x' -> TypeError: Index does not support mutable operations
s = pd.Series([1,1,2,3], index=index)
s
a 1
a 1
b 2
c 3
dtype: int64
s.index is index
True
frame = pd.DataFrame({'even':[0,2,4,6,8], 'odd':[1,3,5,7,9], 'prime':[1,2,3,5,7]})
frame.columns
Index(['even', 'odd', 'prime'], dtype='object')
frame.index
RangeIndex(start=0, stop=5, step=1)
'odd' in frame.columns
True
3 in frame.index
True
Essential functionality
Reindexing
i.e. creates a new object with the values rearranged according to the new index.
obj = pd.Series([1,2,3], index=['a', 'b', 'c'])
obj
a 1
b 2
c 3
dtype: int64
obj.reindex(['c','b','a'])
c 3
b 2
a 1
dtype: int64
Dropping entries from an axis
of a Series:
obj = pd.Series([1,2,3], index=['a', 'b', 'c'])
obj.drop('b')
a 1
c 3
dtype: int64
of a DataFrame:
This line np.arange(4).reshape((2,2))
creates a 2x2 matrix with values from 0 to 3.
frame = pd.DataFrame(np.arange(4).reshape((2,2)), index=['a', 'b'], columns=['c', 'd'])
frame
c d
a 0 1
b 2 3
frame.drop(index=['b'], columns=['d'])
c
a 0
Indexing, selection, and filtering
obj = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
obj['b']
2
obj.loc[['b','a','d']]
b 2
a 1
d 4
dtype: int64
obj[1] # Future warning: integer keys will always be treated as labels
2
obj.iloc[1]
2
obj[2:4]
c 3
d 4
dtype: int64
obj[['b','a','d']]
b 2
a 1
d 4
dtype: int64
obj[obj > 2]
c 3
d 4
dtype: int64
frame = pd.DataFrame(np.arange(4).reshape((2,2)), index=['a', 'b'], columns=['c', 'd'])
frame
c d
a 0 1
b 2 3
frame.loc['b']
c 2
d 3
Name: b, dtype: int64
frame.iloc[1]
c 2
d 3
Name: b, dtype: int64
Sorting
obj = pd.Series([4,3,2,1], index=['d', 'c', 'b', 'a'])
obj.sort_index()
a 1
b 2
c 3
d 4
dtype: int64
frame = pd.DataFrame(np.arange(4).reshape((2,2)), index=['b', 'a'], columns=['d', 'c'])
frame
d c
b 0 1
a 2 3
frame.sort_index()
d c
a 2 3
b 0 1
frame.sort_index(axis='columns')
c d
b 1 0
a 3 2
frame.sort_index(axis='columns', ascending=False)
d c
b 0 1
a 2 3
Reading and writing data
Pandas can read and write data from and to a variety of formats, such as CSV, JSON, Excel, SQL databases, and more.
In this example, we will be focusing on reading and writing CSV files.
Reading a CSV file
cat data.csv
col1,col2,col3
1,2,foo
3,4,bar
5,6,baz
pd.read_csv('data.csv')
col1 col2 col3
0 1 2 foo
1 3 4 bar
2 5 6 baz
In case the file does not have a header row, read_csv
, has some parameters to help with that.
cat data_no_header.csv
1,2,foo
3,4,bar
5,6,baz
pd.read_csv('data_no_header.csv', header=None)
0 1 2
0 1 2 foo
1 3 4 bar
2 5 6 baz
pd.read_csv('data_no_header.csv', names=['c1','c2','c3'])
c1 c2 c3
0 1 2 foo
1 3 4 bar
2 5 6 baz
We can also specify the index column to be used instead of the default one.
pd.read_csv('data.csv')
col1 col2 col3
0 1 2 foo
1 3 4 bar
2 5 6 baz
pd.read_csv('data.csv', index_col='col3')
col1 col2
col3
foo 1 2
bar 3 4
baz 5 6
In case the csv file uses another separator, we can specify it using the sep
parameter.
pd.read_csv('data_with_spaces.csv', sep='\\s+')
col1 col2 col3
0 1 2 foo
1 3 4 bar
2 5 6 baz
read_csv
also allows us to skip specific rows, or to only read a specific number of rows.
pd.read_csv('data_with_spaces.csv', sep='\\s+')
col1 col2 col3
0 1 2 foo
1 3 4 bar
2 5 6 baz
pd.read_csv('data_with_spaces.csv', sep='\\s+', skiprows=[1,2])
col1 col2 col3
0 5 6 baz
pd.read_csv('data_with_spaces.csv', sep='\\s+')
col1 col2 col3
0 1 2 foo
1 3 4 bar
2 5 6 baz
3 7 8 bax
4 9 10 bay
pd.read_csv('data_with_spaces.csv', sep='\\s+', nrows=3)
col1 col2 col3
0 1 2 foo
1 3 4 bar
2 5 6 baz
Writing a CSV file
to_csv
writes a DataFrame to a CSV file. It offers a variety of parameters to customize the output, among them are sep
, header
, and columns
.
data = pd.read_csv('data_with_spaces.csv', sep='\\s+', nrows=3)
data
col1 col2 col3
0 1 2 foo
1 3 4 bar
2 5 6 baz
data.to_csv('new_data.csv', sep='|', header=True, columns=['col2','col3'])
cat new_data.csv
|col2|col3
0|2|foo
1|4|bar
2|6|baz
Data cleaning
Missing data
Python’s None
and NumPy’s np.nan
(np
comes from import numpy as np
) are considered missing data in Pandas.
na
stands for “not available”.
s = pd.Series([1,2,np.nan,None,3,4])
s
0 1.0
1 2.0
2 NaN
3 NaN
4 3.0
5 4.0
dtype: float64
s.isna()
0 False
1 False
2 True
3 True
4 False
5 False
dtype: bool
s.notna()
0 True
1 True
2 False
3 False
4 True
5 True
dtype: bool
notna()
is the opposite of isna()
.
Filtering out missing data
s = pd.Series([1,2,np.nan,None,3,4])
s
0 1.0
1 2.0
2 NaN
3 NaN
4 3.0
5 4.0
dtype: float64
s.dropna()
0 1.0
1 2.0
4 3.0
5 4.0
dtype: float64
data = pd.DataFrame([[1,2,3],[4,None,5],[None,None,np.nan]])
data
0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
data.dropna()
0 1 2
0 1.0 2.0 3.0
# to drop only rows that are all NA
data.dropna(how='all')
0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
For dropping columns we need to specify axis='columns'
.
data
0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
# add a new column with all NA values
data[len(data.columns)] = [None,None,None]
data
0 1 2 3
0 1.0 2.0 3.0 None
1 4.0 NaN 5.0 None
2 NaN NaN NaN None
data.dropna(axis='columns', how='all')
0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
Filling in missing data
data
0 1 2 3
0 1.0 2.0 3.0 None
1 4.0 NaN 5.0 None
2 NaN NaN NaN None
data.fillna(0)
0 1 2 3
0 1.0 2.0 3.0 0
1 4.0 0.0 5.0 0
2 0.0 0.0 0.0 0
data.fillna({0:-1, 1:-2, 2:-3, 3:-4})
0 1 2 3
0 1.0 2.0 3.0 -4
1 4.0 -2.0 5.0 -4
2 -1.0 -2.0 -3.0 -4
Removing duplicates
data = pd.DataFrame([[1,2,3],[4,2,5],[1,2,3]])
data
0 1 2
0 1 2 3
1 4 2 5
2 1 2 3
# returns a boolean Series indicating whether each row is a duplicate
data.duplicated()
0 False
1 False
2 True
dtype: bool
data.drop_duplicates()
0 1 2
0 1 2 3
1 4 2 5
By defaault, drop_duplicates
looks at all columns, but we can specify a subset of columns to consider.
data
0 1 2 3
0 1.0 2.0 3.0 -4
1 4.0 -2.0 5.0 -4
2 -1.0 -2.0 -3.0 -4
data.drop_duplicates(subset=[3])
0 1 2 3
0 1.0 2.0 3.0 -4
Applying functions or mapping
data = pd.DataFrame([[1,2,3],[4,2,5],[1,2,3]])
data
0 1 2
0 1 2 3
1 4 2 5
2 1 2 3
def f(x):
... return x**2
...
data[3] = data[2].map(f)
data
0 1 2 3
0 1 2 3 9
1 4 2 5 25
2 1 2 3 9
digits = pd.DataFrame([[1,2],[3,4]])
digits
0 1
0 1 2
1 3 4
d = {
... 1: 'one',
... 2: 'two',
... 3: 'three',
... 4: 'four'
... }
digits[0] = digits[0].map(d)
digits[1] = digits[1].map(d)
digits
0 1
0 one two
1 three four
Outro
This post covered the basics of Pandas, a powerful data manipulation tool built on top of Python. We learned about the main data structures in Pandas, the Series
and DataFrame
objects, and how to perform operations on them.
This is just the tip of the iceberg, Pandas has a lot more to offer, and I encourage you to explore the official documentation and other resources to learn more about it.