Pandas

Intro

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. It was created by Wes McKinney in 2008.

The name has nothing to do with the cute animal, it’s derived from the term “panel data”, an econometrics term for data sets that include observations over multiple time periods for the same individuals. It is also a play on the phrase “Python data analysis”.

A good resource to learn more about Pandas is the book “Python for Data Analysis” by Wes McKinney.

The environment

Make sure you can run the following commands in your terminal, without errors:

python3.12 --version
Python 3.12.4

Inside the Python’s REPL (Read-Eval-Print Loop), you get inside by typing python3.12 in the terminal:

import numpy as np
import pandas as pd
print(np.__version__)
1.26.4
print(pd.__version__)
2.2.2

If you get similar results, you are ready to go.

Data Structures

The main data structures in Pandas are the Series and DataFrame objects.

The Series object is a one-dimensional array-like object containing a sequence of values of the same type and an associated array of data labels, called its index.

The DataFrame object is a two-dimensional array with both row and column indices. The name comes from the R data.frame object.

Operations on Series

Below are some examples of using the Series object.

Create a Series with a default index

The string representation of a Series shows the index on the left and the values on the right.

s = pd.Series([1,2,3,4,5,6,7])
s
0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

s.array
<NumpyExtensionArray>
[1, 2, 3, 4, 5, 6, 7]
Length: 7, dtype: int64

s.index
RangeIndex(start=0, stop=7, step=1)

Create a Series with a custom index

s = pd.Series([7,77,777], index= ['a', 'b', 'c'])
s
a      7
b     77
c    777
dtype: int64

Selecting values from a Series

s = pd.Series([7,77,777], index= ['a', 'b', 'c'])
s['b']
77
s[['c','b','a']]
c    777
b     77
a      7
dtype: int64

From Python dictionary to Series and back

data = {1:'a', 2:'b', 3:'c'}
s = pd.Series(data)
s
1    a
2    b
3    c
dtype: object

s.to_dict()
{1: 'a', 2: 'b', 3: 'c'}

# passing `index` changes the order of the elements
s = pd.Series(data, index=[3,2,1])
s
3    c
2    b
1    a
dtype: object

Data alignment for arithmetic operations

Exemplified for addition, but it works for other operations as well.

s1 = pd.Series({'a':1, 'b':2})
s2 = pd.Series({'a':9, 'b':8})
s1 + s2
a    10
b    10
dtype: int64

Operations on DataFrame

Creating a DataFrame from a dictionary of equal-length lists

frame = pd.DataFrame({'even':[0,2,4,6,8], 'odd':[1,3,5,7,9], 'prime':[1,2,3,5,7]})
frame
   even  odd  prime
0     0    1      1
1     2    3      2
2     4    5      3
3     6    7      5
4     8    9      7

# to change the order of the columns use the `columns` parameter
frame = pd.DataFrame({'even':[0,2,4,6,8], 'odd':[1,3,5,7,9], 'prime':[1,2,3,5,7]}, columns=['prime','odd','even'])
frame
   prime  odd  even
0      1    1     0
1      2    3     2
2      3    5     4
3      5    7     6
4      7    9     8

Retrieving a column from a DataFrame as a Series

frame = pd.DataFrame({'even':[0,2,4,6,8], 'odd':[1,3,5,7,9], 'prime':[1,2,3,5,7]})
primes = frame['prime']
primes
0    1
1    2
2    3
3    5
4    7
Name: prime, dtype: int64
type(primes)
<class 'pandas.core.series.Series'>

Retrieving a row from a DataFrame as a Series

frame = pd.DataFrame({'even':[0,2,4,6,8], 'odd':[1,3,5,7,9], 'prime':[1,2,3,5,7]})
frame.loc[4]
even     8
odd      9
prime    7
Name: 4, dtype: int64

frame.iloc[4]
even     8
odd      9
prime    7
Name: 4, dtype: int64

Assigning values to a column

frame
   even  odd  prime
0     0    1      1
1     2    3      2
2     4    5      3
3     6    7      5
4     8    9      7
frame.even = [2,4,6,8,10]
frame
   even  odd  prime
0     2    1      1
1     4    3      2
2     6    5      3
3     8    7      5
4    10    9      7

Adding a new column

frame
   even  odd  prime
0     0    1      1
1     2    3      2
2     4    5      3
3     6    7      5
4     8    9      7
frame['prime_is_odd'] = frame['prime'] % 2 != 0
frame
   even  odd  prime  prime_is_odd
0     0    1      1          True
1     2    3      2         False
2     4    5      3          True
3     6    7      5          True
4     8    9      7          True

Deleting a column

frame.columns
Index(['even', 'odd', 'prime', 'prime_is_odd'], dtype='object')

del frame['prime_is_odd']

frame.columns
Index(['prime', 'odd', 'even'], dtype='object')

Transposing a DataFrame

i.e. swaping rows and columns

frame
   even  odd  prime
0     0    1      1
1     2    3      2
2     4    5      3
3     6    7      5
4     8    9      7
frame.T
       0  1  2  3  4
even   0  2  4  6  8
odd    1  3  5  7  9
prime  1  2  3  5  7

From DataFrame to Numpy’s ndarray

NumPy arrays will be covered in another post. For now you can think of them as a more powerful version of Python’s lists.

frame
   even  odd  prime
0     0    1      1
1     2    3      2
2     4    5      3
3     6    7      5
4     8    9      7
frame.to_numpy()
array([[0, 1, 1],
       [2, 3, 2],
       [4, 5, 3],
       [6, 7, 5],
       [8, 9, 7]])

The Index object

is immutable
can be shared between data structures
is array-like and also behaves like a fixed-size set
can contain duplicate labels

index = pd.Index(['a','a','b','c'])
index
Index(['a', 'a', 'b', 'c'], dtype='object')

index[1:]
Index(['a', 'b', 'c'], dtype='object')

# index[0] = 'x' -> TypeError: Index does not support mutable operations

s = pd.Series([1,1,2,3], index=index)
s
a    1
a    1
b    2
c    3
dtype: int64

s.index is index 
True

frame = pd.DataFrame({'even':[0,2,4,6,8], 'odd':[1,3,5,7,9], 'prime':[1,2,3,5,7]})
frame.columns
Index(['even', 'odd', 'prime'], dtype='object')

frame.index
RangeIndex(start=0, stop=5, step=1)

'odd' in frame.columns
True

3 in frame.index
True

Essential functionality

Reindexing

i.e. creates a new object with the values rearranged according to the new index.

obj = pd.Series([1,2,3], index=['a', 'b', 'c'])
obj
a    1
b    2
c    3
dtype: int64
obj.reindex(['c','b','a'])
c    3
b    2
a    1
dtype: int64

Dropping entries from an axis

of a Series:

obj = pd.Series([1,2,3], index=['a', 'b', 'c'])
obj.drop('b')
a    1
c    3
dtype: int64

of a DataFrame:

This line np.arange(4).reshape((2,2)) creates a 2x2 matrix with values from 0 to 3.

frame = pd.DataFrame(np.arange(4).reshape((2,2)), index=['a', 'b'], columns=['c', 'd'])
frame
   c  d
a  0  1
b  2  3

frame.drop(index=['b'], columns=['d'])
   c
a  0

Indexing, selection, and filtering

obj = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
obj['b']
2
obj.loc[['b','a','d']]
b    2
a    1
d    4
dtype: int64
obj[1] # Future warning: integer keys will always be treated as labels
2
obj.iloc[1]
2
obj[2:4]
c    3
d    4
dtype: int64
obj[['b','a','d']]
b    2
a    1
d    4
dtype: int64
obj[obj > 2]
c    3
d    4
dtype: int64

frame = pd.DataFrame(np.arange(4).reshape((2,2)), index=['a', 'b'], columns=['c', 'd'])
frame
   c  d
a  0  1
b  2  3
frame.loc['b']
c    2
d    3
Name: b, dtype: int64

frame.iloc[1]
c    2
d    3
Name: b, dtype: int64

Sorting

obj = pd.Series([4,3,2,1], index=['d', 'c', 'b', 'a'])
obj.sort_index()
a    1
b    2
c    3
d    4
dtype: int64

frame = pd.DataFrame(np.arange(4).reshape((2,2)), index=['b', 'a'], columns=['d', 'c'])
frame
   d  c
b  0  1
a  2  3
frame.sort_index()
   d  c
a  2  3
b  0  1

frame.sort_index(axis='columns')
   c  d
b  1  0
a  3  2

frame.sort_index(axis='columns', ascending=False)
   d  c
b  0  1
a  2  3

Reading and writing data

Pandas can read and write data from and to a variety of formats, such as CSV, JSON, Excel, SQL databases, and more.

In this example, we will be focusing on reading and writing CSV files.

Reading a CSV file

cat data.csv
col1,col2,col3
1,2,foo
3,4,bar
5,6,baz

pd.read_csv('data.csv')
   col1  col2 col3
0     1     2  foo
1     3     4  bar
2     5     6  baz

In case the file does not have a header row, read_csv, has some parameters to help with that.

cat data_no_header.csv
1,2,foo
3,4,bar
5,6,baz

pd.read_csv('data_no_header.csv', header=None)
   0  1    2
0  1  2  foo
1  3  4  bar
2  5  6  baz

pd.read_csv('data_no_header.csv', names=['c1','c2','c3'])
   c1  c2   c3
0   1   2  foo
1   3   4  bar
2   5   6  baz

We can also specify the index column to be used instead of the default one.

pd.read_csv('data.csv')
   col1  col2 col3
0     1     2  foo
1     3     4  bar
2     5     6  baz
pd.read_csv('data.csv', index_col='col3')
      col1  col2
col3
foo      1     2
bar      3     4
baz      5     6

In case the csv file uses another separator, we can specify it using the sep parameter.

pd.read_csv('data_with_spaces.csv', sep='\\s+')
   col1  col2 col3
0     1     2  foo
1     3     4  bar
2     5     6  baz

read_csv also allows us to skip specific rows, or to only read a specific number of rows.

pd.read_csv('data_with_spaces.csv', sep='\\s+')
   col1  col2 col3
0     1     2  foo
1     3     4  bar
2     5     6  baz
pd.read_csv('data_with_spaces.csv', sep='\\s+', skiprows=[1,2])
   col1  col2 col3
0     5     6  baz

pd.read_csv('data_with_spaces.csv', sep='\\s+')
   col1  col2 col3
0     1     2  foo
1     3     4  bar
2     5     6  baz
3     7     8  bax
4     9    10  bay
pd.read_csv('data_with_spaces.csv', sep='\\s+', nrows=3)
   col1  col2 col3
0     1     2  foo
1     3     4  bar
2     5     6  baz

Writing a CSV file

to_csv writes a DataFrame to a CSV file. It offers a variety of parameters to customize the output, among them are sep, header, and columns.

data = pd.read_csv('data_with_spaces.csv', sep='\\s+', nrows=3)
data
   col1  col2 col3
0     1     2  foo
1     3     4  bar
2     5     6  baz
data.to_csv('new_data.csv', sep='|', header=True, columns=['col2','col3'])

cat new_data.csv
|col2|col3
0|2|foo
1|4|bar
2|6|baz

Data cleaning

Missing data

Python’s None and NumPy’s np.nan(np comes from import numpy as np) are considered missing data in Pandas.

na stands for “not available”.

s = pd.Series([1,2,np.nan,None,3,4])
s
0    1.0
1    2.0
2    NaN
3    NaN
4    3.0
5    4.0
dtype: float64

s.isna()
0    False
1    False
2     True
3     True
4    False
5    False
dtype: bool

s.notna()
0     True
1     True
2    False
3    False
4     True
5     True
dtype: bool

notna() is the opposite of isna().

Filtering out missing data

s = pd.Series([1,2,np.nan,None,3,4])
s
0    1.0
1    2.0
2    NaN
3    NaN
4    3.0
5    4.0
dtype: float64

s.dropna()
0    1.0
1    2.0
4    3.0
5    4.0
dtype: float64

data = pd.DataFrame([[1,2,3],[4,None,5],[None,None,np.nan]])
data
     0    1    2
0  1.0  2.0  3.0
1  4.0  NaN  5.0
2  NaN  NaN  NaN
data.dropna()
     0    1    2
0  1.0  2.0  3.0

# to drop only rows that are all NA
data.dropna(how='all')
     0    1    2
0  1.0  2.0  3.0
1  4.0  NaN  5.0

For dropping columns we need to specify axis='columns'.

data
     0    1    2
0  1.0  2.0  3.0
1  4.0  NaN  5.0
2  NaN  NaN  NaN

# add a new column with all NA values
data[len(data.columns)] = [None,None,None]
data
     0    1    2     3
0  1.0  2.0  3.0  None
1  4.0  NaN  5.0  None
2  NaN  NaN  NaN  None

data.dropna(axis='columns', how='all')
     0    1    2
0  1.0  2.0  3.0
1  4.0  NaN  5.0
2  NaN  NaN  NaN

Filling in missing data

data
     0    1    2     3
0  1.0  2.0  3.0  None
1  4.0  NaN  5.0  None
2  NaN  NaN  NaN  None

data.fillna(0)
     0    1    2  3
0  1.0  2.0  3.0  0
1  4.0  0.0  5.0  0
2  0.0  0.0  0.0  0

data.fillna({0:-1, 1:-2, 2:-3, 3:-4})
     0    1    2  3
0  1.0  2.0  3.0 -4
1  4.0 -2.0  5.0 -4
2 -1.0 -2.0 -3.0 -4

Removing duplicates

data = pd.DataFrame([[1,2,3],[4,2,5],[1,2,3]])
data
   0  1  2
0  1  2  3
1  4  2  5
2  1  2  3

# returns a boolean Series indicating whether each row is a duplicate
data.duplicated()
0    False
1    False
2     True
dtype: bool

data.drop_duplicates()
   0  1  2
0  1  2  3
1  4  2  5

By defaault, drop_duplicates looks at all columns, but we can specify a subset of columns to consider.

data
     0    1    2  3
0  1.0  2.0  3.0 -4
1  4.0 -2.0  5.0 -4
2 -1.0 -2.0 -3.0 -4
data.drop_duplicates(subset=[3])
     0    1    2  3
0  1.0  2.0  3.0 -4

Applying functions or mapping

data = pd.DataFrame([[1,2,3],[4,2,5],[1,2,3]])
data
   0  1  2
0  1  2  3
1  4  2  5
2  1  2  3

def f(x):
...     return x**2
...

data[3] = data[2].map(f)
data
   0  1  2   3
0  1  2  3   9
1  4  2  5  25
2  1  2  3   9

digits = pd.DataFrame([[1,2],[3,4]])
digits
   0  1
0  1  2
1  3  4

d = {
... 1: 'one',
... 2: 'two',
... 3: 'three',
... 4: 'four'
... }

digits[0] = digits[0].map(d)
digits[1] = digits[1].map(d)
digits
       0     1
0    one   two
1  three  four

Outro

This post covered the basics of Pandas, a powerful data manipulation tool built on top of Python. We learned about the main data structures in Pandas, the Series and DataFrame objects, and how to perform operations on them.

This is just the tip of the iceberg, Pandas has a lot more to offer, and I encourage you to explore the official documentation and other resources to learn more about it.