Getting started with Pandas in Python

Import Pandas into Python

Enter the following line in Python. If you don’t get any error, then you have Pandas installed in your system. Otherwise, go here to learn how to install Pandas.

import pandas as pd

Read the input csv file

The input file must be in the same directory as your working directory. Else specify the relative or absolute path to the file instead.

data = pd.read_csv("coffees.csv")

Let us print the dataset that we just read –

print(data)
timestamp coffees contributor
0 2011-10-03 08:22:00 397.0 Quentin
1 2011-10-04 11:48:00 410.0 Quentin
2 2011-10-05 07:02:00 testing Anthony
3 2011-10-05 08:25:00 NaN Quentin
4 2011-10-05 10:47:00 464.0 Quentin
5 2011-10-05 13:15:00 481.0 Quentin
6 2011-10-06 07:21:00 503.0 Anthony
7 2011-10-06 10:04:00 513.0 Quentin
8 2011-10-06 12:14:00 539.0 Mike
9 2011-10-06 12:49:00 540.0 Quentin
10 2011-10-06 14:52:00 563.0 Ben
11 2011-10-07 07:34:00 581.0 Anthony
12 2011-10-07 08:37:00 587.0 Quentin
13 2011-10-07 11:09:00 605.0 Quentin
14 2011-10-07 13:14:00 616.0 Mike
15 2011-10-07 14:10:00 NaN Ben
16 2011-10-07 15:20:00 626.0 Mike M
17 2011-10-07 16:50:00 635.0 Mike M
18 2011-10-09 16:53:00 650.0 Colm
19 2011-10-10 07:29:00 656.0 Anthony
20 2011-10-10 10:13:00 673.0 Quentin
21 2011-10-10 13:41:00 694.0 Mike M
22 2011-10-10 14:02:00 699.0 Quentin
23 2011-10-10 15:23:00 713.0 Quentin
24 2011-10-11 14:09:00 770.0 Mike M
25 2011-10-12 08:11:00 790.0 Quentin
26 2011-10-12 09:57:00 799.0 Mike M
27 2011-10-12 10:06:00 805.0 Sergio
28 2011-10-12 12:01:00 818.0 Mike M
29 2011-10-12 12:30:00 819.0 Quentin
641 2013-01-28 10:43:00 NaN Sergio
642 2013-01-28 13:13:00 NaN Quentin
643 2013-01-28 14:01:00 16195.0 Quentin
644 2013-01-29 13:43:00 16237.0 Quentin
645 2013-01-29 15:06:00 16257.0 Quentin
646 2013-02-04 13:25:00 16513.0 Sergio
647 2013-02-06 17:33:00 16659.0 Quentin
648 2013-02-07 13:30:00 16714.0 Sergio
649 2013-02-12 08:36:00 16891.0 Sergio
650 2013-02-12 11:39:00 16909.0 Quentin
651 2013-02-13 13:58:00 16977.0 Quentin
652 2013-02-16 11:55:00 17104.0 Quentin
653 2013-02-18 12:04:00 NaN Quentin
654 2013-02-18 13:46:00 17165.0 Quentin
655 2013-02-21 13:44:00 17345.0 Quentin
656 2013-02-21 15:02:00 17354.0 Quentin
657 2013-02-25 13:33:00 17468.0 Quentin
658 2013-02-25 17:25:00 17489.0 Quentin
659 2013-02-27 09:33:00 17564.0 Quentin
660 2013-03-04 10:46:00 17789.0 Sergio
661 2013-03-04 11:12:00 17793.0 Quentin
662 2013-03-04 16:43:00 17824.0 Quentin
663 2013-03-05 10:42:00 17852.0 Quentin
664 2013-03-05 13:29:00 17868.0 Quentin
665 2013-03-08 10:28:00 18062.0 Quentin
666 2013-03-12 08:28:00 18235.0 Sergio
667 2013-04-05 11:20:00 18942.0 Sergio
668 2013-04-27 11:04:00 19698.0 Sergio
669 2013-09-12 15:38:00 24450.0 Quentin
670 2013-09-13 10:28:00 24463.0 Quentin

Printing the entire dataset might not be too helpful. We might need to view only the first few rows of the dataset –

print(data.head())
timestamp coffees contributor
0 2011-10-03 08:22:00 397.0 Quentin
1 2011-10-04 11:48:00 410.0 Quentin
2 2011-10-05 07:02:00 testing Anthony
3 2011-10-05 08:25:00 NaN Quentin
4 2011-10-05 10:47:00 464.0 Quentin

By default, the first five rows of the dataset are printed. If we need to print some other value (say, 15), we can do that as follows –

print(data.head(n=15))

Print a particular row

One way to think about Pandas is to think of it as a combination of an array and a hashmap – you can access it using a key and as also as an array using a positional index.

To access it using the key, we use loc with the value of the key. Here, the key is same as the positional index. Running data.loc[2] will give us the third row in the dataset, but if, for example, the timestamp column were the keys, then we would’ve used data.loc[‘2011-10-05 07:02:00’]. On the other hand, if we want to access it like as an array – based on the positional index, then we would, instead, use iloc method. Running data.iloc[2] will always return the third row in the dataset, irrespective of what is the key.

print(data.loc[2])
print(data.iloc[2])
timestamp      2011-10-05 07:02:00
coffees                    testing
contributor                Anthony
Name: 2, dtype: object

Print the total number of rows in the dataset –

print("Dataset length :"+str(len(data)))

Printing just the total number of rows might not always be useful. We may sometimes need to have more detailed information about the columns.

print(data.describe())
timestamp coffees contributor
count 671 658 671
unique 671 654 9
top 2012-09-26 16:02:00 12358.0 Quentin
freq 1 2 367

There is a problem in the dataset – there is a row with the value testing and several rows with NaN in the coffees column. Let us view all the rows with NaN in the coffees column.

print(data[data.coffees.isnull()])
timestamp coffees contributor
3 2011-10-05 08:25:00 NaN Quentin
15 2011-10-07 14:10:00 NaN Ben
72 2011-10-28 10:53:00 NaN Mike M
95 2011-11-11 11:13:00 NaN Quentin
323 2012-06-10 16:10:00 NaN Sergio
370 2012-07-13 13:59:00 NaN Mike
394 2012-08-03 14:35:00 NaN Sergio
479 2012-09-21 10:15:00 NaN Sergio
562 2012-11-01 09:45:00 NaN Quentin
606 2012-11-30 13:11:00 NaN Quentin
641 2013-01-28 10:43:00 NaN Sergio
642 2013-01-28 13:13:00 NaN Quentin
653 2013-02-18 12:04:00 NaN Quentin

Let us now look at the datatypes of the values at the different columns –

print(data.dtypes)
timestamp      object
coffees        object
contributor    object
dtype: object

That didn’t give us much information – let us dig a bit more into it. We know that strings are objects – maybe the timestamp column actually has strings. We look at the first element of the timestamp column and its datatype.

print(data.timestamp[0])
print(type(data.timestamp[0]))
2011-10-03 08:22:00
<class 'str'>

Changing string to datetime values –

We see that the values of timestamp are of string type. But they should be of datetime datatype. Here we have a type string. Let us change those to datetime –

data.timestamp = pd.to_datetime(data.timestamp)
print(data.dtypes)
timestamp      datetime64[ns]
coffees                object
contributor            object
dtype: object

Similarly, we find that the coffees column also has elements in datatype – string. But looking at the values of the elements, we can figure out that they should be integers. We should change those to integers –

data.coffees = pd.to_numeric(data.coffees, errors="coerce")
data.head()
timestamp coffees contributor
0 2011-10-03 08:22:00 397.0 Quentin
1 2011-10-04 11:48:00 410.0 Quentin
2 2011-10-05 07:02:00 NaN Anthony
3 2011-10-05 08:25:00 NaN Quentin
4 2011-10-05 10:47:00 464.0 Quentin

Removing rows with NaN values –

Let us also remove all the rows that have NaN in the coffee column –

data.dropna(subset=["coffees"], inplace=True)
data.head()
timestamp coffees contributor
0 2011-10-03 08:22:00 397.0 Quentin
1 2011-10-04 11:48:00 410.0 Quentin
4 2011-10-05 10:47:00 464.0 Quentin
5 2011-10-05 13:15:00 481.0 Quentin
6 2011-10-06 07:21:00 503.0 Anthony

Changing floats to integers –

We have all the values at coffees column as floats. Let us change them to integers –

data.coffees = data.coffees.astype(int)
print(data.dtype)
timestamp      datetime64[ns]
coffees                 int64
contributor            object
dtype: object

The dataset now has all the columns in their appropriate datatypes. Let us look at the first few rows of the dataset.

print(data.head(n=15))
timestamp coffees contributor
0 2011-10-03 08:22:00 397 Quentin
1 2011-10-04 11:48:00 410 Quentin
4 2011-10-05 10:47:00 464 Quentin
5 2011-10-05 13:15:00 481 Quentin
6 2011-10-06 07:21:00 503 Anthony