Getting Started with Python¶

Data Analysis With Python Pandas¶

Author : Waweru Kennedy¶

Date: 14/3/2022¶

Outline¶

  • Definition

  • Types of Statistical Data

  • Indexing

  • Sorting

  • Filtering

  • Cleaning Data

  • Add or Remove rows and columns

  • Grouping and Aggregating

  • Dates and Time Series Data

  • Joining Data

  • Write to File

  • Summary

  • Challenges

Course Material¶

To follow along this course please download the necessary material

- Student Workbook¶

Vehicle Dataset

What is Pandas?¶

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Pandas is most useful when working with tabular data such as data from relational databases or spreadsheets. Pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

Pandas DataFrame Object¶

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

A DataFrame can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R.

Installation¶

To install pandas simply use pip

pip install pandas

Once you have successfully installed pandas, you need to import it.

import pandas as pd

Pandas Data Structures¶

Let's discuss some fundamental data structures.

The Pandas DataFrame Object

Creating a DataFrame¶

We can create a DataFrame in two ways,

  1. we can create a dataframe from a variety of structures, such as from lists or a list of lists or from a dictionary.

  2. by reading in an external file.

When converting a Python object into a dataframe we use the DataFrame keyword.

Syntax¶

pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Create a DataFrame From Lists¶

In [7]:
import pandas as pd

# create lists

countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
capitals = ["Beijing","New Delhi","Brasil","Washington DC","Addis Ababa","Cairo"]
density = [153,464,25,36,115,103]

# create a list of column names
columns = ['countries','capitals','density']

# create a DataFrame of lists
# remember to use the zip function
pd.DataFrame(zip(countries,capitals,density), columns=columns)
Out[7]:
countries capitals density
0 China Beijing 153
1 India New Delhi 464
2 Brazil Brasil 25
3 USA Washington DC 36
4 Ethiopia Addis Ababa 115
5 Egypt Cairo 103

Create a DataFrame from a Dictionary¶

When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame.

In [8]:
import pandas as pd
# create lists

countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
capitals = ["Beijing","New Delhi","Brasil","Washington DC","Addis Ababa","Cairo"]
density = [153,464,25,36,115,103]

# create a dictionary of lists
my_dict = {'countries':countries, 'capitals':capitals, 'density':density}

# convert the dictionary of lists into a DataFrame
df = pd.DataFrame(my_dict)

# display the DataFrame
df
Out[8]:
countries capitals density
0 China Beijing 153
1 India New Delhi 464
2 Brazil Brasil 25
3 USA Washington DC 36
4 Ethiopia Addis Ababa 115
5 Egypt Cairo 103

Create a DataFrame From an External File¶

We can extract the data from an existing datasource, such as a csv (comma seperated variables) file using pd.read_csv. Inside the file each column is seperated by a comma as the name suggests. The file is read and the data converted to a data frame which is then displayed.

I want to analyse the motor vehicle dataset that is available as a CSV file.

The Motor Vehicle Dataset is available for download. Follow this link.

Vehicle Dataset

After Downloading make sure you move the file to the folder where your Python Notebook is located. (ie) move it to your current working directory

pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json,…), each of them with the prefix read_*.

Make sure to always have a check on the data after reading in the data.

In [1]:
import pandas as pd

# import data saved in file
df = pd.read_csv('vehicle_data.csv')

# display the first 5 rows
df.head()
Out[1]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
0 Toyota Land Cruiser Prado 2016 Black Cars Mvita Mombasa Foreign Used First registration, No faults Toyota Black Land Cruiser Prado 2016.0 87000.0 NaN NaN NaN Automatic NaN NaN 6500000
1 Mazda Demio 2014 Brown Cars Langata Nairobi Foreign Used First owner, No faults Mazda Brown Demio 2014.0 92000.0 NaN NaN NaN Automatic NaN Yes 970000
2 Clean NV300 Caravan 2014 Model Dielsel 16 Seater Buses & Microbuses Kilimani Nairobi Foreign Used Nissan Nissan NaN Caravan (Urvan) 2014.0 180000.0 NaN NaN NaN NaN NaN NaN 2550000
3 Toyota Crown 2014 Pearl Cars Kilimani Nairobi Foreign Used No faults Toyota Pearl Crown 2014.0 75000.0 NaN NaN NaN Automatic NaN No 2100000
4 Honda Fit 2014 Black Cars Mvita Mombasa Foreign Used No faults Honda Black Fit 2014.0 58000.0 NaN NaN NaN Automatic NaN Yes 880000

Pandas Series Object¶

Each column in a DataFrame is a Series

It is a one-dimensional array holding data of any type.

The Pandas Series Object

Create a Pandas Series from Scratch¶

You can create a Series from scratch

In [10]:
import pandas as pd
# create lists

countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
density = [153,464,25,36,115,103]

# create pandas series of each of the lists
countries = pd.Series(countries)
density = pd.Series(density)
In [11]:
countries
Out[11]:
0       China
1       India
2      Brazil
3         USA
4    Ethiopia
5       Egypt
dtype: object
In [12]:
density
Out[12]:
0    153
1    464
2     25
3     36
4    115
5    103
dtype: int64
In [13]:
# create Series with index
import pandas as pd
# create lists

countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
density = [153,464,25,36,115,103]

# create pandas series of each of the lists
density = pd.Series(density,index=countries)
density
Out[13]:
China       153
India       464
Brazil       25
USA          36
Ethiopia    115
Egypt       103
dtype: int64
In [14]:
# create Series with index and name
import pandas as pd
# create lists

countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
density = [153,464,25,36,115,103]

# create pandas series of each of the lists
density = pd.Series(density,index=countries,name="density")
density
Out[14]:
China       153
India       464
Brazil       25
USA          36
Ethiopia    115
Egypt       103
Name: density, dtype: int64
In [15]:
# indexing
# access specific country value

density['Ethiopia']
Out[15]:
115
In [16]:
# slicing
density['India':'Egypt']
Out[16]:
India       464
Brazil       25
USA          36
Ethiopia    115
Egypt       103
Name: density, dtype: int64
In [17]:
# accessing multiple rows
density[['Ethiopia','Egypt']]
Out[17]:
Ethiopia    115
Egypt       103
Name: density, dtype: int64

When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use the column label in between square brackets [].

In [18]:
import pandas as pd
# create lists

countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
capitals = ["Beijing","New Delhi","Brasil","Washington DC","Addis Ababa","Cairo"]
density = [153,464,25,36,115,103]

# create a dictionary of lists
my_dict = {'countries':countries, 'capitals':capitals, 'density':density}

# convert the dictionary of lists into a DataFrame
df = pd.DataFrame(my_dict)

# display the DataFrame
df['density']
Out[18]:
0    153
1    464
2     25
3     36
4    115
5    103
Name: density, dtype: int64
In [19]:
type(df['density'])
Out[19]:
pandas.core.series.Series

Using the Vehicle Dataset¶

We'll be using the real dataset with more values to learn about DataFrames. Load in the vehicle dataset.

In [2]:
import pandas as pd

# import data saved in file
df = pd.read_csv('vehicle_data.csv')

Exploring a DataFrame¶

Display the first 10 rows of a DataFrame¶

To see the first N rows of a DataFrame, use the head() method with the required number of rows (in this case 10) as argument.

In [22]:
df.head(10)
Out[22]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
0 Toyota Land Cruiser Prado 2016 Black Cars Mvita Mombasa Foreign Used First registration, No faults Toyota Black Land Cruiser Prado 2016.0 87000.0 NaN NaN NaN Automatic NaN NaN 6500000
1 Mazda Demio 2014 Brown Cars Langata Nairobi Foreign Used First owner, No faults Mazda Brown Demio 2014.0 92000.0 NaN NaN NaN Automatic NaN Yes 970000
2 Clean NV300 Caravan 2014 Model Dielsel 16 Seater Buses & Microbuses Kilimani Nairobi Foreign Used Nissan Nissan NaN Caravan (Urvan) 2014.0 180000.0 NaN NaN NaN NaN NaN NaN 2550000
3 Toyota Crown 2014 Pearl Cars Kilimani Nairobi Foreign Used No faults Toyota Pearl Crown 2014.0 75000.0 NaN NaN NaN Automatic NaN No 2100000
4 Honda Fit 2014 Black Cars Mvita Mombasa Foreign Used No faults Honda Black Fit 2014.0 58000.0 NaN NaN NaN Automatic NaN Yes 880000
5 Mitsubishi Delica 2013 White Cars Mvita Mombasa Foreign Used First registration, No faults, Unpainted Mitsubishi White Delica 2013.0 88000.0 NaN NaN NaN Automatic NaN Yes 630000
6 New Toyota Premio 2013 Red Cars Mvita Mombasa Brand New No faults, First registration Toyota Red Premio 2013.0 45000.0 NaN NaN NaN Automatic NaN No 1500000
7 Toyota Sienta 2014 1.5 AWD Gray Cars Ganjoni Mombasa Foreign Used No faults Toyota Gray Sienta 2014.0 51000.0 Minivan Petrol All Wheel Automatic 7.0 Yes 1200000
8 BMW X4 2015 xDrive35i Black Cars Mombasa CBD Mombasa Foreign Used No faults BMW Black X4 2015.0 63128.0 NaN NaN NaN Automatic NaN No 5800000
9 Mitsubishi Outlander 2015 White Cars Lavington Nairobi Foreign Used Unpainted, Original parts, No faults Mitsubishi White Outlander 2015.0 40382.0 SUV Petrol Front Wheel Automatic NaN Yes 2900000

Interested in the last N rows instead? pandas also provides a tail() method. For example, df.tail(10) will return the last 10 rows of the DataFrame.

In [23]:
df.tail(10)
Out[23]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
290 Lexus LS 2011 460 AWD Black Cars Lavington Nairobi Kenyan Used Lexus Lexus Black LS 2011.0 88201.0 Sedan Petrol All Wheel Automatic 5.0 NaN 3100000
291 Toyota Ractis 2015 Black Cars Mombasa CBD Mombasa Foreign Used Original parts Toyota Black Ractis 2015.0 58964.0 NaN NaN NaN Automatic NaN Yes 1049000
292 Toyota Wish 2011 Silver Cars Ridgeways Nairobi Kenyan Used First owner, Unpainted, Original parts Toyota Silver Wish 2011.0 107809.0 NaN NaN NaN Automatic NaN Yes 1030000
293 Mitsubishi L200 2014 Gold Cars Roysambu Nairobi Kenyan Used First owner, Original parts, First registration Mitsubishi Gold L200 2014.0 NaN NaN NaN NaN Manual NaN Yes 1200000
294 Subaru Outback 2014 White Cars Mvita Mombasa Foreign Used No faults Subaru White Outback 2014.0 63142.0 NaN NaN NaN Automatic NaN NaN 2500000
295 Toyota Allion 2008 Silver Cars Ganjoni Mombasa Kenyan Used No faults Toyota Silver Allion 2008.0 176549.0 NaN NaN NaN Automatic NaN Yes 860000
296 Mitsubishi Hd Trucks & Trailers Kisauni Mombasa Used Used Mitsubishi NaN NaN 2006.0 NaN NaN NaN NaN NaN NaN NaN 1300000
297 Mitsubishi Fuso Refrigerated Trucks & Trailers Kisauni Mombasa Used Used Mitsubishi NaN Canter 2014.0 NaN NaN Diesel NaN NaN NaN NaN 2200000
298 Toyota Ractis 2009 Black Cars Ridgeways Nairobi Kenyan Used First owner, No faults, Original parts Toyota Black Ractis 2009.0 120120.0 NaN NaN NaN Automatic NaN Yes 550000
299 Subaru Forester 2015 Matt Black Cars Mombasa CBD Mombasa Foreign Used Unpainted, Original parts, First registration Subaru Matt Black Forester 2015.0 86910.0 NaN NaN NaN Automatic NaN Yes 2450000
In [7]:
df.shape
Out[7]:
(300, 18)

Pandas - Analyzing DataFrames¶

info()¶

I’m interested in a technical summary of a DataFrame

The DataFrames object has a method called info(), that gives you more information about the data set.

In [24]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          300 non-null    object 
 1   category       300 non-null    object 
 2   region         300 non-null    object 
 3   parent_region  294 non-null    object 
 4   condition      295 non-null    object 
 5   attrs          299 non-null    object 
 6   brand          299 non-null    object 
 7   color          244 non-null    object 
 8   model          283 non-null    object 
 9   yom            297 non-null    float64
 10  mileage        209 non-null    float64
 11  body_type      42 non-null     object 
 12  fuel           82 non-null     object 
 13  drive_train    41 non-null     object 
 14  trans          249 non-null    object 
 15  seat           31 non-null     float64
 16  registered     175 non-null    object 
 17  price          300 non-null    int64  
dtypes: float64(3), int64(1), object(14)
memory usage: 42.3+ KB
In [2]:
df.dtypes
Out[2]:
title             object
category          object
region            object
parent_region     object
condition         object
attrs             object
brand             object
color             object
model             object
yom              float64
mileage          float64
body_type         object
fuel              object
drive_train       object
trans             object
seat             float64
registered        object
price              int64
dtype: object
In [4]:
df['yom'].isnull().sum()
Out[4]:
3
In [5]:
df[df['yom'].isnull()]
Out[5]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
109 Isuzu Truck . Trucks & Trailers Donholm Nairobi NaN Isuzu Isuzu NaN N Series NaN NaN NaN NaN NaN NaN NaN NaN 810000
201 Subaru Outback 2014 black Cars Nairobi NaN NaN Outback Subaru Black Outback NaN 180097.0 NaN NaN NaN Automatic NaN NaN 1649999
261 Single Diff Volve FL10 Trucks & Trailers Embakasi Nairobi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 400000

Results Explained¶

The result tells us there are 300 rows and 18 columns

RangeIndex: 300 entries, 0 to 299

Data columns (total 18 columns)

And the name of each column, with the data type:

Types of Statistical Data¶

When working with statistics, it’s important to recognize the different types of data: numerical (discrete and continuous), categorical, and ordinal.

Data are the actual pieces of information that you collect through your study. Most data fall into one of two groups: numerical or categorical.

  1. Numeric : Continous, Discrete

  2. Categorical : Nominal, Ordinal

Discrete data represent items that can be counted; they take on possible values that can be listed out. For example, the number of customer complaints or the number of flaws or defects.

Continuous data represent measurements; their possible values cannot be counted and can only be described using intervals on the real number line. Continuous variable can be numeric or date/time. For example, the length of a part or the date and time a payment is received.

Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order. Categorical variables are qualitative data in which the values are assigned to a set of distinct groups or categories. These groups may consist of alphabetic (e.g., male, female) or numeric labels (e.g., male = 0, female = 1) that do not contain mathematical information beyond the frequency counts related to group membership.

Ordinal data mixes numerical and categorical data. The data fall into categories, but the numbers placed on the categories have meaning. For example, rating a restaurant on a scale from 0 (lowest) to 4 (highest) stars gives ordinal data. Ordinal data are often treated as categorical, where the groups are ordered when graphs and charts are made. However, unlike categorical data, the numbers do have mathematical meaning.

Let's breakdown the columns in our dataframe into their statistical type:

  • title : Each vehicle has a unique name.
  • category : Categorical Variable
  • region : Categorical Variable
  • parent_region : Categorical Variable
  • condition : Categorical Variable
  • attrs : Categorical Variable
  • brand : Categorical Variable
  • color : Categorical Variable
  • model : Categorical Variable
  • yom : Discrete
  • mileage : Continous
  • body_type : Categorical Variable
  • fuel : Categorical Variable
  • drive_train : Categorical Variable
  • trans : Categorical Variable
  • seat : Discrete
  • registered : Categorical Variable
  • price : Continous

Null Values

The info() method also tells us how many Non-Null values there are present in each column, and in our data set it seems like there are 283 of 300 Non-Null values in the "Calories" column.

Which means that there are 5 rows with no value at all, in the "model" column, for whatever reason.

Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. This is a step towards what is called cleaning data, and you will learn more about that in the next chapters.

A handy way of checking for null values in your DataFrame is using

DataFrame.isnull().sum()

In [9]:
df['parent_region'].value_counts(normalize=True)
Out[9]:
Mombasa        0.500000
Nairobi        0.431973
Kiambu         0.023810
Nakuru         0.013605
Machakos       0.006803
Kisumu         0.003401
Uasin Gishu    0.003401
Meru           0.003401
Kirinyaga      0.003401
Kajiado        0.003401
Mvita          0.003401
Nyali          0.003401
Name: parent_region, dtype: float64
In [25]:
df.isnull().sum()
Out[25]:
title              0
category           0
region             0
parent_region      6
condition          5
attrs              1
brand              1
color             56
model             17
yom                3
mileage           91
body_type        258
fuel             218
drive_train      259
trans             51
seat             269
registered       125
price              0
dtype: int64

Pandas Indexing¶

There are primarily two ways in which we can access a row or column in a data frame:

  1. By labels
  2. By position

The Pandas Index Object¶

Both the Series and DataFrame objects contain an explicit index that lets you reference and modify data.

In [4]:
df.index.values
Out[4]:
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
       208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,
       221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,
       234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,
       247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,
       260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,
       273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,
       286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298,
       299], dtype=int64)
In [5]:
df.columns
Out[5]:
Index(['title', 'category', 'region', 'parent_region', 'condition', 'attrs',
       'brand', 'color', 'model', 'yom', 'mileage', 'body_type', 'fuel',
       'drive_train', 'trans', 'seat', 'registered', 'price'],
      dtype='object')

Indexers: loc, iloc, at and iat¶

  • .loc selects data only by labels

  • .iloc selects data only by integer location

  • .at: get scalar values. It's a very fast loc

  • .iat Get scalar values. It's a very fast iloc

at and iat are meant to access a scalar, that is, a single element in the dataframe, while loc and iloc are meant to access several elements at the same time, potentially to perform vectorized operations.

.loc¶

Access a group of rows and columns by label(s) or a boolean array.

  1. A single label (returns a Series)
In [13]:
# lets create a toy dataset

import pandas as pd
# create lists

countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
capitals = ["Beijing","New Delhi","Brasil","Washington DC","Addis Ababa","Cairo"]
density = [153,464,25,36,115,103]

# create a dictionary of lists
my_dict = {'countries':countries, 'capitals':capitals, 'density':density}

# convert the dictionary of lists into a DataFrame
toy = pd.DataFrame(my_dict,index=countries)
In [14]:
toy
Out[14]:
countries capitals density
China China Beijing 153
India India New Delhi 464
Brazil Brazil Brasil 25
USA USA Washington DC 36
Ethiopia Ethiopia Addis Ababa 115
Egypt Egypt Cairo 103
In [16]:
toy.index.values
Out[16]:
array(['China', 'India', 'Brazil', 'USA', 'Ethiopia', 'Egypt'],
      dtype=object)
In [18]:
toy.loc['Brazil']
Out[18]:
countries    Brazil
capitals     Brasil
density          25
Name: Brazil, dtype: object
In [19]:
toy.loc[:,'capitals']
Out[19]:
China             Beijing
India           New Delhi
Brazil             Brasil
USA         Washington DC
Ethiopia      Addis Ababa
Egypt               Cairo
Name: capitals, dtype: object

2. A list or array of labels¶

While accessing multiple rows and columns using .loc, represent the row and column labels in separate square brackets, preferably.

In [20]:
toy.loc[['India','Ethiopia','Egypt']]
Out[20]:
countries capitals density
India India New Delhi 464
Ethiopia Ethiopia Addis Ababa 115
Egypt Egypt Cairo 103
In [21]:
toy.loc[:,['countries','density']]
Out[21]:
countries density
China China 153
India India 464
Brazil Brazil 25
USA USA 36
Ethiopia Ethiopia 115
Egypt Egypt 103
In [22]:
toy.loc[
    ['Brazil','USA'],
    ['countries','density']
]
Out[22]:
countries density
Brazil Brazil 25
USA USA 36

3. Slicing¶

Remember in slicing operation on data frames, the end is inclusive, Also, use only single square bracket.

In [23]:
toy.loc['Brazil':'Ethiopia']
Out[23]:
countries capitals density
Brazil Brazil Brasil 25
USA USA Washington DC 36
Ethiopia Ethiopia Addis Ababa 115
In [24]:
toy.loc[:,'capitals':'density']
Out[24]:
capitals density
China Beijing 153
India New Delhi 464
Brazil Brasil 25
USA Washington DC 36
Ethiopia Addis Ababa 115
Egypt Cairo 103
In [25]:
toy.loc[
    'Brazil':'Ethiopia',
    'capitals':'density'
]
Out[25]:
capitals density
Brazil Brasil 25
USA Washington DC 36
Ethiopia Addis Ababa 115

.iloc (integer location)¶

The operation of .iloc is the same as .loc except for the fact that we use integer positions of the rows and columns instead of labels. The letter i stands for integer.

1. Indexing¶

Similar to what we have seen before, we can mention single index positions or multiple index positions of the row/column that we want to select.

In [26]:
toy.iloc[1] # the entry at row 1
Out[26]:
countries        India
capitals     New Delhi
density            464
Name: India, dtype: object
In [27]:
toy.iloc[-1] # negative indexing
Out[27]:
countries    Egypt
capitals     Cairo
density        103
Name: Egypt, dtype: object
In [28]:
toy.iloc[[1,3,4]] # index multiple rows
Out[28]:
countries capitals density
India India New Delhi 464
USA USA Washington DC 36
Ethiopia Ethiopia Addis Ababa 115
In [31]:
# index all rows and first and third columns
toy.iloc[:, [0,2]]
Out[31]:
countries density
China China 153
India India 464
Brazil Brazil 25
USA USA 36
Ethiopia Ethiopia 115
Egypt Egypt 103

1. Slicing¶

Note that the slicing operation in .iloc shifts back to the traditional way of python where the start is inclusive and the end is exclusive.

In [32]:
# slice from first row to the second row
toy.iloc[0:2]
Out[32]:
countries capitals density
China China Beijing 153
India India New Delhi 464
In [33]:
toy.iloc[0:4,0:2]
Out[33]:
countries capitals
China China Beijing
India India New Delhi
Brazil Brazil Brasil
USA USA Washington DC

.at¶

The operation of .at is similar to .loc but it is capable of selecting a single cell or value.

Consider we want to know the density of Egypt

In [36]:
toy.at['Egypt','density']
Out[36]:
103

.iat¶

The operation of .iat is similar to .iloc selecting using integer location but it is capable of selecting a single cell or value just like .at.

Suppose we want to know the density of Egypt. We know Egypt is the last row, and density if the last column.

In [37]:
toy.iat[-1,-1]
Out[37]:
103

Indexing Operator []¶

The indexing operator [ ]can select rows and columns but not simultaneously. This is the major difference between indexing operators and other methods.

Selecting Columns¶

To Select one column

In [38]:
toy['countries']
Out[38]:
China          China
India          India
Brazil        Brazil
USA              USA
Ethiopia    Ethiopia
Egypt          Egypt
Name: countries, dtype: object

To select multiple columns use double square brackets [[]]

Mention the column labels that you want to select in the brackets.

Also, slicing of columns is not possible using index operators. You have to explicitly mention the column labels to be selected.

In [39]:
toy[['countries','density']]
Out[39]:
countries density
China China 153
India India 464
Brazil Brazil 25
USA USA 36
Ethiopia Ethiopia 115
Egypt Egypt 103

Selecting Rows¶

The only way to select rows is to use the slicing operation in single square brackets [ ] . You can either mention row labels or index positions.

In [42]:
toy[0:2]
Out[42]:
countries capitals density
China China Beijing 153
India India New Delhi 464
In [43]:
toy["China":'Brazil']
Out[43]:
countries capitals density
China China Beijing 153
India India New Delhi 464
Brazil Brazil Brasil 25
In [ ]:
 

Sorting¶

sort values¶

Sort by the values along either axis.

The columns to sort by is a required argument.

Default axis is the rows (0). You can specify index=1 or index='columns'

pd.DataFrame.sort_values(by, axis=0, ascending=True, inplace=False)

sort index¶

Sort object by labels (along an axis).

Returns a new DataFrame sorted by label if inplace argument is False, otherwise updates the original DataFrame and returns None.

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False)

sort multiple columns¶

Specify the columns to sort by in order by which sorting should follow

pd.DataFrame.sort_values(by=[], axis=0, ascending=True, inplace=False)

In [45]:
# sort the toy dataset in alphabetical order by name of country
toy.sort_values(by="countries")
Out[45]:
countries capitals density
Brazil Brazil Brasil 25
China China Beijing 153
Egypt Egypt Cairo 103
Ethiopia Ethiopia Addis Ababa 115
India India New Delhi 464
USA USA Washington DC 36
In [46]:
# sort the toy dataset such that the country with highest density is on top
toy.sort_values(by='density',ascending=False)
Out[46]:
countries capitals density
India India New Delhi 464
China China Beijing 153
Ethiopia Ethiopia Addis Ababa 115
Egypt Egypt Cairo 103
USA USA Washington DC 36
Brazil Brazil Brasil 25
In [47]:
toy
Out[47]:
countries capitals density
China China Beijing 153
India India New Delhi 464
Brazil Brazil Brasil 25
USA USA Washington DC 36
Ethiopia Ethiopia Addis Ababa 115
Egypt Egypt Cairo 103

Update the original DataFrame¶

as can be seen from above results, the sorting is temporary and does not affect the original DataFrame.

To update the DataFrame and keep the sorted order, specify inplace=True in the function call.

In [48]:
toy.sort_values(by='density',ascending=False,inplace=True)
In [49]:
toy
Out[49]:
countries capitals density
India India New Delhi 464
China China Beijing 153
Ethiopia Ethiopia Addis Ababa 115
Egypt Egypt Cairo 103
USA USA Washington DC 36
Brazil Brazil Brasil 25

Sort by index¶

In [50]:
toy.sort_index()
Out[50]:
countries capitals density
Brazil Brazil Brasil 25
China China Beijing 153
Egypt Egypt Cairo 103
Ethiopia Ethiopia Addis Ababa 115
India India New Delhi 464
USA USA Washington DC 36
In [52]:
toy.sort_index(ascending=False)
Out[52]:
countries capitals density
USA USA Washington DC 36
India India New Delhi 464
Ethiopia Ethiopia Addis Ababa 115
Egypt Egypt Cairo 103
China China Beijing 153
Brazil Brazil Brasil 25
In [53]:
# keep the changes
toy.sort_index(inplace=True)
In [54]:
toy
Out[54]:
countries capitals density
Brazil Brazil Brasil 25
China China Beijing 153
Egypt Egypt Cairo 103
Ethiopia Ethiopia Addis Ababa 115
India India New Delhi 464
USA USA Washington DC 36
In [ ]:
 

Filtering¶

A common operation in data analysis is to filter values based on a condition or multiple conditions. Pandas provides a variety of ways to filter data points (i.e. rows).

1. Logical operators¶

We can use the logical operators on column values to filter rows.

In [56]:
# create a mask
mask = toy['density'] > 100
mask
Out[56]:
Brazil      False
China        True
Egypt        True
Ethiopia     True
India        True
USA         False
Name: density, dtype: bool
In [57]:
toy[mask]
Out[57]:
countries capitals density
China China Beijing 153
Egypt Egypt Cairo 103
Ethiopia Ethiopia Addis Ababa 115
India India New Delhi 464

Only the countries with density more than 100 are selected

Using the Vehicle Dataset¶

In [67]:
df.head()
Out[67]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
0 Toyota Land Cruiser Prado 2016 Black Cars Mvita Mombasa Foreign Used First registration, No faults Toyota Black Land Cruiser Prado 2016.0 87000.0 NaN NaN NaN Automatic NaN NaN 6500000
1 Mazda Demio 2014 Brown Cars Langata Nairobi Foreign Used First owner, No faults Mazda Brown Demio 2014.0 92000.0 NaN NaN NaN Automatic NaN Yes 970000
2 Clean NV300 Caravan 2014 Model Dielsel 16 Seater Buses & Microbuses Kilimani Nairobi Foreign Used Nissan Nissan NaN Caravan (Urvan) 2014.0 180000.0 NaN NaN NaN NaN NaN NaN 2550000
3 Toyota Crown 2014 Pearl Cars Kilimani Nairobi Foreign Used No faults Toyota Pearl Crown 2014.0 75000.0 NaN NaN NaN Automatic NaN No 2100000
4 Honda Fit 2014 Black Cars Mvita Mombasa Foreign Used No faults Honda Black Fit 2014.0 58000.0 NaN NaN NaN Automatic NaN Yes 880000

Select only the vehicles that year of make is after 2015¶

In [70]:
df[df['yom'] > 2015]
Out[70]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
0 Toyota Land Cruiser Prado 2016 Black Cars Mvita Mombasa Foreign Used First registration, No faults Toyota Black Land Cruiser Prado 2016.0 87000.0 NaN NaN NaN Automatic NaN NaN 6500000
11 Lexus NX 2016 200t FWD Beige Cars Kilimani Nairobi Foreign Used No faults, Unpainted, Original parts Lexus Beige NX 2016.0 35000.0 SUV Petrol Front Wheel Automatic 5.0 No 4750000
22 Lexus RX 2016 Black Cars Mombasa CBD Mombasa Foreign Used No faults Lexus Black RX 2016.0 NaN NaN NaN NaN Automatic NaN Yes 14500000
41 Isuzu Nkr. Yr 2019 Trucks & Trailers Ridgeways Nairobi Used Used Isuzu White N Series 2019.0 NaN NaN Diesel NaN Manual NaN NaN 3050000
52 Mazda CX-5 2016 Red Cars Mombasa CBD Mombasa Foreign Used First registration Mazda Red CX-5 2016.0 51238.0 NaN NaN NaN Automatic NaN No 2900000
94 Mazda CX-3 2016 Red Cars Mombasa CBD Mombasa Foreign Used First registration Mazda Red CX-3 2016.0 76000.0 NaN NaN NaN Automatic NaN No 1800000
101 Mitsubishi Outlander 2016 ES FWD White Cars Mombasa CBD Mombasa Foreign Used First owner, First registration, Original parts Mitsubishi White Outlander 2016.0 56390.0 SUV Petrol Front Wheel Automatic 7.0 NaN 3100000
103 Shacman F2000 Tipper Trucks & Trailers Municipality Meru Used Used Shacman NaN NaN 2018.0 NaN NaN NaN NaN NaN NaN NaN 5100000
107 Toyota Wish 2016 Black Cars Mombasa CBD Mombasa Foreign Used No faults Toyota Black Wish 2016.0 NaN NaN NaN NaN Automatic NaN Yes 1500000
222 Tata Signa LPK-1618 Tipper 10 Ton Trucks & Trailers Nairobi Central Nairobi Brand New Brand New Tata NaN M&HCV Rigid Truck 2021.0 NaN NaN Diesel NaN NaN NaN NaN 6000000
224 Toyota Hilux 2016 Black Cars Mombasa CBD Mombasa Foreign Used First registration Toyota Black Hilux 2016.0 36000.0 NaN NaN NaN Automatic NaN No 9000000
236 Isuzu NPR Clean Unit Trucks & Trailers Nairobi Central Nairobi Used Used Isuzu Other N Series 2016.0 NaN NaN Diesel NaN Manual NaN NaN 2850000
265 New Hyundai Palisade 2021 White Cars Mombasa Road Nairobi Brand New No faults Hyundai White Palisade 2021.0 200.0 NaN NaN NaN Automatic NaN Yes 9500000

2. Multiple logical operators¶

Select only vehicles that were made after 2015 and are category car¶

In [78]:
df[(df['yom']>2015) & (df['category'] == 'Cars')]
Out[78]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
0 Toyota Land Cruiser Prado 2016 Black Cars Mvita Mombasa Foreign Used First registration, No faults Toyota Black Land Cruiser Prado 2016.0 87000.0 NaN NaN NaN Automatic NaN NaN 6500000
11 Lexus NX 2016 200t FWD Beige Cars Kilimani Nairobi Foreign Used No faults, Unpainted, Original parts Lexus Beige NX 2016.0 35000.0 SUV Petrol Front Wheel Automatic 5.0 No 4750000
22 Lexus RX 2016 Black Cars Mombasa CBD Mombasa Foreign Used No faults Lexus Black RX 2016.0 NaN NaN NaN NaN Automatic NaN Yes 14500000
52 Mazda CX-5 2016 Red Cars Mombasa CBD Mombasa Foreign Used First registration Mazda Red CX-5 2016.0 51238.0 NaN NaN NaN Automatic NaN No 2900000
94 Mazda CX-3 2016 Red Cars Mombasa CBD Mombasa Foreign Used First registration Mazda Red CX-3 2016.0 76000.0 NaN NaN NaN Automatic NaN No 1800000
101 Mitsubishi Outlander 2016 ES FWD White Cars Mombasa CBD Mombasa Foreign Used First owner, First registration, Original parts Mitsubishi White Outlander 2016.0 56390.0 SUV Petrol Front Wheel Automatic 7.0 NaN 3100000
107 Toyota Wish 2016 Black Cars Mombasa CBD Mombasa Foreign Used No faults Toyota Black Wish 2016.0 NaN NaN NaN NaN Automatic NaN Yes 1500000
224 Toyota Hilux 2016 Black Cars Mombasa CBD Mombasa Foreign Used First registration Toyota Black Hilux 2016.0 36000.0 NaN NaN NaN Automatic NaN No 9000000
265 New Hyundai Palisade 2021 White Cars Mombasa Road Nairobi Brand New No faults Hyundai White Palisade 2021.0 200.0 NaN NaN NaN Automatic NaN Yes 9500000

3. isin¶

The isin method is another way of applying multiple condition for filtering.

For instance, we can filter the vehicles whose color is in a given list.

Suppose we only want to select vehicles that are either Black, White or Red

In [79]:
colors = ['Black','White','Red']

df[df['color'].isin(colors)]
Out[79]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
0 Toyota Land Cruiser Prado 2016 Black Cars Mvita Mombasa Foreign Used First registration, No faults Toyota Black Land Cruiser Prado 2016.0 87000.0 NaN NaN NaN Automatic NaN NaN 6500000
4 Honda Fit 2014 Black Cars Mvita Mombasa Foreign Used No faults Honda Black Fit 2014.0 58000.0 NaN NaN NaN Automatic NaN Yes 880000
5 Mitsubishi Delica 2013 White Cars Mvita Mombasa Foreign Used First registration, No faults, Unpainted Mitsubishi White Delica 2013.0 88000.0 NaN NaN NaN Automatic NaN Yes 630000
6 New Toyota Premio 2013 Red Cars Mvita Mombasa Brand New No faults, First registration Toyota Red Premio 2013.0 45000.0 NaN NaN NaN Automatic NaN No 1500000
8 BMW X4 2015 xDrive35i Black Cars Mombasa CBD Mombasa Foreign Used No faults BMW Black X4 2015.0 63128.0 NaN NaN NaN Automatic NaN No 5800000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
289 Nissan Teana 2015 Black Cars Nairobi Central Nairobi Foreign Used First registration Nissan Black Teana 2015.0 NaN NaN NaN NaN Automatic NaN Yes 1420000
290 Lexus LS 2011 460 AWD Black Cars Lavington Nairobi Kenyan Used Lexus Lexus Black LS 2011.0 88201.0 Sedan Petrol All Wheel Automatic 5.0 NaN 3100000
291 Toyota Ractis 2015 Black Cars Mombasa CBD Mombasa Foreign Used Original parts Toyota Black Ractis 2015.0 58964.0 NaN NaN NaN Automatic NaN Yes 1049000
294 Subaru Outback 2014 White Cars Mvita Mombasa Foreign Used No faults Subaru White Outback 2014.0 63142.0 NaN NaN NaN Automatic NaN NaN 2500000
298 Toyota Ractis 2009 Black Cars Ridgeways Nairobi Kenyan Used First owner, No faults, Original parts Toyota Black Ractis 2009.0 120120.0 NaN NaN NaN Automatic NaN Yes 550000

136 rows × 18 columns

4. Str Accessor¶

Pandas is a highly efficient library on textual data as well. The functions and methods under the str accessor provide flexible ways to filter rows based on strings.

for example, we can select all the vehicles where the title begins with letter 'M'

In [83]:
df[df['title'].str.startswith('M')]
Out[83]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
1 Mazda Demio 2014 Brown Cars Langata Nairobi Foreign Used First owner, No faults Mazda Brown Demio 2014.0 92000.0 NaN NaN NaN Automatic NaN Yes 970000
5 Mitsubishi Delica 2013 White Cars Mvita Mombasa Foreign Used First registration, No faults, Unpainted Mitsubishi White Delica 2013.0 88000.0 NaN NaN NaN Automatic NaN Yes 630000
9 Mitsubishi Outlander 2015 White Cars Lavington Nairobi Foreign Used Unpainted, Original parts, No faults Mitsubishi White Outlander 2015.0 40382.0 SUV Petrol Front Wheel Automatic NaN Yes 2900000
15 Mazda CX-5 2014 Sport FWD Blue Cars Ridgeways Nairobi Foreign Used No faults Mazda Blue CX-5 2014.0 NaN SUV Petrol Front Wheel Automatic 5.0 NaN 2060000
17 Mitsubishi RVR 2014 Blue Cars Mombasa CBD Mombasa Foreign Used Unpainted, Original parts, No faults Mitsubishi Blue RVR 2014.0 142000.0 NaN NaN NaN Automatic NaN No 1980000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
279 Mitsubishi Fuso Trucks & Trailers Runda Nairobi Brand New Brand New Mitsubishi NaN NaN 2014.0 NaN NaN NaN NaN NaN NaN NaN 2100000
288 Mitsubishi RVR 2015 Black Cars Mombasa CBD Mombasa Brand New No faults Mitsubishi Black RVR 2015.0 45555.0 NaN NaN NaN Automatic NaN NaN 2150000
293 Mitsubishi L200 2014 Gold Cars Roysambu Nairobi Kenyan Used First owner, Original parts, First registration Mitsubishi Gold L200 2014.0 NaN NaN NaN NaN Manual NaN Yes 1200000
296 Mitsubishi Hd Trucks & Trailers Kisauni Mombasa Used Used Mitsubishi NaN NaN 2006.0 NaN NaN NaN NaN NaN NaN NaN 1300000
297 Mitsubishi Fuso Refrigerated Trucks & Trailers Kisauni Mombasa Used Used Mitsubishi NaN Canter 2014.0 NaN NaN Diesel NaN NaN NaN NaN 2200000

63 rows × 18 columns

The contains function under the str accessor returns the values that contain a given set of characters.

For instance select all vehicles that contain the word Benz in their title:

In [85]:
df[df['title'].str.contains('Benz')]
Out[85]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
78 Mercedes-Benz E250 2014 SV Premium Black Cars Nairobi Central Nairobi Kenyan Used Unpainted, No faults, Original parts Mercedes-Benz Black E250 2014.0 100555.0 NaN NaN NaN Automatic NaN Yes 3200000
179 Mercedes-Benz B-Class 2009 White Cars Lavington Nairobi Kenyan Used Mercedes-Benz Mercedes-Benz White B-Class 2009.0 154155.0 NaN Petrol NaN Automatic NaN NaN 699999
195 Mercedes-Benz Actros Trucks & Trailers Thome Nairobi Used Used Mercedes-Benz NaN NaN 2013.0 NaN NaN NaN NaN NaN NaN NaN 7500000
207 Mercedes-Benz C180 2014 Black Cars Kilimani Nairobi Foreign Used No faults Mercedes-Benz Black C180 2014.0 82000.0 NaN NaN NaN Automatic NaN Yes 2500000
210 Mercedes-Benz C200 2008 White Cars Langata Nairobi Kenyan Used Mercedes-Benz Mercedes-Benz White C200 2008.0 132489.0 NaN Petrol NaN Automatic NaN NaN 1400000
219 Mercedes-Benz C-Class 2011 C 200 (C204) White Cars Lavington Nairobi Kenyan Used Mercedes-Benz Mercedes-Benz White C-Class 2011.0 76000.0 Coupe Petrol Rear Wheel Automatic 5.0 NaN 1650000
238 Mercedes-Benz B-Class 2014 Silver Cars Mvita Mombasa Foreign Used First owner, No faults Mercedes-Benz Silver B-Class 2014.0 90000.0 NaN NaN NaN Automatic NaN NaN 1650000
247 Mercedes-Benz E250 2014 Black Cars Nairobi Central Nairobi Foreign Used No faults Mercedes-Benz Black E250 2014.0 49000.0 NaN NaN NaN Automatic NaN No 3900000
274 Mercedes-Benz C200 2015 Black Cars Mombasa CBD Mombasa Foreign Used No faults Mercedes-Benz Black C200 2015.0 56800.0 NaN NaN NaN Automatic NaN Yes 3850000

5. Tilde ~¶

The tilde operator is used for not logic in filtering. If we add the tilde operator before the filter expression, the rows that do not fit the condition are returned.

for instance select all the vehicles that do not have the word Toyota in their title.

In [87]:
df[~df['title'].str.contains('Toyota')]
Out[87]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
1 Mazda Demio 2014 Brown Cars Langata Nairobi Foreign Used First owner, No faults Mazda Brown Demio 2014.0 92000.0 NaN NaN NaN Automatic NaN Yes 970000
2 Clean NV300 Caravan 2014 Model Dielsel 16 Seater Buses & Microbuses Kilimani Nairobi Foreign Used Nissan Nissan NaN Caravan (Urvan) 2014.0 180000.0 NaN NaN NaN NaN NaN NaN 2550000
4 Honda Fit 2014 Black Cars Mvita Mombasa Foreign Used No faults Honda Black Fit 2014.0 58000.0 NaN NaN NaN Automatic NaN Yes 880000
5 Mitsubishi Delica 2013 White Cars Mvita Mombasa Foreign Used First registration, No faults, Unpainted Mitsubishi White Delica 2013.0 88000.0 NaN NaN NaN Automatic NaN Yes 630000
8 BMW X4 2015 xDrive35i Black Cars Mombasa CBD Mombasa Foreign Used No faults BMW Black X4 2015.0 63128.0 NaN NaN NaN Automatic NaN No 5800000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
293 Mitsubishi L200 2014 Gold Cars Roysambu Nairobi Kenyan Used First owner, Original parts, First registration Mitsubishi Gold L200 2014.0 NaN NaN NaN NaN Manual NaN Yes 1200000
294 Subaru Outback 2014 White Cars Mvita Mombasa Foreign Used No faults Subaru White Outback 2014.0 63142.0 NaN NaN NaN Automatic NaN NaN 2500000
296 Mitsubishi Hd Trucks & Trailers Kisauni Mombasa Used Used Mitsubishi NaN NaN 2006.0 NaN NaN NaN NaN NaN NaN NaN 1300000
297 Mitsubishi Fuso Refrigerated Trucks & Trailers Kisauni Mombasa Used Used Mitsubishi NaN Canter 2014.0 NaN NaN Diesel NaN NaN NaN NaN 2200000
299 Subaru Forester 2015 Matt Black Cars Mombasa CBD Mombasa Foreign Used Unpainted, Original parts, First registration Subaru Matt Black Forester 2015.0 86910.0 NaN NaN NaN Automatic NaN Yes 2450000

222 rows × 18 columns

6. Query¶

The query function offers a little more flexibility at writing the conditions for filtering.

We can pass the conditions as a string.

For instance, the following code returns the rows that belong to the Cars category and have a value higher than 5,000,000 in the price column.

In [89]:
df.query("category == 'Cars' and price > 5_000_000")
Out[89]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
0 Toyota Land Cruiser Prado 2016 Black Cars Mvita Mombasa Foreign Used First registration, No faults Toyota Black Land Cruiser Prado 2016.0 87000.0 NaN NaN NaN Automatic NaN NaN 6500000
8 BMW X4 2015 xDrive35i Black Cars Mombasa CBD Mombasa Foreign Used No faults BMW Black X4 2015.0 63128.0 NaN NaN NaN Automatic NaN No 5800000
13 BMW 520i 2014 Black Cars Mombasa CBD Mombasa Foreign Used First registration BMW Black 520i 2014.0 80000.0 NaN NaN NaN Automatic NaN No 5700000
22 Lexus RX 2016 Black Cars Mombasa CBD Mombasa Foreign Used No faults Lexus Black RX 2016.0 NaN NaN NaN NaN Automatic NaN Yes 14500000
40 Volkswagen Touareg 2014 Black Cars Mombasa CBD Mombasa Foreign Used First registration Volkswagen Black Touareg 2014.0 56000.0 NaN NaN NaN Automatic NaN No 5200000
53 Toyota Land Cruiser Prado 2015 2.7 VVT-i Brown Cars Mvita Mombasa Foreign Used No faults Toyota Brown Land Cruiser Prado 2015.0 45260.0 SUV Petrol All Wheel Automatic 5.0 No 6500000
73 Toyota Land Cruiser Prado 2014 2.7 VVT-i Gold Cars Mombasa CBD Mombasa Foreign Used First owner, First registration Toyota Gold Land Cruiser Prado 2014.0 NaN SUV Petrol All Wheel Automatic 5.0 No 5600000
156 Toyota Land Cruiser 2010 4.6 V8 ZX Black Cars Runda Nairobi Foreign Used No faults Toyota Black Land Cruiser 2010.0 NaN SUV Petrol 4x4 Automatic 8.0 No 8799999
177 Toyota Land Cruiser Prado 2014 Blue Cars Nairobi Central Nairobi Foreign Used First owner, Unpainted, Original parts Toyota Blue Land Cruiser Prado 2014.0 42000.0 NaN NaN NaN Automatic NaN No 6150000
220 Volkswagen Touareg 2015 TDI Executive AWD 4MOT... Cars Tudor Mombasa Foreign Used No faults Volkswagen Black Touareg 2015.0 111285.0 SUV Diesel All Wheel Automatic 5.0 NaN 5500000
224 Toyota Hilux 2016 Black Cars Mombasa CBD Mombasa Foreign Used First registration Toyota Black Hilux 2016.0 36000.0 NaN NaN NaN Automatic NaN No 9000000
241 BMW X5 2015 White Cars Mombasa CBD Mombasa Foreign Used First registration BMW White X5 2015.0 68000.0 SUV Petrol All Wheel Automatic NaN No 6300000
249 Toyota Land Cruiser 2014 4.6 V8 ZX Black Cars Karen Nairobi Foreign Used Unpainted, Original parts, No faults Toyota Black Land Cruiser 2014.0 30000.0 SUV Petrol 4x4 Automatic 8.0 No 8199999
265 New Hyundai Palisade 2021 White Cars Mombasa Road Nairobi Brand New No faults Hyundai White Palisade 2021.0 200.0 NaN NaN NaN Automatic NaN Yes 9500000
276 Lexus NX 2014 Red Cars Mombasa CBD Mombasa Foreign Used No faults Lexus Red NX 2014.0 NaN NaN NaN NaN Automatic NaN No 5400000

7. nlargest and nsmallest¶

In some cases, we do not have a specific range for filtering but just need the largest or smallest values.

The nlargest and nsmallest functions allow for selecting rows that have the largest or smallest values in a column, respectively.

We specify the number of largest or smallest values to be selected and the name of the column.

Example: lets select 5 vehicles with the highest price

In [92]:
# 5 vehicles with highest price

df.nlargest(5,'price')
Out[92]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
22 Lexus RX 2016 Black Cars Mombasa CBD Mombasa Foreign Used No faults Lexus Black RX 2016.0 NaN NaN NaN NaN Automatic NaN Yes 14500000
148 Mazda Bongo Buses & Microbuses Ridgeways Nairobi Foreign Used Unpainted, No faults Mazda White Bongo 2014.0 127800.0 NaN NaN NaN Automatic NaN NaN 11200000
265 New Hyundai Palisade 2021 White Cars Mombasa Road Nairobi Brand New No faults Hyundai White Palisade 2021.0 200.0 NaN NaN NaN Automatic NaN Yes 9500000
224 Toyota Hilux 2016 Black Cars Mombasa CBD Mombasa Foreign Used First registration Toyota Black Hilux 2016.0 36000.0 NaN NaN NaN Automatic NaN No 9000000
156 Toyota Land Cruiser 2010 4.6 V8 ZX Black Cars Runda Nairobi Foreign Used No faults Toyota Black Land Cruiser 2010.0 NaN SUV Petrol 4x4 Automatic 8.0 No 8799999
Select 5 cars with the lowest price¶
In [93]:
# 5 vehicles with the lowest prices
df.nsmallest(5,'price')
Out[93]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
111 Mitsubishi Lancer / Cedia 2002 White Cars Syokimau Machakos Kenyan Used No faults Mitsubishi White Lancer / Cedia 2002.0 NaN NaN NaN NaN Automatic NaN Yes 250000
85 Volkswagen Golf 2012 Blue Cars Ridgeways Nairobi Kenyan Used First owner, After crash, Original parts Volkswagen Blue Golf 2012.0 49252.0 NaN NaN NaN Automatic NaN Yes 260000
246 Mazda Familia 1998 Black Cars Ridgeways Nairobi Kenyan Used Mazda Mazda Black Familia 1998.0 202745.0 NaN Petrol NaN Manual NaN NaN 300000
38 Land Rover Range Rover 1979 Green Cars Kisumu Central Kisumu Kenyan Used First owner Land Rover Green Range Rover 1979.0 NaN NaN NaN NaN Manual NaN Yes 400000
261 Single Diff Volve FL10 Trucks & Trailers Embakasi Nairobi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 400000

Challenges¶

Challenge 1: use .loc to retrieve all the rows for columns:¶

  • title

  • region

  • brand

  • price

In [95]:
# solution to challenge 1
df.loc[:, ['title','region','brand','price']]
Out[95]:
title region brand price
0 Toyota Land Cruiser Prado 2016 Black Mvita Toyota 6500000
1 Mazda Demio 2014 Brown Langata Mazda 970000
2 Clean NV300 Caravan 2014 Model Dielsel 16 Seater Kilimani Nissan 2550000
3 Toyota Crown 2014 Pearl Kilimani Toyota 2100000
4 Honda Fit 2014 Black Mvita Honda 880000
... ... ... ... ...
295 Toyota Allion 2008 Silver Ganjoni Toyota 860000
296 Mitsubishi Hd Kisauni Mitsubishi 1300000
297 Mitsubishi Fuso Refrigerated Kisauni Mitsubishi 2200000
298 Toyota Ractis 2009 Black Ridgeways Toyota 550000
299 Subaru Forester 2015 Matt Black Mombasa CBD Subaru 2450000

300 rows × 4 columns

Challenge 2: Use a filter to select all the vehicles that are Toyota brand and are within the region kilimani¶

In [97]:
# solution
# first create a mask using the conditions
mask = (df['brand'] == "Toyota") & (df['region'] == "Kilimani")

# apply the mask on the dataframe to filter only True values
df[mask]
Out[97]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
3 Toyota Crown 2014 Pearl Cars Kilimani Nairobi Foreign Used No faults Toyota Pearl Crown 2014.0 75000.0 NaN NaN NaN Automatic NaN No 2100000
31 Toyota Mark X 2014 Silver Cars Kilimani Nairobi Foreign Used No faults Toyota Silver Mark X 2014.0 53580.0 NaN NaN NaN Automatic NaN Yes 1980000
42 Toyota Harrier 2014 Black Cars Kilimani Nairobi Foreign Used First registration, Original parts, No faults Toyota Black Harrier 2014.0 48592.0 NaN NaN NaN Automatic NaN No 3450000
74 Toyota Corolla Rumion 2014 Blue Cars Kilimani Nairobi Foreign Used No faults Toyota Blue Corolla Rumion 2014.0 54000.0 NaN NaN NaN Automatic NaN Yes 1330000
126 Toyota RAV4 2014 SV Premium Black Cars Kilimani Nairobi Foreign Used Original parts Toyota Black RAV4 2014.0 NaN NaN NaN NaN Automatic NaN Yes 2800000
262 Toyota Hiace 2014 White Buses & Microbuses Kilimani Nairobi Foreign Used Toyota Toyota NaN HiAce 2014.0 100000.0 NaN NaN NaN NaN NaN NaN 2100000

Challenge 3. Select all the vehicles that contain the word 'CX' in their model¶

In [99]:
### solution to challenge 3

df[df['model'].str.contains('CX', na=False)]
Out[99]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
15 Mazda CX-5 2014 Sport FWD Blue Cars Ridgeways Nairobi Foreign Used No faults Mazda Blue CX-5 2014.0 NaN SUV Petrol Front Wheel Automatic 5.0 NaN 2060000
24 Mazda CX-5 2014 Red Cars Nairobi Central Nairobi Foreign Used First registration Mazda Red CX-5 2014.0 85239.0 NaN NaN NaN Automatic NaN Yes 2250000
52 Mazda CX-5 2016 Red Cars Mombasa CBD Mombasa Foreign Used First registration Mazda Red CX-5 2016.0 51238.0 NaN NaN NaN Automatic NaN No 2900000
58 Mazda CX-5 2014 Grand Touring FWD Red Cars Mombasa CBD Mombasa Foreign Used No faults Mazda Red CX-5 2014.0 NaN SUV Petrol NaN Automatic 5.0 No 2300000
94 Mazda CX-3 2016 Red Cars Mombasa CBD Mombasa Foreign Used First registration Mazda Red CX-3 2016.0 76000.0 NaN NaN NaN Automatic NaN No 1800000
183 New Mazda CX-5 2014 Grand Touring AWD Blue Cars Nairobi Central Nairobi Brand New No faults Mazda Blue CX-5 2014.0 115000.0 SUV Petrol All Wheel Automatic 5.0 No 2100000

Challenge 4. Use a filter to return only the vehicles:¶

  • that are of brand Mercedes-Benz

  • The color should not be Red

  • Sort the results by price in descending order

In [107]:
## Challenge 4 solution

df[(df['brand']=="Mercedes-Benz") & (df['color'] != 'Red')].sort_values(by='price',ascending=False)
Out[107]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
195 Mercedes-Benz Actros Trucks & Trailers Thome Nairobi Used Used Mercedes-Benz NaN NaN 2013.0 NaN NaN NaN NaN NaN NaN NaN 7500000
247 Mercedes-Benz E250 2014 Black Cars Nairobi Central Nairobi Foreign Used No faults Mercedes-Benz Black E250 2014.0 49000.0 NaN NaN NaN Automatic NaN No 3900000
274 Mercedes-Benz C200 2015 Black Cars Mombasa CBD Mombasa Foreign Used No faults Mercedes-Benz Black C200 2015.0 56800.0 NaN NaN NaN Automatic NaN Yes 3850000
78 Mercedes-Benz E250 2014 SV Premium Black Cars Nairobi Central Nairobi Kenyan Used Unpainted, No faults, Original parts Mercedes-Benz Black E250 2014.0 100555.0 NaN NaN NaN Automatic NaN Yes 3200000
207 Mercedes-Benz C180 2014 Black Cars Kilimani Nairobi Foreign Used No faults Mercedes-Benz Black C180 2014.0 82000.0 NaN NaN NaN Automatic NaN Yes 2500000
219 Mercedes-Benz C-Class 2011 C 200 (C204) White Cars Lavington Nairobi Kenyan Used Mercedes-Benz Mercedes-Benz White C-Class 2011.0 76000.0 Coupe Petrol Rear Wheel Automatic 5.0 NaN 1650000
238 Mercedes-Benz B-Class 2014 Silver Cars Mvita Mombasa Foreign Used First owner, No faults Mercedes-Benz Silver B-Class 2014.0 90000.0 NaN NaN NaN Automatic NaN NaN 1650000
210 Mercedes-Benz C200 2008 White Cars Langata Nairobi Kenyan Used Mercedes-Benz Mercedes-Benz White C200 2008.0 132489.0 NaN Petrol NaN Automatic NaN NaN 1400000
179 Mercedes-Benz B-Class 2009 White Cars Lavington Nairobi Kenyan Used Mercedes-Benz Mercedes-Benz White B-Class 2009.0 154155.0 NaN Petrol NaN Automatic NaN NaN 699999

Challenge 5: Write a filter to select vehicles¶

  • that are of brand Mercedes-Benz or BMW

  • That mention the colors 'Black' or 'White' in the title

  • Are in either of the following regions: Kilimani, Lavington, Langata or Westlands

  • Sort them in alphabetical order by brand, then by price with most expensive on top

In [134]:
#### Challenge 5 solution
df[
    ((df['brand'] == 'Mercedes-Benz') | (df['brand'] == 'BMW')) &
    (df['title'].str.contains('Black') | (df['title'].str.contains('White'))) &
    (df['region'].isin(['Kilimani','Lavington','Langata','Westlands']))
].sort_values(by=['brand','price'],ascending=[True,False])
Out[134]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
282 BMW 7 Series 2013 White Cars Westlands Nairobi Foreign Used No faults BMW White 7 Series 2013.0 40000.0 NaN NaN NaN Automatic NaN Yes 3800000
207 Mercedes-Benz C180 2014 Black Cars Kilimani Nairobi Foreign Used No faults Mercedes-Benz Black C180 2014.0 82000.0 NaN NaN NaN Automatic NaN Yes 2500000
219 Mercedes-Benz C-Class 2011 C 200 (C204) White Cars Lavington Nairobi Kenyan Used Mercedes-Benz Mercedes-Benz White C-Class 2011.0 76000.0 Coupe Petrol Rear Wheel Automatic 5.0 NaN 1650000
210 Mercedes-Benz C200 2008 White Cars Langata Nairobi Kenyan Used Mercedes-Benz Mercedes-Benz White C200 2008.0 132489.0 NaN Petrol NaN Automatic NaN NaN 1400000
179 Mercedes-Benz B-Class 2009 White Cars Lavington Nairobi Kenyan Used Mercedes-Benz Mercedes-Benz White B-Class 2009.0 154155.0 NaN Petrol NaN Automatic NaN NaN 699999

Challenge 6: Write a filter to select vehicles¶

  • that are of brand Toyota or Mazda or Subaru

  • That mention the colors 'Black' or 'White' in the title

  • Are in either of the following regions: Kilimani, Lavington, Langata, Westlands, Ridgeways

  • Sort them in alphabetical order by brand, then by price with most expensive on top

In [ ]:
### Challenge 6 Solution

# your solution here



#

Cleaning Data : Casting Datatypes and Handling Missing values¶

Data cleaning means fixing bad data in your data set.

Bad data could be:

  • Empty cells

  • Data in wrong format

  • Wrong data

  • Duplicates

Earlier we say the command to see the number of missing values per column:

In [144]:
df.isnull().sum()
Out[144]:
title              0
category           0
region             0
parent_region      6
condition          5
attrs              1
brand              1
color             56
model             17
yom                3
mileage           91
body_type        258
fuel             218
drive_train      259
trans             51
seat             269
registered       125
price              0
dtype: int64

Cleaning Empty Cells¶

Empty cells can potentially give you a wrong result when you analyze data.

One way of dealing with empty cells is to insert a new value instead.

Since only 3 rows are missing the values for yom, here's how we'll handle the missing values:

  • Fill the yom empty cell for Isuzu Truck with the modal class (most frequent) for vehicles with category Trucks & Trailers and brand Isuzu

  • Fill in the yom empty cell for Subaru Outback 2014 black with the modal class for vehicles with category Cars and model Outback

  • Since most of the entries are missing for Single Diff Volve we remove the row. Drop/discard the record.

In [146]:
df[df['yom'].isnull()]
Out[146]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
109 Isuzu Truck . Trucks & Trailers Donholm Nairobi NaN Isuzu Isuzu NaN N Series NaN NaN NaN NaN NaN NaN NaN NaN 810000
201 Subaru Outback 2014 black Cars Nairobi NaN NaN Outback Subaru Black Outback NaN 180097.0 NaN NaN NaN Automatic NaN NaN 1649999
261 Single Diff Volve FL10 Trucks & Trailers Embakasi Nairobi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 400000
In [153]:
# Isuzu trucks modal yom

# filter 
df[(df['category']=='Trucks & Trailers')&(df['brand']=='Isuzu')].head()
Out[153]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
14 Very Clean Isuzu FRR Truck 2015 Model Trucks & Trailers Thome Nairobi Used Used Isuzu White F SERIES 2015.0 NaN NaN Diesel NaN Manual NaN NaN 3520000
36 Isuzu Elf,Year 2014manual Transmission Trucks & Trailers Mombasa CBD Mombasa Brand New Brand New Isuzu NaN NaN 2013.0 NaN NaN NaN NaN NaN NaN NaN 2150000
41 Isuzu Nkr. Yr 2019 Trucks & Trailers Ridgeways Nairobi Used Used Isuzu White N Series 2019.0 NaN NaN Diesel NaN Manual NaN NaN 3050000
46 Isuzu Frr Kbw Trucks & Trailers Eldoret CBD Uasin Gishu Used Used Isuzu White F SERIES 2013.0 NaN NaN Diesel NaN Manual NaN NaN 2600000
62 Isuzu Elf,Year 2015 Manual Trucks & Trailers Mombasa CBD Mombasa Brand New Brand New Isuzu NaN NaN 2014.0 NaN NaN NaN NaN NaN NaN NaN 3650000
In [160]:
df[(df['category']=='Trucks & Trailers')&(df['brand']=='Isuzu')]['yom'].value_counts()
Out[160]:
2015.0    4
2013.0    3
2012.0    2
2019.0    1
2014.0    1
2006.0    1
2007.0    1
1994.0    1
1990.0    1
1999.0    1
2016.0    1
2010.0    1
Name: yom, dtype: int64

You can fill the yom value for Isuzu Truck with the 2014 which is the modal class for Isuzu Trucks. We know that the Isuzu Truck is on index 109 of our DataFrame, we will use the .at attribute to index the row.

In [164]:
df.at[109,'yom'] = 2014
In [165]:
df.at[109,'yom']
Out[165]:
2014.0

Repeat the process for the Subaru

In [172]:
df[(df['category']=='Cars')&(df['model']=='Outback')]['yom'].value_counts().index.tolist()
Out[172]:
[2015.0, 2013.0, 2014.0]
In [173]:
df.at[201,'yom']
Out[173]:
nan
In [175]:
df.at[201,'yom'] = 2015

Discard a specific row from the DataFrame¶

To drop a row from the DataFrame, simply use df.drop() then you specify the index and axis.

We know that the entry is on row 261.

the df.dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows containg NULL values from the original DataFrame.

In [177]:
df.drop(261,inplace=True)
In [179]:
df.isnull().sum()
Out[179]:
title              0
category           0
region             0
parent_region      6
condition          4
attrs              0
brand              0
color             55
model             16
yom                0
mileage           90
body_type        257
fuel             217
drive_train      258
trans             50
seat             268
registered       124
price              0
dtype: int64

Now you can see that the column yom has 0 null values

The fillna() method allows us to replace empty cells with a value.

To only replace empty values for one column, specify the column name for the DataFrame

Let's fill in the Missing values for color of all missing rows with Not Specified

In [190]:
df['color'].value_counts()
Out[190]:
White         68
Black         57
Silver        27
Blue          25
Gray          16
Red           11
Brown         10
Pink           5
Purple         5
Green          4
Pearl          3
Gold           3
Other          3
Beige          2
Off white      2
Grey           1
Burgandy       1
Matt Black     1
Name: color, dtype: int64
In [192]:
df['color'].fillna('Not Specified', inplace=True)
In [193]:
df['color'].value_counts()
Out[193]:
White            68
Black            57
Not Specified    55
Silver           27
Blue             25
Gray             16
Red              11
Brown            10
Purple            5
Pink              5
Green             4
Gold              3
Pearl             3
Other             3
Beige             2
Off white         2
Grey              1
Burgandy          1
Matt Black        1
Name: color, dtype: int64

Replace with mean, median or mode¶

A common way to replace empty cells, is to calculate the mean, median or mode value of the column.

Pandas uses the mean(), median() and mode() methods to calculate the respective values for a specified column

  • Mean = the average value (the sum of all values divided by number of values).
  • Median = the value in the middle, after you have sorted all values ascending.
  • Mode = the value that appears most frequently.
In [197]:
df[df['condition'].isnull()]
Out[197]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
109 Isuzu Truck . Trucks & Trailers Donholm Nairobi NaN Isuzu Isuzu Not Specified N Series 2014.0 NaN NaN NaN NaN NaN NaN NaN 810000
161 Tipper In Exellent Condition. Trucks & Trailers Mombasa CBD Mombasa NaN Isuzu Isuzu Not Specified NaN 2012.0 NaN NaN NaN NaN NaN NaN NaN 1850000
201 Subaru Outback 2014 black Cars Nairobi Nairobi NaN Outback Subaru Black Outback 2015.0 180097.0 NaN NaN NaN Automatic NaN NaN 1649999
202 Mazda Bongo 2013 Trucks & Trailers Ngong Kajiado NaN Mazda Mazda Not Specified NaN 2013.0 NaN NaN NaN NaN NaN NaN NaN 1650000
In [198]:
df['condition'].value_counts()
Out[198]:
Foreign Used    170
Kenyan Used      65
Used             35
Brand New        25
Name: condition, dtype: int64
In [204]:
# the modal condition for vehicles under the Trucks & Trailers category
trucks_mode = df[df['category'] == "Trucks & Trailers"]['condition'].mode()
trucks_mode
Out[204]:
0    Used
Name: condition, dtype: object
In [228]:
df['condition'].replace(np.nan,'Used',inplace=True)

So we choose to fill the missing values for condition with Used

The other way to deal with rows with missing values is to drop them using a criteria.

Suppose we want to drop all the vehicles where all the following details are missing:

  • model
  • mileage
  • body_type
  • fuel
  • drive_train
  • trans
  • seat
  • registered

we need to specify these columns in as a subset

syntax¶

df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

The how argument determines if row or column is removed from DataFrame, when we have at least one NA or all NA.

  • any : If any NA values are present, drop that row or column.

  • all : If all values are NA, drop that row or column.

In [240]:
# drop when any columns in cols list are missing
cols=['model','mileage','body_type',
      'fuel','drive_train','trans',
      'seat','registered']
df.dropna(axis='index', how='any', subset=cols).head()
Out[240]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
7 Toyota Sienta 2014 1.5 AWD Gray Cars Ganjoni Mombasa Foreign Used No faults Toyota Gray Sienta 2014.0 51000.0 Minivan Petrol All Wheel Automatic 7.0 Yes 1200000
11 Lexus NX 2016 200t FWD Beige Cars Kilimani Nairobi Foreign Used No faults, Unpainted, Original parts Lexus Beige NX 2016.0 35000.0 SUV Petrol Front Wheel Automatic 5.0 No 4750000
18 Toyota Harrier 2010 2.4 Black Cars Ridgeways Nairobi Kenyan Used First registration, First owner Toyota Black Harrier 2010.0 110643.0 SUV Petrol Front Wheel Automatic 5.0 Yes 1850000
19 Volkswagen Touareg 2006 3.0 V6 TDi Automatic B... Cars Lavington Nairobi Kenyan Used Original parts Volkswagen Black Touareg 2006.0 180000.0 SUV Diesel All Wheel Automatic 5.0 Yes 1850000
53 Toyota Land Cruiser Prado 2015 2.7 VVT-i Brown Cars Mvita Mombasa Foreign Used No faults Toyota Brown Land Cruiser Prado 2015.0 45260.0 SUV Petrol All Wheel Automatic 5.0 No 6500000
In [239]:
# drop only when any 5 columns in cols list are missing
cols=['model','mileage','body_type',
      'fuel','drive_train','trans',
      'seat','registered']

df.dropna(axis='index', how='any', subset=cols,thresh=5).head()
Out[239]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
7 Toyota Sienta 2014 1.5 AWD Gray Cars Ganjoni Mombasa Foreign Used No faults Toyota Gray Sienta 2014.0 51000.0 Minivan Petrol All Wheel Automatic 7.0 Yes 1200000
9 Mitsubishi Outlander 2015 White Cars Lavington Nairobi Foreign Used Unpainted, Original parts, No faults Mitsubishi White Outlander 2015.0 40382.0 SUV Petrol Front Wheel Automatic NaN Yes 2900000
10 Toyota Ractis 2012 1.3 FWD Silver Cars Lavington Nairobi Kenyan Used Toyota Toyota Silver Ractis 2012.0 167704.0 Hatchback Petrol Front Wheel Automatic 5.0 NaN 800000
11 Lexus NX 2016 200t FWD Beige Cars Kilimani Nairobi Foreign Used No faults, Unpainted, Original parts Lexus Beige NX 2016.0 35000.0 SUV Petrol Front Wheel Automatic 5.0 No 4750000
15 Mazda CX-5 2014 Sport FWD Blue Cars Ridgeways Nairobi Foreign Used No faults Mazda Blue CX-5 2014.0 NaN SUV Petrol Front Wheel Automatic 5.0 NaN 2060000
In [238]:
# drop only when all columns in cols list are missing
cols=['model','mileage','body_type',
      'fuel','drive_train','trans',
      'seat','registered']
df.dropna(axis='index', how='all', subset=cols).head()
Out[238]:
title category region parent_region condition attrs brand color model yom mileage body_type fuel drive_train trans seat registered price
0 Toyota Land Cruiser Prado 2016 Black Cars Mvita Mombasa Foreign Used First registration, No faults Toyota Black Land Cruiser Prado 2016.0 87000.0 NaN NaN NaN Automatic NaN NaN 6500000
1 Mazda Demio 2014 Brown Cars Langata Nairobi Foreign Used First owner, No faults Mazda Brown Demio 2014.0 92000.0 NaN NaN NaN Automatic NaN Yes 970000
2 Clean NV300 Caravan 2014 Model Dielsel 16 Seater Buses & Microbuses Kilimani Nairobi Foreign Used Nissan Nissan Not Specified Caravan (Urvan) 2014.0 180000.0 NaN NaN NaN NaN NaN NaN 2550000
3 Toyota Crown 2014 Pearl Cars Kilimani Nairobi Foreign Used No faults Toyota Pearl Crown 2014.0 75000.0 NaN NaN NaN Automatic NaN No 2100000
4 Honda Fit 2014 Black Cars Mvita Mombasa Foreign Used No faults Honda Black Fit 2014.0 58000.0 NaN NaN NaN Automatic NaN Yes 880000

To persist the changes specify inplace=True¶

In [241]:
# drop only when all columns in cols list are missing
cols=['model','mileage','body_type',
      'fuel','drive_train','trans',
      'seat','registered']
df.dropna(axis='index', how='all', subset=cols,inplace=True)
In [243]:
df.isnull().sum()
Out[243]:
title              0
category           0
region             0
parent_region      0
condition          0
attrs              0
brand              0
color              0
model              1
yom                0
mileage           75
body_type        242
fuel             202
drive_train      243
trans             35
seat             253
registered       109
price              0
dtype: int64

Drop columns if majority of values are missing¶

Keep only the columns with at least 60% non-NA values.

In [258]:
threshold = int(0.6* len(df))
threshold
Out[258]:
170
In [260]:
cols=['model','body_type',
      'fuel','drive_train','trans',
      'seat','registered']
df.dropna(axis='columns',thresh=threshold,inplace=True)
In [261]:
df.isnull().sum()
Out[261]:
title              0
category           0
region             0
parent_region      0
condition          0
attrs              0
brand              0
color              0
model              1
yom                0
mileage           75
trans             35
registered       109
price              0
dtype: int64
In [264]:
df.groupby('condition')['registered'].value_counts()
Out[264]:
condition     registered
Brand New     No             6
              Yes            6
Foreign Used  Yes           80
              No            49
Kenyan Used   Yes           34
Name: registered, dtype: int64
In [284]:
df[df['condition'] == "Kenyan Used"]['registered'].value_counts()
Out[284]:
Yes    65
Name: registered, dtype: int64

For all vehicles that indicate they are Kenyan Used, fill in all missing values for column registered with 'Yes'

In [282]:
reg_df = df[df['condition'] == "Kenyan Used"]['registered'].apply(lambda x: 'Yes' if x is np.nan else x)

df.loc[ df['condition'] == "Kenyan Used", 'registered'] = reg_df
In [285]:
df.groupby('condition')['registered'].value_counts()
Out[285]:
condition     registered
Brand New     No             6
              Yes            6
Foreign Used  Yes           80
              No            49
Kenyan Used   Yes           65
Name: registered, dtype: int64
In [283]:
df.isnull().sum()
Out[283]:
title             0
category          0
region            0
parent_region     0
condition         0
attrs             0
brand             0
color             0
model             1
yom               0
mileage          75
trans            35
registered       78
price             0
dtype: int64

Let's say we want to generate a scatter plot of car_price vs car_mileage¶

We first drop all rows that are missing the car_mileage

In [184]:
import matplotlib.pyplot as plt
In [357]:
cars_df = df[df['category']=='Cars'].copy()
cars_df.head()
Out[357]:
title category region parent_region condition attrs brand color model yom mileage trans registered price
0 Toyota Land Cruiser Prado 2016 Black Cars Mvita Mombasa Foreign Used First registration, No faults Toyota Black Land Cruiser Prado 2016.0 87000.0 Automatic NaN 6500000
1 Mazda Demio 2014 Brown Cars Langata Nairobi Foreign Used First owner, No faults Mazda Brown Demio 2014.0 92000.0 Automatic Yes 970000
3 Toyota Crown 2014 Pearl Cars Kilimani Nairobi Foreign Used No faults Toyota Pearl Crown 2014.0 75000.0 Automatic No 2100000
4 Honda Fit 2014 Black Cars Mvita Mombasa Foreign Used No faults Honda Black Fit 2014.0 58000.0 Automatic Yes 880000
5 Mitsubishi Delica 2013 White Cars Mvita Mombasa Foreign Used First registration, No faults, Unpainted Mitsubishi White Delica 2013.0 88000.0 Automatic Yes 630000
In [358]:
plt.scatter("yom","price",data=cars_df)
plt.title("Year Against Price")
plt.xlabel("Year of Make")
plt.ylabel("Price in Kshs Millions")
plt.grid()

Create a new Column from an Existing Column¶

Let's create a new column age_years which is the number of years that have pased for a vehicle of category Car since it's year of make.

To come up with this column, we use the column yom to arrive at the number of years from the current year.

We will simply deduct the current year with the entry for each row.

age_years = current_year - yom

In [359]:
# ensure that the yom column is numeric 
cars_df.dtypes
Out[359]:
title             object
category          object
region            object
parent_region     object
condition         object
attrs             object
brand             object
color             object
model             object
yom              float64
mileage          float64
trans             object
registered        object
price              int64
dtype: object
In [360]:
# get the current year
from datetime import date

cur_year = date.today().year
cur_year
Out[360]:
2022
In [361]:
cur_year - cars_df['yom']
Out[361]:
0       6.0
1       8.0
3       8.0
4       8.0
5       9.0
       ... 
293     8.0
294     8.0
295    14.0
298    13.0
299     7.0
Name: yom, Length: 221, dtype: float64
In [362]:
# to create a new column simply put the name of the new column withing square brackets
cars_df['age_years'] = cur_year - cars_df['yom']
cars_df.head()
Out[362]:
title category region parent_region condition attrs brand color model yom mileage trans registered price age_years
0 Toyota Land Cruiser Prado 2016 Black Cars Mvita Mombasa Foreign Used First registration, No faults Toyota Black Land Cruiser Prado 2016.0 87000.0 Automatic NaN 6500000 6.0
1 Mazda Demio 2014 Brown Cars Langata Nairobi Foreign Used First owner, No faults Mazda Brown Demio 2014.0 92000.0 Automatic Yes 970000 8.0
3 Toyota Crown 2014 Pearl Cars Kilimani Nairobi Foreign Used No faults Toyota Pearl Crown 2014.0 75000.0 Automatic No 2100000 8.0
4 Honda Fit 2014 Black Cars Mvita Mombasa Foreign Used No faults Honda Black Fit 2014.0 58000.0 Automatic Yes 880000 8.0
5 Mitsubishi Delica 2013 White Cars Mvita Mombasa Foreign Used First registration, No faults, Unpainted Mitsubishi White Delica 2013.0 88000.0 Automatic Yes 630000 9.0
In [363]:
cars_df['age_years'].astype(int)
Out[363]:
0       6
1       8
3       8
4       8
5       9
       ..
293     8
294     8
295    14
298    13
299     7
Name: age_years, Length: 221, dtype: int32
In [364]:
# casting the dtype from float to int
cars_df['age_years'] = cars_df['age_years'].astype(int)
In [365]:
cars_df['age_years'].mean()
Out[365]:
9.027149321266968

Create a simple scatter plot of age_years vs price¶

In [366]:
cars_df[['age_years','price']].corr()
Out[366]:
age_years price
age_years 1.000000 -0.286765
price -0.286765 1.000000
In [347]:
# there is a weak negative linear relationship between age_years and price
In [367]:
from matplotlib import pyplot as plt

plt.scatter('age_years', 'price', data=cars_df)
Out[367]:
<matplotlib.collections.PathCollection at 0x1a4655b5b80>
In [368]:
cars_df['price_log'] = np.log(cars_df['price'])
In [369]:
cars_df.corr()
Out[369]:
yom mileage price age_years price_log
yom 1.000000 -0.290489 0.286765 -1.000000 0.430495
mileage -0.290489 1.000000 -0.188173 0.290489 -0.198183
price 0.286765 -0.188173 1.000000 -0.286765 0.879782
age_years -1.000000 0.290489 -0.286765 1.000000 -0.430495
price_log 0.430495 -0.198183 0.879782 -0.430495 1.000000
In [370]:
from matplotlib import pyplot as plt

plt.scatter('yom', 'price_log', data=cars_df)
Out[370]:
<matplotlib.collections.PathCollection at 0x1a465623400>
In [ ]:
 
In [ ]:
 
In [ ]:
 

Previous : Just Enough Numpy¶

Next : [Data Visualization with Python Matplotlib]¶

References¶

  1. https://vikingpathak.medium.com/pandas-loc-v-s-iloc-v-s-at-v-s-iat-v-s-ix-8d82eada663a

  2. https://stackoverflow.com/questions/28757389/pandas-loc-vs-iloc-vs-at-vs-iat

  3. https://towardsdatascience.com/8-ways-to-filter-pandas-dataframes-d34ba585c1b8

  4. https://www.dummies.com/article/academics-the-arts/math/statistics/types-of-statistical-data-numerical-categorical-and-ordinal-169735