Definition
Types of Statistical Data
Indexing
Sorting
Filtering
Cleaning Data
Add or Remove rows and columns
Grouping and Aggregating
Dates and Time Series Data
Joining Data
Write to File
Summary
Challenges
To follow along this course please download the necessary material
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Pandas is most useful when working with tabular data such as data from relational databases or spreadsheets. Pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
A DataFrame can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R.
To install pandas simply use pip
pip install pandas
Once you have successfully installed pandas, you need to import it.
import pandas as pd
Let's discuss some fundamental data structures.
We can create a DataFrame in two ways,
we can create a dataframe from a variety of structures, such as from lists or a list of lists or from a dictionary.
by reading in an external file.
When converting a Python object into a dataframe we use the DataFrame
keyword.
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
import pandas as pd
# create lists
countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
capitals = ["Beijing","New Delhi","Brasil","Washington DC","Addis Ababa","Cairo"]
density = [153,464,25,36,115,103]
# create a list of column names
columns = ['countries','capitals','density']
# create a DataFrame of lists
# remember to use the zip function
pd.DataFrame(zip(countries,capitals,density), columns=columns)
countries | capitals | density | |
---|---|---|---|
0 | China | Beijing | 153 |
1 | India | New Delhi | 464 |
2 | Brazil | Brasil | 25 |
3 | USA | Washington DC | 36 |
4 | Ethiopia | Addis Ababa | 115 |
5 | Egypt | Cairo | 103 |
When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame.
import pandas as pd
# create lists
countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
capitals = ["Beijing","New Delhi","Brasil","Washington DC","Addis Ababa","Cairo"]
density = [153,464,25,36,115,103]
# create a dictionary of lists
my_dict = {'countries':countries, 'capitals':capitals, 'density':density}
# convert the dictionary of lists into a DataFrame
df = pd.DataFrame(my_dict)
# display the DataFrame
df
countries | capitals | density | |
---|---|---|---|
0 | China | Beijing | 153 |
1 | India | New Delhi | 464 |
2 | Brazil | Brasil | 25 |
3 | USA | Washington DC | 36 |
4 | Ethiopia | Addis Ababa | 115 |
5 | Egypt | Cairo | 103 |
We can extract the data from an existing datasource, such as a csv (comma seperated variables) file using pd.read_csv
. Inside the file each column is seperated by a comma as the name suggests. The file is read and the data converted to a data frame which is then displayed.
I want to analyse the motor vehicle dataset that is available as a CSV file.
The Motor Vehicle Dataset is available for download. Follow this link.
After Downloading make sure you move the file to the folder where your Python Notebook is located. (ie) move it to your current working directory
pandas provides the read_csv()
function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json,…), each of them with the prefix read_*
.
Make sure to always have a check on the data after reading in the data.
import pandas as pd
# import data saved in file
df = pd.read_csv('vehicle_data.csv')
# display the first 5 rows
df.head()
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults | Toyota | Black | Land Cruiser Prado | 2016.0 | 87000.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 6500000 |
1 | Mazda Demio 2014 Brown | Cars | Langata | Nairobi | Foreign Used | First owner, No faults | Mazda | Brown | Demio | 2014.0 | 92000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 970000 |
2 | Clean NV300 Caravan 2014 Model Dielsel 16 Seater | Buses & Microbuses | Kilimani | Nairobi | Foreign Used | Nissan | Nissan | NaN | Caravan (Urvan) | 2014.0 | 180000.0 | NaN | NaN | NaN | NaN | NaN | NaN | 2550000 |
3 | Toyota Crown 2014 Pearl | Cars | Kilimani | Nairobi | Foreign Used | No faults | Toyota | Pearl | Crown | 2014.0 | 75000.0 | NaN | NaN | NaN | Automatic | NaN | No | 2100000 |
4 | Honda Fit 2014 Black | Cars | Mvita | Mombasa | Foreign Used | No faults | Honda | Black | Fit | 2014.0 | 58000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 880000 |
import pandas as pd
# create lists
countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
density = [153,464,25,36,115,103]
# create pandas series of each of the lists
countries = pd.Series(countries)
density = pd.Series(density)
countries
0 China 1 India 2 Brazil 3 USA 4 Ethiopia 5 Egypt dtype: object
density
0 153 1 464 2 25 3 36 4 115 5 103 dtype: int64
# create Series with index
import pandas as pd
# create lists
countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
density = [153,464,25,36,115,103]
# create pandas series of each of the lists
density = pd.Series(density,index=countries)
density
China 153 India 464 Brazil 25 USA 36 Ethiopia 115 Egypt 103 dtype: int64
# create Series with index and name
import pandas as pd
# create lists
countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
density = [153,464,25,36,115,103]
# create pandas series of each of the lists
density = pd.Series(density,index=countries,name="density")
density
China 153 India 464 Brazil 25 USA 36 Ethiopia 115 Egypt 103 Name: density, dtype: int64
# indexing
# access specific country value
density['Ethiopia']
115
# slicing
density['India':'Egypt']
India 464 Brazil 25 USA 36 Ethiopia 115 Egypt 103 Name: density, dtype: int64
# accessing multiple rows
density[['Ethiopia','Egypt']]
Ethiopia 115 Egypt 103 Name: density, dtype: int64
When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use the column label in between square brackets []
.
import pandas as pd
# create lists
countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
capitals = ["Beijing","New Delhi","Brasil","Washington DC","Addis Ababa","Cairo"]
density = [153,464,25,36,115,103]
# create a dictionary of lists
my_dict = {'countries':countries, 'capitals':capitals, 'density':density}
# convert the dictionary of lists into a DataFrame
df = pd.DataFrame(my_dict)
# display the DataFrame
df['density']
0 153 1 464 2 25 3 36 4 115 5 103 Name: density, dtype: int64
type(df['density'])
pandas.core.series.Series
We'll be using the real dataset with more values to learn about DataFrames. Load in the vehicle dataset.
import pandas as pd
# import data saved in file
df = pd.read_csv('vehicle_data.csv')
df.head(10)
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults | Toyota | Black | Land Cruiser Prado | 2016.0 | 87000.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 6500000 |
1 | Mazda Demio 2014 Brown | Cars | Langata | Nairobi | Foreign Used | First owner, No faults | Mazda | Brown | Demio | 2014.0 | 92000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 970000 |
2 | Clean NV300 Caravan 2014 Model Dielsel 16 Seater | Buses & Microbuses | Kilimani | Nairobi | Foreign Used | Nissan | Nissan | NaN | Caravan (Urvan) | 2014.0 | 180000.0 | NaN | NaN | NaN | NaN | NaN | NaN | 2550000 |
3 | Toyota Crown 2014 Pearl | Cars | Kilimani | Nairobi | Foreign Used | No faults | Toyota | Pearl | Crown | 2014.0 | 75000.0 | NaN | NaN | NaN | Automatic | NaN | No | 2100000 |
4 | Honda Fit 2014 Black | Cars | Mvita | Mombasa | Foreign Used | No faults | Honda | Black | Fit | 2014.0 | 58000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 880000 |
5 | Mitsubishi Delica 2013 White | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults, Unpainted | Mitsubishi | White | Delica | 2013.0 | 88000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 630000 |
6 | New Toyota Premio 2013 Red | Cars | Mvita | Mombasa | Brand New | No faults, First registration | Toyota | Red | Premio | 2013.0 | 45000.0 | NaN | NaN | NaN | Automatic | NaN | No | 1500000 |
7 | Toyota Sienta 2014 1.5 AWD Gray | Cars | Ganjoni | Mombasa | Foreign Used | No faults | Toyota | Gray | Sienta | 2014.0 | 51000.0 | Minivan | Petrol | All Wheel | Automatic | 7.0 | Yes | 1200000 |
8 | BMW X4 2015 xDrive35i Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | BMW | Black | X4 | 2015.0 | 63128.0 | NaN | NaN | NaN | Automatic | NaN | No | 5800000 |
9 | Mitsubishi Outlander 2015 White | Cars | Lavington | Nairobi | Foreign Used | Unpainted, Original parts, No faults | Mitsubishi | White | Outlander | 2015.0 | 40382.0 | SUV | Petrol | Front Wheel | Automatic | NaN | Yes | 2900000 |
Interested in the last N rows instead? pandas also provides a tail() method. For example, df.tail(10) will return the last 10 rows of the DataFrame.
df.tail(10)
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
290 | Lexus LS 2011 460 AWD Black | Cars | Lavington | Nairobi | Kenyan Used | Lexus | Lexus | Black | LS | 2011.0 | 88201.0 | Sedan | Petrol | All Wheel | Automatic | 5.0 | NaN | 3100000 |
291 | Toyota Ractis 2015 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | Original parts | Toyota | Black | Ractis | 2015.0 | 58964.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 1049000 |
292 | Toyota Wish 2011 Silver | Cars | Ridgeways | Nairobi | Kenyan Used | First owner, Unpainted, Original parts | Toyota | Silver | Wish | 2011.0 | 107809.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 1030000 |
293 | Mitsubishi L200 2014 Gold | Cars | Roysambu | Nairobi | Kenyan Used | First owner, Original parts, First registration | Mitsubishi | Gold | L200 | 2014.0 | NaN | NaN | NaN | NaN | Manual | NaN | Yes | 1200000 |
294 | Subaru Outback 2014 White | Cars | Mvita | Mombasa | Foreign Used | No faults | Subaru | White | Outback | 2014.0 | 63142.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 2500000 |
295 | Toyota Allion 2008 Silver | Cars | Ganjoni | Mombasa | Kenyan Used | No faults | Toyota | Silver | Allion | 2008.0 | 176549.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 860000 |
296 | Mitsubishi Hd | Trucks & Trailers | Kisauni | Mombasa | Used | Used | Mitsubishi | NaN | NaN | 2006.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1300000 |
297 | Mitsubishi Fuso Refrigerated | Trucks & Trailers | Kisauni | Mombasa | Used | Used | Mitsubishi | NaN | Canter | 2014.0 | NaN | NaN | Diesel | NaN | NaN | NaN | NaN | 2200000 |
298 | Toyota Ractis 2009 Black | Cars | Ridgeways | Nairobi | Kenyan Used | First owner, No faults, Original parts | Toyota | Black | Ractis | 2009.0 | 120120.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 550000 |
299 | Subaru Forester 2015 Matt Black | Cars | Mombasa CBD | Mombasa | Foreign Used | Unpainted, Original parts, First registration | Subaru | Matt Black | Forester | 2015.0 | 86910.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 2450000 |
df.shape
(300, 18)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 300 entries, 0 to 299 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 300 non-null object 1 category 300 non-null object 2 region 300 non-null object 3 parent_region 294 non-null object 4 condition 295 non-null object 5 attrs 299 non-null object 6 brand 299 non-null object 7 color 244 non-null object 8 model 283 non-null object 9 yom 297 non-null float64 10 mileage 209 non-null float64 11 body_type 42 non-null object 12 fuel 82 non-null object 13 drive_train 41 non-null object 14 trans 249 non-null object 15 seat 31 non-null float64 16 registered 175 non-null object 17 price 300 non-null int64 dtypes: float64(3), int64(1), object(14) memory usage: 42.3+ KB
df.dtypes
title object category object region object parent_region object condition object attrs object brand object color object model object yom float64 mileage float64 body_type object fuel object drive_train object trans object seat float64 registered object price int64 dtype: object
df['yom'].isnull().sum()
3
df[df['yom'].isnull()]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
109 | Isuzu Truck . | Trucks & Trailers | Donholm | Nairobi | NaN | Isuzu | Isuzu | NaN | N Series | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 810000 |
201 | Subaru Outback 2014 black | Cars | Nairobi | NaN | NaN | Outback | Subaru | Black | Outback | NaN | 180097.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 1649999 |
261 | Single Diff Volve FL10 | Trucks & Trailers | Embakasi | Nairobi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 400000 |
The result tells us there are 300 rows and 18 columns
RangeIndex: 300 entries, 0 to 299
Data columns (total 18 columns)
And the name of each column, with the data type:
When working with statistics, it’s important to recognize the different types of data: numerical (discrete and continuous), categorical, and ordinal.
Data are the actual pieces of information that you collect through your study. Most data fall into one of two groups: numerical or categorical.
Numeric : Continous, Discrete
Categorical : Nominal, Ordinal
Discrete data represent items that can be counted; they take on possible values that can be listed out. For example, the number of customer complaints or the number of flaws or defects.
Continuous data represent measurements; their possible values cannot be counted and can only be described using intervals on the real number line. Continuous variable can be numeric or date/time. For example, the length of a part or the date and time a payment is received.
Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order. Categorical variables are qualitative data in which the values are assigned to a set of distinct groups or categories. These groups may consist of alphabetic (e.g., male, female) or numeric labels (e.g., male = 0, female = 1) that do not contain mathematical information beyond the frequency counts related to group membership.
Ordinal data mixes numerical and categorical data. The data fall into categories, but the numbers placed on the categories have meaning. For example, rating a restaurant on a scale from 0 (lowest) to 4 (highest) stars gives ordinal data. Ordinal data are often treated as categorical, where the groups are ordered when graphs and charts are made. However, unlike categorical data, the numbers do have mathematical meaning.
Let's breakdown the columns in our dataframe into their statistical type:
Null Values
The info()
method also tells us how many Non-Null values there are present in each column, and in our data set it seems like there are 283 of 300 Non-Null values in the "Calories" column.
Which means that there are 5 rows with no value at all, in the "model" column, for whatever reason.
Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. This is a step towards what is called cleaning data, and you will learn more about that in the next chapters.
A handy way of checking for null values in your DataFrame is using
DataFrame.isnull().sum()
df['parent_region'].value_counts(normalize=True)
Mombasa 0.500000 Nairobi 0.431973 Kiambu 0.023810 Nakuru 0.013605 Machakos 0.006803 Kisumu 0.003401 Uasin Gishu 0.003401 Meru 0.003401 Kirinyaga 0.003401 Kajiado 0.003401 Mvita 0.003401 Nyali 0.003401 Name: parent_region, dtype: float64
df.isnull().sum()
title 0 category 0 region 0 parent_region 6 condition 5 attrs 1 brand 1 color 56 model 17 yom 3 mileage 91 body_type 258 fuel 218 drive_train 259 trans 51 seat 269 registered 125 price 0 dtype: int64
df.index.values
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299], dtype=int64)
df.columns
Index(['title', 'category', 'region', 'parent_region', 'condition', 'attrs', 'brand', 'color', 'model', 'yom', 'mileage', 'body_type', 'fuel', 'drive_train', 'trans', 'seat', 'registered', 'price'], dtype='object')
loc
, iloc
, at
and iat
¶.loc
selects data only by labels
.iloc
selects data only by integer location
.at
: get scalar values. It's a very fast loc
.iat
Get scalar values. It's a very fast iloc
at
andiat
are meant to access a scalar, that is, a single element in the dataframe, whileloc
andiloc
are meant to access several elements at the same time, potentially to perform vectorized operations.
.loc
¶Access a group of rows and columns by label(s) or a boolean array.
# lets create a toy dataset
import pandas as pd
# create lists
countries = ["China","India","Brazil","USA","Ethiopia","Egypt"]
capitals = ["Beijing","New Delhi","Brasil","Washington DC","Addis Ababa","Cairo"]
density = [153,464,25,36,115,103]
# create a dictionary of lists
my_dict = {'countries':countries, 'capitals':capitals, 'density':density}
# convert the dictionary of lists into a DataFrame
toy = pd.DataFrame(my_dict,index=countries)
toy
countries | capitals | density | |
---|---|---|---|
China | China | Beijing | 153 |
India | India | New Delhi | 464 |
Brazil | Brazil | Brasil | 25 |
USA | USA | Washington DC | 36 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
Egypt | Egypt | Cairo | 103 |
toy.index.values
array(['China', 'India', 'Brazil', 'USA', 'Ethiopia', 'Egypt'], dtype=object)
toy.loc['Brazil']
countries Brazil capitals Brasil density 25 Name: Brazil, dtype: object
toy.loc[:,'capitals']
China Beijing India New Delhi Brazil Brasil USA Washington DC Ethiopia Addis Ababa Egypt Cairo Name: capitals, dtype: object
While accessing multiple rows and columns using .loc
, represent the row and column labels in separate square brackets, preferably.
toy.loc[['India','Ethiopia','Egypt']]
countries | capitals | density | |
---|---|---|---|
India | India | New Delhi | 464 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
Egypt | Egypt | Cairo | 103 |
toy.loc[:,['countries','density']]
countries | density | |
---|---|---|
China | China | 153 |
India | India | 464 |
Brazil | Brazil | 25 |
USA | USA | 36 |
Ethiopia | Ethiopia | 115 |
Egypt | Egypt | 103 |
toy.loc[
['Brazil','USA'],
['countries','density']
]
countries | density | |
---|---|---|
Brazil | Brazil | 25 |
USA | USA | 36 |
Remember in slicing operation on data frames, the end is inclusive, Also, use only single square bracket.
toy.loc['Brazil':'Ethiopia']
countries | capitals | density | |
---|---|---|---|
Brazil | Brazil | Brasil | 25 |
USA | USA | Washington DC | 36 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
toy.loc[:,'capitals':'density']
capitals | density | |
---|---|---|
China | Beijing | 153 |
India | New Delhi | 464 |
Brazil | Brasil | 25 |
USA | Washington DC | 36 |
Ethiopia | Addis Ababa | 115 |
Egypt | Cairo | 103 |
toy.loc[
'Brazil':'Ethiopia',
'capitals':'density'
]
capitals | density | |
---|---|---|
Brazil | Brasil | 25 |
USA | Washington DC | 36 |
Ethiopia | Addis Ababa | 115 |
.iloc
(integer location)¶The operation of .iloc
is the same as .loc
except for the fact that we use integer positions of the rows and columns instead of labels. The letter i
stands for integer.
Similar to what we have seen before, we can mention single index positions or multiple index positions of the row/column that we want to select.
toy.iloc[1] # the entry at row 1
countries India capitals New Delhi density 464 Name: India, dtype: object
toy.iloc[-1] # negative indexing
countries Egypt capitals Cairo density 103 Name: Egypt, dtype: object
toy.iloc[[1,3,4]] # index multiple rows
countries | capitals | density | |
---|---|---|---|
India | India | New Delhi | 464 |
USA | USA | Washington DC | 36 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
# index all rows and first and third columns
toy.iloc[:, [0,2]]
countries | density | |
---|---|---|
China | China | 153 |
India | India | 464 |
Brazil | Brazil | 25 |
USA | USA | 36 |
Ethiopia | Ethiopia | 115 |
Egypt | Egypt | 103 |
Note that the slicing operation in .iloc shifts back to the traditional way of python where the start is inclusive and the end is exclusive.
# slice from first row to the second row
toy.iloc[0:2]
countries | capitals | density | |
---|---|---|---|
China | China | Beijing | 153 |
India | India | New Delhi | 464 |
toy.iloc[0:4,0:2]
countries | capitals | |
---|---|---|
China | China | Beijing |
India | India | New Delhi |
Brazil | Brazil | Brasil |
USA | USA | Washington DC |
.at
¶The operation of .at
is similar to .lo
c but it is capable of selecting a single cell or value.
Consider we want to know the density of Egypt
toy.at['Egypt','density']
103
The operation of .iat
is similar to .iloc
selecting using integer location but it is capable of selecting a single cell or value just like .at.
Suppose we want to know the density of Egypt. We know Egypt is the last row, and density if the last column.
toy.iat[-1,-1]
103
toy['countries']
China China India India Brazil Brazil USA USA Ethiopia Ethiopia Egypt Egypt Name: countries, dtype: object
To select multiple columns use double square brackets [[]]
Mention the column labels that you want to select in the brackets.
Also, slicing of columns is not possible using index operators. You have to explicitly mention the column labels to be selected.
toy[['countries','density']]
countries | density | |
---|---|---|
China | China | 153 |
India | India | 464 |
Brazil | Brazil | 25 |
USA | USA | 36 |
Ethiopia | Ethiopia | 115 |
Egypt | Egypt | 103 |
The only way to select rows is to use the slicing operation in single square brackets [ ] . You can either mention row labels or index positions.
toy[0:2]
countries | capitals | density | |
---|---|---|---|
China | China | Beijing | 153 |
India | India | New Delhi | 464 |
toy["China":'Brazil']
countries | capitals | density | |
---|---|---|---|
China | China | Beijing | 153 |
India | India | New Delhi | 464 |
Brazil | Brazil | Brasil | 25 |
Sort by the values along either axis.
The columns to sort by is a required argument.
Default axis is the rows (0).
You can specify index=1
or index='columns'
pd.DataFrame.sort_values(by, axis=0, ascending=True, inplace=False)
Sort object by labels (along an axis).
Returns a new DataFrame sorted by label if inplace
argument is False, otherwise updates the original DataFrame and returns None.
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False)
Specify the columns to sort by in order by which sorting should follow
pd.DataFrame.sort_values(by=[], axis=0, ascending=True, inplace=False)
# sort the toy dataset in alphabetical order by name of country
toy.sort_values(by="countries")
countries | capitals | density | |
---|---|---|---|
Brazil | Brazil | Brasil | 25 |
China | China | Beijing | 153 |
Egypt | Egypt | Cairo | 103 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
India | India | New Delhi | 464 |
USA | USA | Washington DC | 36 |
# sort the toy dataset such that the country with highest density is on top
toy.sort_values(by='density',ascending=False)
countries | capitals | density | |
---|---|---|---|
India | India | New Delhi | 464 |
China | China | Beijing | 153 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
Egypt | Egypt | Cairo | 103 |
USA | USA | Washington DC | 36 |
Brazil | Brazil | Brasil | 25 |
toy
countries | capitals | density | |
---|---|---|---|
China | China | Beijing | 153 |
India | India | New Delhi | 464 |
Brazil | Brazil | Brasil | 25 |
USA | USA | Washington DC | 36 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
Egypt | Egypt | Cairo | 103 |
as can be seen from above results, the sorting is temporary and does not affect the original DataFrame.
To update the DataFrame and keep the sorted order, specify inplace=True
in the function call.
toy.sort_values(by='density',ascending=False,inplace=True)
toy
countries | capitals | density | |
---|---|---|---|
India | India | New Delhi | 464 |
China | China | Beijing | 153 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
Egypt | Egypt | Cairo | 103 |
USA | USA | Washington DC | 36 |
Brazil | Brazil | Brasil | 25 |
toy.sort_index()
countries | capitals | density | |
---|---|---|---|
Brazil | Brazil | Brasil | 25 |
China | China | Beijing | 153 |
Egypt | Egypt | Cairo | 103 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
India | India | New Delhi | 464 |
USA | USA | Washington DC | 36 |
toy.sort_index(ascending=False)
countries | capitals | density | |
---|---|---|---|
USA | USA | Washington DC | 36 |
India | India | New Delhi | 464 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
Egypt | Egypt | Cairo | 103 |
China | China | Beijing | 153 |
Brazil | Brazil | Brasil | 25 |
# keep the changes
toy.sort_index(inplace=True)
toy
countries | capitals | density | |
---|---|---|---|
Brazil | Brazil | Brasil | 25 |
China | China | Beijing | 153 |
Egypt | Egypt | Cairo | 103 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
India | India | New Delhi | 464 |
USA | USA | Washington DC | 36 |
# create a mask
mask = toy['density'] > 100
mask
Brazil False China True Egypt True Ethiopia True India True USA False Name: density, dtype: bool
toy[mask]
countries | capitals | density | |
---|---|---|---|
China | China | Beijing | 153 |
Egypt | Egypt | Cairo | 103 |
Ethiopia | Ethiopia | Addis Ababa | 115 |
India | India | New Delhi | 464 |
Only the countries with density more than 100 are selected
df.head()
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults | Toyota | Black | Land Cruiser Prado | 2016.0 | 87000.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 6500000 |
1 | Mazda Demio 2014 Brown | Cars | Langata | Nairobi | Foreign Used | First owner, No faults | Mazda | Brown | Demio | 2014.0 | 92000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 970000 |
2 | Clean NV300 Caravan 2014 Model Dielsel 16 Seater | Buses & Microbuses | Kilimani | Nairobi | Foreign Used | Nissan | Nissan | NaN | Caravan (Urvan) | 2014.0 | 180000.0 | NaN | NaN | NaN | NaN | NaN | NaN | 2550000 |
3 | Toyota Crown 2014 Pearl | Cars | Kilimani | Nairobi | Foreign Used | No faults | Toyota | Pearl | Crown | 2014.0 | 75000.0 | NaN | NaN | NaN | Automatic | NaN | No | 2100000 |
4 | Honda Fit 2014 Black | Cars | Mvita | Mombasa | Foreign Used | No faults | Honda | Black | Fit | 2014.0 | 58000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 880000 |
df[df['yom'] > 2015]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults | Toyota | Black | Land Cruiser Prado | 2016.0 | 87000.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 6500000 |
11 | Lexus NX 2016 200t FWD Beige | Cars | Kilimani | Nairobi | Foreign Used | No faults, Unpainted, Original parts | Lexus | Beige | NX | 2016.0 | 35000.0 | SUV | Petrol | Front Wheel | Automatic | 5.0 | No | 4750000 |
22 | Lexus RX 2016 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | Lexus | Black | RX | 2016.0 | NaN | NaN | NaN | NaN | Automatic | NaN | Yes | 14500000 |
41 | Isuzu Nkr. Yr 2019 | Trucks & Trailers | Ridgeways | Nairobi | Used | Used | Isuzu | White | N Series | 2019.0 | NaN | NaN | Diesel | NaN | Manual | NaN | NaN | 3050000 |
52 | Mazda CX-5 2016 Red | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Mazda | Red | CX-5 | 2016.0 | 51238.0 | NaN | NaN | NaN | Automatic | NaN | No | 2900000 |
94 | Mazda CX-3 2016 Red | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Mazda | Red | CX-3 | 2016.0 | 76000.0 | NaN | NaN | NaN | Automatic | NaN | No | 1800000 |
101 | Mitsubishi Outlander 2016 ES FWD White | Cars | Mombasa CBD | Mombasa | Foreign Used | First owner, First registration, Original parts | Mitsubishi | White | Outlander | 2016.0 | 56390.0 | SUV | Petrol | Front Wheel | Automatic | 7.0 | NaN | 3100000 |
103 | Shacman F2000 Tipper | Trucks & Trailers | Municipality | Meru | Used | Used | Shacman | NaN | NaN | 2018.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5100000 |
107 | Toyota Wish 2016 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | Toyota | Black | Wish | 2016.0 | NaN | NaN | NaN | NaN | Automatic | NaN | Yes | 1500000 |
222 | Tata Signa LPK-1618 Tipper 10 Ton | Trucks & Trailers | Nairobi Central | Nairobi | Brand New | Brand New | Tata | NaN | M&HCV Rigid Truck | 2021.0 | NaN | NaN | Diesel | NaN | NaN | NaN | NaN | 6000000 |
224 | Toyota Hilux 2016 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Toyota | Black | Hilux | 2016.0 | 36000.0 | NaN | NaN | NaN | Automatic | NaN | No | 9000000 |
236 | Isuzu NPR Clean Unit | Trucks & Trailers | Nairobi Central | Nairobi | Used | Used | Isuzu | Other | N Series | 2016.0 | NaN | NaN | Diesel | NaN | Manual | NaN | NaN | 2850000 |
265 | New Hyundai Palisade 2021 White | Cars | Mombasa Road | Nairobi | Brand New | No faults | Hyundai | White | Palisade | 2021.0 | 200.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 9500000 |
df[(df['yom']>2015) & (df['category'] == 'Cars')]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults | Toyota | Black | Land Cruiser Prado | 2016.0 | 87000.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 6500000 |
11 | Lexus NX 2016 200t FWD Beige | Cars | Kilimani | Nairobi | Foreign Used | No faults, Unpainted, Original parts | Lexus | Beige | NX | 2016.0 | 35000.0 | SUV | Petrol | Front Wheel | Automatic | 5.0 | No | 4750000 |
22 | Lexus RX 2016 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | Lexus | Black | RX | 2016.0 | NaN | NaN | NaN | NaN | Automatic | NaN | Yes | 14500000 |
52 | Mazda CX-5 2016 Red | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Mazda | Red | CX-5 | 2016.0 | 51238.0 | NaN | NaN | NaN | Automatic | NaN | No | 2900000 |
94 | Mazda CX-3 2016 Red | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Mazda | Red | CX-3 | 2016.0 | 76000.0 | NaN | NaN | NaN | Automatic | NaN | No | 1800000 |
101 | Mitsubishi Outlander 2016 ES FWD White | Cars | Mombasa CBD | Mombasa | Foreign Used | First owner, First registration, Original parts | Mitsubishi | White | Outlander | 2016.0 | 56390.0 | SUV | Petrol | Front Wheel | Automatic | 7.0 | NaN | 3100000 |
107 | Toyota Wish 2016 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | Toyota | Black | Wish | 2016.0 | NaN | NaN | NaN | NaN | Automatic | NaN | Yes | 1500000 |
224 | Toyota Hilux 2016 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Toyota | Black | Hilux | 2016.0 | 36000.0 | NaN | NaN | NaN | Automatic | NaN | No | 9000000 |
265 | New Hyundai Palisade 2021 White | Cars | Mombasa Road | Nairobi | Brand New | No faults | Hyundai | White | Palisade | 2021.0 | 200.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 9500000 |
isin
¶The isin method is another way of applying multiple condition for filtering.
For instance, we can filter the vehicles whose color is in a given list.
Suppose we only want to select vehicles that are either Black, White or Red
colors = ['Black','White','Red']
df[df['color'].isin(colors)]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults | Toyota | Black | Land Cruiser Prado | 2016.0 | 87000.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 6500000 |
4 | Honda Fit 2014 Black | Cars | Mvita | Mombasa | Foreign Used | No faults | Honda | Black | Fit | 2014.0 | 58000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 880000 |
5 | Mitsubishi Delica 2013 White | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults, Unpainted | Mitsubishi | White | Delica | 2013.0 | 88000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 630000 |
6 | New Toyota Premio 2013 Red | Cars | Mvita | Mombasa | Brand New | No faults, First registration | Toyota | Red | Premio | 2013.0 | 45000.0 | NaN | NaN | NaN | Automatic | NaN | No | 1500000 |
8 | BMW X4 2015 xDrive35i Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | BMW | Black | X4 | 2015.0 | 63128.0 | NaN | NaN | NaN | Automatic | NaN | No | 5800000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
289 | Nissan Teana 2015 Black | Cars | Nairobi Central | Nairobi | Foreign Used | First registration | Nissan | Black | Teana | 2015.0 | NaN | NaN | NaN | NaN | Automatic | NaN | Yes | 1420000 |
290 | Lexus LS 2011 460 AWD Black | Cars | Lavington | Nairobi | Kenyan Used | Lexus | Lexus | Black | LS | 2011.0 | 88201.0 | Sedan | Petrol | All Wheel | Automatic | 5.0 | NaN | 3100000 |
291 | Toyota Ractis 2015 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | Original parts | Toyota | Black | Ractis | 2015.0 | 58964.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 1049000 |
294 | Subaru Outback 2014 White | Cars | Mvita | Mombasa | Foreign Used | No faults | Subaru | White | Outback | 2014.0 | 63142.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 2500000 |
298 | Toyota Ractis 2009 Black | Cars | Ridgeways | Nairobi | Kenyan Used | First owner, No faults, Original parts | Toyota | Black | Ractis | 2009.0 | 120120.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 550000 |
136 rows × 18 columns
Pandas is a highly efficient library on textual data as well. The functions and methods under the str accessor provide flexible ways to filter rows based on strings.
for example, we can select all the vehicles where the title begins with letter 'M'
df[df['title'].str.startswith('M')]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Mazda Demio 2014 Brown | Cars | Langata | Nairobi | Foreign Used | First owner, No faults | Mazda | Brown | Demio | 2014.0 | 92000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 970000 |
5 | Mitsubishi Delica 2013 White | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults, Unpainted | Mitsubishi | White | Delica | 2013.0 | 88000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 630000 |
9 | Mitsubishi Outlander 2015 White | Cars | Lavington | Nairobi | Foreign Used | Unpainted, Original parts, No faults | Mitsubishi | White | Outlander | 2015.0 | 40382.0 | SUV | Petrol | Front Wheel | Automatic | NaN | Yes | 2900000 |
15 | Mazda CX-5 2014 Sport FWD Blue | Cars | Ridgeways | Nairobi | Foreign Used | No faults | Mazda | Blue | CX-5 | 2014.0 | NaN | SUV | Petrol | Front Wheel | Automatic | 5.0 | NaN | 2060000 |
17 | Mitsubishi RVR 2014 Blue | Cars | Mombasa CBD | Mombasa | Foreign Used | Unpainted, Original parts, No faults | Mitsubishi | Blue | RVR | 2014.0 | 142000.0 | NaN | NaN | NaN | Automatic | NaN | No | 1980000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
279 | Mitsubishi Fuso | Trucks & Trailers | Runda | Nairobi | Brand New | Brand New | Mitsubishi | NaN | NaN | 2014.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2100000 |
288 | Mitsubishi RVR 2015 Black | Cars | Mombasa CBD | Mombasa | Brand New | No faults | Mitsubishi | Black | RVR | 2015.0 | 45555.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 2150000 |
293 | Mitsubishi L200 2014 Gold | Cars | Roysambu | Nairobi | Kenyan Used | First owner, Original parts, First registration | Mitsubishi | Gold | L200 | 2014.0 | NaN | NaN | NaN | NaN | Manual | NaN | Yes | 1200000 |
296 | Mitsubishi Hd | Trucks & Trailers | Kisauni | Mombasa | Used | Used | Mitsubishi | NaN | NaN | 2006.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1300000 |
297 | Mitsubishi Fuso Refrigerated | Trucks & Trailers | Kisauni | Mombasa | Used | Used | Mitsubishi | NaN | Canter | 2014.0 | NaN | NaN | Diesel | NaN | NaN | NaN | NaN | 2200000 |
63 rows × 18 columns
The contains
function under the str accessor returns the values that contain a given set of characters.
For instance select all vehicles that contain the word Benz in their title:
df[df['title'].str.contains('Benz')]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
78 | Mercedes-Benz E250 2014 SV Premium Black | Cars | Nairobi Central | Nairobi | Kenyan Used | Unpainted, No faults, Original parts | Mercedes-Benz | Black | E250 | 2014.0 | 100555.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 3200000 |
179 | Mercedes-Benz B-Class 2009 White | Cars | Lavington | Nairobi | Kenyan Used | Mercedes-Benz | Mercedes-Benz | White | B-Class | 2009.0 | 154155.0 | NaN | Petrol | NaN | Automatic | NaN | NaN | 699999 |
195 | Mercedes-Benz Actros | Trucks & Trailers | Thome | Nairobi | Used | Used | Mercedes-Benz | NaN | NaN | 2013.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7500000 |
207 | Mercedes-Benz C180 2014 Black | Cars | Kilimani | Nairobi | Foreign Used | No faults | Mercedes-Benz | Black | C180 | 2014.0 | 82000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 2500000 |
210 | Mercedes-Benz C200 2008 White | Cars | Langata | Nairobi | Kenyan Used | Mercedes-Benz | Mercedes-Benz | White | C200 | 2008.0 | 132489.0 | NaN | Petrol | NaN | Automatic | NaN | NaN | 1400000 |
219 | Mercedes-Benz C-Class 2011 C 200 (C204) White | Cars | Lavington | Nairobi | Kenyan Used | Mercedes-Benz | Mercedes-Benz | White | C-Class | 2011.0 | 76000.0 | Coupe | Petrol | Rear Wheel | Automatic | 5.0 | NaN | 1650000 |
238 | Mercedes-Benz B-Class 2014 Silver | Cars | Mvita | Mombasa | Foreign Used | First owner, No faults | Mercedes-Benz | Silver | B-Class | 2014.0 | 90000.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 1650000 |
247 | Mercedes-Benz E250 2014 Black | Cars | Nairobi Central | Nairobi | Foreign Used | No faults | Mercedes-Benz | Black | E250 | 2014.0 | 49000.0 | NaN | NaN | NaN | Automatic | NaN | No | 3900000 |
274 | Mercedes-Benz C200 2015 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | Mercedes-Benz | Black | C200 | 2015.0 | 56800.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 3850000 |
~
¶The tilde operator is used for not
logic in filtering. If we add the tilde operator before the filter expression, the rows that do not fit the condition are returned.
for instance select all the vehicles that do not have the word Toyota in their title.
df[~df['title'].str.contains('Toyota')]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Mazda Demio 2014 Brown | Cars | Langata | Nairobi | Foreign Used | First owner, No faults | Mazda | Brown | Demio | 2014.0 | 92000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 970000 |
2 | Clean NV300 Caravan 2014 Model Dielsel 16 Seater | Buses & Microbuses | Kilimani | Nairobi | Foreign Used | Nissan | Nissan | NaN | Caravan (Urvan) | 2014.0 | 180000.0 | NaN | NaN | NaN | NaN | NaN | NaN | 2550000 |
4 | Honda Fit 2014 Black | Cars | Mvita | Mombasa | Foreign Used | No faults | Honda | Black | Fit | 2014.0 | 58000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 880000 |
5 | Mitsubishi Delica 2013 White | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults, Unpainted | Mitsubishi | White | Delica | 2013.0 | 88000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 630000 |
8 | BMW X4 2015 xDrive35i Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | BMW | Black | X4 | 2015.0 | 63128.0 | NaN | NaN | NaN | Automatic | NaN | No | 5800000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
293 | Mitsubishi L200 2014 Gold | Cars | Roysambu | Nairobi | Kenyan Used | First owner, Original parts, First registration | Mitsubishi | Gold | L200 | 2014.0 | NaN | NaN | NaN | NaN | Manual | NaN | Yes | 1200000 |
294 | Subaru Outback 2014 White | Cars | Mvita | Mombasa | Foreign Used | No faults | Subaru | White | Outback | 2014.0 | 63142.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 2500000 |
296 | Mitsubishi Hd | Trucks & Trailers | Kisauni | Mombasa | Used | Used | Mitsubishi | NaN | NaN | 2006.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1300000 |
297 | Mitsubishi Fuso Refrigerated | Trucks & Trailers | Kisauni | Mombasa | Used | Used | Mitsubishi | NaN | Canter | 2014.0 | NaN | NaN | Diesel | NaN | NaN | NaN | NaN | 2200000 |
299 | Subaru Forester 2015 Matt Black | Cars | Mombasa CBD | Mombasa | Foreign Used | Unpainted, Original parts, First registration | Subaru | Matt Black | Forester | 2015.0 | 86910.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 2450000 |
222 rows × 18 columns
The query
function offers a little more flexibility at writing the conditions for filtering.
We can pass the conditions as a string.
For instance, the following code returns the rows that belong to the Cars category and have a value higher than 5,000,000 in the price column.
df.query("category == 'Cars' and price > 5_000_000")
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults | Toyota | Black | Land Cruiser Prado | 2016.0 | 87000.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 6500000 |
8 | BMW X4 2015 xDrive35i Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | BMW | Black | X4 | 2015.0 | 63128.0 | NaN | NaN | NaN | Automatic | NaN | No | 5800000 |
13 | BMW 520i 2014 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | BMW | Black | 520i | 2014.0 | 80000.0 | NaN | NaN | NaN | Automatic | NaN | No | 5700000 |
22 | Lexus RX 2016 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | Lexus | Black | RX | 2016.0 | NaN | NaN | NaN | NaN | Automatic | NaN | Yes | 14500000 |
40 | Volkswagen Touareg 2014 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Volkswagen | Black | Touareg | 2014.0 | 56000.0 | NaN | NaN | NaN | Automatic | NaN | No | 5200000 |
53 | Toyota Land Cruiser Prado 2015 2.7 VVT-i Brown | Cars | Mvita | Mombasa | Foreign Used | No faults | Toyota | Brown | Land Cruiser Prado | 2015.0 | 45260.0 | SUV | Petrol | All Wheel | Automatic | 5.0 | No | 6500000 |
73 | Toyota Land Cruiser Prado 2014 2.7 VVT-i Gold | Cars | Mombasa CBD | Mombasa | Foreign Used | First owner, First registration | Toyota | Gold | Land Cruiser Prado | 2014.0 | NaN | SUV | Petrol | All Wheel | Automatic | 5.0 | No | 5600000 |
156 | Toyota Land Cruiser 2010 4.6 V8 ZX Black | Cars | Runda | Nairobi | Foreign Used | No faults | Toyota | Black | Land Cruiser | 2010.0 | NaN | SUV | Petrol | 4x4 | Automatic | 8.0 | No | 8799999 |
177 | Toyota Land Cruiser Prado 2014 Blue | Cars | Nairobi Central | Nairobi | Foreign Used | First owner, Unpainted, Original parts | Toyota | Blue | Land Cruiser Prado | 2014.0 | 42000.0 | NaN | NaN | NaN | Automatic | NaN | No | 6150000 |
220 | Volkswagen Touareg 2015 TDI Executive AWD 4MOT... | Cars | Tudor | Mombasa | Foreign Used | No faults | Volkswagen | Black | Touareg | 2015.0 | 111285.0 | SUV | Diesel | All Wheel | Automatic | 5.0 | NaN | 5500000 |
224 | Toyota Hilux 2016 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Toyota | Black | Hilux | 2016.0 | 36000.0 | NaN | NaN | NaN | Automatic | NaN | No | 9000000 |
241 | BMW X5 2015 White | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | BMW | White | X5 | 2015.0 | 68000.0 | SUV | Petrol | All Wheel | Automatic | NaN | No | 6300000 |
249 | Toyota Land Cruiser 2014 4.6 V8 ZX Black | Cars | Karen | Nairobi | Foreign Used | Unpainted, Original parts, No faults | Toyota | Black | Land Cruiser | 2014.0 | 30000.0 | SUV | Petrol | 4x4 | Automatic | 8.0 | No | 8199999 |
265 | New Hyundai Palisade 2021 White | Cars | Mombasa Road | Nairobi | Brand New | No faults | Hyundai | White | Palisade | 2021.0 | 200.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 9500000 |
276 | Lexus NX 2014 Red | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | Lexus | Red | NX | 2014.0 | NaN | NaN | NaN | NaN | Automatic | NaN | No | 5400000 |
nlargest
and nsmallest
¶In some cases, we do not have a specific range for filtering but just need the largest or smallest values.
The nlargest
and nsmallest
functions allow for selecting rows that have the largest or smallest values in a column, respectively.
We specify the number of largest or smallest values to be selected and the name of the column.
Example: lets select 5 vehicles with the highest price
# 5 vehicles with highest price
df.nlargest(5,'price')
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | Lexus RX 2016 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | Lexus | Black | RX | 2016.0 | NaN | NaN | NaN | NaN | Automatic | NaN | Yes | 14500000 |
148 | Mazda Bongo | Buses & Microbuses | Ridgeways | Nairobi | Foreign Used | Unpainted, No faults | Mazda | White | Bongo | 2014.0 | 127800.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 11200000 |
265 | New Hyundai Palisade 2021 White | Cars | Mombasa Road | Nairobi | Brand New | No faults | Hyundai | White | Palisade | 2021.0 | 200.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 9500000 |
224 | Toyota Hilux 2016 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Toyota | Black | Hilux | 2016.0 | 36000.0 | NaN | NaN | NaN | Automatic | NaN | No | 9000000 |
156 | Toyota Land Cruiser 2010 4.6 V8 ZX Black | Cars | Runda | Nairobi | Foreign Used | No faults | Toyota | Black | Land Cruiser | 2010.0 | NaN | SUV | Petrol | 4x4 | Automatic | 8.0 | No | 8799999 |
# 5 vehicles with the lowest prices
df.nsmallest(5,'price')
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
111 | Mitsubishi Lancer / Cedia 2002 White | Cars | Syokimau | Machakos | Kenyan Used | No faults | Mitsubishi | White | Lancer / Cedia | 2002.0 | NaN | NaN | NaN | NaN | Automatic | NaN | Yes | 250000 |
85 | Volkswagen Golf 2012 Blue | Cars | Ridgeways | Nairobi | Kenyan Used | First owner, After crash, Original parts | Volkswagen | Blue | Golf | 2012.0 | 49252.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 260000 |
246 | Mazda Familia 1998 Black | Cars | Ridgeways | Nairobi | Kenyan Used | Mazda | Mazda | Black | Familia | 1998.0 | 202745.0 | NaN | Petrol | NaN | Manual | NaN | NaN | 300000 |
38 | Land Rover Range Rover 1979 Green | Cars | Kisumu Central | Kisumu | Kenyan Used | First owner | Land Rover | Green | Range Rover | 1979.0 | NaN | NaN | NaN | NaN | Manual | NaN | Yes | 400000 |
261 | Single Diff Volve FL10 | Trucks & Trailers | Embakasi | Nairobi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 400000 |
# solution to challenge 1
df.loc[:, ['title','region','brand','price']]
title | region | brand | price | |
---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Mvita | Toyota | 6500000 |
1 | Mazda Demio 2014 Brown | Langata | Mazda | 970000 |
2 | Clean NV300 Caravan 2014 Model Dielsel 16 Seater | Kilimani | Nissan | 2550000 |
3 | Toyota Crown 2014 Pearl | Kilimani | Toyota | 2100000 |
4 | Honda Fit 2014 Black | Mvita | Honda | 880000 |
... | ... | ... | ... | ... |
295 | Toyota Allion 2008 Silver | Ganjoni | Toyota | 860000 |
296 | Mitsubishi Hd | Kisauni | Mitsubishi | 1300000 |
297 | Mitsubishi Fuso Refrigerated | Kisauni | Mitsubishi | 2200000 |
298 | Toyota Ractis 2009 Black | Ridgeways | Toyota | 550000 |
299 | Subaru Forester 2015 Matt Black | Mombasa CBD | Subaru | 2450000 |
300 rows × 4 columns
# solution
# first create a mask using the conditions
mask = (df['brand'] == "Toyota") & (df['region'] == "Kilimani")
# apply the mask on the dataframe to filter only True values
df[mask]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Toyota Crown 2014 Pearl | Cars | Kilimani | Nairobi | Foreign Used | No faults | Toyota | Pearl | Crown | 2014.0 | 75000.0 | NaN | NaN | NaN | Automatic | NaN | No | 2100000 |
31 | Toyota Mark X 2014 Silver | Cars | Kilimani | Nairobi | Foreign Used | No faults | Toyota | Silver | Mark X | 2014.0 | 53580.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 1980000 |
42 | Toyota Harrier 2014 Black | Cars | Kilimani | Nairobi | Foreign Used | First registration, Original parts, No faults | Toyota | Black | Harrier | 2014.0 | 48592.0 | NaN | NaN | NaN | Automatic | NaN | No | 3450000 |
74 | Toyota Corolla Rumion 2014 Blue | Cars | Kilimani | Nairobi | Foreign Used | No faults | Toyota | Blue | Corolla Rumion | 2014.0 | 54000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 1330000 |
126 | Toyota RAV4 2014 SV Premium Black | Cars | Kilimani | Nairobi | Foreign Used | Original parts | Toyota | Black | RAV4 | 2014.0 | NaN | NaN | NaN | NaN | Automatic | NaN | Yes | 2800000 |
262 | Toyota Hiace 2014 White | Buses & Microbuses | Kilimani | Nairobi | Foreign Used | Toyota | Toyota | NaN | HiAce | 2014.0 | 100000.0 | NaN | NaN | NaN | NaN | NaN | NaN | 2100000 |
### solution to challenge 3
df[df['model'].str.contains('CX', na=False)]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | Mazda CX-5 2014 Sport FWD Blue | Cars | Ridgeways | Nairobi | Foreign Used | No faults | Mazda | Blue | CX-5 | 2014.0 | NaN | SUV | Petrol | Front Wheel | Automatic | 5.0 | NaN | 2060000 |
24 | Mazda CX-5 2014 Red | Cars | Nairobi Central | Nairobi | Foreign Used | First registration | Mazda | Red | CX-5 | 2014.0 | 85239.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 2250000 |
52 | Mazda CX-5 2016 Red | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Mazda | Red | CX-5 | 2016.0 | 51238.0 | NaN | NaN | NaN | Automatic | NaN | No | 2900000 |
58 | Mazda CX-5 2014 Grand Touring FWD Red | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | Mazda | Red | CX-5 | 2014.0 | NaN | SUV | Petrol | NaN | Automatic | 5.0 | No | 2300000 |
94 | Mazda CX-3 2016 Red | Cars | Mombasa CBD | Mombasa | Foreign Used | First registration | Mazda | Red | CX-3 | 2016.0 | 76000.0 | NaN | NaN | NaN | Automatic | NaN | No | 1800000 |
183 | New Mazda CX-5 2014 Grand Touring AWD Blue | Cars | Nairobi Central | Nairobi | Brand New | No faults | Mazda | Blue | CX-5 | 2014.0 | 115000.0 | SUV | Petrol | All Wheel | Automatic | 5.0 | No | 2100000 |
that are of brand Mercedes-Benz
The color should not be Red
Sort the results by price in descending order
## Challenge 4 solution
df[(df['brand']=="Mercedes-Benz") & (df['color'] != 'Red')].sort_values(by='price',ascending=False)
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | Mercedes-Benz Actros | Trucks & Trailers | Thome | Nairobi | Used | Used | Mercedes-Benz | NaN | NaN | 2013.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7500000 |
247 | Mercedes-Benz E250 2014 Black | Cars | Nairobi Central | Nairobi | Foreign Used | No faults | Mercedes-Benz | Black | E250 | 2014.0 | 49000.0 | NaN | NaN | NaN | Automatic | NaN | No | 3900000 |
274 | Mercedes-Benz C200 2015 Black | Cars | Mombasa CBD | Mombasa | Foreign Used | No faults | Mercedes-Benz | Black | C200 | 2015.0 | 56800.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 3850000 |
78 | Mercedes-Benz E250 2014 SV Premium Black | Cars | Nairobi Central | Nairobi | Kenyan Used | Unpainted, No faults, Original parts | Mercedes-Benz | Black | E250 | 2014.0 | 100555.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 3200000 |
207 | Mercedes-Benz C180 2014 Black | Cars | Kilimani | Nairobi | Foreign Used | No faults | Mercedes-Benz | Black | C180 | 2014.0 | 82000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 2500000 |
219 | Mercedes-Benz C-Class 2011 C 200 (C204) White | Cars | Lavington | Nairobi | Kenyan Used | Mercedes-Benz | Mercedes-Benz | White | C-Class | 2011.0 | 76000.0 | Coupe | Petrol | Rear Wheel | Automatic | 5.0 | NaN | 1650000 |
238 | Mercedes-Benz B-Class 2014 Silver | Cars | Mvita | Mombasa | Foreign Used | First owner, No faults | Mercedes-Benz | Silver | B-Class | 2014.0 | 90000.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 1650000 |
210 | Mercedes-Benz C200 2008 White | Cars | Langata | Nairobi | Kenyan Used | Mercedes-Benz | Mercedes-Benz | White | C200 | 2008.0 | 132489.0 | NaN | Petrol | NaN | Automatic | NaN | NaN | 1400000 |
179 | Mercedes-Benz B-Class 2009 White | Cars | Lavington | Nairobi | Kenyan Used | Mercedes-Benz | Mercedes-Benz | White | B-Class | 2009.0 | 154155.0 | NaN | Petrol | NaN | Automatic | NaN | NaN | 699999 |
that are of brand Mercedes-Benz or BMW
That mention the colors 'Black' or 'White' in the title
Are in either of the following regions: Kilimani, Lavington, Langata or Westlands
Sort them in alphabetical order by brand, then by price with most expensive on top
#### Challenge 5 solution
df[
((df['brand'] == 'Mercedes-Benz') | (df['brand'] == 'BMW')) &
(df['title'].str.contains('Black') | (df['title'].str.contains('White'))) &
(df['region'].isin(['Kilimani','Lavington','Langata','Westlands']))
].sort_values(by=['brand','price'],ascending=[True,False])
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
282 | BMW 7 Series 2013 White | Cars | Westlands | Nairobi | Foreign Used | No faults | BMW | White | 7 Series | 2013.0 | 40000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 3800000 |
207 | Mercedes-Benz C180 2014 Black | Cars | Kilimani | Nairobi | Foreign Used | No faults | Mercedes-Benz | Black | C180 | 2014.0 | 82000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 2500000 |
219 | Mercedes-Benz C-Class 2011 C 200 (C204) White | Cars | Lavington | Nairobi | Kenyan Used | Mercedes-Benz | Mercedes-Benz | White | C-Class | 2011.0 | 76000.0 | Coupe | Petrol | Rear Wheel | Automatic | 5.0 | NaN | 1650000 |
210 | Mercedes-Benz C200 2008 White | Cars | Langata | Nairobi | Kenyan Used | Mercedes-Benz | Mercedes-Benz | White | C200 | 2008.0 | 132489.0 | NaN | Petrol | NaN | Automatic | NaN | NaN | 1400000 |
179 | Mercedes-Benz B-Class 2009 White | Cars | Lavington | Nairobi | Kenyan Used | Mercedes-Benz | Mercedes-Benz | White | B-Class | 2009.0 | 154155.0 | NaN | Petrol | NaN | Automatic | NaN | NaN | 699999 |
that are of brand Toyota or Mazda or Subaru
That mention the colors 'Black' or 'White' in the title
Are in either of the following regions: Kilimani, Lavington, Langata, Westlands, Ridgeways
Sort them in alphabetical order by brand, then by price with most expensive on top
### Challenge 6 Solution
# your solution here
#
Data cleaning means fixing bad data in your data set.
Bad data could be:
Empty cells
Data in wrong format
Wrong data
Duplicates
Earlier we say the command to see the number of missing values per column:
df.isnull().sum()
title 0 category 0 region 0 parent_region 6 condition 5 attrs 1 brand 1 color 56 model 17 yom 3 mileage 91 body_type 258 fuel 218 drive_train 259 trans 51 seat 269 registered 125 price 0 dtype: int64
Empty cells can potentially give you a wrong result when you analyze data.
One way of dealing with empty cells is to insert a new value instead.
Since only 3 rows are missing the values for yom, here's how we'll handle the missing values:
Fill the yom empty cell for Isuzu Truck with the modal class (most frequent) for vehicles with category Trucks & Trailers and brand Isuzu
Fill in the yom empty cell for Subaru Outback 2014 black with the modal class for vehicles with category Cars and model Outback
Since most of the entries are missing for Single Diff Volve we remove the row. Drop/discard the record.
df[df['yom'].isnull()]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
109 | Isuzu Truck . | Trucks & Trailers | Donholm | Nairobi | NaN | Isuzu | Isuzu | NaN | N Series | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 810000 |
201 | Subaru Outback 2014 black | Cars | Nairobi | NaN | NaN | Outback | Subaru | Black | Outback | NaN | 180097.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 1649999 |
261 | Single Diff Volve FL10 | Trucks & Trailers | Embakasi | Nairobi | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 400000 |
# Isuzu trucks modal yom
# filter
df[(df['category']=='Trucks & Trailers')&(df['brand']=='Isuzu')].head()
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14 | Very Clean Isuzu FRR Truck 2015 Model | Trucks & Trailers | Thome | Nairobi | Used | Used | Isuzu | White | F SERIES | 2015.0 | NaN | NaN | Diesel | NaN | Manual | NaN | NaN | 3520000 |
36 | Isuzu Elf,Year 2014manual Transmission | Trucks & Trailers | Mombasa CBD | Mombasa | Brand New | Brand New | Isuzu | NaN | NaN | 2013.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2150000 |
41 | Isuzu Nkr. Yr 2019 | Trucks & Trailers | Ridgeways | Nairobi | Used | Used | Isuzu | White | N Series | 2019.0 | NaN | NaN | Diesel | NaN | Manual | NaN | NaN | 3050000 |
46 | Isuzu Frr Kbw | Trucks & Trailers | Eldoret CBD | Uasin Gishu | Used | Used | Isuzu | White | F SERIES | 2013.0 | NaN | NaN | Diesel | NaN | Manual | NaN | NaN | 2600000 |
62 | Isuzu Elf,Year 2015 Manual | Trucks & Trailers | Mombasa CBD | Mombasa | Brand New | Brand New | Isuzu | NaN | NaN | 2014.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3650000 |
df[(df['category']=='Trucks & Trailers')&(df['brand']=='Isuzu')]['yom'].value_counts()
2015.0 4 2013.0 3 2012.0 2 2019.0 1 2014.0 1 2006.0 1 2007.0 1 1994.0 1 1990.0 1 1999.0 1 2016.0 1 2010.0 1 Name: yom, dtype: int64
You can fill the yom value for Isuzu Truck with the 2014 which is the modal class for Isuzu Trucks. We know that the Isuzu Truck is on index 109 of our DataFrame, we will use the .at
attribute to index the row.
df.at[109,'yom'] = 2014
df.at[109,'yom']
2014.0
Repeat the process for the Subaru
df[(df['category']=='Cars')&(df['model']=='Outback')]['yom'].value_counts().index.tolist()
[2015.0, 2013.0, 2014.0]
df.at[201,'yom']
nan
df.at[201,'yom'] = 2015
To drop a row from the DataFrame, simply use df.drop()
then you specify the index and axis.
We know that the entry is on row 261.
the
df.dropna(inplace = True)
will NOT return a new DataFrame, but it will remove all rows containg NULL values from the original DataFrame.
df.drop(261,inplace=True)
df.isnull().sum()
title 0 category 0 region 0 parent_region 6 condition 4 attrs 0 brand 0 color 55 model 16 yom 0 mileage 90 body_type 257 fuel 217 drive_train 258 trans 50 seat 268 registered 124 price 0 dtype: int64
Now you can see that the column yom has 0 null values
The fillna()
method allows us to replace empty cells with a value.
To only replace empty values for one column, specify the column name for the DataFrame
Let's fill in the Missing values for color of all missing rows with Not Specified
df['color'].value_counts()
White 68 Black 57 Silver 27 Blue 25 Gray 16 Red 11 Brown 10 Pink 5 Purple 5 Green 4 Pearl 3 Gold 3 Other 3 Beige 2 Off white 2 Grey 1 Burgandy 1 Matt Black 1 Name: color, dtype: int64
df['color'].fillna('Not Specified', inplace=True)
df['color'].value_counts()
White 68 Black 57 Not Specified 55 Silver 27 Blue 25 Gray 16 Red 11 Brown 10 Purple 5 Pink 5 Green 4 Gold 3 Pearl 3 Other 3 Beige 2 Off white 2 Grey 1 Burgandy 1 Matt Black 1 Name: color, dtype: int64
mean
, median
or mode
¶A common way to replace empty cells, is to calculate the mean, median or mode value of the column.
Pandas uses the mean()
, median()
and mode()
methods to calculate the respective values for a specified column
df[df['condition'].isnull()]
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
109 | Isuzu Truck . | Trucks & Trailers | Donholm | Nairobi | NaN | Isuzu | Isuzu | Not Specified | N Series | 2014.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 810000 |
161 | Tipper In Exellent Condition. | Trucks & Trailers | Mombasa CBD | Mombasa | NaN | Isuzu | Isuzu | Not Specified | NaN | 2012.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1850000 |
201 | Subaru Outback 2014 black | Cars | Nairobi | Nairobi | NaN | Outback | Subaru | Black | Outback | 2015.0 | 180097.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 1649999 |
202 | Mazda Bongo 2013 | Trucks & Trailers | Ngong | Kajiado | NaN | Mazda | Mazda | Not Specified | NaN | 2013.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1650000 |
df['condition'].value_counts()
Foreign Used 170 Kenyan Used 65 Used 35 Brand New 25 Name: condition, dtype: int64
# the modal condition for vehicles under the Trucks & Trailers category
trucks_mode = df[df['category'] == "Trucks & Trailers"]['condition'].mode()
trucks_mode
0 Used Name: condition, dtype: object
df['condition'].replace(np.nan,'Used',inplace=True)
So we choose to fill the missing values for condition with Used
The other way to deal with rows with missing values is to drop them using a criteria.
Suppose we want to drop all the vehicles where all the following details are missing:
we need to specify these columns in as a subset
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
The how
argument determines if row or column is removed from DataFrame, when we have at least one NA or all NA.
any : If any NA values are present, drop that row or column.
all : If all values are NA, drop that row or column.
# drop when any columns in cols list are missing
cols=['model','mileage','body_type',
'fuel','drive_train','trans',
'seat','registered']
df.dropna(axis='index', how='any', subset=cols).head()
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | Toyota Sienta 2014 1.5 AWD Gray | Cars | Ganjoni | Mombasa | Foreign Used | No faults | Toyota | Gray | Sienta | 2014.0 | 51000.0 | Minivan | Petrol | All Wheel | Automatic | 7.0 | Yes | 1200000 |
11 | Lexus NX 2016 200t FWD Beige | Cars | Kilimani | Nairobi | Foreign Used | No faults, Unpainted, Original parts | Lexus | Beige | NX | 2016.0 | 35000.0 | SUV | Petrol | Front Wheel | Automatic | 5.0 | No | 4750000 |
18 | Toyota Harrier 2010 2.4 Black | Cars | Ridgeways | Nairobi | Kenyan Used | First registration, First owner | Toyota | Black | Harrier | 2010.0 | 110643.0 | SUV | Petrol | Front Wheel | Automatic | 5.0 | Yes | 1850000 |
19 | Volkswagen Touareg 2006 3.0 V6 TDi Automatic B... | Cars | Lavington | Nairobi | Kenyan Used | Original parts | Volkswagen | Black | Touareg | 2006.0 | 180000.0 | SUV | Diesel | All Wheel | Automatic | 5.0 | Yes | 1850000 |
53 | Toyota Land Cruiser Prado 2015 2.7 VVT-i Brown | Cars | Mvita | Mombasa | Foreign Used | No faults | Toyota | Brown | Land Cruiser Prado | 2015.0 | 45260.0 | SUV | Petrol | All Wheel | Automatic | 5.0 | No | 6500000 |
# drop only when any 5 columns in cols list are missing
cols=['model','mileage','body_type',
'fuel','drive_train','trans',
'seat','registered']
df.dropna(axis='index', how='any', subset=cols,thresh=5).head()
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | Toyota Sienta 2014 1.5 AWD Gray | Cars | Ganjoni | Mombasa | Foreign Used | No faults | Toyota | Gray | Sienta | 2014.0 | 51000.0 | Minivan | Petrol | All Wheel | Automatic | 7.0 | Yes | 1200000 |
9 | Mitsubishi Outlander 2015 White | Cars | Lavington | Nairobi | Foreign Used | Unpainted, Original parts, No faults | Mitsubishi | White | Outlander | 2015.0 | 40382.0 | SUV | Petrol | Front Wheel | Automatic | NaN | Yes | 2900000 |
10 | Toyota Ractis 2012 1.3 FWD Silver | Cars | Lavington | Nairobi | Kenyan Used | Toyota | Toyota | Silver | Ractis | 2012.0 | 167704.0 | Hatchback | Petrol | Front Wheel | Automatic | 5.0 | NaN | 800000 |
11 | Lexus NX 2016 200t FWD Beige | Cars | Kilimani | Nairobi | Foreign Used | No faults, Unpainted, Original parts | Lexus | Beige | NX | 2016.0 | 35000.0 | SUV | Petrol | Front Wheel | Automatic | 5.0 | No | 4750000 |
15 | Mazda CX-5 2014 Sport FWD Blue | Cars | Ridgeways | Nairobi | Foreign Used | No faults | Mazda | Blue | CX-5 | 2014.0 | NaN | SUV | Petrol | Front Wheel | Automatic | 5.0 | NaN | 2060000 |
# drop only when all columns in cols list are missing
cols=['model','mileage','body_type',
'fuel','drive_train','trans',
'seat','registered']
df.dropna(axis='index', how='all', subset=cols).head()
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | body_type | fuel | drive_train | trans | seat | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults | Toyota | Black | Land Cruiser Prado | 2016.0 | 87000.0 | NaN | NaN | NaN | Automatic | NaN | NaN | 6500000 |
1 | Mazda Demio 2014 Brown | Cars | Langata | Nairobi | Foreign Used | First owner, No faults | Mazda | Brown | Demio | 2014.0 | 92000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 970000 |
2 | Clean NV300 Caravan 2014 Model Dielsel 16 Seater | Buses & Microbuses | Kilimani | Nairobi | Foreign Used | Nissan | Nissan | Not Specified | Caravan (Urvan) | 2014.0 | 180000.0 | NaN | NaN | NaN | NaN | NaN | NaN | 2550000 |
3 | Toyota Crown 2014 Pearl | Cars | Kilimani | Nairobi | Foreign Used | No faults | Toyota | Pearl | Crown | 2014.0 | 75000.0 | NaN | NaN | NaN | Automatic | NaN | No | 2100000 |
4 | Honda Fit 2014 Black | Cars | Mvita | Mombasa | Foreign Used | No faults | Honda | Black | Fit | 2014.0 | 58000.0 | NaN | NaN | NaN | Automatic | NaN | Yes | 880000 |
inplace=True
¶# drop only when all columns in cols list are missing
cols=['model','mileage','body_type',
'fuel','drive_train','trans',
'seat','registered']
df.dropna(axis='index', how='all', subset=cols,inplace=True)
df.isnull().sum()
title 0 category 0 region 0 parent_region 0 condition 0 attrs 0 brand 0 color 0 model 1 yom 0 mileage 75 body_type 242 fuel 202 drive_train 243 trans 35 seat 253 registered 109 price 0 dtype: int64
Keep only the columns with at least 60% non-NA values.
threshold = int(0.6* len(df))
threshold
170
cols=['model','body_type',
'fuel','drive_train','trans',
'seat','registered']
df.dropna(axis='columns',thresh=threshold,inplace=True)
df.isnull().sum()
title 0 category 0 region 0 parent_region 0 condition 0 attrs 0 brand 0 color 0 model 1 yom 0 mileage 75 trans 35 registered 109 price 0 dtype: int64
df.groupby('condition')['registered'].value_counts()
condition registered Brand New No 6 Yes 6 Foreign Used Yes 80 No 49 Kenyan Used Yes 34 Name: registered, dtype: int64
df[df['condition'] == "Kenyan Used"]['registered'].value_counts()
Yes 65 Name: registered, dtype: int64
For all vehicles that indicate they are Kenyan Used, fill in all missing values for column registered with 'Yes'
reg_df = df[df['condition'] == "Kenyan Used"]['registered'].apply(lambda x: 'Yes' if x is np.nan else x)
df.loc[ df['condition'] == "Kenyan Used", 'registered'] = reg_df
df.groupby('condition')['registered'].value_counts()
condition registered Brand New No 6 Yes 6 Foreign Used Yes 80 No 49 Kenyan Used Yes 65 Name: registered, dtype: int64
df.isnull().sum()
title 0 category 0 region 0 parent_region 0 condition 0 attrs 0 brand 0 color 0 model 1 yom 0 mileage 75 trans 35 registered 78 price 0 dtype: int64
We first drop all rows that are missing the car_mileage
import matplotlib.pyplot as plt
cars_df = df[df['category']=='Cars'].copy()
cars_df.head()
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | trans | registered | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults | Toyota | Black | Land Cruiser Prado | 2016.0 | 87000.0 | Automatic | NaN | 6500000 |
1 | Mazda Demio 2014 Brown | Cars | Langata | Nairobi | Foreign Used | First owner, No faults | Mazda | Brown | Demio | 2014.0 | 92000.0 | Automatic | Yes | 970000 |
3 | Toyota Crown 2014 Pearl | Cars | Kilimani | Nairobi | Foreign Used | No faults | Toyota | Pearl | Crown | 2014.0 | 75000.0 | Automatic | No | 2100000 |
4 | Honda Fit 2014 Black | Cars | Mvita | Mombasa | Foreign Used | No faults | Honda | Black | Fit | 2014.0 | 58000.0 | Automatic | Yes | 880000 |
5 | Mitsubishi Delica 2013 White | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults, Unpainted | Mitsubishi | White | Delica | 2013.0 | 88000.0 | Automatic | Yes | 630000 |
plt.scatter("yom","price",data=cars_df)
plt.title("Year Against Price")
plt.xlabel("Year of Make")
plt.ylabel("Price in Kshs Millions")
plt.grid()
Let's create a new column age_years
which is the number of years that have pased for a vehicle of category Car since it's year of make.
To come up with this column, we use the column yom to arrive at the number of years from the current year.
We will simply deduct the current year with the entry for each row.
age_years = current_year - yom
# ensure that the yom column is numeric
cars_df.dtypes
title object category object region object parent_region object condition object attrs object brand object color object model object yom float64 mileage float64 trans object registered object price int64 dtype: object
# get the current year
from datetime import date
cur_year = date.today().year
cur_year
2022
cur_year - cars_df['yom']
0 6.0 1 8.0 3 8.0 4 8.0 5 9.0 ... 293 8.0 294 8.0 295 14.0 298 13.0 299 7.0 Name: yom, Length: 221, dtype: float64
# to create a new column simply put the name of the new column withing square brackets
cars_df['age_years'] = cur_year - cars_df['yom']
cars_df.head()
title | category | region | parent_region | condition | attrs | brand | color | model | yom | mileage | trans | registered | price | age_years | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Toyota Land Cruiser Prado 2016 Black | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults | Toyota | Black | Land Cruiser Prado | 2016.0 | 87000.0 | Automatic | NaN | 6500000 | 6.0 |
1 | Mazda Demio 2014 Brown | Cars | Langata | Nairobi | Foreign Used | First owner, No faults | Mazda | Brown | Demio | 2014.0 | 92000.0 | Automatic | Yes | 970000 | 8.0 |
3 | Toyota Crown 2014 Pearl | Cars | Kilimani | Nairobi | Foreign Used | No faults | Toyota | Pearl | Crown | 2014.0 | 75000.0 | Automatic | No | 2100000 | 8.0 |
4 | Honda Fit 2014 Black | Cars | Mvita | Mombasa | Foreign Used | No faults | Honda | Black | Fit | 2014.0 | 58000.0 | Automatic | Yes | 880000 | 8.0 |
5 | Mitsubishi Delica 2013 White | Cars | Mvita | Mombasa | Foreign Used | First registration, No faults, Unpainted | Mitsubishi | White | Delica | 2013.0 | 88000.0 | Automatic | Yes | 630000 | 9.0 |
cars_df['age_years'].astype(int)
0 6 1 8 3 8 4 8 5 9 .. 293 8 294 8 295 14 298 13 299 7 Name: age_years, Length: 221, dtype: int32
# casting the dtype from float to int
cars_df['age_years'] = cars_df['age_years'].astype(int)
cars_df['age_years'].mean()
9.027149321266968
age_years
vs price
¶cars_df[['age_years','price']].corr()
age_years | price | |
---|---|---|
age_years | 1.000000 | -0.286765 |
price | -0.286765 | 1.000000 |
# there is a weak negative linear relationship between age_years and price
from matplotlib import pyplot as plt
plt.scatter('age_years', 'price', data=cars_df)
<matplotlib.collections.PathCollection at 0x1a4655b5b80>
cars_df['price_log'] = np.log(cars_df['price'])
cars_df.corr()
yom | mileage | price | age_years | price_log | |
---|---|---|---|---|---|
yom | 1.000000 | -0.290489 | 0.286765 | -1.000000 | 0.430495 |
mileage | -0.290489 | 1.000000 | -0.188173 | 0.290489 | -0.198183 |
price | 0.286765 | -0.188173 | 1.000000 | -0.286765 | 0.879782 |
age_years | -1.000000 | 0.290489 | -0.286765 | 1.000000 | -0.430495 |
price_log | 0.430495 | -0.198183 | 0.879782 | -0.430495 | 1.000000 |
from matplotlib import pyplot as plt
plt.scatter('yom', 'price_log', data=cars_df)
<matplotlib.collections.PathCollection at 0x1a465623400>