top of page

Selecting and Filtering in Pandas

Updated: Jun 15, 2022



Introduction


Your dataset had too many variables to wrap your head around, or even to print out nicely. How can you pare down this overwhelming amount of data to something you can understand?

To show you the techniques, we'll start by picking a few variables using our intuition. Later tutorials will show you statistical techniques to automatically prioritize variables.

Before we can choose variables/columns, it is helpful to see a list of all columns in the dataset. That is done with the columns property of the DataFrame (the bottom line of code below).

In [1]:

import pandas as pd

melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
print(melbourne_data.columns)
Index(['Unnamed: 0', 'Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',
       'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom',
       'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea',
       'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

There are many ways to select a subset of your data. We'll start with two main approaches:


Selecting a Single Column


You can pull out any variable (or column) with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data. Here's an example:


melbourne_price_data = melbourne_data.Price
# the head command returns the top few lines of data.
print(melbourne_price_data.head())
0    1480000.0
1    1035000.0
2    1465000.0
3     850000.0
4    1600000.0
Name: Price, dtype: float64

Selecting Multiple Columns

You can select multiple columns from a DataFrame by providing a list of column names inside brackets. Remember, each item in that list should be a string (with quotes).

In [3]:

columns_of_interest = ['Landsize', 'BuildingArea']
two_columns_of_data = melbourne_data[columns_of_interest]

We can verify that we got the columns we need with the describe command.

In [4]:

two_columns_of_data.describe()




ree

Comments


©2019 by  NGiannakoulis 

bottom of page