Introduction
If you have been using pandas for data analysis and are still not aware of the pandas query method then you are not alone. The query method is intuitive, clean, and less verbose than the traditional methods of filtering rows from DataFrame. But, for some reason, the pandas query method is still not being used widely despite its simplicity.
So, in this article, we would like to introduce you to the pandas query method, and syntax with examples so that you can start using it immediately.
You can access the complete code for this article from GitHub here.
Pandas query method
Pandas’ query method (df.query()
) is used to query the DataFrame columns and filter only the required rows from the DataFrame.
The most commonly used methods to filter the rows from the DataFrame are boolean indexing and positional indexing methods. Refer to the examples below of what I mean by boolean indexing and positional indexing. We will use the tips dataset from seaborn for the examples.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
If we wanted to select only rows with two conditions — non-smoker (smoker=’No’) and Female, then with boolean indexing and positional indexing, we would have written code as below.
# boolean indexing
df[ (df['smoker']=='No') & (df['sex'] =='Female') ]
# positional indexing using .loc[] and .iloc[]
df.loc[(df['smoker'] == 'No') & (df['sex'] == 'Female')]
But, the above code can also be rewritten using the query method as below. As you can see despite its being intuitive, clean, and less verbose, the query method is often overlooked by pandas users. In the next section, let’s understand the query method in detail.
df.query('smoker == "No" & sex=="Female"')
Syntax
The syntax for the query method is as shown below —
DataFrame.query(expr, inplace=False, **kwargs)
- expr — It is the expression/conditions to filter the rows. You need to follow certain rules when writing the expression. We will go through them in the examples section.
- inplace — When set to True the original DataFrame will be modified otherwise returns the modified DataFrame.
- **kwargs — These are additional keyword-only arguments. The default values work fine most of the time. So you can skip this for now. But, if you are interested in knowing all the kwargs refer to the eval method’s arguments because internally query method calls the eval method for evaluating the expr and returning the resulting DataFrame.
Examples
In the previous section, we have already downloaded the tips dataset. Let’s continue using it for the examples below.
[1]. Select rows based on a single condition
The expression should be enclosed within either single quotes or double quotes. In the below example, we are selecting all non-smokers records.
df_1 = df.query('smoker == "No"')
df_1.head()
[2]. Select rows based on multiple conditions
You can also use multiple conditions with query method. When using multiple conditions, you can use both and and the symbol & interchangeably. Similarly, or and the symbol | can be used interchageably. In the below example, we are filtering all non-smoker females.
df_2 = df.query('smoker == "No" and sex=="Female"')
df_2.head()
[3]. Access the variables defined outside the query
You can refer to the variables defined in the environment by appending @ to the variable. In the below example, we have defined the variable named MAX_TIP and tried to filter the rows where tip > 8 (i.e. MAX_TIP)
MAX_TIP = 8
df_3 = df.query('tip > @MAX_TIP')
df_3.head()
[4]. Dealing with column names containing spaces, and punctuations. etc.
If the DataFrame columns are not valid Python identifiers then df.query() will raise SyntaxError. This can be handled in 2 ways — (i) modify the column names to valid identifiers or (ii) surround the variables with backticks.
df = df.rename(columns={'total_bill': 'total bill'})
df.head()
We have intentionally modified the column name from total_bill to total bill. If we run the query method on the total bill column we run into an error because the column name contains space.
# This gives syntax error
df_4 = df.query('total bill > 50')
# After surrounded by backticks runs without any errors
df_4 = df.query('`total bill` > 50')
df_4.head()
[5] Select rows using the DataFrame index
You can also select/filter rows from DataFrame based on DataFrame index values. In the below example, we are filtering only the rows with index values less than 5.
df.query('index < 5')
Note that if the name of the DataFrame index overlaps with a column name, the column name is given precedence. For example, in the below example, we have used modified the index name to day so that it overlaps with one column name. Next, when we use the index name in the query method, the first name is given to the column name.
df.index.name = 'day'
df_5 = df.query('day == "Sun"')
df_5.head()
[6] Select rows from multi-index DataFrame
If you are dealing with a multi-index DataFrame, you can easily use the names of the indices very easily inside the query method. In the below example, we are setting day
and time
columns as indices and then selecting the rows based on the multi-indices.
df.index.name = ''
df_copy = df.set_index(['day', 'time'])
df_copy.head()
df_6 = df_copy.query('day == "Fri" & time=="Lunch"')
df_6.head()
[7] Select rows based on multiple values of columns using in and not in operator
The membership operators in and not in can also be used with query method. In the below example, we filter the rows having the values Mon, Tue, and Wed in the column.
df_7 = df.query('day in ["Sat", "Sun"]')
df_7.head()
You can also use not in operator in a similar way.
df.query('day not in ["Sat", "Sun"]')
[8] Select rows with date accessor
If the DataFrame contains a DateTime column, we can also use date accessor methods such as day, month, year, etc. Refer to the example below.
df = pd.DataFrame(data=pd.date_range(start='1/1/2022', freq='M', periods=5), columns=['Col_1'])
df['Col_2'] = range(5)
df.head()
df.query('Col_1.dt.month > 2')
df.query('Col_1.dt.day > 30')
Summary
- The entire expression has to be written inside quotes (either single quotes or double quotes)
- When using multiple conditions you can use
and
or&
andor
or|
interchangeably. - The variables from the global or local scope can be accessed by appending @ to the variable inside the query method.
- The query method throws a syntax error if column names are not valid identifiers — so either convert the column names to valid identifiers or surround the column name with backticks.
- The DataFrame index values can also be used to filter the rows. In the case of multi-index DataFrame objects, the indices names can be used inside the query to filter rows.
- The query method supports membership operators —
in
andnot in
. - If DataFrame contains a DateTime object, then date accessor methods such as day, month, year, etc can be used with the query method.