Python Simplified

PythonSimplifiedcomLogo

How to query pandas DataFrame with the query() method?

How to query pandas DataFrame

Introduction

If you have been using pandas for data analysis and are still not aware of the pandas query method then you are not alone. The query method is intuitive, clean, and less verbose than the traditional methods of filtering rows from DataFrame. But, for some reason, the pandas query method is still not being used widely despite its simplicity. 

So, in this article, we would like to introduce you to the pandas query method, and syntax with examples so that you can start using it immediately. 

You can access the complete code for this article from GitHub here.

Pandas query method

Pandas’ query method (df.query()) is used to query the DataFrame columns and filter only the required rows from the DataFrame. 

The most commonly used methods to filter the rows from the DataFrame are boolean indexing and positional indexing methods. Refer to the examples below of what I mean by boolean indexing and positional indexing. We will use the tips dataset from seaborn for the examples. 

				
					import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
				
			
how to query dataframe - tips dataset

If we wanted to select only rows with two conditions — non-smoker (smoker=’No’) and Female, then with boolean indexing and positional indexing, we would have written code as below.

				
					# boolean indexing 
df[ (df['smoker']=='No') & (df['sex'] =='Female') ]

# positional indexing using .loc[] and .iloc[]
df.loc[(df['smoker'] == 'No') & (df['sex'] == 'Female')]
				
			

But, the above code can also be rewritten using the query method as below. As you can see despite its being intuitive, clean, and less verbose, the query method is often overlooked by pandas users. In the next section, let’s understand the query method in detail.

				
					df.query('smoker == "No" & sex=="Female"')
				
			

Syntax

The syntax for the query method is as shown below — 

				
					DataFrame.query(expr, inplace=False, **kwargs)
				
			
  • expr — It is the expression/conditions to filter the rows. You need to follow certain rules when writing the expression. We will go through them in the examples section.
  • inplace — When set to True the original DataFrame will be modified otherwise returns the modified DataFrame. 
  • **kwargs — These are additional keyword-only arguments. The default values work fine most of the time. So you can skip this for now. But, if you are interested in knowing all the kwargs refer to the eval method’s arguments because internally query method calls the eval method for evaluating the expr and returning the resulting DataFrame.

Examples

In the previous section, we have already downloaded the tips dataset. Let’s continue using it for the examples below. 

[1]. Select rows based on a single condition

The expression should be enclosed within either single quotes or double quotes. In the below example, we are selecting all non-smokers records. 

				
					df_1 = df.query('smoker == "No"')
df_1.head()
				
			
pandas query method - filter all non smokers

[2]. Select rows based on multiple conditions

You can also use multiple conditions with query method. When using multiple conditions, you can use both and  and the symbol & interchangeably. Similarly, or and the symbol | can be used interchageably.  In the below example, we are filtering all non-smoker females.

				
					df_2 = df.query('smoker == "No" and sex=="Female"')
df_2.head()
				
			
pandas query method - all non smoker females

[3]. Access the variables defined outside the query 

You can refer to the variables defined in the environment by appending @ to the variable. In the below example, we have defined the variable named MAX_TIP and tried to filter the rows where tip > 8 (i.e. MAX_TIP)

				
					MAX_TIP = 8
df_3 = df.query('tip > @MAX_TIP')
df_3.head()
				
			
pandas query method access variable from environment

[4]. Dealing with column names containing spaces, and punctuations. etc.

If the DataFrame columns are not valid Python identifiers then df.query() will raise SyntaxError. This can be handled in 2 ways — (i) modify the column names to valid identifiers or (ii) surround the variables with backticks. 

				
					df = df.rename(columns={'total_bill': 'total bill'})
df.head()
				
			

We have intentionally modified the column name from total_bill to total bill. If we run the query method on the total bill column we run into an error because the column name contains space.  

				
					# This gives syntax error
df_4 = df.query('total bill > 50')

# After surrounded by backticks runs without any errors
df_4 = df.query('`total bill` > 50')
df_4.head()
				
			
pandas query method - columns with special characters

[5] Select rows using the DataFrame index

You can also select/filter rows from DataFrame based on DataFrame index values. In the below example, we are filtering only the rows with index values less than 5. 

				
					df.query('index < 5')
				
			
pandas query method - filter by dataframe index

Note that if the name of the DataFrame index overlaps with a column name, the column name is given precedence. For example, in the below example, we have used modified the index name to day so that it overlaps with one column name. Next, when we use the index name in the query method, the first name is given to the column name. 

				
					df.index.name = 'day'
df_5 = df.query('day == "Sun"')
df_5.head()
				
			
pandas query method - filter by index names

[6] Select rows from multi-index DataFrame

If you are dealing with a multi-index DataFrame, you can easily use the names of the indices very easily inside the query method. In the below example, we are setting day and time columns as indices and then selecting the rows based on the multi-indices. 

				
					df.index.name = ''
df_copy = df.set_index(['day', 'time'])
df_copy.head()
				
			
pandas query method set multi
				
					df_6 = df_copy.query('day == "Fri" & time=="Lunch"')
df_6.head()
				
			
pandas query method - filter by index names from multi index dataframe

[7] Select rows based on multiple values of columns using in and not in operator

The membership operators in and not in can also be used with query method. In the below example, we filter the rows having the values Mon, Tue, and Wed in the column.

				
					df_7 = df.query('day in ["Sat", "Sun"]')
df_7.head()
				
			
pandas query method filter by membership operators

You can also use not in operator in a similar way.

				
					df.query('day not in ["Sat", "Sun"]')
				
			

[8] Select rows with date accessor 

If the DataFrame contains a DateTime column, we can also use date accessor methods such as day, month, year, etc. Refer to the example below.

				
					df = pd.DataFrame(data=pd.date_range(start='1/1/2022', freq='M', periods=5), columns=['Col_1'])
df['Col_2'] = range(5)
df.head()
				
			
pandas query method dummy table for date accessor
				
					df.query('Col_1.dt.month > 2')
df.query('Col_1.dt.day > 30')
				
			
pandas query method - filter by date accessor

Summary

  • The entire expression has to be written inside quotes (either single quotes or double quotes)
  • When using multiple conditions you can use and or & and or or | interchangeably.
  • The variables from the global or local scope can be accessed by appending @ to the variable inside the query method.
  • The query method throws a syntax error if column names are not valid identifiers — so either convert the column names to valid identifiers or surround the column name with backticks.
  • The DataFrame index values can also be used to filter the rows. In the case of multi-index DataFrame objects, the indices names can be used inside the query to filter rows. 
  • The query method supports membership operators — in and not in.
  • If DataFrame contains a DateTime object, then date accessor methods such as day, month, year, etc can be used with the query method.

References

Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp
Share on email
Chetan Ambi

Chetan Ambi

A Software Engineer & Team Lead with over 10+ years of IT experience, a Technical Blogger with a passion for cutting edge technology. Currently working in the field of Python, Machine Learning & Data Science. Chetan Ambi holds a Bachelor of Engineering Degree in Computer Science.
Scroll to Top