Dropping Duplicate Rows across Multiple Columns in Python Pandas
The pandas drop_duplicates function eliminates duplicated rows from a DataFrame, an invaluable tool for data cleansing. To extend this functionality, one can specify the columns to check for uniqueness.
For instance, consider the following DataFrame:
A B C 0 foo 0 A 1 foo 1 A 2 foo 1 B 3 bar 1 A
Suppose you want to remove rows that have identical values in columns 'A' and 'C.' In this case, rows 0 and 1 would be eliminated.
Previously, this task required manual filtering or complex operations. However, with pandas' enhanced drop_duplicates function, it's now a breeze. The introduction of the keep parameter allows you to control how duplicates are handled.
To drop rows that match on specific columns, use the subset parameter. By setting keep to False, you instruct pandas to eliminate all duplicate rows:
import pandas as pd df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]}) df.drop_duplicates(subset=['A', 'C'], keep=False)
Output:
A B C 2 foo 1 B 3 bar 1 A
As you can see, rows 0 and 1 are successfully removed, leaving only the rows that are unique based on the values in columns 'A' and 'C.'
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3