Anti-Join Pandas

Multi tool use


Anti-Join Pandas
I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).
I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.
I am familiar with R and this is the code I would use to do this in R.
library("dplyr")
## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")
## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)
How would I achieve this in python?
By key i mean a column contained in both tables with values to merge on.
– Ayelavan
Jul 22 '16 at 4:19
I've added an answer. Please let me know what you think via feedback, and if it solves your exercise. Thank you!
– Jossie Calderon
Jul 22 '16 at 7:31
What is the expected input and desired output?
– tommy.carstensen
May 26 '17 at 11:26
4 Answers
4
Consider the following dataframes
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
TableA
TableB
This is one way to do what you want
# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)
# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)
rows =
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)
pd.concat([TableA.T] + rows, axis=1).T
4 rows with 2 overlap
Method 1 is much quicker
10,000 rows 5,000 overlap
loops are bad
Thanks. Just had a look at the documentation for combine_first and it seems to be kind of what I am looking for. However, how do I specify the column to combine on (key)? For each row in Table B I want it to check the value in the "Key" column and if that value is contained in Table A's "Key" column I want to ignore that row in B. Where there is a key value in table B that is not in Table A I want to append that row from Table B.
– Ayelavan
Jul 22 '16 at 4:38
@Ayelavan after thinking about it, I cannot recommend
combine_first
. The problem being that if a key exist in A but some of the columns are null, those nulls might be filled in by B's values with same key. the answer to the question you posed in comments is that combine_first
expects the key to be in the index.– piRSquared
Jul 22 '16 at 7:28
combine_first
combine_first
Thats exactly what I was looking for. Thank you very much
– Ayelavan
Jul 22 '16 at 9:55
I had the same problem. This answer using how='outer'
and indicator=True
of merge inspired me to come up with this solution:
how='outer'
indicator=True
import pandas as pd
import numpy as np
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
print('TableA', TableA, sep='n')
print('TableB', TableB, sep='n')
TableB_only = pd.merge(
TableA, TableB,
how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
'_merge == "right_only"')
print('TableB_only', TableB_only, sep='n')
Table_concatenated = pd.concat((TableA, TableB_only), join='inner')
print('Table_concatenated', Table_concatenated, sep='n')
Which prints this output:
TableA
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
TableB
Key A B C
0 a 0.754538 0.692902 0.537704
1 e 0.499092 0.864145 0.004559
2 c 0.082087 0.682573 0.421654
3 f 0.768914 0.281617 0.924693
TableB_only
Key A_foo B_foo C_foo A B C _merge
4 e NaN NaN NaN 0.499092 0.864145 0.004559 right_only
5 f NaN NaN NaN 0.768914 0.281617 0.924693 right_only
Table_concatenated
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
4 e 0.499092 0.864145 0.004559
5 f 0.768914 0.281617 0.924693
You'll have both tables TableA
and TableB
such that both DataFrame
objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.
TableA
TableB
DataFrame
Then, we want to merge the rows in TableA
with the rows in TableB
that don't match any in TableA
for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA
with the other sB
if sB
's values don't match sA
's. The following code solves this exercise:
TableA
TableB
TableA
sA
sB
sB
sA
import pandas as pd
TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])
removeTheseIndexes =
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same
for i in range(0, len(keyColumnA)):
firstValue = keyColumnA[i]
for j in range(0, len(keyColumnB)):
copycat = keyColumnB[j]
if firstValue == copycat:
removeTheseIndexes.append(j)
TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)
Note this affects TableB
's data as well. You can use inplace=False
and re-assign it to a newTable
, then TableA.append(newTable)
alternatively.
TableB
inplace=False
newTable
TableA.append(newTable)
# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
# Table B
0 1 2
0 1 3 4
1 5 7 8
2 9 10 0
# Set 'Key' column = 1
# Run the script after the loop
# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
3 5 7 8
4 9 10 0
# Table B
0 1 2
1 5 7 8
2 9 10 0
Works a treat. Thank you.
– Ayelavan
Jul 22 '16 at 11:19
Easiest answer imaginable:
import numpy as np
tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableC, how="left" on="key")
answer = mergedTable[mergedTable.iloc[:,-1] == np.nan]
Should be the fastest proposed as well.
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
By key do you mean row index, column index, or cell?
– Jossie Calderon
Jul 22 '16 at 1:47