Anti-Join Pandas

Multi tool use
Multi tool use
The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


Anti-Join Pandas



I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).



I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.



I am familiar with R and this is the code I would use to do this in R.


library("dplyr")

## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")

## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)



How would I achieve this in python?





By key do you mean row index, column index, or cell?
– Jossie Calderon
Jul 22 '16 at 1:47





By key i mean a column contained in both tables with values to merge on.
– Ayelavan
Jul 22 '16 at 4:19





I've added an answer. Please let me know what you think via feedback, and if it solves your exercise. Thank you!
– Jossie Calderon
Jul 22 '16 at 7:31





What is the expected input and desired output?
– tommy.carstensen
May 26 '17 at 11:26




4 Answers
4



Consider the following dataframes


TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()


TableA



enter image description here


TableB



enter image description here



This is one way to do what you want


# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)

# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)



enter image description here


rows =
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)

pd.concat([TableA.T] + rows, axis=1).T



4 rows with 2 overlap



Method 1 is much quicker



enter image description here



10,000 rows 5,000 overlap



loops are bad



enter image description here





Thanks. Just had a look at the documentation for combine_first and it seems to be kind of what I am looking for. However, how do I specify the column to combine on (key)? For each row in Table B I want it to check the value in the "Key" column and if that value is contained in Table A's "Key" column I want to ignore that row in B. Where there is a key value in table B that is not in Table A I want to append that row from Table B.
– Ayelavan
Jul 22 '16 at 4:38





@Ayelavan after thinking about it, I cannot recommend combine_first. The problem being that if a key exist in A but some of the columns are null, those nulls might be filled in by B's values with same key. the answer to the question you posed in comments is that combine_first expects the key to be in the index.
– piRSquared
Jul 22 '16 at 7:28


combine_first


combine_first





Thats exactly what I was looking for. Thank you very much
– Ayelavan
Jul 22 '16 at 9:55



I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:


how='outer'


indicator=True


import pandas as pd
import numpy as np

TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()

print('TableA', TableA, sep='n')
print('TableB', TableB, sep='n')

TableB_only = pd.merge(
TableA, TableB,
how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
'_merge == "right_only"')

print('TableB_only', TableB_only, sep='n')

Table_concatenated = pd.concat((TableA, TableB_only), join='inner')

print('Table_concatenated', Table_concatenated, sep='n')



Which prints this output:


TableA
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
TableB
Key A B C
0 a 0.754538 0.692902 0.537704
1 e 0.499092 0.864145 0.004559
2 c 0.082087 0.682573 0.421654
3 f 0.768914 0.281617 0.924693
TableB_only
Key A_foo B_foo C_foo A B C _merge
4 e NaN NaN NaN 0.499092 0.864145 0.004559 right_only
5 f NaN NaN NaN 0.768914 0.281617 0.924693 right_only
Table_concatenated
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
4 e 0.499092 0.864145 0.004559
5 f 0.768914 0.281617 0.924693



You'll have both tables TableA and TableB such that both DataFrame objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.


TableA


TableB


DataFrame



Then, we want to merge the rows in TableA with the rows in TableB that don't match any in TableA for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA with the other sB if sB's values don't match sA's. The following code solves this exercise:


TableA


TableB


TableA


sA


sB


sB


sA


import pandas as pd

TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])

removeTheseIndexes =
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same

for i in range(0, len(keyColumnA)):
firstValue = keyColumnA[i]
for j in range(0, len(keyColumnB)):
copycat = keyColumnB[j]
if firstValue == copycat:
removeTheseIndexes.append(j)

TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)



Note this affects TableB's data as well. You can use inplace=False and re-assign it to a newTable, then TableA.append(newTable) alternatively.


TableB


inplace=False


newTable


TableA.append(newTable)


# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10

# Table B
0 1 2
0 1 3 4
1 5 7 8
2 9 10 0

# Set 'Key' column = 1
# Run the script after the loop

# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
3 5 7 8
4 9 10 0

# Table B
0 1 2
1 5 7 8
2 9 10 0





Works a treat. Thank you.
– Ayelavan
Jul 22 '16 at 11:19



Easiest answer imaginable:


import numpy as np

tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableC, how="left" on="key")

answer = mergedTable[mergedTable.iloc[:,-1] == np.nan]



Should be the fastest proposed as well.






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Tggjj3tzgpKqw3Lci4LMxp,U4JWc jgkXyuzn5vF,FADs,OxxUla
3p KR8v6uUxoBz,rI 8m88,xkc6n14kwfJHzqqwP2iW N1A ECYWpBNEJoL1iMPHeBE8

Popular posts from this blog

Makefile test if variable is not empty

Will Oldham

Visual Studio Code: How to configure includePath for better IntelliSense results