Anti-Join Pandas

Anti-Join Pandas

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).

I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.

I am familiar with R and this is the code I would use to do this in R.

library("dplyr") ## Filtering join to remove values already in "TableA" from "TableB" FilteredTableB <- anti_join(TableB,TableA, by = "Key") ## Append "FilteredTableB" to "TableA" CombinedTable <- bind_rows(TableA,FilteredTableB)

How would I achieve this in python?

By key do you mean row index, column index, or cell?
– Jossie Calderon
Jul 22 '16 at 1:47

By key i mean a column contained in both tables with values to merge on.
– Ayelavan
Jul 22 '16 at 4:19

I've added an answer. Please let me know what you think via feedback, and if it solves your exercise. Thank you!
– Jossie Calderon
Jul 22 '16 at 7:31

What is the expected input and desired output?
– tommy.carstensen
May 26 '17 at 11:26

4 Answers
4

Consider the following dataframes

TableA = pd.DataFrame(np.random.rand(4, 3), pd.Index(list('abcd'), name='Key'), ['A', 'B', 'C']).reset_index() TableB = pd.DataFrame(np.random.rand(4, 3), pd.Index(list('aecf'), name='Key'), ['A', 'B', 'C']).reset_index()

TableA

enter image description here

TableB

enter image description here

This is one way to do what you want

# Identify what values are in TableB and not in TableA key_diff = set(TableB.Key).difference(TableA.Key) where_diff = TableB.Key.isin(key_diff) # Slice TableB accordingly and append to TableA TableA.append(TableB[where_diff], ignore_index=True)

enter image description here

rows = for i, row in TableB.iterrows(): if row.Key not in TableA.Key.values: rows.append(row) pd.concat([TableA.T] + rows, axis=1).T

4 rows with 2 overlap

Method 1 is much quicker

enter image description here

10,000 rows 5,000 overlap

loops are bad

enter image description here

Thanks. Just had a look at the documentation for combine_first and it seems to be kind of what I am looking for. However, how do I specify the column to combine on (key)? For each row in Table B I want it to check the value in the "Key" column and if that value is contained in Table A's "Key" column I want to ignore that row in B. Where there is a key value in table B that is not in Table A I want to append that row from Table B.
– Ayelavan
Jul 22 '16 at 4:38

@Ayelavan after thinking about it, I cannot recommend combine_first. The problem being that if a key exist in A but some of the columns are null, those nulls might be filled in by B's values with same key. the answer to the question you posed in comments is that combine_first expects the key to be in the index.
– piRSquared
Jul 22 '16 at 7:28

combine_first

Thats exactly what I was looking for. Thank you very much
– Ayelavan
Jul 22 '16 at 9:55

I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:

how='outer'

indicator=True

import pandas as pd import numpy as np TableA = pd.DataFrame(np.random.rand(4, 3), pd.Index(list('abcd'), name='Key'), ['A', 'B', 'C']).reset_index() TableB = pd.DataFrame(np.random.rand(4, 3), pd.Index(list('aecf'), name='Key'), ['A', 'B', 'C']).reset_index() print('TableA', TableA, sep='n') print('TableB', TableB, sep='n') TableB_only = pd.merge( TableA, TableB, how='outer', on='Key', indicator=True, suffixes=('_foo','')).query( '_merge == "right_only"') print('TableB_only', TableB_only, sep='n') Table_concatenated = pd.concat((TableA, TableB_only), join='inner') print('Table_concatenated', Table_concatenated, sep='n')

Which prints this output:

TableA Key A B C 0 a 0.035548 0.344711 0.860918 1 b 0.640194 0.212250 0.277359 2 c 0.592234 0.113492 0.037444 3 d 0.112271 0.205245 0.227157 TableB Key A B C 0 a 0.754538 0.692902 0.537704 1 e 0.499092 0.864145 0.004559 2 c 0.082087 0.682573 0.421654 3 f 0.768914 0.281617 0.924693 TableB_only Key A_foo B_foo C_foo A B C _merge 4 e NaN NaN NaN 0.499092 0.864145 0.004559 right_only 5 f NaN NaN NaN 0.768914 0.281617 0.924693 right_only Table_concatenated Key A B C 0 a 0.035548 0.344711 0.860918 1 b 0.640194 0.212250 0.277359 2 c 0.592234 0.113492 0.037444 3 d 0.112271 0.205245 0.227157 4 e 0.499092 0.864145 0.004559 5 f 0.768914 0.281617 0.924693

You'll have both tables TableA and TableB such that both DataFrame objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.

TableA

TableB

DataFrame

Then, we want to merge the rows in TableA with the rows in TableB that don't match any in TableA for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA with the other sB if sB's values don't match sA's. The following code solves this exercise:

TableA

TableB

TableA

sA

sB

sA

import pandas as pd TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]]) TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]]) removeTheseIndexes = keyColumnA = TableA.iloc[:,1] # your 'Key' column here keyColumnB = TableB.iloc[:,1] # same for i in range(0, len(keyColumnA)): firstValue = keyColumnA[i] for j in range(0, len(keyColumnB)): copycat = keyColumnB[j] if firstValue == copycat: removeTheseIndexes.append(j) TableB.drop(removeTheseIndexes, inplace = True) TableA = TableA.append(TableB) TableA = TableA.reset_index(drop=True)

Note this affects TableB's data as well. You can use inplace=False and re-assign it to a newTable, then TableA.append(newTable) alternatively.

TableB

inplace=False

newTable

TableA.append(newTable)

# Table A 0 1 2 0 2 3 4 1 5 6 7 2 8 9 10 # Table B 0 1 2 0 1 3 4 1 5 7 8 2 9 10 0 # Set 'Key' column = 1 # Run the script after the loop # Table A 0 1 2 0 2 3 4 1 5 6 7 2 8 9 10 3 5 7 8 4 9 10 0 # Table B 0 1 2 1 5 7 8 2 9 10 0

Works a treat. Thank you.
– Ayelavan
Jul 22 '16 at 11:19

Easiest answer imaginable:

import numpy as np tableB = pd.concat([tableB, pd.Series(1)], axis=1) mergedTable = tableA.merge(tableC, how="left" on="key") answer = mergedTable[mergedTable.iloc[:,-1] == np.nan]

Should be the fastest proposed as well.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Tggjj3tzgpKqw3Lci4LMxp,U4JWc jgkXyuzn5vF,FADs,OxxUla

搜尋此網誌

Ciugk

Anti-Join Pandas

Anti-Join Pandas

4 Answers
4

Popular posts from this blog

Makefile test if variable is not empty

Will Oldham

Visual Studio Code: How to configure includePath for better IntelliSense results

Anti-Join Pandas

Anti-Join Pandas

4 Answers 4

Related Preview

Popular posts from this blog

Makefile test if variable is not empty

Will Oldham

Visual Studio Code: How to configure includePath for better IntelliSense results

4 Answers
4