How to identify duplicate rows between tables
I’m using R2020b, and I want to set up a master table for appending new data to – and as part of this I want to identify any duplicate rows in the new, incoming table to filter them out before appending. Ideally, the master table will live in a related directory in a .mat file, and the new data will be read in directly from a set-name, set-location .csv using e.g.
fullname = fullfile(‘relativepath’,’newdata.csv’);
% grab column headers from input sheet
opts = detectImportOptions(fullname);
% set all variable types to categorical
opts.VariableTypes(:) = {‘categorical’};
% read in new data
T = readtable(fullname,opts);
% make any modifications to new data headers to match old data
T = renamevars(T,"NewLabel","OldLabel");
% clean new table headers to match originally-wizard-imported headers (I’d ask why these exhibit different behaviour, but that’s a separate tragedy, and this current fix works – I think)
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘ ‘, ”);
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘(‘, ”);
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘)’, ”);
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘_’, ”);
I found the solution suggested here: https://au.mathworks.com/matlabcentral/answers/514921-finding-identical-rows-in-2-tables, but having done a quick test via:
foo = T(str2double(string(T.Year))<1943,:); % not my actual query, but structurally the same; this gave me ~40% of my original data
bar = T(str2double(string(T.Year))>1941,:); % similar, gave me ~70% of the original data
baz = ismember(foo,bar); % similar, gives the overlap for 1 particular year (should be about 14% of my original data)
blah = T(str2double(string(T.Year))==1942,:); % to directly extract the number of rows I am looking for
sum(baz) % What I expect here is the number of rows in the overlap
ans =
0
I found that ismember was not finding any duplicates (which were there by construction).
Note: due to categorical data I actually used T(str2double(string(T.Year))…)
Replacing
baz = ismember(foo,bar,’rows’);
sum(baz)
ans =
0
results in the same not finding any duplicates. Using double quotes "rows" does not change the behaviour.
On the other hand, using the function to assess single variables gives the expected behaviour (to some degree):
testest = ismember(foo.var1,bar.var1)
sum(testest)
The sum is now non-zero, and (because single variables are repeated more often than their combinations) gives more like 30% of the original data, which seems reasonable (the number of unique entries in the original set in that variable was about 40% of the total).
I guess I could create a logical index based on the product of multiple calls of this kind, but that seems rather… inefficient… and sensitive to the exact construction of the table/variables used in the filter. I’d rather have a generic solution for full table rows that will be robust if the overall table changes over the long term (or if/when I functionalise the code and use it for other work). Whilst most of the time, a couple of key variables can be used to identify unique rows, occasionally more information is required to distinguish pathological cases. I will probably use this approach if a more elegant solution doesn’t appear, though, and put some thought into which groups of variables are 100% correlated (and therefore useless for this distinction) to cut down the Boolean product.
I could also throw good coding practice to the winds and just write two nested loops (one for rows, one for variables) and exhaustively test every combination, but I suspect that would be even less efficient (although I wonder whether the scaling order would be the same given the nature of the comparisons required).
If it is pertinent, I imported all (>25) data columns from a .csv file as categorical variables. The original data before that were a mix of number and general columns from an Excel sheet; I could have used any or all of {double,string,categorical,datetime} to store the various variables, but there are some data which are best stored as categorical to avoid character trimming and consequent data cleaning / returning to original state steps.
Digging further, I also found this: https://au.mathworks.com/matlabcentral/answers/1775400-how-do-i-find-all-indexes-of-duplicate-names-in-a-table-column-then-compare-the-row-values-for-each which appears to imply that ismember should have the functionality I need here.
Similarly, methods using unique (see e.g. https://au.mathworks.com/matlabcentral/answers/1999193-find-duplicated-rows-in-matlab-without-for-loop or https://au.mathworks.com/matlabcentral/answers/1571588-table-find-duplicate-rows-double-char-datetime or https://au.mathworks.com/matlabcentral/answers/305987-identify-duplicate-rows-in-a-matrix) give:
size(unique([foo;bar],’rows’),1) == size(foo,1)+size(bar,1)
ans =
logical
1
instead of the expected 0 due to the lower amount of actual full-row matches. (Same for "rows" again.)
I’ve also looked into outerjoin/join/innerjoin, but those don’t seem to remove duplicates like I need.I’m using R2020b, and I want to set up a master table for appending new data to – and as part of this I want to identify any duplicate rows in the new, incoming table to filter them out before appending. Ideally, the master table will live in a related directory in a .mat file, and the new data will be read in directly from a set-name, set-location .csv using e.g.
fullname = fullfile(‘relativepath’,’newdata.csv’);
% grab column headers from input sheet
opts = detectImportOptions(fullname);
% set all variable types to categorical
opts.VariableTypes(:) = {‘categorical’};
% read in new data
T = readtable(fullname,opts);
% make any modifications to new data headers to match old data
T = renamevars(T,"NewLabel","OldLabel");
% clean new table headers to match originally-wizard-imported headers (I’d ask why these exhibit different behaviour, but that’s a separate tragedy, and this current fix works – I think)
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘ ‘, ”);
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘(‘, ”);
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘)’, ”);
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘_’, ”);
I found the solution suggested here: https://au.mathworks.com/matlabcentral/answers/514921-finding-identical-rows-in-2-tables, but having done a quick test via:
foo = T(str2double(string(T.Year))<1943,:); % not my actual query, but structurally the same; this gave me ~40% of my original data
bar = T(str2double(string(T.Year))>1941,:); % similar, gave me ~70% of the original data
baz = ismember(foo,bar); % similar, gives the overlap for 1 particular year (should be about 14% of my original data)
blah = T(str2double(string(T.Year))==1942,:); % to directly extract the number of rows I am looking for
sum(baz) % What I expect here is the number of rows in the overlap
ans =
0
I found that ismember was not finding any duplicates (which were there by construction).
Note: due to categorical data I actually used T(str2double(string(T.Year))…)
Replacing
baz = ismember(foo,bar,’rows’);
sum(baz)
ans =
0
results in the same not finding any duplicates. Using double quotes "rows" does not change the behaviour.
On the other hand, using the function to assess single variables gives the expected behaviour (to some degree):
testest = ismember(foo.var1,bar.var1)
sum(testest)
The sum is now non-zero, and (because single variables are repeated more often than their combinations) gives more like 30% of the original data, which seems reasonable (the number of unique entries in the original set in that variable was about 40% of the total).
I guess I could create a logical index based on the product of multiple calls of this kind, but that seems rather… inefficient… and sensitive to the exact construction of the table/variables used in the filter. I’d rather have a generic solution for full table rows that will be robust if the overall table changes over the long term (or if/when I functionalise the code and use it for other work). Whilst most of the time, a couple of key variables can be used to identify unique rows, occasionally more information is required to distinguish pathological cases. I will probably use this approach if a more elegant solution doesn’t appear, though, and put some thought into which groups of variables are 100% correlated (and therefore useless for this distinction) to cut down the Boolean product.
I could also throw good coding practice to the winds and just write two nested loops (one for rows, one for variables) and exhaustively test every combination, but I suspect that would be even less efficient (although I wonder whether the scaling order would be the same given the nature of the comparisons required).
If it is pertinent, I imported all (>25) data columns from a .csv file as categorical variables. The original data before that were a mix of number and general columns from an Excel sheet; I could have used any or all of {double,string,categorical,datetime} to store the various variables, but there are some data which are best stored as categorical to avoid character trimming and consequent data cleaning / returning to original state steps.
Digging further, I also found this: https://au.mathworks.com/matlabcentral/answers/1775400-how-do-i-find-all-indexes-of-duplicate-names-in-a-table-column-then-compare-the-row-values-for-each which appears to imply that ismember should have the functionality I need here.
Similarly, methods using unique (see e.g. https://au.mathworks.com/matlabcentral/answers/1999193-find-duplicated-rows-in-matlab-without-for-loop or https://au.mathworks.com/matlabcentral/answers/1571588-table-find-duplicate-rows-double-char-datetime or https://au.mathworks.com/matlabcentral/answers/305987-identify-duplicate-rows-in-a-matrix) give:
size(unique([foo;bar],’rows’),1) == size(foo,1)+size(bar,1)
ans =
logical
1
instead of the expected 0 due to the lower amount of actual full-row matches. (Same for "rows" again.)
I’ve also looked into outerjoin/join/innerjoin, but those don’t seem to remove duplicates like I need. I’m using R2020b, and I want to set up a master table for appending new data to – and as part of this I want to identify any duplicate rows in the new, incoming table to filter them out before appending. Ideally, the master table will live in a related directory in a .mat file, and the new data will be read in directly from a set-name, set-location .csv using e.g.
fullname = fullfile(‘relativepath’,’newdata.csv’);
% grab column headers from input sheet
opts = detectImportOptions(fullname);
% set all variable types to categorical
opts.VariableTypes(:) = {‘categorical’};
% read in new data
T = readtable(fullname,opts);
% make any modifications to new data headers to match old data
T = renamevars(T,"NewLabel","OldLabel");
% clean new table headers to match originally-wizard-imported headers (I’d ask why these exhibit different behaviour, but that’s a separate tragedy, and this current fix works – I think)
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘ ‘, ”);
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘(‘, ”);
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘)’, ”);
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ‘_’, ”);
I found the solution suggested here: https://au.mathworks.com/matlabcentral/answers/514921-finding-identical-rows-in-2-tables, but having done a quick test via:
foo = T(str2double(string(T.Year))<1943,:); % not my actual query, but structurally the same; this gave me ~40% of my original data
bar = T(str2double(string(T.Year))>1941,:); % similar, gave me ~70% of the original data
baz = ismember(foo,bar); % similar, gives the overlap for 1 particular year (should be about 14% of my original data)
blah = T(str2double(string(T.Year))==1942,:); % to directly extract the number of rows I am looking for
sum(baz) % What I expect here is the number of rows in the overlap
ans =
0
I found that ismember was not finding any duplicates (which were there by construction).
Note: due to categorical data I actually used T(str2double(string(T.Year))…)
Replacing
baz = ismember(foo,bar,’rows’);
sum(baz)
ans =
0
results in the same not finding any duplicates. Using double quotes "rows" does not change the behaviour.
On the other hand, using the function to assess single variables gives the expected behaviour (to some degree):
testest = ismember(foo.var1,bar.var1)
sum(testest)
The sum is now non-zero, and (because single variables are repeated more often than their combinations) gives more like 30% of the original data, which seems reasonable (the number of unique entries in the original set in that variable was about 40% of the total).
I guess I could create a logical index based on the product of multiple calls of this kind, but that seems rather… inefficient… and sensitive to the exact construction of the table/variables used in the filter. I’d rather have a generic solution for full table rows that will be robust if the overall table changes over the long term (or if/when I functionalise the code and use it for other work). Whilst most of the time, a couple of key variables can be used to identify unique rows, occasionally more information is required to distinguish pathological cases. I will probably use this approach if a more elegant solution doesn’t appear, though, and put some thought into which groups of variables are 100% correlated (and therefore useless for this distinction) to cut down the Boolean product.
I could also throw good coding practice to the winds and just write two nested loops (one for rows, one for variables) and exhaustively test every combination, but I suspect that would be even less efficient (although I wonder whether the scaling order would be the same given the nature of the comparisons required).
If it is pertinent, I imported all (>25) data columns from a .csv file as categorical variables. The original data before that were a mix of number and general columns from an Excel sheet; I could have used any or all of {double,string,categorical,datetime} to store the various variables, but there are some data which are best stored as categorical to avoid character trimming and consequent data cleaning / returning to original state steps.
Digging further, I also found this: https://au.mathworks.com/matlabcentral/answers/1775400-how-do-i-find-all-indexes-of-duplicate-names-in-a-table-column-then-compare-the-row-values-for-each which appears to imply that ismember should have the functionality I need here.
Similarly, methods using unique (see e.g. https://au.mathworks.com/matlabcentral/answers/1999193-find-duplicated-rows-in-matlab-without-for-loop or https://au.mathworks.com/matlabcentral/answers/1571588-table-find-duplicate-rows-double-char-datetime or https://au.mathworks.com/matlabcentral/answers/305987-identify-duplicate-rows-in-a-matrix) give:
size(unique([foo;bar],’rows’),1) == size(foo,1)+size(bar,1)
ans =
logical
1
instead of the expected 0 due to the lower amount of actual full-row matches. (Same for "rows" again.)
I’ve also looked into outerjoin/join/innerjoin, but those don’t seem to remove duplicates like I need. table, ismember, rows, duplicate MATLAB Answers — New Questions