Vectorized Levenshtein distances between arrays of text labels?
I have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. The NxN array of distances is iterated through on an element-by-element basis.
I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I’m surprised that a vectorized implementation doesn’t exist. Am I missing something obvious?
Annex: Matlab implementations of Levenshtein distance
https://people.math.sc.edu/Burkardt/m_src/levenshtein/levenshtein.html
https://www.mathworks.com/matlabcentral/fileexchange/17585-calculation-of-distance-between-strings
https://blogs.mathworks.com/cleve/2017/08/14/levenshtein-edit-distance-between-stringsI have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. The NxN array of distances is iterated through on an element-by-element basis.
I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I’m surprised that a vectorized implementation doesn’t exist. Am I missing something obvious?
Annex: Matlab implementations of Levenshtein distance
https://people.math.sc.edu/Burkardt/m_src/levenshtein/levenshtein.html
https://www.mathworks.com/matlabcentral/fileexchange/17585-calculation-of-distance-between-strings
https://blogs.mathworks.com/cleve/2017/08/14/levenshtein-edit-distance-between-strings I have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. The NxN array of distances is iterated through on an element-by-element basis.
I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I’m surprised that a vectorized implementation doesn’t exist. Am I missing something obvious?
Annex: Matlab implementations of Levenshtein distance
https://people.math.sc.edu/Burkardt/m_src/levenshtein/levenshtein.html
https://www.mathworks.com/matlabcentral/fileexchange/17585-calculation-of-distance-between-strings
https://blogs.mathworks.com/cleve/2017/08/14/levenshtein-edit-distance-between-strings levenshtein-distance, vectorized MATLAB Answers — New Questions