Memory optimization and code speed up
Hi! I have a function to compute what is called Fuzzy Rand Index. It is a way to compare a fuzzy (soft) or probabilistic data partition (clustering result) with a reference hard partition. The math behind this is basically an algebra with t-norms and t-conorms (s-norms). Among some choices, we can use min() as t-norm and max() as s-norm or algebraic product and probabilistic OR (probor) also as t-norm and s-norm. In the present case I’m using min() and max().
The input data are 2 partition matrices (Nxd) in which N is the number of data points and d the number of clusters. So each element h_ji of a partition matrix is precisely the probability of datapoint j belongs to cluster i. In general, N >> d (e.g. N = 20000 to 50000 points and d is tipically 2 to 5 clusters).
The math behind the index is such that I have to compute sort of a "pairwise min" between each data point and all others within the same column (min(h_ji,h_ki)), so I’ll have N*(N-1)/2 values for each of the d columns. To these d arrays of size N*(N-1)/2, I must apply the max s-norm element wise to end up with a quantity V that is an array of size N*(N-1)/2.
Also I must compute the same "pairwise min" t-norm between every data point and each other but mixing the columns like min(h_ji,h_kp). So I’ll end up with factorial(d)/factorial(d-2) arrays of size N*(N-1)/2 and to these I must also apply the max s-norm element wise to end up with a quantity X that is an array of size N*(N-1)/2.
What I call "pairwise min" is much like pdist do, comparing each point with the next ones, but not with the previous ones.
Later, this V and X from one partition matrix are compared with another V0 and X0 from the reference partition matrix with min t-norm.
I have implemented this in 3 different ways.
The slower and the one that has high memory consumption is the more compact form as:
% you can use as an example U = rand(20000,4);
U = rand(30000,4); % just for test
[N,d] = size(U); % U is a partition matrix
comb = ones(d,2).*(1:d)’;
V = max(cell2mat(arrayfun(@(k) cell2mat(arrayfun(@(j) pdist2(U((j+1):N,comb(k,1)),U(j,comb(k,2)),@min),(1:(N-1)).’,’UniformOutput’,false)),1:size(comb,1),’UniformOutput’,false)).’);
comb = nchoosek(1:d,2);
comb = [comb;fliplr(comb)];
X = max(cell2mat(arrayfun(@(k) cell2mat(arrayfun(@(j) pdist2(U((j+1):N,comb(k,1)),U(j,comb(k,2)),@min),(1:(N-1)).’,’UniformOutput’,false)),1:size(comb,1),’UniformOutput’,false)).’);
In some intermediate form, I’ve tried to use parallel code like
% you can use as an example U = rand(20000,4);
U = rand(30000,4); % just for test
[N,d] = size(U); % U is a partition matrix
comb = nchoosek(1:d,2);
ncomb = size(comb,1);
X = cell(n-1,1);
V = cell(n-1,1);
parfor j=1:n-1
tmp = zeros(n-j,1);
for k=1:ncomb
tmp = max(tmp,max(min(U(j,comb(k,1)), U((j+1):end,comb(k,2))) , min(U(j,comb(k,2)), U((j+1):end,comb(k,1)))));
end
X{j} = tmp;
tmp = zeros(n-j,1);
for k=1:d
tmp = max(tmp,min(U(j,k), U((j+1):end,k)));
end
V{j} = tmp;
end
X2 = cell2mat(X);
V2 = cell2mat(V);
Finally, the version I’ve got the best results in terms of speed and memory usage is the following, where I did the trick using circshifts and have used single type, as double precision is not important to me:
% you can use as an example A = rand(20000,4);
A = rand(30000,4); % just for test
[N,d] = size(A); % A is a partition matrix
NC = fix(N/2);
numelements = N*(N-1)/2;
V = zeros(numelements,1,’single’);
X = zeros(numelements,1,’single’);
temp = zeros(N,1,’single’);
for nc1 = 1:NC-1
Ac = circshift(A,-nc1,1);
V((nc1-1)*N + 1 : nc1*N) = max(min(A,Ac),[],2);
X((nc1-1)*N + 1 : nc1*N) = func1(A,Ac,temp,d);
temp(:) = 0;
end
if N/2 – NC == 0
V((NC-1)*N + 1 : end) = max(min(A(1:N/2,:),A(N/2+1:end,:)),[],2);
X((NC-1)*N + 1 : end) = func2(A,N,d,temp(1:N/2));
else
Ac = circshift(A,-NC,1);
V((NC-1)*N + 1 : end) = max(min(A,Ac),[],2);
X((NC-1)*N + 1 : end) = func1(A,Ac,N,d);
end
function temp = func1(A,Ac,temp,d)
for nc2 = 1:d-1
temp = max(temp,max(min(A,circshift(Ac,-nc2,2)),[],2));
end
end
function temp = func2(A,N,d,temp)
for nc2 = 1:d-1
temp = max(temp,max(min(A(1:N/2,:),circshift(A(N/2+1:end,:),-nc2,2)),[],2));
end
end
The problem is that this calculation must be made several times to compare a lot of different clustering results to the reference one and for datasets with N > 20000 things take a lot of time. So do you think there is room for more optimization and get this faster and more memory efficient than the 3rd version with the circshifts?
Also, as this is done multiple times (2300 times for each dataset), this function is being called inside a parfor. So each thread executes this function. But min and max are multi-threaded functions also. My approach was to set the parpool with 6 workers (12 threads) and left 2 workers (4 threads) out of the parpool willing that these 4 threads will be used for the min() and max() multi-threading. Is that right? I thought that using all workers within the parpool would disable the multi-threading of min() and max() and get things slower.Hi! I have a function to compute what is called Fuzzy Rand Index. It is a way to compare a fuzzy (soft) or probabilistic data partition (clustering result) with a reference hard partition. The math behind this is basically an algebra with t-norms and t-conorms (s-norms). Among some choices, we can use min() as t-norm and max() as s-norm or algebraic product and probabilistic OR (probor) also as t-norm and s-norm. In the present case I’m using min() and max().
The input data are 2 partition matrices (Nxd) in which N is the number of data points and d the number of clusters. So each element h_ji of a partition matrix is precisely the probability of datapoint j belongs to cluster i. In general, N >> d (e.g. N = 20000 to 50000 points and d is tipically 2 to 5 clusters).
The math behind the index is such that I have to compute sort of a "pairwise min" between each data point and all others within the same column (min(h_ji,h_ki)), so I’ll have N*(N-1)/2 values for each of the d columns. To these d arrays of size N*(N-1)/2, I must apply the max s-norm element wise to end up with a quantity V that is an array of size N*(N-1)/2.
Also I must compute the same "pairwise min" t-norm between every data point and each other but mixing the columns like min(h_ji,h_kp). So I’ll end up with factorial(d)/factorial(d-2) arrays of size N*(N-1)/2 and to these I must also apply the max s-norm element wise to end up with a quantity X that is an array of size N*(N-1)/2.
What I call "pairwise min" is much like pdist do, comparing each point with the next ones, but not with the previous ones.
Later, this V and X from one partition matrix are compared with another V0 and X0 from the reference partition matrix with min t-norm.
I have implemented this in 3 different ways.
The slower and the one that has high memory consumption is the more compact form as:
% you can use as an example U = rand(20000,4);
U = rand(30000,4); % just for test
[N,d] = size(U); % U is a partition matrix
comb = ones(d,2).*(1:d)’;
V = max(cell2mat(arrayfun(@(k) cell2mat(arrayfun(@(j) pdist2(U((j+1):N,comb(k,1)),U(j,comb(k,2)),@min),(1:(N-1)).’,’UniformOutput’,false)),1:size(comb,1),’UniformOutput’,false)).’);
comb = nchoosek(1:d,2);
comb = [comb;fliplr(comb)];
X = max(cell2mat(arrayfun(@(k) cell2mat(arrayfun(@(j) pdist2(U((j+1):N,comb(k,1)),U(j,comb(k,2)),@min),(1:(N-1)).’,’UniformOutput’,false)),1:size(comb,1),’UniformOutput’,false)).’);
In some intermediate form, I’ve tried to use parallel code like
% you can use as an example U = rand(20000,4);
U = rand(30000,4); % just for test
[N,d] = size(U); % U is a partition matrix
comb = nchoosek(1:d,2);
ncomb = size(comb,1);
X = cell(n-1,1);
V = cell(n-1,1);
parfor j=1:n-1
tmp = zeros(n-j,1);
for k=1:ncomb
tmp = max(tmp,max(min(U(j,comb(k,1)), U((j+1):end,comb(k,2))) , min(U(j,comb(k,2)), U((j+1):end,comb(k,1)))));
end
X{j} = tmp;
tmp = zeros(n-j,1);
for k=1:d
tmp = max(tmp,min(U(j,k), U((j+1):end,k)));
end
V{j} = tmp;
end
X2 = cell2mat(X);
V2 = cell2mat(V);
Finally, the version I’ve got the best results in terms of speed and memory usage is the following, where I did the trick using circshifts and have used single type, as double precision is not important to me:
% you can use as an example A = rand(20000,4);
A = rand(30000,4); % just for test
[N,d] = size(A); % A is a partition matrix
NC = fix(N/2);
numelements = N*(N-1)/2;
V = zeros(numelements,1,’single’);
X = zeros(numelements,1,’single’);
temp = zeros(N,1,’single’);
for nc1 = 1:NC-1
Ac = circshift(A,-nc1,1);
V((nc1-1)*N + 1 : nc1*N) = max(min(A,Ac),[],2);
X((nc1-1)*N + 1 : nc1*N) = func1(A,Ac,temp,d);
temp(:) = 0;
end
if N/2 – NC == 0
V((NC-1)*N + 1 : end) = max(min(A(1:N/2,:),A(N/2+1:end,:)),[],2);
X((NC-1)*N + 1 : end) = func2(A,N,d,temp(1:N/2));
else
Ac = circshift(A,-NC,1);
V((NC-1)*N + 1 : end) = max(min(A,Ac),[],2);
X((NC-1)*N + 1 : end) = func1(A,Ac,N,d);
end
function temp = func1(A,Ac,temp,d)
for nc2 = 1:d-1
temp = max(temp,max(min(A,circshift(Ac,-nc2,2)),[],2));
end
end
function temp = func2(A,N,d,temp)
for nc2 = 1:d-1
temp = max(temp,max(min(A(1:N/2,:),circshift(A(N/2+1:end,:),-nc2,2)),[],2));
end
end
The problem is that this calculation must be made several times to compare a lot of different clustering results to the reference one and for datasets with N > 20000 things take a lot of time. So do you think there is room for more optimization and get this faster and more memory efficient than the 3rd version with the circshifts?
Also, as this is done multiple times (2300 times for each dataset), this function is being called inside a parfor. So each thread executes this function. But min and max are multi-threaded functions also. My approach was to set the parpool with 6 workers (12 threads) and left 2 workers (4 threads) out of the parpool willing that these 4 threads will be used for the min() and max() multi-threading. Is that right? I thought that using all workers within the parpool would disable the multi-threading of min() and max() and get things slower. Hi! I have a function to compute what is called Fuzzy Rand Index. It is a way to compare a fuzzy (soft) or probabilistic data partition (clustering result) with a reference hard partition. The math behind this is basically an algebra with t-norms and t-conorms (s-norms). Among some choices, we can use min() as t-norm and max() as s-norm or algebraic product and probabilistic OR (probor) also as t-norm and s-norm. In the present case I’m using min() and max().
The input data are 2 partition matrices (Nxd) in which N is the number of data points and d the number of clusters. So each element h_ji of a partition matrix is precisely the probability of datapoint j belongs to cluster i. In general, N >> d (e.g. N = 20000 to 50000 points and d is tipically 2 to 5 clusters).
The math behind the index is such that I have to compute sort of a "pairwise min" between each data point and all others within the same column (min(h_ji,h_ki)), so I’ll have N*(N-1)/2 values for each of the d columns. To these d arrays of size N*(N-1)/2, I must apply the max s-norm element wise to end up with a quantity V that is an array of size N*(N-1)/2.
Also I must compute the same "pairwise min" t-norm between every data point and each other but mixing the columns like min(h_ji,h_kp). So I’ll end up with factorial(d)/factorial(d-2) arrays of size N*(N-1)/2 and to these I must also apply the max s-norm element wise to end up with a quantity X that is an array of size N*(N-1)/2.
What I call "pairwise min" is much like pdist do, comparing each point with the next ones, but not with the previous ones.
Later, this V and X from one partition matrix are compared with another V0 and X0 from the reference partition matrix with min t-norm.
I have implemented this in 3 different ways.
The slower and the one that has high memory consumption is the more compact form as:
% you can use as an example U = rand(20000,4);
U = rand(30000,4); % just for test
[N,d] = size(U); % U is a partition matrix
comb = ones(d,2).*(1:d)’;
V = max(cell2mat(arrayfun(@(k) cell2mat(arrayfun(@(j) pdist2(U((j+1):N,comb(k,1)),U(j,comb(k,2)),@min),(1:(N-1)).’,’UniformOutput’,false)),1:size(comb,1),’UniformOutput’,false)).’);
comb = nchoosek(1:d,2);
comb = [comb;fliplr(comb)];
X = max(cell2mat(arrayfun(@(k) cell2mat(arrayfun(@(j) pdist2(U((j+1):N,comb(k,1)),U(j,comb(k,2)),@min),(1:(N-1)).’,’UniformOutput’,false)),1:size(comb,1),’UniformOutput’,false)).’);
In some intermediate form, I’ve tried to use parallel code like
% you can use as an example U = rand(20000,4);
U = rand(30000,4); % just for test
[N,d] = size(U); % U is a partition matrix
comb = nchoosek(1:d,2);
ncomb = size(comb,1);
X = cell(n-1,1);
V = cell(n-1,1);
parfor j=1:n-1
tmp = zeros(n-j,1);
for k=1:ncomb
tmp = max(tmp,max(min(U(j,comb(k,1)), U((j+1):end,comb(k,2))) , min(U(j,comb(k,2)), U((j+1):end,comb(k,1)))));
end
X{j} = tmp;
tmp = zeros(n-j,1);
for k=1:d
tmp = max(tmp,min(U(j,k), U((j+1):end,k)));
end
V{j} = tmp;
end
X2 = cell2mat(X);
V2 = cell2mat(V);
Finally, the version I’ve got the best results in terms of speed and memory usage is the following, where I did the trick using circshifts and have used single type, as double precision is not important to me:
% you can use as an example A = rand(20000,4);
A = rand(30000,4); % just for test
[N,d] = size(A); % A is a partition matrix
NC = fix(N/2);
numelements = N*(N-1)/2;
V = zeros(numelements,1,’single’);
X = zeros(numelements,1,’single’);
temp = zeros(N,1,’single’);
for nc1 = 1:NC-1
Ac = circshift(A,-nc1,1);
V((nc1-1)*N + 1 : nc1*N) = max(min(A,Ac),[],2);
X((nc1-1)*N + 1 : nc1*N) = func1(A,Ac,temp,d);
temp(:) = 0;
end
if N/2 – NC == 0
V((NC-1)*N + 1 : end) = max(min(A(1:N/2,:),A(N/2+1:end,:)),[],2);
X((NC-1)*N + 1 : end) = func2(A,N,d,temp(1:N/2));
else
Ac = circshift(A,-NC,1);
V((NC-1)*N + 1 : end) = max(min(A,Ac),[],2);
X((NC-1)*N + 1 : end) = func1(A,Ac,N,d);
end
function temp = func1(A,Ac,temp,d)
for nc2 = 1:d-1
temp = max(temp,max(min(A,circshift(Ac,-nc2,2)),[],2));
end
end
function temp = func2(A,N,d,temp)
for nc2 = 1:d-1
temp = max(temp,max(min(A(1:N/2,:),circshift(A(N/2+1:end,:),-nc2,2)),[],2));
end
end
The problem is that this calculation must be made several times to compare a lot of different clustering results to the reference one and for datasets with N > 20000 things take a lot of time. So do you think there is room for more optimization and get this faster and more memory efficient than the 3rd version with the circshifts?
Also, as this is done multiple times (2300 times for each dataset), this function is being called inside a parfor. So each thread executes this function. But min and max are multi-threaded functions also. My approach was to set the parpool with 6 workers (12 threads) and left 2 workers (4 threads) out of the parpool willing that these 4 threads will be used for the min() and max() multi-threading. Is that right? I thought that using all workers within the parpool would disable the multi-threading of min() and max() and get things slower. multi-thread, parfor, speed up, memory usage MATLAB Answers — New Questions