How to speed up our code to be implemented on GPU
Hello, I have previously created my MEX file of my code to speed up its implementation speed on GPU. Fortunately, it got faster by 5 times, and hopefully, I want to know if there is way to implement it with higher speed. Here is my code:
function BPmimo2C(Efield) %#codegen
coder.gpu.kernelfun;
image = complex(zeros(17,54,54));
%% creating kaiser window
numT = 16;
numR= 16;
f = 10e9:0.5e9:20e9;
numF = numel(f);
w = ones(numel(f),1);
viq = repmat(w.’, [1,numT*numR]);
c = physconst(‘LightSpeed’);
%% grid points
xf = (-8:0.3:8)*0.01;
yf = (-8:0.3:8)*0.01;
[uf , vf] = meshgrid(xf,yf);
x1f = uf(:);
y1f = vf(:);
%% initialization
ArrRadius = 30;
TX = [ArrRadius.*cosd((360/15)*(0:14))*0.01 0];
TY = [ArrRadius.*sind((360/15)*(0:14))*0.01 0];
K = 2*pi*f/c;
z = 0.36:0.003:0.41;
% z = 0.4;
for dep = 1:numel(z)
%% making the matrix of <transmitter-grid point> distance
XYPos = [TX.’ TY.’ ones(size(TX,2),1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dtXYUV = pdist2( XYPos, UVPos);
dtXYUV2 = zeros(numR,numel(x1f(:)));
expTerm1 = bsxfun(@times,dtXYUV(:)’ , K’);
expT1 = reshape(expTerm1,[numel(K),numel(TX),numel(x1f)]);
expT2 = zeros(numel(K),numR,numel(x1f),numel(TX));
for i = 1:numel(TX)
expT2(:,:,:,i) = repmat(expT1(:,i,:),[1 numR 1]);
dtXYUV2(:,:,i) = repmat(dtXYUV(i,:),[numR,1]);
end
expT = permute(reshape(permute(expT2,[1 3 2 4]),[numel(K),numel(x1f),numR*numel(TX)]),[1 3 2]);
%% making the matrix of <reciever-grid point> distance
XYPos = [real(Efield(1:numR,2,1)) , real(Efield(1:numR,3,1)), ones(numR,1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dXYUV = pdist2( XYPos, UVPos);
expTerm1 = bsxfun(@times,dXYUV(:)’ , K’);
expR = repmat(reshape(expTerm1,[numel(K),numR,numel(x1f)]),[1 numel(TX) 1]);
%% making the exponentail term
EXP = exp(1i*(expT + expR));
EXP2 = reshape(EXP,[numel(K)*numel(TX)*numR,numel(x1f)]);
Efield2 = reshape(permute(Efield(1:numT*numR,:,:),[3 1 2]),[numel(f)*numT*numR,6]);
image2 = reshape(((viq.’).*Efield2(:,6)).’*EXP2,[sqrt(numel(x1f)),sqrt(numel(x1f))]);
%% gahter to change matrix from GPU-array to normal array
image(dep,:,:) = image2;
end
image = abs(image);
uf = repmat(reshape(uf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
vf = repmat(reshape(vf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
hf = uf;
for j = 1:numel(z)
hf(j,:,:) = z(j);
end
figure(1);
er = squeeze((image(13,:,:)));
h = surf(squeeze(uf(1,:,:)),squeeze(vf(1,:,:)),er);
colormap(jet);
set(h,’LineStyle’,’none’);
view(2);
end
In addition to speed, sometimes it encounters with "out of memory" error, which is due to huge size of some arrays. I can implement it using multiple nested "for"loops, however, I understood it’d be faster on CPU if I use MATLAB’s matrix multipication capability; Therefore, I preferred matrix-based code rather than multiple nested "for" loops.
Any advice, whether it would be general or specific, would be appreciated.
Thank youHello, I have previously created my MEX file of my code to speed up its implementation speed on GPU. Fortunately, it got faster by 5 times, and hopefully, I want to know if there is way to implement it with higher speed. Here is my code:
function BPmimo2C(Efield) %#codegen
coder.gpu.kernelfun;
image = complex(zeros(17,54,54));
%% creating kaiser window
numT = 16;
numR= 16;
f = 10e9:0.5e9:20e9;
numF = numel(f);
w = ones(numel(f),1);
viq = repmat(w.’, [1,numT*numR]);
c = physconst(‘LightSpeed’);
%% grid points
xf = (-8:0.3:8)*0.01;
yf = (-8:0.3:8)*0.01;
[uf , vf] = meshgrid(xf,yf);
x1f = uf(:);
y1f = vf(:);
%% initialization
ArrRadius = 30;
TX = [ArrRadius.*cosd((360/15)*(0:14))*0.01 0];
TY = [ArrRadius.*sind((360/15)*(0:14))*0.01 0];
K = 2*pi*f/c;
z = 0.36:0.003:0.41;
% z = 0.4;
for dep = 1:numel(z)
%% making the matrix of <transmitter-grid point> distance
XYPos = [TX.’ TY.’ ones(size(TX,2),1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dtXYUV = pdist2( XYPos, UVPos);
dtXYUV2 = zeros(numR,numel(x1f(:)));
expTerm1 = bsxfun(@times,dtXYUV(:)’ , K’);
expT1 = reshape(expTerm1,[numel(K),numel(TX),numel(x1f)]);
expT2 = zeros(numel(K),numR,numel(x1f),numel(TX));
for i = 1:numel(TX)
expT2(:,:,:,i) = repmat(expT1(:,i,:),[1 numR 1]);
dtXYUV2(:,:,i) = repmat(dtXYUV(i,:),[numR,1]);
end
expT = permute(reshape(permute(expT2,[1 3 2 4]),[numel(K),numel(x1f),numR*numel(TX)]),[1 3 2]);
%% making the matrix of <reciever-grid point> distance
XYPos = [real(Efield(1:numR,2,1)) , real(Efield(1:numR,3,1)), ones(numR,1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dXYUV = pdist2( XYPos, UVPos);
expTerm1 = bsxfun(@times,dXYUV(:)’ , K’);
expR = repmat(reshape(expTerm1,[numel(K),numR,numel(x1f)]),[1 numel(TX) 1]);
%% making the exponentail term
EXP = exp(1i*(expT + expR));
EXP2 = reshape(EXP,[numel(K)*numel(TX)*numR,numel(x1f)]);
Efield2 = reshape(permute(Efield(1:numT*numR,:,:),[3 1 2]),[numel(f)*numT*numR,6]);
image2 = reshape(((viq.’).*Efield2(:,6)).’*EXP2,[sqrt(numel(x1f)),sqrt(numel(x1f))]);
%% gahter to change matrix from GPU-array to normal array
image(dep,:,:) = image2;
end
image = abs(image);
uf = repmat(reshape(uf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
vf = repmat(reshape(vf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
hf = uf;
for j = 1:numel(z)
hf(j,:,:) = z(j);
end
figure(1);
er = squeeze((image(13,:,:)));
h = surf(squeeze(uf(1,:,:)),squeeze(vf(1,:,:)),er);
colormap(jet);
set(h,’LineStyle’,’none’);
view(2);
end
In addition to speed, sometimes it encounters with "out of memory" error, which is due to huge size of some arrays. I can implement it using multiple nested "for"loops, however, I understood it’d be faster on CPU if I use MATLAB’s matrix multipication capability; Therefore, I preferred matrix-based code rather than multiple nested "for" loops.
Any advice, whether it would be general or specific, would be appreciated.
Thank you Hello, I have previously created my MEX file of my code to speed up its implementation speed on GPU. Fortunately, it got faster by 5 times, and hopefully, I want to know if there is way to implement it with higher speed. Here is my code:
function BPmimo2C(Efield) %#codegen
coder.gpu.kernelfun;
image = complex(zeros(17,54,54));
%% creating kaiser window
numT = 16;
numR= 16;
f = 10e9:0.5e9:20e9;
numF = numel(f);
w = ones(numel(f),1);
viq = repmat(w.’, [1,numT*numR]);
c = physconst(‘LightSpeed’);
%% grid points
xf = (-8:0.3:8)*0.01;
yf = (-8:0.3:8)*0.01;
[uf , vf] = meshgrid(xf,yf);
x1f = uf(:);
y1f = vf(:);
%% initialization
ArrRadius = 30;
TX = [ArrRadius.*cosd((360/15)*(0:14))*0.01 0];
TY = [ArrRadius.*sind((360/15)*(0:14))*0.01 0];
K = 2*pi*f/c;
z = 0.36:0.003:0.41;
% z = 0.4;
for dep = 1:numel(z)
%% making the matrix of <transmitter-grid point> distance
XYPos = [TX.’ TY.’ ones(size(TX,2),1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dtXYUV = pdist2( XYPos, UVPos);
dtXYUV2 = zeros(numR,numel(x1f(:)));
expTerm1 = bsxfun(@times,dtXYUV(:)’ , K’);
expT1 = reshape(expTerm1,[numel(K),numel(TX),numel(x1f)]);
expT2 = zeros(numel(K),numR,numel(x1f),numel(TX));
for i = 1:numel(TX)
expT2(:,:,:,i) = repmat(expT1(:,i,:),[1 numR 1]);
dtXYUV2(:,:,i) = repmat(dtXYUV(i,:),[numR,1]);
end
expT = permute(reshape(permute(expT2,[1 3 2 4]),[numel(K),numel(x1f),numR*numel(TX)]),[1 3 2]);
%% making the matrix of <reciever-grid point> distance
XYPos = [real(Efield(1:numR,2,1)) , real(Efield(1:numR,3,1)), ones(numR,1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dXYUV = pdist2( XYPos, UVPos);
expTerm1 = bsxfun(@times,dXYUV(:)’ , K’);
expR = repmat(reshape(expTerm1,[numel(K),numR,numel(x1f)]),[1 numel(TX) 1]);
%% making the exponentail term
EXP = exp(1i*(expT + expR));
EXP2 = reshape(EXP,[numel(K)*numel(TX)*numR,numel(x1f)]);
Efield2 = reshape(permute(Efield(1:numT*numR,:,:),[3 1 2]),[numel(f)*numT*numR,6]);
image2 = reshape(((viq.’).*Efield2(:,6)).’*EXP2,[sqrt(numel(x1f)),sqrt(numel(x1f))]);
%% gahter to change matrix from GPU-array to normal array
image(dep,:,:) = image2;
end
image = abs(image);
uf = repmat(reshape(uf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
vf = repmat(reshape(vf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
hf = uf;
for j = 1:numel(z)
hf(j,:,:) = z(j);
end
figure(1);
er = squeeze((image(13,:,:)));
h = surf(squeeze(uf(1,:,:)),squeeze(vf(1,:,:)),er);
colormap(jet);
set(h,’LineStyle’,’none’);
view(2);
end
In addition to speed, sometimes it encounters with "out of memory" error, which is due to huge size of some arrays. I can implement it using multiple nested "for"loops, however, I understood it’d be faster on CPU if I use MATLAB’s matrix multipication capability; Therefore, I preferred matrix-based code rather than multiple nested "for" loops.
Any advice, whether it would be general or specific, would be appreciated.
Thank you gpu coder, matlab, gpu, speed, mex MATLAB Answers — New Questions