making a custom way to train CNNs, and I am noticing that avgpool is SIGNIFICANTLY faster than maxpool in forward and backwards passes…

I’m designing a custom training procedure for a CNN that is different from backpropagation in that I use manual update rules for layers or sets of layers. I’m studying my gradient for two types of layers: “conv + actfun + maxpool”, and “conv + actfun + avgpool”, which are identical layers except the last action is a different pooling type.
I compared the two layer types with identical data dimension sizes to see the time differences between maxpool and avgpool, both in the forward pass and the backwards pass of the pooling layers. All other steps in calculating the gradient were exactly the same between the two layers, and showed the same time costs in the two layers. But when looking at time costs specifically of the pooling operations’ forward and backwards passes, I get significantly different times (average of 5000 runs of the gradient, each measurement is in milliseconds):
gradient step | AvgPool | MaxPool | Difference
————————–|———|———|———-
pooling (forward pass) | 0.4165 | 38.6316 | +38.2151
unpooling (backward pass) | 9.9468 | 46.1667 | +36.2199

For reference, all my data arrays are dlarrays on the GPU (gpuArrays in dlarrays), all single precision, and the pooling operations convert 32 by 32 feature maps (across 2 channels and 16384 batch size) to 16 by 16 feature maps (of same # channels and batch size), so just a 2 by 2 pooling operation.
You can see here that the maxpool forward pass (using “maxpool” function) is about 92 times slower than the avgpool forward pass (using “avgpool”), and the maxpool backward pass (using “maxunpool”) is about 4.6 times slower than the avgpool backward pass (using a custom “avgunpool” function that Anthropic’s Claude had to create for me, since matlab has no “avgunpool”).
These results are extremely suspect to me. For the forwards pass, comparing matlab’s built in "maxpool" to built in "avgpool" functions gives a 92x difference, but searching online people seem to instead claim that training max pooling is supposed to be faster than avg pooling, which contradicts the results here.
For simplicity, see the code example below that runs just "maxpool" and "avgpool" only (no other functions) and compares their times:

function analyze_pooling_timing()

% GPU setup
g = gpuDevice();
fprintf(‘GPU: %sn’, g.Name);

% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;

pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;

conv_params.stride = [1, 1];
conv_params.padding = ‘same’;
conv_params.dilation = [1, 1];

% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), ‘SSCU’);
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), ‘C’);
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), ‘SSCB’);

% Number of iterations for averaging
num_iter = 100;
fprintf(‘Running %d iterations for each timing measurement…nn’, num_iter);

%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, …
‘Stride’, conv_params.stride, …
‘Padding’, conv_params.padding, …
‘DilationFactor’, conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);

%% Time AVERAGE POOLING
fprintf(‘=== AVERAGE POOLING (conv_af_ap) ===n’);
times_ap = struct();

for iter = 1:num_iter

% Average pooling
tic;
Oj_pooled = avgpool(Oj, pool_params.pool_size, …
‘Stride’, pool_params.pool_stride, …
‘Padding’, pool_params.pool_padding);
wait(g);
times_ap.pooling(iter) = toc;

end

%% Time MAX POOLING
fprintf(‘n=== MAX POOLING (conv_af_mp) ===n’);
times_mp = struct();

for iter = 1:num_iter

% Max pooling with indices
tic;
[Oj_pooled, max_indices] = maxpool(Oj, pool_params.pool_size, …
‘Stride’, pool_params.pool_stride, …
‘Padding’, pool_params.pool_padding);
wait(g);
times_mp.pooling(iter) = toc;

end

%% Compute statistics and display results
fprintf(‘n=== TIMING RESULTS (milliseconds) ===n’);
fprintf(‘%-25s %12s %12s %12sn’, ‘Step’, ‘AvgPool’, ‘MaxPool’, ‘Difference’);
fprintf(‘%sn’, repmat(‘-‘, 1, 65));

steps_common = { ‘pooling’};

total_ap = 0;
total_mp = 0;

for i = 1:length(steps_common)
step = steps_common{i};
if isfield(times_ap, step) && isfield(times_mp, step)
mean_ap = mean(times_ap.(step)) * 1000; % times 1000 to convert seconds to milliseconds
mean_mp = mean(times_mp.(step)) * 1000; % times 1000 to convert seconds to milliseconds
total_ap = total_ap + mean_ap;
total_mp = total_mp + mean_mp;
diff = mean_mp – mean_ap;
fprintf(‘%-25s %12.4f %12.4f %+12.4fn’, step, mean_ap, mean_mp, diff);
end
end

fprintf(‘%sn’, repmat(‘-‘, 1, 65));
%fprintf(‘%-25s %12.4f %12.4f %+12.4fn’, ‘TOTAL’, total_ap, total_mp, total_mp – total_ap);
fprintf(‘%-25s %12s %12s %12.2fxn’, ‘Speedup’, ”, ”, total_mp/total_ap);

end

Now you can see one main difference between my calling of maxpool and avgpool: for avgpool I only have 1 output (the pooled values), but with maxpool I have 2 outputs (the pooled values, and the index locations of these max values).

This is important because if I replaced the call to maxpool with only requesting the pooled values (1 output), maxpool is faster as expected:

>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement…

=== AVERAGE POOLING (conv_af_ap) ===

=== MAX POOLING (conv_af_mp) ===

=== TIMING RESULTS (milliseconds) ===
Step AvgPool MaxPool Difference
—————————————————————–
pooling 0.4358 0.2330 -0.2028
—————————————————————–
max/avg 0.53x
>>

However if I call maxpool like here with 2 outputs (pooled values and indices), or with 3 outputs (pooled values, indices, and inputSize), this is where maxpool is extremely slow:

>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement…

=== AVERAGE POOLING (conv_af_ap) ===

=== MAX POOLING (conv_af_mp) ===

=== TIMING RESULTS (milliseconds) ===
Step AvgPool MaxPool Difference
—————————————————————–
pooling 0.4153 38.9818 +38.5665
—————————————————————–
max/avg 93.86x
>>

So it appears that this is the reason why maxpool is so much slower: requesting the maxpool function to also output the indices leads to a massive slowdown.
Unfortunately, the indices are needed to later differentiate (backwards pass) the maxpool layer… so I need the indices…
I’d assume that whenever someone wants to train a CNN in matlab using a maxpool layer, they would have to call maxpool with indices, and thus I’d expect a similar slowdown…
Can anyone here comment on this? My application needs me to train maxpool layers, so I need to both forward pass and backwards pass through them, thus I need these indices. But it appears that matlab’s version of maxpool may not be the best implementation of a maxpool operation… maybe some inefficient processes regarding obtaining the indices?I’m designing a custom training procedure for a CNN that is different from backpropagation in that I use manual update rules for layers or sets of layers. I’m studying my gradient for two types of layers: “conv + actfun + maxpool”, and “conv + actfun + avgpool”, which are identical layers except the last action is a different pooling type.
I compared the two layer types with identical data dimension sizes to see the time differences between maxpool and avgpool, both in the forward pass and the backwards pass of the pooling layers. All other steps in calculating the gradient were exactly the same between the two layers, and showed the same time costs in the two layers. But when looking at time costs specifically of the pooling operations’ forward and backwards passes, I get significantly different times (average of 5000 runs of the gradient, each measurement is in milliseconds):
gradient step | AvgPool | MaxPool | Difference
————————–|———|———|———-
pooling (forward pass) | 0.4165 | 38.6316 | +38.2151
unpooling (backward pass) | 9.9468 | 46.1667 | +36.2199

function analyze_pooling_timing()

% GPU setup
g = gpuDevice();
fprintf(‘GPU: %sn’, g.Name);

% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;

pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;

conv_params.stride = [1, 1];
conv_params.padding = ‘same’;
conv_params.dilation = [1, 1];

% Number of iterations for averaging
num_iter = 100;
fprintf(‘Running %d iterations for each timing measurement…nn’, num_iter);

%% Time AVERAGE POOLING
fprintf(‘=== AVERAGE POOLING (conv_af_ap) ===n’);
times_ap = struct();

for iter = 1:num_iter

% Average pooling
tic;
Oj_pooled = avgpool(Oj, pool_params.pool_size, …
‘Stride’, pool_params.pool_stride, …
‘Padding’, pool_params.pool_padding);
wait(g);
times_ap.pooling(iter) = toc;

end

%% Time MAX POOLING
fprintf(‘n=== MAX POOLING (conv_af_mp) ===n’);
times_mp = struct();

for iter = 1:num_iter

end

steps_common = { ‘pooling’};

total_ap = 0;
total_mp = 0;

end

This is important because if I replaced the call to maxpool with only requesting the pooled values (1 output), maxpool is faster as expected:

>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement…

=== AVERAGE POOLING (conv_af_ap) ===

=== MAX POOLING (conv_af_mp) ===

However if I call maxpool like here with 2 outputs (pooled values and indices), or with 3 outputs (pooled values, indices, and inputSize), this is where maxpool is extremely slow:

>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement…

=== AVERAGE POOLING (conv_af_ap) ===

=== MAX POOLING (conv_af_mp) ===

So it appears that this is the reason why maxpool is so much slower: requesting the maxpool function to also output the indices leads to a massive slowdown.
Unfortunately, the indices are needed to later differentiate (backwards pass) the maxpool layer… so I need the indices…
I’d assume that whenever someone wants to train a CNN in matlab using a maxpool layer, they would have to call maxpool with indices, and thus I’d expect a similar slowdown…
Can anyone here comment on this? My application needs me to train maxpool layers, so I need to both forward pass and backwards pass through them, thus I need these indices. But it appears that matlab’s version of maxpool may not be the best implementation of a maxpool operation… maybe some inefficient processes regarding obtaining the indices? I’m designing a custom training procedure for a CNN that is different from backpropagation in that I use manual update rules for layers or sets of layers. I’m studying my gradient for two types of layers: “conv + actfun + maxpool”, and “conv + actfun + avgpool”, which are identical layers except the last action is a different pooling type.
I compared the two layer types with identical data dimension sizes to see the time differences between maxpool and avgpool, both in the forward pass and the backwards pass of the pooling layers. All other steps in calculating the gradient were exactly the same between the two layers, and showed the same time costs in the two layers. But when looking at time costs specifically of the pooling operations’ forward and backwards passes, I get significantly different times (average of 5000 runs of the gradient, each measurement is in milliseconds):
gradient step | AvgPool | MaxPool | Difference
————————–|———|———|———-
pooling (forward pass) | 0.4165 | 38.6316 | +38.2151
unpooling (backward pass) | 9.9468 | 46.1667 | +36.2199

function analyze_pooling_timing()

% GPU setup
g = gpuDevice();
fprintf(‘GPU: %sn’, g.Name);

% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;

pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;

conv_params.stride = [1, 1];
conv_params.padding = ‘same’;
conv_params.dilation = [1, 1];

% Number of iterations for averaging
num_iter = 100;
fprintf(‘Running %d iterations for each timing measurement…nn’, num_iter);

%% Time AVERAGE POOLING
fprintf(‘=== AVERAGE POOLING (conv_af_ap) ===n’);
times_ap = struct();

for iter = 1:num_iter

% Average pooling
tic;
Oj_pooled = avgpool(Oj, pool_params.pool_size, …
‘Stride’, pool_params.pool_stride, …
‘Padding’, pool_params.pool_padding);
wait(g);
times_ap.pooling(iter) = toc;

end

%% Time MAX POOLING
fprintf(‘n=== MAX POOLING (conv_af_mp) ===n’);
times_mp = struct();

for iter = 1:num_iter

end

steps_common = { ‘pooling’};

total_ap = 0;
total_mp = 0;

end

This is important because if I replaced the call to maxpool with only requesting the pooled values (1 output), maxpool is faster as expected:

>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement…

=== AVERAGE POOLING (conv_af_ap) ===

=== MAX POOLING (conv_af_mp) ===

However if I call maxpool like here with 2 outputs (pooled values and indices), or with 3 outputs (pooled values, indices, and inputSize), this is where maxpool is extremely slow:

>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement…

=== AVERAGE POOLING (conv_af_ap) ===

=== MAX POOLING (conv_af_mp) ===

Cart

Cart