What is the origin of the discrepancy in binning caused by the built-in Freedman–Diaconis method in the histcounts function?
I am attempting to use the Freedman–Diaconis (FD) rule to determine appropriate bins for a right-skewed dataset. Within the histcounts MATLAB function, the parameter ‘BinMethod’ has a built-in FD formula via the parameter value ‘fd.’ On the MATLAB reference page for histcounts, the formula for the FD rule that is supposedly applied to calculate bin-width when one calls ‘fd’ is correctly cited as the following:
2*iqr(X(:))*numel(X)^(-1/3)
In my testing, I have found that the actual bins produced by the ‘fd’ method vary significantly from what one expects to observe when manually applying the FD rule. Even accounting for the fact that "histcounts adjusts the number of bins slightly so that the bin edges fall on ‘nice’ numbers, rather than using these exact formulas" as stated on the MATLAB reference page, the differences in binning are substantial. While I have not been able to discern an exact pattern regarding the differences between MATLAB and manual FD binning, I have observed that the MATLAB FD typically reduces the number of total bins by a factor of 2-10 across various datasets.
To illustrate this issue, I’ve attached a MATLAB script that re-creates this discrepancy. With this seed, the manual FD rule is generating 239 bins and the MATLAB FD rule is generating 106 bins. I’ve also attached the output figures. Similar discrepancies occur regardless of seed or population size.
clear
clc
%Data Generation
rng(1);
data = randn(10000, 1);
skewed_data = exp(data);
%Manual FD
bin_length = 2*iqr(skewed_data(:))*numel(skewed_data)^(-1/3); %Formula from MATLAB references
edges = min(skewed_data):bin_length:max(skewed_data);
b = histcounts(skewed_data, edges);
figure
bar(edges(1:end-1), b, ‘histc’);
xlabel(‘Value’);
ylabel(‘Frequency’);
title(‘Manual FD’);
disp(‘Number of bins for manual FD:’)
disp(length(b));
%MATLAB FD
[b, edges] = histcounts(skewed_data,’BinMethod’,’fd’);
figure
bar(edges(1:end-1), b, ‘histc’);
xlabel(‘Value’);
ylabel(‘Frequency’);
title(‘MATLAB FD’);
disp(‘Number of bins for MATLAB FD:’)
disp(length(b));
Regarding my interest in this discrepancy: I used both MATLAB FD and manual FD binning on a dataset before conducting non-linear optimization, and I found that the resulting models were best when the MATLAB FD was applied. I am preparing to publish, and I want to be able to explain exactly how my data is being binned to maximize performance. As a result, I would like to know how the MATLAB FD method, and broadly histcounts is binning my data.
Of note: I’ve found similar differences in binning between the MATLAB and real real formula for Sturges and Scott as well. While I imagine that the origin of this difference may be similar, I am primarily concerned with Freedman–Diaconis currently.I am attempting to use the Freedman–Diaconis (FD) rule to determine appropriate bins for a right-skewed dataset. Within the histcounts MATLAB function, the parameter ‘BinMethod’ has a built-in FD formula via the parameter value ‘fd.’ On the MATLAB reference page for histcounts, the formula for the FD rule that is supposedly applied to calculate bin-width when one calls ‘fd’ is correctly cited as the following:
2*iqr(X(:))*numel(X)^(-1/3)
In my testing, I have found that the actual bins produced by the ‘fd’ method vary significantly from what one expects to observe when manually applying the FD rule. Even accounting for the fact that "histcounts adjusts the number of bins slightly so that the bin edges fall on ‘nice’ numbers, rather than using these exact formulas" as stated on the MATLAB reference page, the differences in binning are substantial. While I have not been able to discern an exact pattern regarding the differences between MATLAB and manual FD binning, I have observed that the MATLAB FD typically reduces the number of total bins by a factor of 2-10 across various datasets.
To illustrate this issue, I’ve attached a MATLAB script that re-creates this discrepancy. With this seed, the manual FD rule is generating 239 bins and the MATLAB FD rule is generating 106 bins. I’ve also attached the output figures. Similar discrepancies occur regardless of seed or population size.
clear
clc
%Data Generation
rng(1);
data = randn(10000, 1);
skewed_data = exp(data);
%Manual FD
bin_length = 2*iqr(skewed_data(:))*numel(skewed_data)^(-1/3); %Formula from MATLAB references
edges = min(skewed_data):bin_length:max(skewed_data);
b = histcounts(skewed_data, edges);
figure
bar(edges(1:end-1), b, ‘histc’);
xlabel(‘Value’);
ylabel(‘Frequency’);
title(‘Manual FD’);
disp(‘Number of bins for manual FD:’)
disp(length(b));
%MATLAB FD
[b, edges] = histcounts(skewed_data,’BinMethod’,’fd’);
figure
bar(edges(1:end-1), b, ‘histc’);
xlabel(‘Value’);
ylabel(‘Frequency’);
title(‘MATLAB FD’);
disp(‘Number of bins for MATLAB FD:’)
disp(length(b));
Regarding my interest in this discrepancy: I used both MATLAB FD and manual FD binning on a dataset before conducting non-linear optimization, and I found that the resulting models were best when the MATLAB FD was applied. I am preparing to publish, and I want to be able to explain exactly how my data is being binned to maximize performance. As a result, I would like to know how the MATLAB FD method, and broadly histcounts is binning my data.
Of note: I’ve found similar differences in binning between the MATLAB and real real formula for Sturges and Scott as well. While I imagine that the origin of this difference may be similar, I am primarily concerned with Freedman–Diaconis currently. I am attempting to use the Freedman–Diaconis (FD) rule to determine appropriate bins for a right-skewed dataset. Within the histcounts MATLAB function, the parameter ‘BinMethod’ has a built-in FD formula via the parameter value ‘fd.’ On the MATLAB reference page for histcounts, the formula for the FD rule that is supposedly applied to calculate bin-width when one calls ‘fd’ is correctly cited as the following:
2*iqr(X(:))*numel(X)^(-1/3)
In my testing, I have found that the actual bins produced by the ‘fd’ method vary significantly from what one expects to observe when manually applying the FD rule. Even accounting for the fact that "histcounts adjusts the number of bins slightly so that the bin edges fall on ‘nice’ numbers, rather than using these exact formulas" as stated on the MATLAB reference page, the differences in binning are substantial. While I have not been able to discern an exact pattern regarding the differences between MATLAB and manual FD binning, I have observed that the MATLAB FD typically reduces the number of total bins by a factor of 2-10 across various datasets.
To illustrate this issue, I’ve attached a MATLAB script that re-creates this discrepancy. With this seed, the manual FD rule is generating 239 bins and the MATLAB FD rule is generating 106 bins. I’ve also attached the output figures. Similar discrepancies occur regardless of seed or population size.
clear
clc
%Data Generation
rng(1);
data = randn(10000, 1);
skewed_data = exp(data);
%Manual FD
bin_length = 2*iqr(skewed_data(:))*numel(skewed_data)^(-1/3); %Formula from MATLAB references
edges = min(skewed_data):bin_length:max(skewed_data);
b = histcounts(skewed_data, edges);
figure
bar(edges(1:end-1), b, ‘histc’);
xlabel(‘Value’);
ylabel(‘Frequency’);
title(‘Manual FD’);
disp(‘Number of bins for manual FD:’)
disp(length(b));
%MATLAB FD
[b, edges] = histcounts(skewed_data,’BinMethod’,’fd’);
figure
bar(edges(1:end-1), b, ‘histc’);
xlabel(‘Value’);
ylabel(‘Frequency’);
title(‘MATLAB FD’);
disp(‘Number of bins for MATLAB FD:’)
disp(length(b));
Regarding my interest in this discrepancy: I used both MATLAB FD and manual FD binning on a dataset before conducting non-linear optimization, and I found that the resulting models were best when the MATLAB FD was applied. I am preparing to publish, and I want to be able to explain exactly how my data is being binned to maximize performance. As a result, I would like to know how the MATLAB FD method, and broadly histcounts is binning my data.
Of note: I’ve found similar differences in binning between the MATLAB and real real formula for Sturges and Scott as well. While I imagine that the origin of this difference may be similar, I am primarily concerned with Freedman–Diaconis currently. freedman–diaconis, binning, histogram MATLAB Answers — New Questions