question about how the function pca() calculates the covariance matrix internally
I was puzzled by the output of pca() when using mean centering or not. I am using Matlab 2024a.
pca.m uses the internal function c = ncnancov(x,Rows,centered) which seems to provide the covariance matrix of x
however,
1) it uses the formula for the population covariance, i.e. it calculates x’*x/n not x’*x/(n-1) – what is the rationale behind that?
2) it does not mean center x. This is surprising because without mean centering x the formula x’*x/n (or x’*x/(n-1) for that matter) does NOT provide the covariance matrix
The second point causes the call [coeff,score,latent]=pca(D, ‘Algorithm’,’eig’,’Centered’,’off’) to produce different coeff, and latent from the call [coeff,score,latent]=pca(D, ‘Algorithm’,’eig’). The scores will obviosuly be different but coeff and latent should not be affected by mean centering as can be shown by comparing the output of:
load(‘Data_Table8p1.mat’);
Dm = D-mean(D);
[coeff,eigValues] = eig(cov(D));
[eigValues, idx] = sort(diag(eigValues), ‘descend’); % sort
coeff = coeff(:, idx);
score = D/coeff’; % get scores of mean centered data
with:
[coeff_m,eigValues_m] = eig(cov(Dm));
[eigValues_m, idx] = sort(diag(eigValues_m), ‘descend’); % sort
coeff_m = coeff_m(:, idx);
score_m = Dm/coeff_m’; % get scores of mean centered data
Probably I am missing something, but the internal function ncnancov() as used in pca is unclear to me. Any explanation is much appreciated!I was puzzled by the output of pca() when using mean centering or not. I am using Matlab 2024a.
pca.m uses the internal function c = ncnancov(x,Rows,centered) which seems to provide the covariance matrix of x
however,
1) it uses the formula for the population covariance, i.e. it calculates x’*x/n not x’*x/(n-1) – what is the rationale behind that?
2) it does not mean center x. This is surprising because without mean centering x the formula x’*x/n (or x’*x/(n-1) for that matter) does NOT provide the covariance matrix
The second point causes the call [coeff,score,latent]=pca(D, ‘Algorithm’,’eig’,’Centered’,’off’) to produce different coeff, and latent from the call [coeff,score,latent]=pca(D, ‘Algorithm’,’eig’). The scores will obviosuly be different but coeff and latent should not be affected by mean centering as can be shown by comparing the output of:
load(‘Data_Table8p1.mat’);
Dm = D-mean(D);
[coeff,eigValues] = eig(cov(D));
[eigValues, idx] = sort(diag(eigValues), ‘descend’); % sort
coeff = coeff(:, idx);
score = D/coeff’; % get scores of mean centered data
with:
[coeff_m,eigValues_m] = eig(cov(Dm));
[eigValues_m, idx] = sort(diag(eigValues_m), ‘descend’); % sort
coeff_m = coeff_m(:, idx);
score_m = Dm/coeff_m’; % get scores of mean centered data
Probably I am missing something, but the internal function ncnancov() as used in pca is unclear to me. Any explanation is much appreciated! I was puzzled by the output of pca() when using mean centering or not. I am using Matlab 2024a.
pca.m uses the internal function c = ncnancov(x,Rows,centered) which seems to provide the covariance matrix of x
however,
1) it uses the formula for the population covariance, i.e. it calculates x’*x/n not x’*x/(n-1) – what is the rationale behind that?
2) it does not mean center x. This is surprising because without mean centering x the formula x’*x/n (or x’*x/(n-1) for that matter) does NOT provide the covariance matrix
The second point causes the call [coeff,score,latent]=pca(D, ‘Algorithm’,’eig’,’Centered’,’off’) to produce different coeff, and latent from the call [coeff,score,latent]=pca(D, ‘Algorithm’,’eig’). The scores will obviosuly be different but coeff and latent should not be affected by mean centering as can be shown by comparing the output of:
load(‘Data_Table8p1.mat’);
Dm = D-mean(D);
[coeff,eigValues] = eig(cov(D));
[eigValues, idx] = sort(diag(eigValues), ‘descend’); % sort
coeff = coeff(:, idx);
score = D/coeff’; % get scores of mean centered data
with:
[coeff_m,eigValues_m] = eig(cov(Dm));
[eigValues_m, idx] = sort(diag(eigValues_m), ‘descend’); % sort
coeff_m = coeff_m(:, idx);
score_m = Dm/coeff_m’; % get scores of mean centered data
Probably I am missing something, but the internal function ncnancov() as used in pca is unclear to me. Any explanation is much appreciated! pca, cov, matlab MATLAB Answers — New Questions