This note describes how to build and use MAF. Basically, the objective is to transform a set of variables \(Z(x)={Z_{1}(x), ..., Z_{p}(x)}\) into another set of variables \(\chi(x)={\chi_1(x),...,\chi_p(x)}\) by a linear transformation where the new variables are orthogonal both locally (i.e. at \(h=0\) for first PCA), and at a given distance \(h\) (second PCA). This is based on a sequence of two PCAs both of which standardize internally the input variables.
Caution is required when manipulating and multiplying matrices, which are not systematically symmetrical, even if they are square.
A vector of size \(n\) is considered as a one-column matrix with dimension \(n \times 1\). A dataframe \(Z\) with \(p\) variables measured at \(n\) samples is a \(n \times p\) matrix, and its variance-covariance matrix \(C_Z\) is \(p \times p\).
A PCA is nothing but the diagonalization of the variance-covariance matrix. Let \(P\) be the \(p \times p\) matrix of the eigen vectors (where the eigen vectors constitute the columns of the matrix), and \(D_\Lambda\) be a \(p \times p\) diagonal matrix of the eigen values, we have that:
\[C_Z =P.D_\Lambda.P^{-1}\]
In the particular case of the diagonalization of a symmetrical matrix (\(C_Z\) being a variance-covariance matrix, it is symmetrical), this leads to:
\[C_Z =P.D_\Lambda.P^{t}\] with \(P^t=P^{-1}\) and \(P^t.P=I\). In particular, \(trace(P)=p\).
The \(n \times p\) matrix of the principal components \(Y\) is obtained by right multiplying \(Z\) by the \(P\):
\[Y = Z.P\]
Because \(C_Z\) is a particular matrix (i.e. it a variance-covariance matrix) the eigen values have a particular meaning: they correspond to the variances of the principal components:
\[C_Y = D_\Lambda\]
If the eigen values are ordered by decreasing or increasing order, this must be done in a coherent manner for all the matrices.
The key message is that:
the eigen vectors are stored as columns of the \(p \times p\) matrix \(P\)
the principal components or factors are obtained by a right multiplication : \(Y=Z.P\)
The graphical abstract of a MAF underlines the steps of MAF computations, describes the standardization steps and the fact that the second PCA consists in the diagonalization of the variance-covariance matrix of the spatial increments of the factors which is applied of the factors themselves (and not on the increments).
We start by simulating 3 cross-correlated Gaussian Random Functions (GRFs) with some fixed spherical variograms using turning band method (with 500 turning bands).
# Creation du modele de coregionalisation
A <- matrix(c(1,0.8,0.2,0.8,1,0.7,0.2,0.7,1),ncol=3,nrow=3,byrow=T)
a.sill <- t(A) %*% A
a.model <- model.create(vartype=3,nvar=3,sill=a.sill,range=10)
a.model <- model.create(vartype=1,nvar=3,model=a.model,sill=diag(1:3)/2)
# Simulation du modele
a.grid <- db.create(flag.grid=T,x0=c(0,0),nx=c(100,100),dx=c(1,1))
a.grid <- simtub(dbout=a.grid,model=a.model,nbtuba = 500,seed=0)
We can check the consistency between the model and the variogram of the simulated data.
plot(vario.grid(a.grid),title="Model of Raw Variables")
plot(a.model, add=T)
dev.off()
## null device
## 1
The variance matrix is consistent with the model.
round(db.stat(a.grid, fun = "var", names = 4:6, flag.mono = F),2)
## [,1] [,2] [,3]
## [1,] 2.31 1.79 0.88
## [2,] 1.79 3.04 1.36
## [3,] 0.88 1.36 2.75
a.sill
## [,1] [,2] [,3]
## [1,] 1.68 1.74 0.96
## [2,] 1.74 2.13 1.56
## [3,] 0.96 1.56 1.53
In this paragraph, we will check the consistency between the contents of the PCA object or the operation performed using this object with RGeostats and the calculations that can be performed explicitly
RGeostats provides two functions to define the objects storing the results of PCA and MAF calculations:
pca.calc() to compute factors according to a PCA
maf.calc() to compute MAF according to a sequence of 2 PCAs
They both produce the following outputs :
Means and Standard deviations for each input variables
Vectors of \(p\) eigen values. Their sum equals the sum of the variances of the input variables (in particular it is equal to \(p\) in case the input variable are standardized)
A \(p \times p\) matrix to transform Variables to Factors (z2f)
A \(p \times p\) matrix to back-transform Factors to Variables (f2z)
RGeostats also provides two functions pca.z2f() and pca.f2z() can then be used to build the factors according to the matrix projection, and to back transform the factors or some of their transformations e.g. their kriging.
RGeostas also handles the possibility to add new variables which are linear combinations of the input variables db.add.CL(). This function allows building factors and/or MAFs but its calculations need to be understood precisely.
The function pca.calc produces factors of the input variables. We will demonstrate that these variables are systematically standardized internally.
a.pca <- pca.calc(a.grid)
we can check that the vectors are the same, up to a possible change of sign per vector.
# Extracting the data
Z <- as.matrix(a.grid[,"Simu*"])
Compare calculations to contents of the PCA object:
# Explicit calculation
means = colMeans(Z)
means
## Simu.V1.S1 Simu.V2.S1 Simu.V3.S1
## 0.1556993 0.2000552 0.1540979
# Contents of the PCA object
a.pca$mean
## [1] 0.1556993 0.2000552 0.1540979
Note the variance calculation in RGeostats uses the division by N, whereas statisticians like to divide by N-1. This is the reason of the difference
# Explicit calculation
sigmas = colStdevs(Z)
sigmas
## Simu.V1.S1 Simu.V2.S1 Simu.V3.S1
## 1.521117 1.742325 1.658231
# Contents of the PCA object
a.pca$sigma
## [1] 1.521041 1.742238 1.658148
N = dim(Z)[1]
sqrt(a.pca$sigma * a.pca$sigma * N / (N-1))
## [1] 1.521117 1.742325 1.658231
# Explicit calculation
Z_s = scale(Z)
Eig = eigen(var(Z_s))
Eig
## eigen() decomposition
## $values
## [1] 2.0099211 0.6809393 0.3091396
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.5941157 -0.4934473 0.6352450
## [2,] -0.6300575 -0.2054725 -0.7488715
## [3,] -0.5000540 0.8451573 0.1888258
# Contents of the PCA object
a.pca$eigen
## [1] 2.0099211 0.6809393 0.3091396
a.pca$pcaz2f
## [,1] [,2] [,3]
## [1,] 0.5941157 -0.4934473 0.6352450
## [2,] 0.6300575 -0.2054725 -0.7488715
## [3,] 0.5000540 0.8451573 0.1888258
We continue the comparison
This function creates a new Db where the calculated factors have been added. They are extracted for comparison with the results of explicit calculations (Y_1)
a.grid_1 <- pca.z2f(a.grid,a.pca)
Y_1 = as.matrix(a.grid_1[,"Factor*"])
One can compute the factors explicitly (in Y_2).
P0 <- Eig$vectors
lambda0 <- Eig$values
# Building factors
Y_2 <- Z_s %*% P0
We compare the resulting factors. Note that the eigen factors are determined up to their sign. Therefore we must correct for the difference of sign before comparing.
a.sign = P0 / a.pca$pcaz2f
a.sign = apply(a.sign,2,mean)
a.sign
## [1] -1 1 1
Y_1 = sweep(Y_1,2,a.sign,FUN="*")
apply(Y_1 - Y_2,2,range)
## Factor.1 Factor.2 Factor.3
## [1,] -0.0002647027 -0.0001552818 -1.108928e-04
## [2,] 0.0002383036 0.0001615063 9.630706e-05
Outputs are the sufficiently close to assert that:
pca.calc() standardizes the input data
the pcaz2f matrix produced by pca.calc() is correct with the eigen vectors in columns (as in P0).
Moving back from the factors to initial variables is done is both ways, either using pca.f2z() or making explicit computations.
For explicit calculations, we perform the product \(t(P0)\) on the factors which produces standardized variables, that must be transformed to the raw units afterwards (de-scaling using the means and standard deviation calculated beforehand). On the other hand, applying pca.f2z() directly gives back the raw data.
a.grid_2 <- pca.f2z(a.grid_1,a.pca)
res.0 <- a.grid_2[,"Simu*"]
res.1 <- a.grid_2[,"Variable*"]
res.2 <- Y_2 %*% t(P0)
res.2 = sweep(res.2,2,sigmas,FUN="*")
res.2 = sweep(res.2,2,means,FUN="+")
plot(res.2[,1],res.1[,1],col=1,pch=20,cex=0.1,
xlab ="input data (Z)",
ylab="Back-transformation of the factors",xlim=c(-6,6),asp=1,
main="Comparison between Initial Variables and back-trasnformed Factors")
points(res.1[,1],res.0[,1],col=2,pch=20,cex=0.1)
abline(0,1)
Factors can be obtained by another way. One can use the function db.add.CL() meant to add new variables that are Linear Combinations of existing variables. However, this must be used carefully as this must be applied:
to the standardized version of the input data
and using the transposed matrix of the eigen vectors.
As a matter of fact, the rational of db.add.CL() is to generates \(q\) new variables \(Y\) from a set of \(p\) variables \(Z\) (\(q\) being anything from 1 to larger than \(p\)). For consistency with the present note, the notations used in the function db.add.CL are modified. The input matrix of transformation \(A\) is \(q \times p\) so that the new variables are obtained by a right multiplication of \(A^t\): \(Y=Z.A^t\). This formulation is mathematically consistent in terms of matrix dimensions:
\(Z\) is \(n \times p\)
\(A\) is \(q \times p\) so that \(A^t\) is \(p \times q\)
and \(Y\) is thus \((n \times p) \times (p \times q) = n \times q\)
So, comparing the two ways to compute factors:
for db.add.CL, the linear combinations are provided in ROWS with an underlying programming point of view doing computation row by row
for pca.z2f, the linear combinations are provided in COLUMNS with an underlying explicit matrix multiplication point of view
In the particular case of PCA, this becomes tricky as to produce factors from the variables one must feed db.add.CL() with \(P^t\) which is meant indeed to back-transform the factors.
#standardisation of the input variables
a.grid.std = db.normalize(a.grid,names="Simu*",oper="scal")
# Computing the factors with db.add.CL
a.grid_3 <- db.add.CL(a.grid.std,
Y.names = "Norm*", Z.names = paste("Y",1:3,sep="."),
mat.CL=t(a.pca$pcaz2f))
summary(a.grid_3$Y.1 - a.grid_1$Factor.1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
summary(a.grid_3$Y.2 - a.grid_1$Factor.2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
summary(a.grid_3$Y.3 - a.grid_1$Factor.3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
The above script allows to check that
Computing MAFs follows the same logic except that the function to use is maf.calc() instead of pca.calc(). This function produces for instance \(P\) the matrix needed to transform the (standardized) input data into MAFs (see the graphical abstract).
The distance arguments \(h0\) and \(dh\) allow defining (the interval associated to) the reference distance to compute the increments. This interval should not include 0. For gridded data this is typically equal to a grid cell size.
a.maf <- maf.calc(a.grid,h0=1,dh=0.01,dirvect = 0,tolang=0)
As for PCA, an explicit computation is possible. However using the code available in Petitgas et al. (2020) is not suited for medium or large data set. This has been re-written to be able to run on large data sets for checking computations on the specific case of a grid with east-west increments.
# Standardizing factors
D_inv_sqrt_lambda0 <- diag(1/sqrt(lambda0))
Y_2_s <- Y_2 %*% D_inv_sqrt_lambda0
# Spatial increments
tmp <- db.add(a.grid,as.data.frame(Y_2_s))
tmp0 <- db.reduce(db.sel(tmp,x1<99),names="V*",flag.keep.grid = T)
tmp1 <- db.reduce(db.sel(tmp,x1>0),names="V*",flag.keep.grid = T)
tmp0 <- as.matrix(db.extract(tmp0,"V*"))
tmp1 <- as.matrix(db.extract(tmp1,"V*"))
dY_sh <- tmp1-tmp0
# ACP of the Variance-Covariance of the increments
Eig <- eigen(var(dY_sh))
Ph <- Eig$vectors
lambdah <- Eig$values
# Ordering MAFs by increasing eigen values
tmp <- order(lambdah)
Ph <- Ph[,tmp]
lambdah <- lambdah[tmp]
We can now compare the results:
# Using maf calculations
a.maf$eigen
## [1] 0.5501545 1.3507196 1.9953027
# Using explicit calculations
lambdah
## [1] 0.550136 1.350721 1.995304
# Using maf calculations
a.maf$pcaz2f
## [,1] [,2] [,3]
## [1,] 0.6453879 0.8892113 0.7945604
## [2,] 0.3706924 -0.4445208 -1.3185564
## [3,] 0.1321705 -0.8470453 0.7441576
# Using explicit calculations
P <- P0 %*% D_inv_sqrt_lambda0 %*% Ph
P
## [,1] [,2] [,3]
## [1,] -0.6453895 0.8892114 0.7945590
## [2,] -0.3706896 -0.4445207 -1.3185573
## [3,] -0.1321724 -0.8470453 0.7441573
a.grid.xsi_1 <- pca.z2f(a.grid,a.maf,radix="MAF")
xsi_1 = db.extract(a.grid.xsi_1,"MAF*")
xsi_2 <- Z_s %*% P
a.sign = P / a.maf$pcaz2f
a.sign = round(apply(a.sign,2,mean),0)
a.sign
## [1] -1 1 1
xsi_2 = sweep(xsi_2,2,a.sign,FUN="*")
a.grid.xsi_2 = db.add(a.grid, as.data.frame(xsi_2))
We compare the results, paying attention to the sign of the factor.
apply(xsi_1 - xsi_2,2,range)
## MAF.1 MAF.2 MAF.3
## [1,] -0.0001727681 -0.0001865276 -0.0001813812
## [2,] 0.0001968363 0.0001861247 0.0001843047
This check indicates that the series "maf.cacl() %>% pca.z2f()" provides the MAFs of centered data, i.e. if the input data are not centered the RGeostats functions impose their centering. As a matter of fact, to get the same outputs between RGeostats and an explicit computations of the MAFs one must first center the data.
One of the goal of MAFs is to get variables that are orthogonal for distances 0 and \(h\). This can be checked by looking at the variogram of the set of MAFs.
plot(vario.grid(a.grid.xsi_1),title="Model for MAF factors")
plot(vario.grid(a.grid.xsi_2),add=T)
Results are consistent with expectations:
Simple variograms becomes less and less structured from the first to the last MAF
cross-variograms are 0 for the distance used for the computation of the MAFs. Note that this holds for the specific distance & orientation choose for the computation
are exactly the same using "maf.calc() %>% pca.z2f() %>% vario.grid()" or "explicit computation %>% vario.grid()" validating the outputs.
plot(a.grid.xsi_1,name="MAF.1")
plot(a.grid.xsi_1,name="MAF.2")
plot(a.grid.xsi_1,name="MAF.3")
Cautious ! Due to the symmetry of the variance-covariance matrices:
\(P_0^t = P_0^{-1}\)
\(P_h^t = P_h^{-1}\)
But:
The transformation matrix provided by maf.cacl() applies on the standardized version of the input data.
\[\chi=Z_s.P\] and
\[Z_s = \chi.P^{-1}\]
Let us denote \(1\) a \(n \times 1\) matrix of \(1\) and \(m\) a \(p \times 1\) matrix of the means of the input variables. We then have:
\[Z = 1.m^t + \chi.P^{-1}.D_s = 1.m^t + \chi.C\] where \(C\) is a \(p \times p\) matrix whose columns are the linear combination we wanted to exhibit.
We have:
maf.calc$pcaz2f = \(P\).
maf.calc$pcaz2f = \(P^{-1}\)
So, in terms of RGeostats outputs, the coefficients of the linear combination of variables on MAFs are obtained by multiplying each column of the matrix maf.calc$pcaf2z by the standard deviation of the input variable.