Min Max Autocorrelation Factor

Nicolas Bez

17 avril 2020

This note describes how to build and use MAF. Basically, the objective is to transform a set of variables \(Z(x)={Z_{1}(x), ..., Z_{p}(x)}\) into another set of variables \(\chi(x)={\chi_1(x),...,\chi_p(x)}\) by a linear transformation where the new variables are orthogonal both locally (i.e. at \(h=0\) for first PCA), and at a given distance \(h\) (second PCA). This is based on a sequence of two PCAs both of which standardize internally the input variables.

Reminder on matrices, diagonalisation, etc

Caution is required when manipulating and multiplying matrices, which are not systematically symmetrical, even if they are square.

A vector of size \(n\) is considered as a one-column matrix with dimension \(n \times 1\). A dataframe \(Z\) with \(p\) variables measured at \(n\) samples is a \(n \times p\) matrix, and its variance-covariance matrix \(C_Z\) is \(p \times p\).

A PCA is nothing but the diagonalization of the variance-covariance matrix. Let \(P\) be the \(p \times p\) matrix of the eigen vectors (where the eigen vectors constitute the columns of the matrix), and \(D_\Lambda\) be a \(p \times p\) diagonal matrix of the eigen values, we have that:

\[C_Z =P.D_\Lambda.P^{-1}\]

In the particular case of the diagonalization of a symmetrical matrix (\(C_Z\) being a variance-covariance matrix, it is symmetrical), this leads to:

\[C_Z =P.D_\Lambda.P^{t}\] with \(P^t=P^{-1}\) and \(P^t.P=I\). In particular, \(trace(P)=p\).

The \(n \times p\) matrix of the principal components \(Y\) is obtained by right multiplying \(Z\) by the \(P\):

\[Y = Z.P\]

Because \(C_Z\) is a particular matrix (i.e. it a variance-covariance matrix) the eigen values have a particular meaning: they correspond to the variances of the principal components:

\[C_Y = D_\Lambda\]

If the eigen values are ordered by decreasing or increasing order, this must be done in a coherent manner for all the matrices.

The key message is that:

Graphical abstract of MAF

The graphical abstract of a MAF underlines the steps of MAF computations, describes the standardization steps and the fact that the second PCA consists in the diagonalization of the variance-covariance matrix of the spatial increments of the factors which is applied of the factors themselves (and not on the increments).

Simulated data

We start by simulating 3 cross-correlated Gaussian Random Functions (GRFs) with some fixed spherical variograms using turning band method (with 500 turning bands).

# Creation du modele de coregionalisation
A       <- matrix(c(1,0.8,0.2,0.8,1,0.7,0.2,0.7,1),ncol=3,nrow=3,byrow=T)
a.sill  <- t(A) %*% A
a.model <- model.create(vartype=3,nvar=3,sill=a.sill,range=10)
a.model <- model.create(vartype=1,nvar=3,model=a.model,sill=diag(1:3)/2)

# Simulation du modele
a.grid <- db.create(flag.grid=T,x0=c(0,0),nx=c(100,100),dx=c(1,1))
a.grid <- simtub(dbout=a.grid,model=a.model,nbtuba = 500,seed=0)

We can check the consistency between the model and the variogram of the simulated data.

plot(vario.grid(a.grid),title="Model of Raw Variables")
plot(a.model, add=T)

dev.off()
## null device 
##           1

The variance matrix is consistent with the model.

round(db.stat(a.grid, fun = "var", names = 4:6, flag.mono = F),2)
##      [,1] [,2] [,3]
## [1,] 2.31 1.79 0.88
## [2,] 1.79 3.04 1.36
## [3,] 0.88 1.36 2.75
a.sill
##      [,1] [,2] [,3]
## [1,] 1.68 1.74 0.96
## [2,] 1.74 2.13 1.56
## [3,] 0.96 1.56 1.53

RGeostats for PCA & MAF, direct and back transformations

In this paragraph, we will check the consistency between the contents of the PCA object or the operation performed using this object with RGeostats and the calculations that can be performed explicitly

RGeostats provides two functions to define the objects storing the results of PCA and MAF calculations:

They both produce the following outputs :

RGeostats also provides two functions pca.z2f() and pca.f2z() can then be used to build the factors according to the matrix projection, and to back transform the factors or some of their transformations e.g. their kriging.

RGeostas also handles the possibility to add new variables which are linear combinations of the input variables db.add.CL(). This function allows building factors and/or MAFs but its calculations need to be understood precisely.

PCA

The function pca.calc produces factors of the input variables. We will demonstrate that these variables are systematically standardized internally.

Using pca.calc()

a.pca <- pca.calc(a.grid)

we can check that the vectors are the same, up to a possible change of sign per vector.

Performing the calculations explicitly

# Extracting the data
Z <- as.matrix(a.grid[,"Simu*"])

Compare calculations to contents of the PCA object:

# Explicit calculation
means = colMeans(Z)
means
## Simu.V1.S1 Simu.V2.S1 Simu.V3.S1 
##  0.1556993  0.2000552  0.1540979
# Contents of the PCA object
a.pca$mean
## [1] 0.1556993 0.2000552 0.1540979

Note the variance calculation in RGeostats uses the division by N, whereas statisticians like to divide by N-1. This is the reason of the difference

# Explicit calculation
sigmas = colStdevs(Z)
sigmas
## Simu.V1.S1 Simu.V2.S1 Simu.V3.S1 
##   1.521117   1.742325   1.658231
# Contents of the PCA object
a.pca$sigma
## [1] 1.521041 1.742238 1.658148
N = dim(Z)[1]
sqrt(a.pca$sigma * a.pca$sigma * N / (N-1)) 
## [1] 1.521117 1.742325 1.658231
# Explicit calculation
Z_s = scale(Z)
Eig = eigen(var(Z_s))
Eig
## eigen() decomposition
## $values
## [1] 2.0099211 0.6809393 0.3091396
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,] -0.5941157 -0.4934473  0.6352450
## [2,] -0.6300575 -0.2054725 -0.7488715
## [3,] -0.5000540  0.8451573  0.1888258
# Contents of the PCA object
a.pca$eigen
## [1] 2.0099211 0.6809393 0.3091396
a.pca$pcaz2f
##           [,1]       [,2]       [,3]
## [1,] 0.5941157 -0.4934473  0.6352450
## [2,] 0.6300575 -0.2054725 -0.7488715
## [3,] 0.5000540  0.8451573  0.1888258

Transformation from Variables to Factors

We continue the comparison

Using pca.z2f()

This function creates a new Db where the calculated factors have been added. They are extracted for comparison with the results of explicit calculations (Y_1)

a.grid_1 <- pca.z2f(a.grid,a.pca)
Y_1 = as.matrix(a.grid_1[,"Factor*"])

Explicit computation

One can compute the factors explicitly (in Y_2).

P0      <- Eig$vectors
lambda0 <- Eig$values
# Building factors
Y_2 <- Z_s %*% P0

We compare the resulting factors. Note that the eigen factors are determined up to their sign. Therefore we must correct for the difference of sign before comparing.

a.sign = P0 / a.pca$pcaz2f
a.sign = apply(a.sign,2,mean)
a.sign
## [1] -1  1  1
Y_1 = sweep(Y_1,2,a.sign,FUN="*")
apply(Y_1 - Y_2,2,range)
##           Factor.1      Factor.2      Factor.3
## [1,] -0.0002647027 -0.0001552818 -1.108928e-04
## [2,]  0.0002383036  0.0001615063  9.630706e-05

Outputs are the sufficiently close to assert that:

Transforming from Factors to Variables

Moving back from the factors to initial variables is done is both ways, either using pca.f2z() or making explicit computations.

For explicit calculations, we perform the product \(t(P0)\) on the factors which produces standardized variables, that must be transformed to the raw units afterwards (de-scaling using the means and standard deviation calculated beforehand). On the other hand, applying pca.f2z() directly gives back the raw data.

a.grid_2 <- pca.f2z(a.grid_1,a.pca)
res.0 <- a.grid_2[,"Simu*"]
res.1 <- a.grid_2[,"Variable*"]
res.2 <- Y_2 %*% t(P0)
res.2 = sweep(res.2,2,sigmas,FUN="*")
res.2 = sweep(res.2,2,means,FUN="+")
plot(res.2[,1],res.1[,1],col=1,pch=20,cex=0.1,
     xlab ="input data (Z)", 
     ylab="Back-transformation of the factors",xlim=c(-6,6),asp=1,
     main="Comparison between Initial Variables and back-trasnformed Factors")
points(res.1[,1],res.0[,1],col=2,pch=20,cex=0.1)
abline(0,1)

Another way to build the factors (or back transform) using db.add.CL()

Factors can be obtained by another way. One can use the function db.add.CL() meant to add new variables that are Linear Combinations of existing variables. However, this must be used carefully as this must be applied:

As a matter of fact, the rational of db.add.CL() is to generates \(q\) new variables \(Y\) from a set of \(p\) variables \(Z\) (\(q\) being anything from 1 to larger than \(p\)). For consistency with the present note, the notations used in the function db.add.CL are modified. The input matrix of transformation \(A\) is \(q \times p\) so that the new variables are obtained by a right multiplication of \(A^t\): \(Y=Z.A^t\). This formulation is mathematically consistent in terms of matrix dimensions:

So, comparing the two ways to compute factors:

In the particular case of PCA, this becomes tricky as to produce factors from the variables one must feed db.add.CL() with \(P^t\) which is meant indeed to back-transform the factors.

#standardisation of the input variables
a.grid.std = db.normalize(a.grid,names="Simu*",oper="scal")
# Computing the factors with db.add.CL
a.grid_3 <- db.add.CL(a.grid.std,
                          Y.names = "Norm*", Z.names = paste("Y",1:3,sep="."),
                          mat.CL=t(a.pca$pcaz2f))
summary(a.grid_3$Y.1 - a.grid_1$Factor.1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0
summary(a.grid_3$Y.2 - a.grid_1$Factor.2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0
summary(a.grid_3$Y.3 - a.grid_1$Factor.3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0

The above script allows to check that

pca.z2f(a.grid , a.pca) = db.add.CL(a.grid.std , t(a.pca$pcaz2f))

Computing MAF

Computing MAFs follows the same logic except that the function to use is maf.calc() instead of pca.calc(). This function produces for instance \(P\) the matrix needed to transform the (standardized) input data into MAFs (see the graphical abstract).

The distance arguments \(h0\) and \(dh\) allow defining (the interval associated to) the reference distance to compute the increments. This interval should not include 0. For gridded data this is typically equal to a grid cell size.

a.maf <- maf.calc(a.grid,h0=1,dh=0.01,dirvect = 0,tolang=0)

Explicit computation

As for PCA, an explicit computation is possible. However using the code available in Petitgas et al. (2020) is not suited for medium or large data set. This has been re-written to be able to run on large data sets for checking computations on the specific case of a grid with east-west increments.

# Standardizing factors
D_inv_sqrt_lambda0 <- diag(1/sqrt(lambda0))
Y_2_s <- Y_2 %*% D_inv_sqrt_lambda0
# Spatial increments 
tmp  <- db.add(a.grid,as.data.frame(Y_2_s))
tmp0 <- db.reduce(db.sel(tmp,x1<99),names="V*",flag.keep.grid = T)
tmp1 <- db.reduce(db.sel(tmp,x1>0),names="V*",flag.keep.grid = T)
tmp0 <- as.matrix(db.extract(tmp0,"V*"))
tmp1 <- as.matrix(db.extract(tmp1,"V*"))
dY_sh <- tmp1-tmp0
# ACP of the Variance-Covariance of the increments
Eig <- eigen(var(dY_sh))
Ph  <- Eig$vectors
lambdah <- Eig$values
# Ordering MAFs by increasing eigen values
tmp <- order(lambdah)
Ph  <- Ph[,tmp]
lambdah <- lambdah[tmp]

We can now compare the results:

# Using maf calculations
a.maf$eigen
## [1] 0.5501545 1.3507196 1.9953027
# Using explicit calculations
lambdah
## [1] 0.550136 1.350721 1.995304
# Using maf calculations
a.maf$pcaz2f
##           [,1]       [,2]       [,3]
## [1,] 0.6453879  0.8892113  0.7945604
## [2,] 0.3706924 -0.4445208 -1.3185564
## [3,] 0.1321705 -0.8470453  0.7441576
# Using explicit calculations
P <- P0 %*% D_inv_sqrt_lambda0 %*% Ph
P
##            [,1]       [,2]       [,3]
## [1,] -0.6453895  0.8892114  0.7945590
## [2,] -0.3706896 -0.4445207 -1.3185573
## [3,] -0.1321724 -0.8470453  0.7441573

Performing the transformation

Using pca.z2f

a.grid.xsi_1 <- pca.z2f(a.grid,a.maf,radix="MAF")
xsi_1 = db.extract(a.grid.xsi_1,"MAF*")

Using explicit calculations

xsi_2 <- Z_s %*% P
a.sign = P / a.maf$pcaz2f
a.sign = round(apply(a.sign,2,mean),0)
a.sign
## [1] -1  1  1
xsi_2 = sweep(xsi_2,2,a.sign,FUN="*")
a.grid.xsi_2 = db.add(a.grid, as.data.frame(xsi_2))

We compare the results, paying attention to the sign of the factor.

apply(xsi_1 - xsi_2,2,range)
##              MAF.1         MAF.2         MAF.3
## [1,] -0.0001727681 -0.0001865276 -0.0001813812
## [2,]  0.0001968363  0.0001861247  0.0001843047

This check indicates that the series "maf.cacl() %>% pca.z2f()" provides the MAFs of centered data, i.e. if the input data are not centered the RGeostats functions impose their centering. As a matter of fact, to get the same outputs between RGeostats and an explicit computations of the MAFs one must first center the data.

MAF properties & verifications

One of the goal of MAFs is to get variables that are orthogonal for distances 0 and \(h\). This can be checked by looking at the variogram of the set of MAFs.

plot(vario.grid(a.grid.xsi_1),title="Model for MAF factors")
plot(vario.grid(a.grid.xsi_2),add=T)

Results are consistent with expectations:

plot(a.grid.xsi_1,name="MAF.1")

plot(a.grid.xsi_1,name="MAF.2")

plot(a.grid.xsi_1,name="MAF.3")

Explicit decomposition of variables on MAFs

Cautious ! Due to the symmetry of the variance-covariance matrices:

But:

The transformation matrix provided by maf.cacl() applies on the standardized version of the input data.

\[\chi=Z_s.P\] and

\[Z_s = \chi.P^{-1}\]

Let us denote \(1\) a \(n \times 1\) matrix of \(1\) and \(m\) a \(p \times 1\) matrix of the means of the input variables. We then have:

\[Z = 1.m^t + \chi.P^{-1}.D_s = 1.m^t + \chi.C\] where \(C\) is a \(p \times p\) matrix whose columns are the linear combination we wanted to exhibit.

We have:

So, in terms of RGeostats outputs, the coefficients of the linear combination of variables on MAFs are obtained by multiplying each column of the matrix maf.calc$pcaf2z by the standard deviation of the input variable.