--- title: "Min Max Autocorrelation Factor" author: "Nicolas Bez" date: "17 avril 2020" output: rmarkdown::html_vignette: fig_width: 5 fig_height: 4 vignette: > %\VignetteIndexEntry{Min Max Autocorrelation Factor} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r Loading_library,include=FALSE} rm(list = ls()) library(RGeostats) library(resample) knitr::opts_chunk$set(echo = TRUE, fig.align="center") constant.define("asp",1) set.seed(32432) ``` This note describes how to build and use MAF. Basically, the objective is to transform a set of variables $Z(x)={Z_{1}(x), ..., Z_{p}(x)}$ into another set of variables $\chi(x)={\chi_1(x),...,\chi_p(x)}$ by a linear transformation where the new variables are orthogonal both locally (i.e. at $h=0$ for first PCA), and at a given distance $h$ (second PCA). This is based on a sequence of two PCAs both of which standardize internally the input variables. # Reminder on matrices, diagonalisation, etc Caution is required when manipulating and multiplying matrices, which are not systematically symmetrical, even if they are square. A vector of size $n$ is considered as a one-column matrix with dimension $n \times 1$. A dataframe $Z$ with $p$ variables measured at $n$ samples is a $n \times p$ matrix, and its variance-covariance matrix $C_Z$ is $p \times p$. A PCA is nothing but the diagonalization of the variance-covariance matrix. Let $P$ be the $p \times p$ matrix of the eigen vectors (where the eigen vectors constitute the columns of the matrix), and $D_\Lambda$ be a $p \times p$ diagonal matrix of the eigen values, we have that: $$C_Z =P.D_\Lambda.P^{-1}$$ In the **particular case** of the diagonalization of a **symmetrical matrix** ($C_Z$ being a variance-covariance matrix, it is symmetrical), this leads to: $$C_Z =P.D_\Lambda.P^{t}$$ with $P^t=P^{-1}$ and $P^t.P=I$. In particular, $trace(P)=p$. The $n \times p$ matrix of the **principal components** $Y$ is obtained by right multiplying $Z$ by the $P$: $$Y = Z.P$$ Because $C_Z$ is a particular matrix (i.e. it a variance-covariance matrix) the eigen values have a particular meaning: they correspond to the variances of the principal components: $$C_Y = D_\Lambda$$ If the eigen values are ordered by decreasing or increasing order, this must be done in a coherent manner for all the matrices. **The key message is that:** * the **eigen vectors** are stored as columns of the $p \times p$ matrix $P$ * the **principal components** or **factors** are obtained by a **right multiplication** : $Y=Z.P$ # Graphical abstract of MAF The graphical abstract of a MAF underlines the steps of MAF computations, describes the standardization steps and the fact that the second PCA consists in the diagonalization of the variance-covariance matrix of the spatial increments of the factors which is applied of the factors themselves (and not on the increments). ![](MAF.jpg) # Simulated data We start by simulating 3 cross-correlated Gaussian Random Functions (GRFs) with some fixed spherical variograms using turning band method (with 500 turning bands). ```{r Creating the Model and the Variables on Grid} A <- matrix(c(1,0.8,0.2,0.8,1,0.7,0.2,0.7,1),ncol=3,nrow=3,byrow=T) a.sill <- t(A) %*% A a.model <- model.create(vartype=3,nvar=3,sill=a.sill,range=10) a.model <- model.create(vartype=1,nvar=3,model=a.model,sill=diag(1:3)/2) # Simulation du modele nbtuba = 500 a.grid <- db.create(flag.grid=T,x0=c(0,0),nx=c(100,100),dx=c(1,1)) a.grid <- simtub(dbout=a.grid,model=a.model,nbtuba = nbtuba,seed=4143) ``` We can check the consistency between the model and the variogram of the simulated data. ```{r Comparing empirical variograms and the Model} plot(vario.grid(a.grid),title="Model of Raw Variables") plot(a.model, add=T) dev.off() ``` The variance matrix is consistent with the model. ```{r Model reproduction} round(db.stat(a.grid, fun = "var", names = 4:6, flag.mono = F),2) a.sill ``` # RGeostats for PCA & MAF, direct and back transformations In this paragraph, we will check the consistency between the contents of the PCA object or the operation performed using this object with RGeostats and the calculations that can be performed explicitly **RGeostats** provides two functions to define the objects storing the results of PCA and MAF calculations: * pca.calc() to compute factors according to a PCA * maf.calc() to compute MAF according to a sequence of 2 PCAs They both produce the following outputs : * Means and Standard deviations for each input variables * Vectors of $p$ eigen values. Their sum equals the sum of the variances of the input variables (in particular it is equal to $p$ in case the input variable are standardized) * A $p \times p$ matrix to transform Variables to Factors (**z2f**) * A $p \times p$ matrix to back-transform Factors to Variables (**f2z**) **RGeostats** also provides two functions pca.z2f() and pca.f2z() can then be used to build the factors according to the matrix projection, and to back transform the factors or some of their transformations e.g. their kriging. **RGeostats** also handles the possibility to add new variables which are linear combinations of the input variables db.add.CL(). This function allows building factors and/or MAFs but its calculations need to be understood precisely. ![](MAF2.jpg){width=50% height=50%} # PCA The function **pca.calc** produces factors of the input variables. We will demonstrate that these variables are systematically standardized internally. ## Using pca.calc() ```{r Computing PCA with RGeostats} a.pca <- pca.calc(a.grid) ``` we can check that the vectors are the same, up to a possible change of sign per vector. ## Performing the calculations explicitly ```{r Extracting Data} Z <- as.matrix(a.grid[,"Simu*"]) ``` Compare calculations to contents of the PCA object: - Means ```{r Mean comparison} # Explicit calculation means = colMeans(Z) means # Contents of the PCA object a.pca$mean ``` - Standard deviations Note the variance calculation in RGeostats uses the division by N, whereas statisticians like to divide by N-1. This is the reason of the difference ```{r Standard deviation comparison} # Explicit calculation sigmas = colStdevs(Z) sigmas # Contents of the PCA object a.pca$sigma N = dim(Z)[1] sqrt(a.pca$sigma * a.pca$sigma * N / (N-1)) ``` - Eigen values (sorted in decreasing order) ```{r Eigen values comparison} # Explicit calculation Z_s = scale(Z) Eig = eigen(var(Z_s)) Eig # Contents of the PCA object a.pca$eigen a.pca$pcaz2f ``` # Transformation from Variables to Factors We continue the comparison ## Using pca.z2f() This function creates a new **Db** where the calculated factors have been added. They are extracted for comparison with the results of explicit calculations (*Y_1*) ```{r Transforming Variables into Factors using RGeostats} a.grid_1 <- pca.z2f(a.grid,a.pca) Y_1 = as.matrix(a.grid_1[,"Factor*"]) ``` ## Explicit computation One can compute the factors explicitly (in *Y_2*). ```{r Transforming Variables into Factors using Explicit calculations} P0 <- Eig$vectors lambda0 <- Eig$values # Building factors Y_2 <- Z_s %*% P0 ``` We compare the resulting factors. Note that the eigen factors are determined up to their sign. Therefore we must correct for the difference of sign before comparing. ```{r Comparing Factors} a.sign = P0 / a.pca$pcaz2f a.sign = apply(a.sign,2,mean) a.sign Y_1 = sweep(Y_1,2,a.sign,FUN="*") apply(Y_1 - Y_2,2,range) ``` Outputs are the sufficiently close to assert that: * **pca.calc()** standardizes the input data * the **pcaz2f** matrix produced by pca.calc() is correct with the eigen vectors in columns (as in *P0*). # Transforming from Factors to Variables Moving back from the factors to initial variables is done is both ways, either using pca.f2z() or making explicit computations. For explicit calculations, we perform the product $t(P0)$ on the factors which produces standardized variables, that must be transformed to the raw units afterwards (de-scaling using the means and standard deviation calculated beforehand). On the other hand, applying pca.f2z() directly gives back the raw data. ```{r Graphic comparison} a.grid_2 <- pca.f2z(a.grid_1,a.pca) res.0 <- a.grid_2[,"Simu*"] res.1 <- a.grid_2[,"Variable*"] res.2 <- Y_2 %*% t(P0) res.2 = sweep(res.2,2,sigmas,FUN="*") res.2 = sweep(res.2,2,means,FUN="+") plot(res.2[,1],res.1[,1],col=1,pch=20,cex=0.1, xlab ="input data (Z)", ylab="Back-transformation of the factors",xlim=c(-6,6),asp=1, main="Comparison between Initial Variables and back-transformed Factors") points(res.1[,1],res.0[,1],col=2,pch=20,cex=0.1) abline(0,1) ``` ## Another way to build the factors (or back transform) using db.add.CL() Factors can be obtained by another way. One can use the function db.add.CL() meant to add new variables that are Linear Combinations of existing variables. However, this must be used carefully as this must be applied: * to the **standardized** version of the input data * and using the **transposed** matrix of the eigen vectors. As a matter of fact, the rational of db.add.CL() is to generates $q$ new variables $Y$ from a set of $p$ variables $Z$ ($q$ being anything from 1 to larger than $p$). For consistency with the present note, the notations used in the function db.add.CL are modified. The input matrix of transformation $A$ is $q \times p$ so that the new variables are obtained by a right multiplication of $A^t$: $Y=Z.A^t$. This formulation is mathematically consistent in terms of matrix dimensions: * $Z$ is $n \times p$ * $A$ is $q \times p$ so that $A^t$ is $p \times q$ * and $Y$ is thus $(n \times p) \times (p \times q) = n \times q$ So, comparing the two ways to compute factors: * for **db.add.CL**, the linear combinations are provided in ROWS with an underlying programming point of view doing computation row by row * for **pca.z2f**, the linear combinations are provided in COLUMNS with an underlying explicit matrix multiplication point of view In the particular case of PCA, this becomes tricky as to produce factors from the variables one must feed db.add.CL() with $P^t$ which is meant indeed to back-transform the factors. ```{r Using db.add.CL} #standardisation of the input variables a.grid.std = db.normalize(a.grid,names="Simu*",oper="scal") # Computing the factors with db.add.CL a.grid_3 <- db.add.CL(a.grid.std, Y.names = "Norm*", Z.names = paste("Y",1:3,sep="."), mat.CL=t(a.pca$pcaz2f)) summary(a.grid_3$Y.1 - a.grid_1$Factor.1) summary(a.grid_3$Y.2 - a.grid_1$Factor.2) summary(a.grid_3$Y.3 - a.grid_1$Factor.3) ``` The above script allows to check that
**pca.z2f(a.grid , a.pca) = db.add.CL(a.grid.std , t(a.pca$pcaz2f))**
# Computing MAF Computing MAFs follows the same logic except that the function to use is maf.calc() instead of pca.calc(). This function produces for instance $P$ the matrix needed to transform the (standardized) input data into MAFs (see the graphical abstract). The distance arguments $h0$ and $dh$ allow defining (the interval associated to) the reference distance to compute the increments. This interval should not include 0. For gridded data this is typically equal to a grid cell size. ```{r Computing MAF using RGeostats} a.maf <- maf.calc(a.grid,h0=1,dh=0.01,dirvect = 0,tolang=0) ``` ## Explicit computation As for PCA, an explicit computation is possible. However using the code available in Petitgas et al. (2020) is not suited for medium or large data set. This has been re-written to be able to run on large data sets for checking computations on the specific case of a grid with east-west increments. ```{r Computing MAF using Explicit calculations} # Standardizing factors D_inv_sqrt_lambda0 <- diag(1/sqrt(lambda0)) Y_2_s <- Y_2 %*% D_inv_sqrt_lambda0 # Spatial increments tmp <- db.add(a.grid,as.data.frame(Y_2_s)) tmp0 <- db.reduce(db.sel(tmp,x1<99),names="V*",flag.keep.grid = T) tmp1 <- db.reduce(db.sel(tmp,x1>0),names="V*",flag.keep.grid = T) tmp0 <- as.matrix(db.extract(tmp0,"V*")) tmp1 <- as.matrix(db.extract(tmp1,"V*")) dY_sh <- tmp1-tmp0 # ACP of the Variance-Covariance of the increments Eig <- eigen(var(dY_sh)) Ph <- Eig$vectors lambdah <- Eig$values # Ordering MAFs by increasing eigen values tmp <- order(lambdah) Ph <- Ph[,tmp] lambdah <- lambdah[tmp] ``` We can now compare the results: - Eigen values ```{r Comparing PCA Eigen values} # Using maf calculations a.maf$eigen # Using explicit calculations lambdah ``` - Eigen vectors ```{r Comparing MAF Eigen values} # Using maf calculations a.maf$pcaz2f # Using explicit calculations P <- P0 %*% D_inv_sqrt_lambda0 %*% Ph P ``` # Performing the transformation ## Using pca.z2f ```{r Performing transformation using RGeostats} a.grid.xsi_1 <- pca.z2f(a.grid,a.maf,radix="MAF") xsi_1 = db.extract(a.grid.xsi_1,"MAF*") ``` ## Using explicit calculations ```{r Performing transformation using Explicit calculations} xsi_2 <- Z_s %*% P a.sign = P / a.maf$pcaz2f a.sign = round(apply(a.sign,2,mean),0) a.sign xsi_2 = sweep(xsi_2,2,a.sign,FUN="*") a.grid.xsi_2 = db.add(a.grid, as.data.frame(xsi_2)) ``` We compare the results, paying attention to the sign of the factor. ```{r Comparing factors} apply(xsi_1 - xsi_2,2,range) ``` This check indicates that the series "maf.cacl() %>% pca.z2f()" provides the MAFs of centered data, i.e. if the input data are not centered the RGeostats functions impose their centering. As a matter of fact, to get the same outputs between RGeostats and an explicit computations of the MAFs one must first center the data. # MAF properties & verifications One of the goal of MAFs is to get variables that are orthogonal for distances 0 and $h$. This can be checked by looking at the variogram of the set of MAFs. ```{r Maf models} plot(vario.grid(a.grid.xsi_1),title="Model for MAF factors") plot(vario.grid(a.grid.xsi_2),add=T) ``` Results are consistent with expectations: * Simple variograms becomes less and less structured from the first to the last MAF * cross-variograms are 0 for the distance used for the computation of the MAFs. Note that this holds for the specific distance & orientation choose for the computation * are exactly the same using "maf.calc() %>% pca.z2f() %>% vario.grid()" or "explicit computation %>% vario.grid()" validating the outputs. ```{r MAF representation} plot(a.grid.xsi_1,name="MAF.1") plot(a.grid.xsi_1,name="MAF.2") plot(a.grid.xsi_1,name="MAF.3") ``` # Explicit decomposition of variables on MAFs Cautious ! Due to the symmetry of the variance-covariance matrices: * $P_0^t = P_0^{-1}$ * $P_h^t = P_h^{-1}$ But: * $P^t \neq P^{-1}$ The transformation matrix provided by maf.cacl() applies on the standardized version of the input data. $$\chi=Z_s.P$$ and $$Z_s = \chi.P^{-1}$$ Let us denote $1$ a $n \times 1$ matrix of $1$ and $m$ a $p \times 1$ matrix of the means of the input variables. We then have: $$Z = 1.m^t + \chi.P^{-1}.D_s = 1.m^t + \chi.C$$ where $C$ is a $p \times p$ matrix whose columns are the linear combination we wanted to exhibit. We have: * maf.calc$pcaz2f = $P$. * maf.calc$pcaz2f = $P^{-1}$ So, in terms of RGeostats outputs, the coefficients of the linear combination of variables on MAFs are obtained by multiplying each column of the matrix maf.calc$pcaf2z by the standard deviation of the input variable.