--- title: "Min Max Autocorrelation Factor" author: "Nicolas Bez" date: "17 avril 2020" output: rmarkdown::html_vignette: fig_width: 5 fig_height: 4 vignette: > %\VignetteIndexEntry{Min Max Autocorrelation Factor} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r Loading_library,include=FALSE} rm(list = ls()) library(RGeostats) library(resample) knitr::opts_chunk$set(echo = TRUE, fig.align="center") constant.define("asp",1) set.seed(32432) ``` This note describes how to build and use MAF. Basically, the objective is to transform a set of variables $Z(x)={Z_{1}(x), ..., Z_{p}(x)}$ into another set of variables $\chi(x)={\chi_1(x),...,\chi_p(x)}$ by a linear transformation where the new variables are orthogonal both locally (i.e. at $h=0$ for first PCA), and at a given distance $h$ (second PCA). This is based on a sequence of two PCAs both of which standardize internally the input variables. # Reminder on matrices, diagonalisation, etc Caution is required when manipulating and multiplying matrices, which are not systematically symmetrical, even if they are square. A vector of size $n$ is considered as a one-column matrix with dimension $n \times 1$. A dataframe $Z$ with $p$ variables measured at $n$ samples is a $n \times p$ matrix, and its variance-covariance matrix $C_Z$ is $p \times p$. A PCA is nothing but the diagonalization of the variance-covariance matrix. Let $P$ be the $p \times p$ matrix of the eigen vectors (where the eigen vectors constitute the columns of the matrix), and $D_\Lambda$ be a $p \times p$ diagonal matrix of the eigen values, we have that: $$C_Z =P.D_\Lambda.P^{-1}$$ In the **particular case** of the diagonalization of a **symmetrical matrix** ($C_Z$ being a variance-covariance matrix, it is symmetrical), this leads to: $$C_Z =P.D_\Lambda.P^{t}$$ with $P^t=P^{-1}$ and $P^t.P=I$. In particular, $trace(P)=p$. The $n \times p$ matrix of the **principal components** $Y$ is obtained by right multiplying $Z$ by the $P$: $$Y = Z.P$$ Because $C_Z$ is a particular matrix (i.e. it a variance-covariance matrix) the eigen values have a particular meaning: they correspond to the variances of the principal components: $$C_Y = D_\Lambda$$ If the eigen values are ordered by decreasing or increasing order, this must be done in a coherent manner for all the matrices. **The key message is that:** * the **eigen vectors** are stored as columns of the $p \times p$ matrix $P$ * the **principal components** or **factors** are obtained by a **right multiplication** : $Y=Z.P$ # Graphical abstract of MAF The graphical abstract of a MAF underlines the steps of MAF computations, describes the standardization steps and the fact that the second PCA consists in the diagonalization of the variance-covariance matrix of the spatial increments of the factors which is applied of the factors themselves (and not on the increments). ![](MAF.jpg) # Simulated data We start by simulating 3 cross-correlated Gaussian Random Functions (GRFs) with some fixed spherical variograms using turning band method (with 500 turning bands). ```{r Creating the Model and the Variables on Grid} A <- matrix(c(1,0.8,0.2,0.8,1,0.7,0.2,0.7,1),ncol=3,nrow=3,byrow=T) a.sill <- t(A) %*% A a.model <- model.create(vartype=3,nvar=3,sill=a.sill,range=10) a.model <- model.create(vartype=1,nvar=3,model=a.model,sill=diag(1:3)/2) # Simulation du modele nbtuba = 500 a.grid <- db.create(flag.grid=T,x0=c(0,0),nx=c(100,100),dx=c(1,1)) a.grid <- simtub(dbout=a.grid,model=a.model,nbtuba = nbtuba,seed=4143) ``` We can check the consistency between the model and the variogram of the simulated data. ```{r Comparing empirical variograms and the Model} plot(vario.grid(a.grid),title="Model of Raw Variables") plot(a.model, add=T) dev.off() ``` The variance matrix is consistent with the model. ```{r Model reproduction} round(db.stat(a.grid, fun = "var", names = 4:6, flag.mono = F),2) a.sill ``` # RGeostats for PCA & MAF, direct and back transformations In this paragraph, we will check the consistency between the contents of the PCA object or the operation performed using this object with RGeostats and the calculations that can be performed explicitly **RGeostats** provides two functions to define the objects storing the results of PCA and MAF calculations: * pca.calc() to compute factors according to a PCA * maf.calc() to compute MAF according to a sequence of 2 PCAs They both produce the following outputs : * Means and Standard deviations for each input variables * Vectors of $p$ eigen values. Their sum equals the sum of the variances of the input variables (in particular it is equal to $p$ in case the input variable are standardized) * A $p \times p$ matrix to transform Variables to Factors (**z2f**) * A $p \times p$ matrix to back-transform Factors to Variables (**f2z**) **RGeostats** also provides two functions pca.z2f() and pca.f2z() can then be used to build the factors according to the matrix projection, and to back transform the factors or some of their transformations e.g. their kriging. **RGeostats** also handles the possibility to add new variables which are linear combinations of the input variables db.add.CL(). This function allows building factors and/or MAFs but its calculations need to be understood precisely. ![](MAF2.jpg){width=50% height=50%} # PCA The function **pca.calc** produces factors of the input variables. We will demonstrate that these variables are systematically standardized internally. ## Using pca.calc() ```{r Computing PCA with RGeostats} a.pca <- pca.calc(a.grid) ``` we can check that the vectors are the same, up to a possible change of sign per vector. ## Performing the calculations explicitly ```{r Extracting Data} Z <- as.matrix(a.grid[,"Simu*"]) ``` Compare calculations to contents of the PCA object: - Means ```{r Mean comparison} # Explicit calculation means = colMeans(Z) means # Contents of the PCA object a.pca$mean ``` - Standard deviations Note the variance calculation in RGeostats uses the division by N, whereas statisticians like to divide by N-1. This is the reason of the difference ```{r Standard deviation comparison} # Explicit calculation sigmas = colStdevs(Z) sigmas # Contents of the PCA object a.pca$sigma N = dim(Z)[1] sqrt(a.pca$sigma * a.pca$sigma * N / (N-1)) ``` - Eigen values (sorted in decreasing order) ```{r Eigen values comparison} # Explicit calculation Z_s = scale(Z) Eig = eigen(var(Z_s)) Eig # Contents of the PCA object a.pca$eigen a.pca$pcaz2f ``` # Transformation from Variables to Factors We continue the comparison ## Using pca.z2f() This function creates a new **Db** where the calculated factors have been added. They are extracted for comparison with the results of explicit calculations (*Y_1*) ```{r Transforming Variables into Factors using RGeostats} a.grid_1 <- pca.z2f(a.grid,a.pca) Y_1 = as.matrix(a.grid_1[,"Factor*"]) ``` ## Explicit computation One can compute the factors explicitly (in *Y_2*). ```{r Transforming Variables into Factors using Explicit calculations} P0 <- Eig$vectors lambda0 <- Eig$values # Building factors Y_2 <- Z_s %*% P0 ``` We compare the resulting factors. Note that the eigen factors are determined up to their sign. Therefore we must correct for the difference of sign before comparing. ```{r Comparing Factors} a.sign = P0 / a.pca$pcaz2f a.sign = apply(a.sign,2,mean) a.sign Y_1 = sweep(Y_1,2,a.sign,FUN="*") apply(Y_1 - Y_2,2,range) ``` Outputs are the sufficiently close to assert that: * **pca.calc()** standardizes the input data * the **pcaz2f** matrix produced by pca.calc() is correct with the eigen vectors in columns (as in *P0*). # Transforming from Factors to Variables Moving back from the factors to initial variables is done is both ways, either using pca.f2z() or making explicit computations. For explicit calculations, we perform the product $t(P0)$ on the factors which produces standardized variables, that must be transformed to the raw units afterwards (de-scaling using the means and standard deviation calculated beforehand). On the other hand, applying pca.f2z() directly gives back the raw data. ```{r Graphic comparison} a.grid_2 <- pca.f2z(a.grid_1,a.pca) res.0 <- a.grid_2[,"Simu*"] res.1 <- a.grid_2[,"Variable*"] res.2 <- Y_2 %*% t(P0) res.2 = sweep(res.2,2,sigmas,FUN="*") res.2 = sweep(res.2,2,means,FUN="+") plot(res.2[,1],res.1[,1],col=1,pch=20,cex=0.1, xlab ="input data (Z)", ylab="Back-transformation of the factors",xlim=c(-6,6),asp=1, main="Comparison between Initial Variables and back-transformed Factors") points(res.1[,1],res.0[,1],col=2,pch=20,cex=0.1) abline(0,1) ``` ## Another way to build the factors (or back transform) using db.add.CL() Factors can be obtained by another way. One can use the function db.add.CL() meant to add new variables that are Linear Combinations of existing variables. However, this must be used carefully as this must be applied: * to the **standardized** version of the input data * and using the **transposed** matrix of the eigen vectors. As a matter of fact, the rational of db.add.CL() is to generates $q$ new variables $Y$ from a set of $p$ variables $Z$ ($q$ being anything from 1 to larger than $p$). For consistency with the present note, the notations used in the function db.add.CL are modified. The input matrix of transformation $A$ is $q \times p$ so that the new variables are obtained by a right multiplication of $A^t$: $Y=Z.A^t$. This formulation is mathematically consistent in terms of matrix dimensions: * $Z$ is $n \times p$ * $A$ is $q \times p$ so that $A^t$ is $p \times q$ * and $Y$ is thus $(n \times p) \times (p \times q) = n \times q$ So, comparing the two ways to compute factors: * for **db.add.CL**, the linear combinations are provided in ROWS with an underlying programming point of view doing computation row by row * for **pca.z2f**, the linear combinations are provided in COLUMNS with an underlying explicit matrix multiplication point of view In the particular case of PCA, this becomes tricky as to produce factors from the variables one must feed db.add.CL() with $P^t$ which is meant indeed to back-transform the factors. ```{r Using db.add.CL} #standardisation of the input variables a.grid.std = db.normalize(a.grid,names="Simu*",oper="scal") # Computing the factors with db.add.CL a.grid_3 <- db.add.CL(a.grid.std, Y.names = "Norm*", Z.names = paste("Y",1:3,sep="."), mat.CL=t(a.pca$pcaz2f)) summary(a.grid_3$Y.1 - a.grid_1$Factor.1) summary(a.grid_3$Y.2 - a.grid_1$Factor.2) summary(a.grid_3$Y.3 - a.grid_1$Factor.3) ``` The above script allows to check that