[SOLVED] Cross validation

Any question regarding the Interpolation method using Kriging

[SOLVED] Cross validation

Postby Jeffrey Yarus » Tue Nov 19, 2019 1:03 am

I am having trouble understanding how to perform cross validation (xvalid) in RGeostats. I would like to get some guidance on setting up the procedure and then doing the following:

1. plot a histogram of the estimated and standard error values
2. Cross plot the error variance against the estimated values
2. map the error variance (standard deviation)

I have looked at the vignettes and I am getting a sense of what to do... I'm stumbling at the following steps:

Plotting the locations of a particular variable per the vignette:

YOUR CODE:

#Modifying the variable role

It is now time to introduce the role that each field must play: this is defined using its locator. For example, we can check that the fields 2 and 3 are respectively the first and second coordinates: the locators are x1 and x2. Similarly if we consider Zn as the variable of interest, we set its locator to z1. The variable Pb is kept unclassified with its locator equal to NA. Setting this locator is performed using the following command:


```{r }
data.db = db.locate(data.db,"Zn","z",1)

```
MY CODE:

```{r }
data.db_wt2 <- db.locate(data.db_wt,"P_PHI","z",1)

```
What I don't understand is how to designate my x and y locator variables. In my data set they are fields 4 and 5, not 1 and 2 as they are in yours....

THEN, WHEN I PLOT:

```{r}
plot(data.db_wt2,pch = 21,bg = "red",col = "black",
title = "PHI Samples location")
```
I GET AN ERROR (looks like it cannot find X or Y coordinates???)

Error: Error: Coordinates (1) incompatible with Space dimension (0)
5.
base::stop(..., call. = FALSE)
4.
stop0("Error: Coordinates (", pos.x, ") incompatible with Space dimension (", ndim, ")")
3.
db.plot(x, ...)
2.
plot(data.db_wt2, pch = 21, bg = "red", col = "black", title = "Zn Samples location")
1.
plot(data.db_wt2, pch = 21, bg = "red", col = "black", title = "Zn Samples location")


MY DATA BASE CHARACTERISTICS ARE:
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension = 0
Number of fields = 11
Maximum Number of attributes = 11
Total number of samples = 262

Variables
---------
Field = 1 - Name = rank - Locator = rank
Field = 2 - Name = F_Top - Locator = NA
Field = 3 - Name = N_Well - Locator = NA
Field = 4 - Name = C_X_ft - Locator = NA
Field = 5 - Name = C_Y_ft - Locator = NA
Field = 6 - Name = L_3_FACIES - Locator = NA
Field = 7 - Name = P_Delta_t - Locator = NA
Field = 8 - Name = P_KH_md - Locator = NA
Field = 9 - Name = P_PHI - Locator = NA
Field = 10 - Name = P_Thickness - Locator = NA
Field = 11 - Name = P_Top_ft - Locator = NA


MOVING ON....

CREATING A UNIQUE NEIGHBORHOOD:


I can create the unique neighborhood and the moving neighborhood as below:

```{r }
WT_2D_unique.neigh <- neigh.create(ndim = 2, type = 0)

```

```{r}
WT_2D_unique.neigh

```

is.na() applied to non-(list or vector) of type 'S4'
Neighborhood characteristics
============================
Unique neighborhood option
Space dimension = 2

CREATING A MOVING NEIGHBORHOOD

```{r}
WT_2D_moving.neigh =
neigh.create(ndim = 0,nmaxi = 40000,radius = 5000,
flag.sector = TRUE,nsect = 8 ,nsmax = 2)

```
But then get stuck here:

``{r}
data.db_wt3 <- xvalid(data.db_wt2,data.model.omni,
WT_2D_moving.neigh)
```
Error message:
is.na() applied to non-(list or vector) of type 'S4'
The function is interrupted

I think this may be BECAUSE that I created my vriograms in S3, in a data frame not s4 in a db... so I will try that next...

Any other suggestions would be helpful!!
Jeffrey Yarus
 
Posts: 48
Joined: Wed Jun 26, 2019 9:39 pm

Re: Cross validation

Postby Fabien Ors » Wed Nov 20, 2019 9:59 am

Dear Jeffrey,
Didier has cooked a Rmd file for answering your question.
Please find it here: http://rgeostats.free.fr/doc/Examples/xvalid.Rmd
Hope this helps.
Best regards.
Fabien
Fabien Ors
Administrateur du site
 
Posts: 226
Joined: Thu Sep 20, 2012 1:07 pm

Re: Cross validation

Postby Jeffrey Yarus » Fri Nov 22, 2019 8:12 pm

Ok... but I get an error when I run your code:

```{r}
data = db.locerase(data,"z")
data = db.locate(data,"data","z")
draw.xvalid(mode=1,data)
draw.xvalid(mode=2,data)
draw.xvalid(mode=3,data)
draw.xvalid(mode=4,data)
```

NAs introduced by coercionError in draw.xvalid(mode = 1, data) :
could not find function "draw.xvalid"

Also, I managed to get my code running, although I do not use the syntax you use for draw.xvalid... However I am unable to 'knit' to PDF.... Here is my code.... it runs, but does not Knit....

---
title: "Script8"
author: "Jeffrey Yarus"
date: "11/18/2019"
output:
pdf_document:
highlight: tango
latex_engine: xelatex
number_sections: yes
toc: yes
toc_depth: 6
word_document:
toc: yes
toc_depth: '6'
html_document: default
---

```{r, include=FALSE}
options(tinytex.verbose = TRUE)
```

#Libraries
A good idea is to move all the libraries here. Once they are loaded, you do not have to load them agin. Libraries
```{r library1}
library(ggplot2)
library(PerformanceAnalytics)
library(GGally)
library(corrplot)
library(knitr)
library(RGeostats)
library(tinytex)
```


#This script 8. It uses some special functions but requires using a database (db) instead of data frame. Your job is to add script 8 to the code we have been working on.
I have described each step in this code with detail. In the end, you will learn different ways of creating some of the same things you have already done and some new things as well. Most importnat, you will learn how to perfomr cross validation!

This first chunk converts your data table to a data base (db).

```{r Create db}
library(RGeostats)
data.df_wt <- read.csv("../data/WT_2D_0612.csv")
data.db_wt <- db.create(data.df_wt)
print(data.db_wt)
```


```{r Single info}
print(data.db_wt,flag.stats = TRUE, names = 6)

```

#The following chunk sets up the ability to plot the location of the PHI variable as a bubble map. To do this, the key variables have to be identified in the db. The variable of interest, PHI, is assigned a locater value of "z1". The coordinates are specified as "x1" and "x2". In each case, the fist letter (either z or x) is the locator id. The numbers that follow, are the order of the particular property. For example, "z" is the only property I want to build a map of (in this casem PHI). So z1 refers to PHI. "x" is a coordinate, and there are 2 of them. So, we have "x1" and "x2." "x1" refers to the x coordinate, and x2 refers to the y coordinate.


```{r Set Z}
data.db_wt2 <- db.locate(data.db_wt,"P_PHI","z",1)

```

```{r Set X}
data.db_wt3 <- db.locate(data.db_wt2,"C_X_ft","x",2)
```

```{r }
data.db_wt4 <- db.locate(data.db_wt3,"C_Y_ft","x",3)
```


```{r Check db}
data.db_wt4
```

```{r}
plot(data.db_wt4,pch = 21,bg = "red",col = "black",xlab = "X (UTM)", ylab =
"Y (UTM)", xlim = c(-20000, 40000),
title = "PHI Locations")
```

#Performing a selection

This is one way to mitigate outliers. You can see an outlier in the above map. If we want to create a selection which suppresses the largest values of the variable Zn (which lies between .03 and .18) say, all values less than .20.

```{r Selection}
data.db_wt5 <- db.sel(data.db_wt4,P_PHI < 0.20)
```

#Now we can plot the stats for the data set prior to the selection...


```{r Print PRE_Sel}
print(data.db_wt4,flag.stats = T,names = "P_PHI")
```

#... and then the stats after the selection.... Note the standard deviation and the variance....


```{r Print Post_Sel}
print(data.db_wt5,flag.stats = T,names = "P_PHI")
```
Now, plot the map without the outlier....

```{r Plot Final_Property}
plot(data.db_wt5,pch = 21,bg = "red",col = "black",xlab = "X (UTM)", ylab =
"Y (UTM)", xlim = c(-20000, 40000),
title = "PHI Locations")
```


```{r Transform}
data.db_wt6 <- db.add(data.db_wt5,P_PHI.transformed = P_PHI - 0.081)
```

```{r Print Transform}
print(data.db_wt6,flag.print = TRUE)
```

Statistical inference:

We now wish to calculate the experimental variogram on the target variable Zn using the previously defined selection, with different options before we define the characteristics of the structural model.

Calculating the experimental variogram

The first task is to calculate 10 lags of the omni-directional variogram established for a lag distance of 1km:


```{r}
plot(data.db_wt6,pch = 21,bg = "red",col = "black",xlab = "X (UTM)", ylab =
"Y (UTM)", xlim = c(-20000, 40000),title = "PHI Transform Locations")
```

```{r Plot Transform}
data.vario1 <- vario.calc(data.db_wt5,lag = 5000,nlag = 10)
#This command produces a variogram file (stored in the object data.vario which
#belongs to the vario class). Its contents can be printed by typing the name of
#the variogram file:
```

#Visualizing the experimental variogram

We can visualize the contents of the experimental variogram (using the generic plot function) where each lag is represented by a symbol proportional to the number of pairs. The number of pairs is also displayed. This is obtained by typing:


```{r Plot Vario1}
plot(data.vario1,npairdw = TRUE,npairpt = 1,title =
"Experimental Variogram for PHI")

```

```{r Vario2}
data.4dir.vario <-
vario.calc(data.db_wt5,lag = 5000,nlag = 10,dir = c(0,45,90,135))
```

```{r Plot vario2}
plot(data.4dir.vario,title="Directional variograms")
```



```{r}
plot(data.4dir.vario,idir0=2,
title = "Directional variogram #2 for P_PHI")

```
#The experimental variogram calculated in 4 directions does show a significant anisotropy. However, for this exercise, it is sufficient to fit the omni-directional variogram.

#Fitting the model

As no anisotropy has been retained, the omni-directional variogram is used and fitted with a model. The model is fitted using an automatic procedure, and ultimately stored in the specific model file called data.model (belonging to the model class):

```{r Fit Vario}
data.model_omni <-
model.auto(data.vario1,
struct = c("Spherical","Exponential"),
title = "Modeling omni-directional variogram for P_PHI")

```

```{r Neighborood 1}
WT_2D_unique.neigh <- neigh.create(ndim = 2, type = 0)

```


```{r Neighborhood 2}
WT_2D_unique.neigh

```

```{r}
WT_2D_moving.neigh =
neigh.create(ndim = 2,nmaxi = 20,radius = 2000,
flag.sector = TRUE,nsect = 8 ,nsmax = 2)

```

#create the variogram in the new DB! then try this.....

```{r Cross Valid}
data.db_wt7 <- xvalid(data.db_wt5,data.model_omni,
WT_2D_moving.neigh)

```

```{r Print db7}
data.db_wt7
```
#Plot a histogram of the estimation error (difference between the actual values and the estimated values.

```{r Histogram 1}
hist(db.extract(data.db_wt7,"Xvalid.P_PHI.*esterr"), nclass = 30,
main = "Histogram of Error",
xlab = "Cross-validation error", col = "blue")

```
# Now plot the normalized cross validation error for variable P_PHI

```{r Histogram 2}
hist(db.extract(data.db_wt7,"Xvalid.P_PHI.stderr"), nclass = 30,
main = "Histogram of Standard error",
xlab = "Cross-validation error", col = "red")

```
Jeffrey Yarus
 
Posts: 48
Joined: Wed Jun 26, 2019 9:39 pm

Re: Cross validation

Postby Jeffrey Yarus » Fri Nov 22, 2019 10:26 pm

In the code Didier sent me, the function 'draw.xvalid' is not recognized.

```{r}
data = db.locerase(data,"z")
data = db.locate(data,"data","z")
draw.xvalid(mode=1,data)
draw.xvalid(mode=2,data)
draw.xvalid(mode=3,data)
draw.xvalid(mode=4,data)
```
NAs introduced by coercionError in draw.xvalid(mode = 1, data) :
could not find function "draw.xvalid"
Jeffrey Yarus
 
Posts: 48
Joined: Wed Jun 26, 2019 9:39 pm

Re: Cross validation

Postby Didier Renard » Sat Nov 23, 2019 5:19 pm

Hi Jef

This type of message tells you that the function draw.xvalid() does not exist ... in your version.

As you can check in the CHANGELOG file (available on the site), this function has been added since version 11.2.10. It serves in representing easily the traditional outputs of the cross-validation function.

The solution: download a more recent version (from the same site).... and the problem will be solved.

Didier
Didier Renard
 
Posts: 337
Joined: Thu Sep 20, 2012 4:22 pm

Re: Cross validation

Postby Jeffrey Yarus » Mon Nov 25, 2019 11:54 pm

I want to verify that I am preceding in the right direction. I've translated your script to fit with my data. It mostly runs, but I do get an error at the end during the draw.xvalid(model4).

Also, I'm not 100% sure I understand some of the details. One of the confusions I have is your use of the 'data.' It seems you use it in two ways; one as the name of your db that you overwrite at various steps, and the other to designate the variable selected for analysis. It get's a bit tricky as you have 'cooked' your demo for me using a db and data that you created on the fly, creating a random variable set. I actually read in some real data and I tend to increment my db name at various steps, just for tracking and to be careful not to make mistakes (I am not a very good programmer!). In any case, I knitted my results and attached them to this email. If you can take a few minutes to review them, I'd appreciate it, as I am unsure if I did everything correctly. There one issue I could not resolve at the end... not sure what to do there... and I am not sure if I understand the chunk below (see my knitted doc). It seems like I cannot preserve the first half if I run the second half....

I appreciate your help. I am attaching everything to the email....

Amicalement,

Jeffrey
Jeffrey Yarus
 
Posts: 48
Joined: Wed Jun 26, 2019 9:39 pm

Re: Cross validation

Postby Didier Renard » Tue Nov 26, 2019 10:53 am

Hi Jef again,
Reading you LONG Rmd file, I think that your main problem is the understandaing of the philosophy of the functions developed in RGeostats as well as the one of the locators.
1) The function that we develop and which concern the Dbs (say foo) can always be considered as filters, i.e.:
a2 = foo(a1)
a2 [output] is a copy of a1 [input], to which some new variables have been added by function foo.
You can use the same name for input and output. Then there is no need to change the name at each step of your Rmd (as you did for data.db_wt1, data_db.wt2, ...).

2) The locator is a qualifier that is set to a variable (or a group of variables) which define the ROLE of these variable(s) for the next steps. This is an important choice as, for example, setting several Z-locator (locator "z" designates the variable of interest) will route the subsequent operations towards multivariate procedures, whereas having only one Z-locator defined will launch monovariate procedures.

3) Some (should I say MOST) procedures are defining NEW locators in the output Db. For example, the kriging procedure (kriging) has the following filter prototype:
b = foo(a,b,...)
where "a" stands for the input Db (only considered as input)
"b" stands for the target Db which is used as input and produced as output.
In the "b" used as input, there may be no Z-locatorized variable. In the "b" considered as output, the newly produced variables (i.e. kriging estimate and kriging stdev) are Z_locatoried in turn.
The reason of this internal locator manipulation is that you can hook the kriging step with a plotting step, which by default will plot the estimation result (first Z_locator variable available in "b")

4) The problem is slightly more difficult in XVALID function. As a matter of fact, this function, seen as a filter, will be written:
a = foo(a,...)
as the output Db file has the same structure as the input Db. As a consequence, one or several target variables must have been designated in the input "a". The XVALID function, while preparing the output "a", first cancels the role of these target variables and set the Z_locator to the resulting variables, i.e. the estimate and stdev (for flag.xvalid==1) or the error and st, error (if flag.xvalid==2).

Then if you wish to run several xvalid steps on the same file, you must do the following operations:
a) define the target variable (use db.locate function)
b) run XVALID which will set the Z_locator to the resulting variables
c) cancel all Z_locator definition (use db.locerase function)
d) define the target variable again (same as a)
e) run XVALID again...

Hoping this clarifies your problem
Didier Renard
 
Posts: 337
Joined: Thu Sep 20, 2012 4:22 pm

Re: Cross validation

Postby Jeffrey Yarus » Tue Nov 26, 2019 3:44 pm

Hi Didier...

Thanks for reading my LONG Rmd file! Yes, I know it was long, but I wanted to be sure you saw exactly what I was doing. Interestingly enough, here at CASE, the expertise in R is excellent, but most seem less familiar with S4 functions and the use of of db'd instead of df's, so it is difficult for me to find good coaching when using RGeostats.

I get the idea of being able to overwrite the db's with the same name, but there does seem to be some value (at least to me) in being able to see what takes place at every evolution. In any case, I don't think that incrementing the db name (foo, foo1, foo2, etc.) is a problem. In any case, I will read through you comments in detail today to be sure I am clear.

On the topic of xvalid, yes, that is a bit more confusing to me. I still don't understand why the final 'draw.xvalid(mode=4, data.db_wt5) failed. If you could look at that for me and let me know what I need to do to fix that, that would be great. The parts of the code I really don't yet fully understand are:

```{r}
data.db_wt5 <- db.locerase(data.db_wt4,"z") (Here, It looks like I am erasing the locator information for z...)
data.db_wt5

data.db_wt5 <- db.locate(data.db_wt5,"9","z") (Here it looks like I am ensuring that filed '9' (P-PHI) variable is stored in 'z'
data.db_wt5

data.db_wt5 <- xvalid(data.db_wt5,model,neigh,flag.xvalid = 2) (Here it looks like I am crating the option 2 (Z* and normalized estimation error)
# HERE I SHOULD BE CREATING THE VARIABLES Z* AND THE STANDARDIZED ERROR
data.db_wt5
data.db_wt5 <- db.locerase(data.db_wt5,"z") (Here it looks like I am erasing the locator variable (now, P_PHI)
data.db_wt5
# I AM ASSSUMING HERE I CONTINUE WHTH THE SAME VARIABLE ALOUGH I SHOULD NOW BE
#CREATING THE VARIABLES FOR Z*-Z AND THE NORMALIZED ERROR
data.db_wt5 <- db.locate(data.db_wt5,"P_PHI","z") (Here it looks like I am putting BACK, P_PHI as the z locator. THIS IS CONFUSING)
data.db_wt5
data.db_wt5 <- xvalid(data.db_wt5,model,neigh,flag.xvalid = 1) (Here it looks like I am now using option 1 to calculate the z*-z and estimation error...)
data.db_wt5
```

```{r}
data.db_wt5
```

#=========================================================================================================================
#THE FIST TWO LINES ARE THE SAME AS THE IN THE PREVIOUS CODE. WHAT IS DIFFERENT IS THE 'dum,' OR DUMP. I AM NOT QUITE SURE WHAT I SHOULD BE LOOKING FOR HERE...

```{r Cross-validation of the first sample, eval=FALSE}
debug.reference(1)
data.db_wt5 <- db.locerase(data.db_wt5,"z") (Here, the first 2 lines are the same, erase z (P_PHI) as the locator variable, then...)
data.db_wt5 <- db.locate(data.db_wt5,"9","z") (assign P_PHI as the 'z' locator variable ...AGAIN, CONFUSING)
dum <- xvalid(data.db_wt5,model,neigh,flag.xvalid=1) (Here, we are performing a 'dump'.
```
#==============================

WHAT IS CONFUSING TO ME IS THAT I DON'T SEE ANY ASSIGNATION OF A Z1, Z2, ETC... NOR DO I SEE WHERE Z*, Z*-Z, OR THE ESTIMATION ERROR ARE BEING CALLED UNLESS THAT IS BUILT INTO THE CODE DURING THE XVALID.DRAW COMMANDS....
Jeffrey Yarus
 
Posts: 48
Joined: Wed Jun 26, 2019 9:39 pm

Re: Cross validation

Postby Jeffrey Yarus » Tue Nov 26, 2019 10:11 pm

Hi Folks:

I'm still wrestling with the draw.xvalid (mode = 3, data.dbXX) and draw.xvalid (mode = 4, data.dbxx). The first two modes work fine! The next two fail. The error message is that "...Error: Identifying attribute called '*.estim' You must designate at least one attribute." When I check the data.dbxx, I can see that my attribute has a locator of z1:
Variables
---------
Field = 1 - Name = rank - Locator = rank
Field = 2 - Name = F_Top - Locator = NA
Field = 3 - Name = N_Well - Locator = NA
Field = 4 - Name = C_X_ft - Locator = x1
Field = 5 - Name = C_Y_ft - Locator = x2
Field = 6 - Name = L_3_FACIES - Locator = NA
Field = 7 - Name = P_Delta_t - Locator = NA
Field = 8 - Name = P_KH_md - Locator = NA
Field = 9 - Name = P_PHI - Locator = z1
Field = 10 - Name = P_Thickness - Locator = NA
Field = 11 - Name = P_Top_ft - Locator = NA
Field = 12 - Name = Xvalid.P_PHI.esterr - Locator = NA
Field = 13 - Name = Xvalid.P_PHI.stderr - Locator = NA

This basic organization matched Didier's who shows a locator value of z1 for DATA:

Variables
---------
Field = 1 - Name = rank - Locator = rank
Field = 2 - Name = x1 - Locator = x1
Field = 3 - Name = x2 - Locator = x2
Field = 4 - Name = data - Locator = z1
Field = 5 - Name = Xvalid.data.estim - Locator = NA
Field = 6 - Name = Xvalid.data.stdev - Locator = NA
Field = 7 - Name = Xvalid.data.esterr - Locator = NA
Field = 8 - Name = Xvalid.data.stderr - Locator = NA


Here is the traceback from my script:
The function is interrupted
Traceback: db.ident(db, name1, flag.one = TRUE)
Traceback: correlation(db, "*.estim", "*.stderr", title = "(Z-Z*)/S* vs. Z*", xlab = "", ylab = "", flag.aspoint = TRUE, size0 = 0.05, bg = "blue", ...)
Traceback: draw.xvalid(mode = 3, data.db_wt6)

Any clue as to what is going on?

Jeffrey
Jeffrey Yarus
 
Posts: 48
Joined: Wed Jun 26, 2019 9:39 pm

Re: Cross validation

Postby Didier Renard » Sat Nov 30, 2019 10:09 pm

Once more, the reason is obvious (this is probably not the case of the error message).
When you run the XVALID function with flag.xvalid=1, the function creates 2 output variables names XXX.esterr and XXX.stderr; while when using flag.xvalid-2, it creates two variables named XXX.estim and XXX.stdev.

The graphic function draw.xvalid(), which should be performed on the results of Xvalid() knows this convention. It presents 4 working modes (depending on the argument 'mode'), as mentioned in the documentation:
1 Base Map showing the absolute value of the Standardized Error in proportional symbol. Samples for which this value is larger than the threshold 'thresh' are represented in solid red points.
2 Histogram of the Standardized Error. Bins corresponding to values beyond the threshold are represented in red.
3 Scatter plot of the Standardized Error vs. the Estimation.
4 Scatter plot of the true data vs. the Estimation

The mode=1 option requires the presence of a variable containing "stderr" in its name (and therefore should be used after a call to xvalid with the option flag,.xvalid=1)
The mode=2: Same
The mode=3 requires the presence of a variable containing "estim" and another one containing "stderr". This is tricky and should be used after calling Xvalid twice (with results in the same file), one called with flag.xvalid=1 (which creates "stderr") and the other one called with flag.xvalid=2 (which creates "estim")
The mode=4 requires the presence of a variable containing "estim" (and should therefore be called after a call to xvalid with the option flag.xvalid=2).

Hope this explains every thing.
Didier Renard
 
Posts: 337
Joined: Thu Sep 20, 2012 4:22 pm


Return to Kriging

Who is online

Users browsing this forum: No registered users and 4 guests

cron