Geostatistical R Package - Board

by **Jeffrey Yarus** » Thu Mar 31, 2022 6:54 pm

Is there any restriction in the RGeostat code that prevents using very large data sets? We are thinking "big data." For example we have a data set that want to run that has 300,000 wells and roughly 10 to 15 variables per well. Obviously, we are not using laptop computers. We have been R and RGeostats on our High Performance Distributed Computing system, and we can assign the number core, the amount of memory, as well as the number of gpu's (and even tpu's). Given this, is there some internal limits in the RGeostat code that we need to know about. If for some reason there is, will there be a way to increase any internal limitations.

Amicalement,
Jeffrey

by **Didier Renard** » Fri Apr 01, 2022 6:51 pm

There are three limitations:
- the one based on the memory of your computer but manipulating 300,000 * 15 is not considered as huge (you did not mention how many samples per well if in 3D)...
- the one based on Geoslib library (sitting below RGeostats). The data base is constituted as an internal array, stored in memory (to improve manipulation speed). Each information, which is compulsorily numerical, is stored on 64bits. Yes, I agree that this can be considered as stupid when dealing with a class type information (where the value does not overpass the number of lithology ... which is usually rather small) or even more for selection (i.e. 0 or 1) where a single bit would be enough.
This great improvement will be taken into account in gstlearn (the next version as an extension of Geoslib, written in C++ only)
- RGeostats API. This is written in R and R, which is processed in memory only) has some internal limitation. There is even an additional package that can be launched to increase this limitation.

Moreover, let me say that RGeostats, by the way it is written, is copying the contents of the data base (stored in R) into Geoslib. Therefore, at a given point, the big data will be present twice!!!

Concerning the use of CU or GPU, this facility has not been used in Geoslib. It requires some (a large) adaptation of the code which will be undertaken in gstlean... certainly not in Geoslib (not written in C++ in terms of objects... therefore difficult to parallelise).
The use of GPU, although it has been imagined once, has not been retained even for next releases. It could simply be envisaged for some very specific cases of calculations. Moreover, using GPU creates many more difficulties for the portability. Bear in mind that we wish the new version (say gstlean) to be "converted" into R or Python interfaces (using an automated conversion methodology provided by SWIG). I am not sure that the usage of GPU will not create additional obstacles for the multi-platform aspect.

Hope this information will help. Fabien may add some extra remarks to my thoughts (if necessary).

by **Jeffrey Yarus** » Fri Apr 08, 2022 9:24 pm

Hi Didier and Fabian: I would like to get Fabian's comments on this.
In my initial question, I failed to mention the number of samples per well. The number is 20,000 samples. So, 300,000 wells, 20, 000 samples per well, and each well has 15 variables. So, that looks like 9e10 or 90,000,000,000 samples in total...

My concern is that just ran a data set with a total of 4403 samples (it's 2D, 259 samples, 15 variables...). The grid was 300,000 cells (500x600). It took 45 minutes to run 25 conditional simulations. This data set is quite small, yet it takes a long time to run. So, the exercise is to determine whether or not the slowdown is simply due to R, RGeostat, or our HPC compute environment. If the latter, then it's a matter of making sure we access enough core and memory. I don't think that is the problem as we are running much larger data sets (non-geostatistical projects) with no real issues. If possible, can we arrange a zoom meeting to discuss details?

Jeffrey

by **Jeffrey Yarus** » Wed Apr 20, 2022 11:09 pm

I know you folks are very busy, but I would like to better understand any limits to running very large data sets with RGeostats. As I mentioned, the data set we are looking at has 400,000 wells and 10 variables in each well. This is only a 2D data set (this time), but we find that running only 3,000 wells with 10 variables causes the code to freeze when trying to run the variogram let alone kriging. Some of my students are running off to use gstat! I really don't want them doing that, but apparently, they can at least get variograms out of it. I can get our computer guy who set up the HPC system on a call with us as he seems to think it's not the compute environment. We can increase both CPU and memory (and GPU and TPU, but as per Didier's note, I don't think RGeostat uses that at this point... waiting for gstlearn!). There are a lot of things to factor including grid size, moving neighborhood design, as well as cpu and memory, and we can use your experience to hopefully solve the problem. As I mentioned, the map we want to make includes some 400,000 sample locations across the USA with 10 variables each. It's 2D for now, but ultimately, we will have vertical samples (600 vertical samples per well). First, let's see if we can tackle the 2D problem....

Geostatistical R Package - Board

Large data sets

Large data sets

Re: Large data sets

Re: Large data sets

Re: Large data sets

Who is online