Programming with Big Data in R

This is an old revision of this page, as edited by 50.147.52.179 (talk) at 01:50, 29 June 2013 (External links). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Programming with Big Data in R (pbdR)[1][2][3] is a series of R packages and an environment for statistical computing with Big Data by utilizing high-performance statistical computation.[4] The pbdR uses the same programming language as R[5] with S3/S4 classes and methods which is used among statisticians and data miners for developing statistical software. The significant difference between pbdR and R[5] codes is pbdR mainly focuses on distributed memory system where data are distributed across several processors and analyzed in a batch mode, while communications between processors are based on MPI which is easily utilized in large high-performance computing (HPC) systems. R system[5] mainly focuses on single multi-core machines for data analysis via an interactive mode such as GUI interface.[6]

pbdR
File:Pbdr.png
ParadigmSPMD
Designed byWei-Chen Chen, George Ostrouchov, Pragneshkumar Patel, and Drew Schmidt
DeveloperpbdR Core Team
First appearedSep. 2012
Preview release
Through GitHub at RBigData
Typing disciplineDynamic
OSCross-platform
LicenseGeneral Public License and Mozilla Public License
Websiter-pbd.org
Influenced by
R, C, Fortran, and MPI

Two main implementations in R using MPI are Rmpi[7] and pbdMPI of pbdR.

The idea of SPMD parallelism is to let every processors do the same works but on different parts of large data, for example, modern GPU is a large collection of slower co-processors which can simply apply the same computation on different parts of (relatively smaller) data, but it ends up an efficient way to obtain final solution.[12][13] It is clearly that pbdR is not only suitable for small clusters, but also is stabler for analyzing Big data and is more scalable for supercomputers.[14]

Package design

Programming with pbdR requires usage of various packages developed by pbdR core team.

Packages developed are the following.

General I/O Computation Application
pbdDEMO pbdNCDF4 pbdDMAT pmclust
pbdMPI pbdBASE
pbdSLAP
  • pbdMPI --- an efficient interface to MPI either OpenMPI[15] or MPICH2[16] with a focus on Single Program/Multiple Data (SPMD) parallel programming style[8][17][9]
  • pbdSLAP --- bundles scalable dense linear algebra libraries in double precision for R, based on ScaLAPACK version 2.0.2[18] which includes several scalable linear algebra packages (namely BLACS, PBLAS, and ScaLAPACK).[19][20]
  • pbdNCDF4 --- Interface to Parallel Unidata NetCDF4 format data files[21]
  • pbdBASE --- low-level ScaLAPACK codes and wrappers
  • pbdDMAT --- distributed matrix classes and computational methods, with a focus on linear algebra and statistics[22][23][24]
  • pbdDEMO --- set of package demonstrations and examples, and this unifying vignette[3]
  • pmclust -- parallel model-based clustering using pbdR

Amount those packages the pbdDEMO package is a collection of 20+ package demos which offer example uses of the various pbdR packages, and contains a vignette which offers detailed explanations for the demos and provides some mathematical or statistical insight.

Examples

Example 1

Hello World! Save the following code in a file called ``demo.r``

### Initial MPI
library(pbdMPI, quiet = TRUE)
init()

comm.cat("Hello World!\n")

### Finish
finalize()

and use the command

mpiexec -np 2 Rscript demo.r

to execute the code where Rscript is one of command line executable program.

Example 2

The following example modified from pbdMPI illustrates the basic syntax of the language of pbdR. Since pbdR is designed in SPMD, all the R scripts are stored in files and executed from the command line via mpiexec, mpirun, etc. Save the following code in a file called ``demo.r``

### Initial MPI
library(pbdMPI, quiet = TRUE)
init()
.comm.size <- comm.size()
.comm.rank <- comm.rank()

### Set a vector x on all processors with different values
N <- 5
x <- (1:N) + N * .comm.rank

### All reduce x using summation operation
y <- allreduce(as.integer(x), op = "sum")
comm.print(y)
y <- allreduce(as.double(x), op = "sum")
comm.print(y)

### Finish
finalize()

and use the command

mpiexec -np 4 Rscript demo.r

to execute the code where Rscript is one of command line executable program.

Example 3

The following example modified from pbdDEMO illustrates the basic ddmatrix computation of pbdR which performs singular value decomposition on a given matrix. Save the following code in a file called ``demo.r``

# Initialize process grid
library(pbdDMAT, quiet=T)
if(comm.size() != 2)
  comm.stop("Exactly 2 processors are required for this demo.")
init.grid()

# Setup for the remainder
comm.set.seed(diff=TRUE)
M <- N <- 16
BL <- 2 # blocking --- passing single value BL assumes BLxBL blocking
dA <- ddmatrix("rnorm", nrow=M, ncol=N, mean=100, sd=10)

# LA SVD
svd1 <- La.svd(dA)
comm.print(svd1$d)

# Finish
finalize()

and use the command

mpiexec -np 2 Rscript demo.r

to execute the code where Rscript is one of command line executable program.

Further reading

Milestones

2013

  • Version 1.0-2:  Add pmclust.
  • Version 1.0-1:  Add pbdNCDF4.
  • Version 1.0-0:  Add pbdDEMO.

2012

  • Version 0.1-2:  Add pbdBASE and pbdDMAT.
  • Version 0.1-1:  Add pbdSLAP.
  • Version 0.1-0:  Migrate from Rmpi[7] to pbdMPI.

References

  1. ^ Ostrouchov, G., Chen, W.-C., Schmidt, D., Patel, P. (2012). "Programming with Big Data in R".{{cite web}}: CS1 maint: multiple names: authors list (link)
  2. ^ "XSEDE".
  3. ^ a b Schmidt, D., Chen, W.-C., Patel, P., Ostrouchov, G. (2013). "Speaking Serial R with a Parallel Accent" (PDF). {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)
  4. ^ Chen, W.-C. and Ostrouchov, G. (2011). "HPSC -- High Performance Statistical Computing for Data Intensive Research".{{cite web}}: CS1 maint: multiple names: authors list (link)
  5. ^ a b c R Core Team (2012). R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0.
  6. ^ Martinez, W. L. (2011), Graphical user interfaces. WIREs Comp Stat, 3: 119–133. doi: 10.1002/wics.150
  7. ^ a b c Yu, H. (2002). "Rmpi: Parallel Statistical Computing in R". R News.
  8. ^ a b Darema, F. (2001). "The SPMD Model: Past, Present and Future". {{cite journal}}: Cite journal requires |journal= (help)
  9. ^ a b Ostrouchov, G. (1987). "Parallel Computing on a Hypercube: An Overview of the Architecture and Some Applications". Proc. 19th Symp. on the Interface of Computer Science and Statistics: 27-32.
  10. ^ Golub, Gene H.; Van Loan, Charles F. (1996). Matrix Computations (3rd ed.). Johns Hopkins. ISBN 978-0-8018-5414-9.
  11. ^ "Google's MapReduce Programming Model -- Revisited" — paper by Ralf Lämmel; from Microsoft
  12. ^ Darren Murph. "Stanford University tailors Folding@home to GPUs". Retrieved 2007-10-04.
  13. ^ Mike Houston. "Folding@Home - GPGPU". Retrieved 2007-10-04.
  14. ^ Schmidt, D., Ostrouchov, G., Chen, W.-C., and Patel, P. (2012). "Tight Coupling of R and Distributed Linear Algebra for High-Level Programming with Big Data". High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:: 811–815.{{cite journal}}: CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)
  15. ^ Jeff Squyres. "Open MPI: 10^15 Flops Can't Be Wrong" (PDF). Open MPI Project. Retrieved 2011-09-27.
  16. ^ MPICH License
  17. ^ Ortega, J.M., Voight, G.G., and Romine, C.H. (1989). "Bibliography on Parallel and Vector Numerical Algorithms". {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)
  18. ^ Blackford, L.S.; et al. (1997). ScaLAPACK Users' Guide. {{cite book}}: Explicit use of et al. in: |author= (help)
  19. ^ Petitet, Antoine (1995). "PBLAS". Netlib. Retrieved 13 July 2012.
  20. ^ Jaeyoung Choi; Dongarra, J.J.; Walker, D.W. (1994). "PB-BLAS: a set of Parallel Block Basic Linear Algebra Subprograms" (PDF). Scalable High-Performance Computing Conference: 534–541. doi:10.1109/SHPCC.1994.296688. ISBN 0-8186-5680-8. {{cite journal}}: Unknown parameter |month= ignored (help)
  21. ^ NetCDF Group (2008). "Network Common Data Form".
  22. ^ J. Dongarra and D. Walker. "The Design of Linear Algebra Libraries for High Performance Computers". {{cite journal}}: Cite journal requires |journal= (help)
  23. ^ J. Demmel, M. Heath, and H. van der Vorst. "Parallel Numerical Linear Algebra". {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)
  24. ^ "2d block-cyclic data layout".
  25. ^ Raim, A.M. (2013). "Introduction to distributed computing with pbdR at the UMBC High Performance Computing Facility" (PDF). Technical Report HPCF-2013-2.
  26. ^ Dirk Eddelbuettel. "High-Performance and Parallel Computing with R".
  27. ^ "R at 12,000 Cores".
  28. ^ "100 most read R posts in 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages".
  29. ^ GSOC-R 2013. "Profiling Tools for Parallel Computing with R".{{cite web}}: CS1 maint: numeric names: authors list (link)