Content deleted Content added
mNo edit summary |
m Link to model-based clustering article updated to link to main article |
||
(89 intermediate revisions by 46 users not shown) | |||
Line 1:
{{multiple issues|
{{notability|date=June 2013}}
{{
}}
{{Infobox programming language
| name =
| logo =
| paradigm = [[SPMD]] and [[MPMD]]
|
| designer =
| developer = pbdR Core Team
| latest_test_version = Through [[GitHub]] at [
| typing = [[dynamic typing|Dynamic]]
| influenced_by = [[R (programming language)|R]], [[C (programming language)|C]], [[
| operating_system = [[Cross-platform]]
| license = [[General Public License]] and [[Mozilla Public License]]
| website =
}}
'''Programming with Big Data in R''' (pbdR)<ref>{{cite web|author=Ostrouchov, G., Chen, W.-C., Schmidt, D., Patel, P.|title=Programming with Big Data in R|year=2012|url=http://r-pbd.org
Two main implementations in [[R (programming language)|R]] using [[Message Passing Interface|MPI]] are [http://cran.r-project.org/package=Rmpi Rmpi]<ref name=rmpi/> and [http://cran.r-project.org/package=pbdMPI pbdMPI] of pbdR.▼
* The pbdR built on [http://cran.r-project.org/package=pbdMPI pbdMPI] uses [[SPMD|SPMD parallelism]] where every processors are considered as workers and own parts of data. The [[SPMD|SPMD parallelism]]<ref name=spmd/><ref name=spmd_ostrouchov/> introduced in mid 1980 is particularly efficient in homogeneous computing environments for large data, for example, performing [[Singular value decomposition|singular value decomposition]]<ref>{{Cite book | last1=Golub | first1=Gene H. | author1-link=Gene H. Golub | last2=Van Loan | first2=Charles F. | author2-link=Charles F. Van Loan | title=Matrix Computations | publisher=Johns Hopkins | edition=3rd | isbn=978-0-8018-5414-9 | year=1996 }}▼
* The [http://cran.r-project.org/package=Rmpi Rmpi]<ref name=rmpi/> uses [[Master/slave (technology)|manager/workers parallelism]] where one main processor (manager) servers as the control of all other processors (workers). The [[Master/slave (technology)|manager/workers parallelism]]<ref>[http://userpages.uni-koblenz.de/~laemmel/MapReduce/paper.pdf "Google's MapReduce Programming Model -- Revisited"] — paper by Ralf Lämmel; from [[Microsoft]]</ref> introduced around early 2000 is particularly efficient for large tasks in small [[Computer cluster|clusters]], for example, [[Bootstrapping (statistics)|bootstrap method]] and [[Monte Carlo method|Monte Carlo simulation]] in applied statistics since [[Independent and identically distributed random variables|i.i.d.]] assumption is commonly used in most [[Statistics|statistical analysis]]. In particular, [http://math.acadiau.ca/ACMMaC/Rmpi/structure.html| task pull] parallelism has better performance for Rmpi in heterogeneous computing environments.▼
The idea of [[SPMD|SPMD parallelism]] is to let every processors do the same works but on different parts of large data. For example, modern [[Graphics processing unit|GPU]] is a large collection of slower co-processors which can simply apply the same computation on different parts of relatively smaller data, but it ends up an efficient way to obtain final solutions.<ref>{{cite web | url = http://www.engadget.com/2006/09/29/stanford-university-tailors-folding-home-to-gpus/ | title = Stanford University tailors Folding@home to GPUs | author = Darren Murph | accessdate = 2007-10-04 }}</ref><ref>{{cite web | url = http://graphics.stanford.edu/~mhouston/ | title = Folding@Home - GPGPU | author = Mike Houston | accessdate = 2007-10-04 }}</ref>▼
▲Two main implementations in [[R (programming language)|R]] using [[Message Passing Interface|MPI]] are
▲* The pbdR built on
▲* The
▲The idea of [[SPMD|SPMD parallelism]] is to let every
== Package design ==
Line 33 ⟶ 29:
{| class="wikitable"
|-
! General !! I/O !! Computation !! Application !! Profiling !! Client/Server
|-
| pbdDEMO || pbdNCDF4 || pbdDMAT || pmclust || pbdPROF || pbdZMQ
|-
| pbdMPI || pbdADIOS || pbdBASE || pbdML || pbdPAPI || remoter
|-
| ||
|-
| || || kazaam || || || pbdRPC
|}
[[File:Pbd overview.png|thumb|The images describes how various pbdr packages are correlated.]]
Among these packages, pbdMPI provides wrapper functions to [[Message Passing Interface|MPI]] library, and it also produces a [[Library (computing)|shared library]] and a configuration file for
*
* pbdSLAP --- bundles scalable dense linear algebra libraries in double precision for R, based on [[ScaLAPACK]] version 2.0.2 which includes several scalable linear algebra packages (namely [[BLACS]], [[PBLAS]], and [[ScaLAPACK]]).
* pbdNCDF4 --- interface to Parallel Unidata [[NetCDF]]4 format data files
*
* pbdDMAT --- distributed matrix classes and computational methods, with a focus on linear algebra and statistics
* pbdDEMO --- set of package demonstrations and examples, and this unifying vignette
* pmclust --- parallel [[model-based clustering]] using pbdR
* pbdPROF --- profiling package for MPI codes and visualization of parsed stats
* pbdZMQ --- interface to [[ZeroMQ|ØMQ]]
* remoter --- R client with remote R servers
* pbdCS --- pbdR client with remote pbdR servers
▲* [http://cran.r-project.org/web/packages/pbdBASE/vignettes/pbdBASE-guide.pdf pbdBASE] --- low-level [[ScaLAPACK]] codes and wrappers
* pbdRPC --- remote procedure call
* kazaam --- very tall and skinny distributed matrices
* pbdML --- machine learning toolbox
== Examples ==
=== Example 1 ===
Hello World! Save the following code in a file called
<
### Initial MPI
library(pbdMPI, quiet = TRUE)
Line 73 ⟶ 72:
### Finish
finalize()
</syntaxhighlight>
and use the command
<
mpiexec -np 2 Rscript demo.r
</syntaxhighlight>
to execute the code where [[R (programming language)|Rscript]] is one of command line executable program.
=== Example 2 ===
The following example modified from pbdMPI illustrates the basic [[programming language syntax|syntax of the language]] of pbdR.
Since pbdR is designed in [[SPMD]], all the R scripts are stored in files and executed from the command line via
<
### Initial MPI
library(pbdMPI, quiet = TRUE)
Line 102 ⟶ 101:
### Finish
finalize()
</syntaxhighlight>
and use the command
<
mpiexec -np 4 Rscript demo.r
</syntaxhighlight>
to execute the code where [[R (programming language)|Rscript]] is one of command line executable program.
=== Example 3 ===
The following example modified from pbdDEMO illustrates the basic ddmatrix computation of pbdR which performs [[
Save the following code in a file called
<
# Initialize process grid
library(pbdDMAT, quiet=T)
Line 131 ⟶ 130:
# Finish
finalize()
</syntaxhighlight>
and use the command
<
mpiexec -np 2 Rscript demo.r
</syntaxhighlight>
to execute the code where [[R (programming language)|Rscript]] is one of command line executable program.
== Further reading ==
*
* {{cite tech report|author=Bachmann, M.G., Dyas, A.D., Kilmer, S.C. and Sass, J.|year=2013|title=Block Cyclic Distribution of Data in pbdR and its Effects on Computational Efficiency|institution=UMBC High Performance Computing Facility, University of Maryland, Baltimore County|number=HPCF-2013-11|url=http://userpages.umbc.edu/~gobbert/papers/REU2013Team1.pdf|accessdate=2014-02-01|archiveurl=https://web.archive.org/web/20140204051351/http://userpages.umbc.edu/~gobbert/papers/REU2013Team1.pdf|archivedate=2014-02-04|url-status=dead}}
* [http://cran.r-project.org/ CRAN] Task View: [http://cran.r-project.org/web/views/HighPerformanceComputing.html High-Performance and Parallel Computing with R].<ref>{{cite web|title=High-Performance and Parallel Computing with R|author=Dirk Eddelbuettel|url=http://cran.r-project.org/web/views/HighPerformanceComputing.html}}</ref>▼
* {{cite tech report|author=Bailey, W.J., Chambless, C.A., Cho, B.M. and Smith, J.D.|year=2013|title=Identifying Nonlinear Correlations in High Dimensional Data with Application to Protein Molecular Dynamics Simulations|institution=UMBC High Performance Computing Facility, University of Maryland, Baltimore County|number=HPCF-2013-12|url=http://userpages.umbc.edu/~gobbert/papers/REU2013Team2.pdf|accessdate=2014-02-01|archiveurl=https://web.archive.org/web/20140204055902/http://userpages.umbc.edu/~gobbert/papers/REU2013Team2.pdf|archivedate=2014-02-04|url-status=dead}}
* [http://www.r-bloggers.com/r-at-12000-cores/ R at 12,000 Cores].<ref>{{cite news|title=R at 12,000 Cores|url=http://www.r-bloggers.com/r-at-12000-cores/}}</ref> This article was read 22,584 times in 2012 since it posted on October 16, 2012 and ranked number 3 according to [http://www.r-bloggers.com/100-most-read-r-posts-for-2012-stats-from-r-bloggers-big-data-visualization-data-manipulation-and-other-languages/|Top 100 R posts of 2012]<ref>{{cite news|url=http://www.r-bloggers.com/100-most-read-r-posts-for-2012-stats-from-r-bloggers-big-data-visualization-data-manipulation-and-other-languages/|title=100 most read R posts in 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages}}</ref>▼
▲*
* [http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2013:mpiprofiler|MPI Profiler for pbdR] mentored by the [http://rwiki.sciviews.org/doku.php| Organization of R Project for Statistical Computing] for Google summer of code 2013.<ref>{{cite web|url=http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2013:mpiprofiler|title=Profiling Tools for Parallel Computing with R|author=GSOC-R 2013}}</ref>▼
▲*
▲*
== External links ==▼
* {{cite web|url=http://rpubs.com/wush978/pbdMPI-linux-pilot|title=在雲端運算環境使用R和MPI|author=Wush Wu (2014)}}
* {{Official website|r-pbd.org}} of the pbdR project▼
* {{cite web|url=https://www.youtube.com/watch?v=m1vtPESsFqM|title=快速在AWS建立R和pbdMPI的使用環境|author=Wush Wu (2013)|website=[[YouTube]] }}
== References ==
{{Reflist|30em}}
▲== External links ==
{{DEFAULTSORT:PbdR}}
[[Category:Parallel computing]]▼
[[Category:Cross-platform free software]]
[[Category:
[[Category:Data-centric programming languages]]
[[Category:Statistical software]]▼
[[Category:Free statistical software]]
[[Category:
[[Category:
[[Category:Numerical analysis software for Windows]]
▲[[Category:Parallel computing]]
|