Revision as of 14:17, 27 June 2013 edit Wccsnow (talk \| contribs) 109 edits →Package design ← Previous edit		Revision as of 14:25, 27 June 2013 edit undo Wccsnow (talk \| contribs) 109 edits m more independent references are added. Next edit →
Line 15: }} '''Programming with Big Data in R''' (pbdR)<ref>{{cite web\|author=Ostrouchov, G., Chen, W.-C., Schmidt, D., Patel, P.\|title=Programming with Big Data in R\|year=2012\|url=http://r-pbd.org/}}</ref><ref>{{cite web\|title=XSEDE\|url=https://portal.xsede.org/knowledge-base/-/kb/document/bcrw}}</ref><ref name=pbdDEMO/> is a series of [[R (programming language)\|R]] packages and an environment for [[statistical computing]] with [[Big Data]] by utilizing high-performance statistical computation.<ref>{{cite web\|author=Chen, W.-C. and Ostrouchov, G.\|url=http://thirteen-01.stat.iastate.edu/snoweye/hpsc/\|year=2011\|title=HPSC -- High Performance Statistical Computing for Data Intensive Research}}</ref> The pbdR uses the same programming language as [[R (programming language)\|R]]<ref name=R>{{cite book\|author=R Core Team\|title=R: A Language and Environment for Statistical Computing\|year=2012\|isbn=3-900051-07-0\|url=http://www.r-project.org/}}</ref> with [[S (programming language)\|S3/S4]] classes and methods which is used among [[statistician]]s and [[Data mining\|data miners]] for developing [[statistical software]]. The significant difference between pbdR and [[R (programming language)\|R]]<ref name=R/> codes is pbdR mainly focuses on [[distributed memory]] system where data are distributed across several processors, while communications between processors are based on [[Message Passing Interface\|MPI]] which is easily utilized in large [[High-performance computing\|high-performance computing (HPC)]] systems. [[R (programming language)\|R]] system<ref name=R/> mainly focuses on interactive data analysis on single [[Multi-core processor\| multi-core]] machines. Two main implementations in [[R (programming language)\|R]] using [[Message Passing Interface\|MPI]] are [http://cran.r-project.org/package=Rmpi Rmpi]<ref name=rmpi/> and [http://cran.r-project.org/package=pbdMPI pbdMPI] of pbdR. * The pbdR built on [http://cran.r-project.org/package=pbdMPI pbdMPI] uses [[SPMD\|SPMD ~~Parallelism~~parallelism]] where every processors are considered as workers and own parts of data. ~~This~~The [[SPMD\|SPMD parallelism]]<ref name=spmd/><ref name=spmd_ostrouchov/> introduced in mid 1980 is particularly efficient in homogeneous computing environments for large data, for example, performing [[Singular value decomposition\|singular value decomposition]] on a large matrix, or performing [[Mixture model\|clustering analysis]] on high-dimensional large data. On the other hand, there is no restriction to use [[Master/slave (technology)\|manager/workers parallelism]] in [[SPMD\|SPMD parallelism]] environment. * The [http://cran.r-project.org/package=Rmpi Rmpi]<ref name=rmpi/> uses [[Master/slave (technology)\|manager/workers parallelism]] where one main processor (manager) servers as the control of all other processors (workers). ~~This~~The [[Master/slave (technology)\|manager/workers parallelism]]<ref>[http://userpages.uni-koblenz.de/~laemmel/MapReduce/paper.pdf "Google's MapReduce Programming Model -- Revisited"] — paper by Ralf Lämmel; from [[Microsoft]]</ref> introduced in mid 2000 is particularly efficient for large tasks in small [[Computer cluster\|clusters]], for example, [[Bootstrapping (statistics)\|bootstrap method]] and [[Monte Carlo method\|Monte Carlo simulation]] in applied statistics since [[Independent and identically distributed random variables\|i.i.d.]] assumption is commonly used in most [[Statistics\|statistical analysis]]. In particular, [http://math.acadiau.ca/ACMMaC/Rmpi/structure.html\|task pull] parallelism has better performance for Rmpi in heterogeneous computing environments. It is clearly that pbdR is suitable for small [[Computer cluster\|clusters]], but is stabler for analyzing larger data and is more scalable for [[Supercomputer\|supercomputers]].<ref>{{cite journal\|author=Schmidt, D., Ostrouchov, G., Chen, W.-C., and Patel, P.\|title=Tight Coupling of R and Distributed Linear Algebra for High-Level Programming with Big Data\|year=2012\|pages=811-815\|journal=High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:\|url=http://dl.acm.org/citation.cfm?id=2477156}}</ref>

Programming with Big Data in R: Difference between revisions