Programming with Big Data in R: Difference between revisions

Content deleted Content added
Wccsnow (talk | contribs)
mNo edit summary
Wccsnow (talk | contribs)
mNo edit summary
Line 25:
</ref> on a large matrix, or performing [[Mixture model|clustering analysis]] on high-dimensional large data. On the other hand, there is no restriction to use [[Master/slave (technology)|manager/workers parallelism]] in [[SPMD|SPMD parallelism]] environment.
* The [http://cran.r-project.org/package=Rmpi Rmpi]<ref name=rmpi/> uses [[Master/slave (technology)|manager/workers parallelism]] where one main processor (manager) servers as the control of all other processors (workers). The [[Master/slave (technology)|manager/workers parallelism]]<ref>[http://userpages.uni-koblenz.de/~laemmel/MapReduce/paper.pdf "Google's MapReduce Programming Model -- Revisited"] — paper by Ralf Lämmel; from [[Microsoft]]</ref> introduced around early 2000 is particularly efficient for large tasks in small [[Computer cluster|clusters]], for example, [[Bootstrapping (statistics)|bootstrap method]] and [[Monte Carlo method|Monte Carlo simulation]] in applied statistics since [[Independent and identically distributed random variables|i.i.d.]] assumption is commonly used in most [[Statistics|statistical analysis]]. In particular, [http://math.acadiau.ca/ACMMaC/Rmpi/structure.html| task pull] parallelism has better performance for Rmpi in heterogeneous computing environments.
The idea of [[SPMD|SPMD parallelism]] is to let every processors do the same works but on different parts of large data. For example, modern [[Graphics processing unit|GPU]] is a large collection of slower co-processors which can simply apply the same computation on different parts of relatively smaller data, but itthe [[SPMD|SPMD parallelism]] ends up an efficient way to obtain final solutions, i.e. time to solution is shorter.<ref>{{cite web | url = http://www.engadget.com/2006/09/29/stanford-university-tailors-folding-home-to-gpus/ | title = Stanford University tailors Folding@home to GPUs | author = Darren Murph | accessdate = 2007-10-04 }}</ref><ref>{{cite web | url = http://graphics.stanford.edu/~mhouston/ | title = Folding@Home - GPGPU | author = Mike Houston | accessdate = 2007-10-04 }}</ref> It is clearly that pbdR is not only suitable for small [[Computer cluster|clusters]], but also is stabler for analyzing [[Big data]] and is more scalable for [[Supercomputer|supercomputers]].<ref>{{cite journal|author=Schmidt, D., Ostrouchov, G., Chen, W.-C., and Patel, P.|title=Tight Coupling of R and Distributed Linear Algebra for High-Level Programming with Big Data|year=2012|pages=811-815|journal=High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:|url=http://dl.acm.org/citation.cfm?id=2477156}}</ref> In short, pbdR
* does '''not''' like Rmpi, snow, snowfall, do-like, '''nor''' parallel packages in R,
* does '''not''' focus on interactive computing '''nor''' master/workers,
* but is able to use '''both''' SPMD and task parallelisms.
 
It is clearly that pbdR is not only suitable for small [[Computer cluster|clusters]], but also is stabler for analyzing [[Big data]] and is more scalable for [[Supercomputer|supercomputers]].<ref>{{cite journal|author=Schmidt, D., Ostrouchov, G., Chen, W.-C., and Patel, P.|title=Tight Coupling of R and Distributed Linear Algebra for High-Level Programming with Big Data|year=2012|pages=811-815|journal=High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:|url=http://dl.acm.org/citation.cfm?id=2477156}}</ref>
 
== Package design ==