Revision as of 20:16, 26 June 2013 edit 128.219.49.14 (talk) →Examples ← Previous edit		Revision as of 20:19, 26 June 2013 edit undo 128.219.49.14 (talk) No edit summary Next edit →
Line 13: \| website = [http://www.r-pbd.org r-pbd.org] }} '''Programming with Big Data in R''' (pbdR)<ref>{{cite web\|author=Ostrouchov, G., Chen, W.-C., Schmidt, D., Patel, P.\|title=Programming with Big Data in R\|year=2012\|url=http://r-pbd.org/}}</ref> is a [[free software]] [[programming language]] and ~~a software~~an environment for [[statistical computing]] with [[Big Data]] by utilizing high-performance statistical computation.<ref>{{cite web\|author=Chen, W.-C. and Ostrouchov, G.\|url=http://thirteen-01.stat.iastate.edu/snoweye/hpsc/\|year=2011\|title=HPSC -- High Performance Statistical Computing for Data Intensive Research}}</ref> The pbdR language is carried from [[R (programming language)\|R]]<ref>{{cite book\|author=R Core Team\|title=R: A Language and Environment for Statistical Computing\|year=2012\|isbn=3-900051-07-0\|url=http://www.r-project.org/}}</ref> used among [[statistician]]s and [[Data mining\|data miners]] for developing [[statistical software]]. The pbdR codes mainly focuses on [[distributed memory]] system where data are distributed across several nodes, while communications between nodes are based on [[Message Passing Interface\|MPI]] which is easily utilized in large [[High-performance computing\|high-performance computing (HPC)]] systems. Two main implementations in [[R (programming language)\|R]] using [[Message Passing Interface\|MPI]] are [[Rmpi]]<ref name=rmpi/> and [[pbdMPI]] of pbdR. * The [[Rmpi]]<ref name=rmpi/> uses [[Master/slave (technology)\|Manager/Workers Parallelism]] where one main processor (manager) servers as the control of all other processors (workers). This parallelism is particularly efficient for large tasks in small [[Computer cluster\|clusters]], for example, [[Bootstrapping (statistics)\|bootstrap method]] and [[Monte Carlo method\|Monte Carlo simulation]] in applied statistics since [[Independent and identically distributed random variables\|i.i.d.]] assumption is commonly used in most [[Statistics\|statistical analysis]].▼ * The pbdR built on [[pbdMPI]] uses [[SPMD\|SPMD Parallelism]] where every processors are considered as workers and own parts of data. This parallelism is particularly for large data, for example, performing [[Singular value decomposition\|singular value decomposition]] on a large matrix, or performing [[Mixture model\|clustering analysis]] on high-dimensional large data. On the other hand, there is no restriction to use [[Master/slave (technology)\|Manager/Workers Parallelism]] in [[SPMD\|SPMD Parallelism]] environment. ▲* The [[Rmpi]]<ref name=rmpi/> uses [[Master/slave (technology)\|Manager/Workers Parallelism]] where one main processor (manager) servers as the control of all other processors (workers). This parallelism is particularly efficient for large tasks in small [[Computer cluster\|clusters]], for example, [[Bootstrapping (statistics)\|bootstrap method]] and [[Monte Carlo method\|Monte Carlo simulation]] in applied statistics since [[Independent and identically distributed random variables\|i.i.d.]] assumption is commonly used in most [[Statistics\|statistical analysis]]. It is clearly that pbdR is suitable for small [[Computer cluster\|clusters]], but is stabler for analyzing larger data and is more scalable for [[Supercomputer\|supercomputers]].

Programming with Big Data in R: Difference between revisions