Computer cluster: Difference between revisions

Content deleted Content added
m link [pP]arallel computing
add link to beowulf clusters
 
(18 intermediate revisions by 12 users not shown)
Line 1:
{{Short description|Set of computers configured in a distributed computing system}}
 
{{Distinguish|data cluster|grid computing}}
{{Redirect|Cluster computing|the journal|Cluster Computing (journal)}}
[[File:MEGWARE.CLIC.jpg|thumb|Technicians working on a large [[Linux]] cluster at the [[Chemnitz University of Technology]], Germany]]
[[File:Sun Microsystems Solaris computer cluster.jpg|thumb|Sun Microsystems [[Solaris Cluster]], with [[Close Coupled Cooling#In-Row Air Conditioners|In-Row cooling]]]]
[[File:Taiwania series.jpg|thumb|[[Taiwania_(supercomputer)|Taiwania]] series uses cluster architecture, with great capacity, helped scientists of [[Taiwan]] and many others during [[COVID-19]].]]
 
A '''computer cluster''' is a set of [[computerscomputer]]s that work together so that they can be viewed as a single system. Unlike [[Grid computing|grid computer]]s, computer clusters have each [[Node (networking)|node]] set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is [[cloud computing]].
 
The components of a cluster are usually connected to each other through fast [[local area network]]s, with each [[Node (networking)|node]] (computer used as a server) running its own instance of an [[operating system]]. In most circumstances, all of the nodes use the same hardware<ref>{{cite web |url=https://stackoverflow.com/questions/9723040/what-is-the-difference-between-cloud-grid-and-cluster |title=Cluster vs grid computing |website=[[Stack Overflow]]}}</ref>{{better source needed|date=June 2017}} and the same operating system, although in some setups (e.g. using [[Open Source Cluster Application Resources]] (OSCAR)), different operating systems can be used on each computer, or different hardware.<ref name=pcauthority>{{cite web|url=http://www.pcauthority.com.au/Feature/306972,weekend-project-build-your-own-supercomputer.aspx|title=Weekend Project: Build your own supercomputer|date=29 June 2012|first=Darien|last=Graham-Smith|website=PC & Tech Authority|access-date=2 June 2017}}</ref>
 
Clusters are usually deployed to improve performance and availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.<ref>{{cite web|url=http://www.cc.gatech.edu/~bader/papers/ijhpca.html|title=Cluster Computing: Applications|last1=Bader|first1=David|author-link=David Bader (computer scientist)|date=May 2001|publisher=[[Georgia Institute of Technology College of Computing|Georgia Tech College of Computing]]|first2=Robert|last2=Pennington|access-date=2017-02-28|archive-url=https://web.archive.org/web/20071221011621/http://www.cc.gatech.edu/~bader/papers/ijhpca.html|archive-date=2007-12-21|url-status=dead}}</ref>
Line 15 ⟶ 16:
 
==Basic concepts==
[[File:Beowulf.jpg|thumb|150px|A simple, home-built [[Beowulf cluster]].]]
The desire to get more computing power and better reliability by orchestrating a number of low-cost [[commercial off-the-shelf]] computers has given rise to a variety of architectures and configurations.
 
Line 24 ⟶ 25:
A computer cluster may be a simple two-node system which just connects two personal computers, or may be a very fast [[supercomputer]]. A basic approach to building a cluster is that of a [[Beowulf (computing)|Beowulf]] cluster which may be built with a few personal computers to produce a cost-effective alternative to traditional [[high-performance computing]]. An early project that showed the viability of the concept was the 133-node [[Stone Soupercomputer]].<ref name="sciam">{{Cite news |title= The Do-It-Yourself Supercomputer |work= [[Scientific American]] |author= William W. Hargrove, Forrest M. Hoffman and [[Thomas Sterling (computing)|Thomas Sterling]] |volume= 265 |number= 2 |pages= 72–79 |date= August 16, 2001 |url= http://www.sciam.com/article.cfm?id=the-do-it-yourself-superc |access-date= October 18, 2011 }}</ref> The developers used [[Linux]], the [[Parallel Virtual Machine]] toolkit and the [[Message Passing Interface]] library to achieve high performance at a relatively low cost.<ref name="extreme">{{Cite news |title= Cluster Computing: Linux Taken to the Extreme |first1= William W. |last1= Hargrove |first2= Forrest M. |last2= Hoffman |work= Linux Magazine |year= 1999 |url= http://climate.ornl.gov/~forrest/linux-magazine-1999/ |access-date= October 18, 2011 |archive-url= https://web.archive.org/web/20111018122713/http://climate.ornl.gov/~forrest/linux-magazine-1999/ |archive-date= October 18, 2011 |url-status= dead }}</ref>
 
Although a cluster may consist of just a few personal computers connected by a simple network, the cluster architecture may also be used to achieve very high levels of performance. The [[TOP500]] organization's semiannual list of the 500 fastest [[supercomputer]]ssupercomputers often includes many clusters, e.g. the world's fastest machine in 2011 was the [[K computer]] which has a [[distributed memory]], cluster architecture.<ref>{{cite conference|first=Mitsuo|last=Yokokawa |display-authors=etal |title=The K computer: Japanese next-generation supercomputer development project|conference=International Symposium on Low Power Electronics and Design (ISLPED)|date=1–3 August 2011|pages=371–372|doi=10.1109/ISLPED.2011.5993668}}</ref>
 
==History==
{{Main|History of computer clusters}}
{{See also|History of supercomputing}}
[[File:SPEC-1 VAX 05.jpg|thumb|150px|A [[VAX]] 11/780, c. 1977, as used in early [[VAXcluster]] development.]]
 
Greg Pfister has stated that clusters were not invented by any specific vendor but by customers who could not fit all their work on one computer, or needed a backup.<ref>{{cite book | last = Pfister | first = Gregory | title = In Search of Clusters | edition = 2nd | publisher = Prentice Hall PTR | ___location = Upper Saddle River, NJ | year = 1998 | page = [https://archive.org/details/insearchofcluste00pfis/page/36 36] | isbn = 978-0-13-899709-0 | url = https://archive.org/details/insearchofcluste00pfis/page/36 }}</ref> Pfister estimates the date as some time in the 1960s. The formal engineering basis of cluster computing as a means of doing parallel work of any sort was arguably invented by [[Gene Amdahl]] of [[IBM]], who in 1967 published what has come to be regarded as the seminal paper on parallel processing: [[Amdahl's Law]].
Line 36 ⟶ 37:
 
The first production system designed as a cluster was the Burroughs [[B5700]] in the mid-1960s. This allowed up to four computers, each with either one or two processors, to be tightly coupled to a common disk storage subsystem in order to distribute the workload. Unlike standard multiprocessor systems, each computer could be restarted without disrupting overall operation.
[[File:TNSII.jpg|thumb|Tandem NonStop II circa 1980.]]
The first commercial loosely coupled clustering product was [[Datapoint|Datapoint Corporation's]] "Attached Resource Computer" (ARC) system, developed in 1977, and using [[ARCnet]] as the cluster interface. Clustering per se did not really take off until [[Digital Equipment Corporation]] released their [[VAXcluster]] product in 1984 for the [[OpenVMS|VMS]] operating system. The ARC and VAXcluster products not only supported [[parallel computing]], but also shared [[file system]]s and [[peripheral]] devices. The idea was to provide the advantages of parallel processing, while maintaining data reliability and uniqueness. Two other noteworthy early commercial clusters were the [[Tandem Computers|''Tandem NonStop'']] (a 1976 high-availability commercial product)<ref>{{Cite book |last=Katzman |first=James A. |title=Computer Structure: Principles and Examples |publisher=McGraw-Hill Book Company |year=1982 |isbn= |editor-last=Siewiorek |editor-first=Donald P. |___location=U.S.A. |pages=470–485 |chapter=Chapter 29, The Tandem 16: A Fault-Tolerant Computing System}}</ref><ref>{{Cite web |title=History of TANDEM COMPUTERS, INC. – FundingUniverse |url=http://www.fundinguniverse.com/company-histories/tandem-computers-inc-history/ |access-date=2023-03-01 |website=www.fundinguniverse.com}}</ref> and the ''IBM S/390 Parallel Sysplex'' (circa 1994, primarily for business use).
 
Line 42 ⟶ 43:
 
==Attributes of clusters==
[[File:Load Balancing Cluster (NAT).svg|thumb|A load balancing cluster with two servers and N user stations.]]
Computer clusters may be configured for different purposes ranging from general purpose business needs such as web-service support, to computation-intensive scientific calculations. In either case, the cluster may use a [[high-availability cluster|high-availability]] approach. Note that the attributes described below are not exclusive and a "computer cluster" may also use a high-availability approach, etc.
 
"[[Load balancing (computing)|Load-balancing]]" clusters are configurations in which cluster-nodes share computational workload to provide better overall performance. For example, a web server cluster may assign different queries to different nodes, so the overall response time will be optimized.<ref name=Sloan>{{cite book|title=High Performance Linux Clusters|url=https://archive.org/details/highperformancel0000sloa|url-access=registration|first=Joseph D.|last=Sloan|year=2004|publisher="O'Reilly Media, Inc." |isbn=978-0-596-00570-2}}</ref> However, approaches to load-balancing may significantly differ among applications, e.g. a high-performance cluster used for scientific computations would balance load with different algorithms from a web-server cluster which may just use a simple [[round-robin scheduling|round-robin method]] by assigning each new request to a different node.<ref name=Sloan />
 
Computer clusters are used for computation-intensive purposes, rather than handling [[Input/output|IO-oriented]] operations such as web service or databases.<ref name=VECPAR >{{cite book|title=High Performance Computing for Computational Science - VECPAR 2004|first1=Michel|last1=Daydé|first2=Jack|last2=Dongarra|year=2005|isbn=978-3-540-25424-9|pages=120–121|publisher=Springer }}</ref> For instance, a computer cluster might support [[Computer simulation|computational simulations]] of vehicle crashes or weather. Very tightly coupled computer clusters are designed for work that may approach "[[supercomputing]]".
 
"[[High-availability cluster]]s" (also known as [[failover]] clusters, or HA clusters) improve the availability of the cluster approach. They operate by having redundant [[Node (networking)|nodes]], which are then used to provide service when system components fail. HA cluster implementations attempt to use redundancy of cluster components to eliminate [[single point of failure|single points of failure]]. There are commercial implementations of High-Availability clusters for many operating systems. The [[Linux-HA]] project is one commonly used [[free software]] HA package for the [[Linux]] operating system.
Line 53 ⟶ 54:
==Benefits==
<!-- This used to be a list. Work has been done since, but it's still incomplete. -->
Clusters are primarily designed with performance in mind, but installations are based on many other factors. Fault tolerance (''the ability forof a system to continue workingoperating withdespite a malfunctioning node'') allows forenables [[horizontal scaling|scalability]], and in high-performance situations, allows for a low frequency of maintenance routines, resource consolidation (e.g., [[RAID]]), and centralized management. Advantages include enabling data recovery in the event of a disaster and providing parallel data processing and high processing capacity.<ref>{{cite web|url=http://www-03.ibm.com/systems/clusters/benefits.html|title=IBM Cluster System : Benefits|publisher=[[IBM]]|access-date=8 September 2014|archive-url=https://web.archive.org/web/20160429022854/http://www-03.ibm.com/systems/clusters/benefits.html|archive-date=29 April 2016|url-status=dead}}</ref><ref>{{cite web|url=https://technet.microsoft.com/en-us/library/cc778629(v=ws.10).aspx|title=Evaluating the Benefits of Clustering|date=28 March 2003|publisher=[[Microsoft]]|access-date=8 September 2014|archive-url=https://web.archive.org/web/20160422092651/https://technet.microsoft.com/en-us/library/cc778629%28v%3Dws.10%29.aspx|archive-date=22 April 2016|url-status=dead}}</ref>
 
In terms of scalability, clusters provide this in their ability to add nodes horizontally. This means that more computers may be added to the cluster, to improve its performance, redundancy and fault tolerance. This can be an inexpensive solution for a higher performing cluster compared to scaling up a single node in the cluster. This property of computer clusters can allow for larger computational loads to be executed by a larger number of lower performing computers.
Line 62 ⟶ 63:
 
==Design and configuration==
[[File:beowulf.png|thumb|240px|left|A typical Beowulf configuration.]]
One of the issues in designing a cluster is how tightly coupled the individual nodes may be. For instance, a single computer job may require frequent communication among nodes: this implies that the cluster shares a dedicated network, is densely located, and probably has homogeneous nodes. The other extreme is where a computer job uses one or few nodes, and needs little or no inter-node communication, approaching [[grid computing]].
 
In a [[Beowulf cluster]], the application programs never see the computational nodes (also called slave computers) but only interact with the "Master" which is a specific computer handling the scheduling and management of the slaves.<ref name=VECPAR /> In a typical implementation the Master has two network interfaces, one that communicates with the private Beowulf network for the slaves, the other for the general purpose network of the organization.<ref name=VECPAR /> The slave computers typically have their own version of the same operating system, and local memory and disk space. However, the private slave network may also have a large and shared file server that stores global persistent data, accessed by the slaves as needed.<ref name=VECPAR />
 
A special purpose 144-node [[DEGIMA (computer cluster)|DEGIMA cluster]] is tuned to running astrophysical N-body simulations using the Multiple-Walk parallel tree code, rather than general purpose scientific computations.<ref name=Hamada>{{cite journal|first=Tsuyoshi|last=Hamada |display-authors=etal |year=2009|title=A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation|journal=Computer Science - Research and Development|volume=24|issue=1–2 |pages=21–31 |doi=10.1007/s00450-009-0089-1|s2cid=31071570 }}</ref>
 
Due to the increasing computing power of each generation of [[game console]]s, a novel use has emerged where they are repurposed into [[High-performance computing]] (HPC) clusters. Some examples of game console clusters are [[PlayStation 3 cluster|Sony PlayStation clusters]] and [[Microsoft]] [[Xbox (console)|Xbox]] clusters. Another example of consumer game product is the [[Nvidia Tesla Personal Supercomputer]] workstation, which uses multiple graphics accelerator processor chips. Besides game consoles, high-end graphics cards too can be used instead. The use of graphics cards (or rather their GPU's) to do calculations for grid computing is vastly more economical than using CPU's, despite being less precise. However, when using double-precision values, they become as precise to work with as CPU's and are still much less costly (purchase cost).<ref name=pcauthority />
Line 77 ⟶ 78:
===Data sharing===
[[File:Nec-cluster.jpg|thumb|A [[NEC]] [[Nehalem (microarchitecture)|Nehalem cluster]]]]
As the computer clusters were appearing during the 1980s, so were [[supercomputer]]s. One of the elements that distinguished the three classes at that time was that the early supercomputers relied on [[Shared memory architecture|shared memory]]. To date, clustersClusters do not typically use physically shared memory, while many supercomputer architectures have also abandoned it.
 
However, the use of a [[clustered file system]] is essential in modern computer clusters.{{Citation needed|date=August 2013}} Examples include the [[IBM General Parallel File System]], Microsoft's [[Cluster Shared Volumes]] or the [[Oracle Cluster File System]].
Line 95 ⟶ 96:
==Cluster management==
[[File:Cubieboard HADOOP cluster.JPG|thumb|Low-cost and low energy tiny-cluster of [[Cubieboard]]s, using [[Apache Hadoop]] on [[Lubuntu]]]]
[[File:Circumference C25 (41227579055).png|thumb|A pre-release sample of the Ground Electronics/AB Open Circumference C25 cluster [[Computers|computer]] system, fitted with 8x [[Raspberry Pi]] 3 Model B+ and 1x UDOO x86 boards.]]
One of the challenges in the use of a computer cluster is the cost of administrating it which can at times be as high as the cost of administrating N independent machines, if the cluster has N nodes.<ref name=patter641 >{{cite book|title=Computer Organization and Design|first1=David A.|last1=Patterson|first2=John L.|last2=Hennessy|year=2011|isbn=978-0-12-374750-1|pages=641–642|publisher=Elsevier }}</ref> In some cases this provides an advantage to [[shared memory architecture]]s with lower administration costs.<ref name=patter641 /> This has also made [[virtual machine]]s popular, due to the ease of administration.<ref name=patter641 />
 
Line 123 ⟶ 124:
 
==Implementations==
The Linux world supports various cluster software; for application clustering, there is [[distcc]], and [[MPICH]]. [[Linux Virtual Server]], [[Linux-HA]] - director-based clusters that allow incoming requests for services to be distributed across multiple cluster nodes. [[MOSIX]], [[LinuxPMI]], [[Kerrighed]], [[OpenSSI]] are full-blown clusters integrated into the [[kernel (computer science)|kernel]] that provide for automatic process migration among homogeneous nodes. [[OpenSSI]], [[openMosix]] and [[Kerrighed]] are [[single-system image]] implementations.
 
[[Microsoft Windows]] computer cluster Server 2003 based on the [[Windows Server]] platform provides pieces for high-performance computing like the job scheduler, MSMPI library and management tools.
Line 158 ⟶ 159:
:* [[Solaris Cluster]]
:* [[Veritas Cluster Server]]
:* [[Beowulf cluster]]
 
''Computer farms''
Line 178 ⟶ 180:
{{Commons category|Clusters (computing)}}
* [https://web.archive.org/web/20190219183441/https://www.ieeetcsc.org/ IEEE Technical Committee on Scalable Computing (TCSC)]
* [https://archive.today/20130103192843/http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom/com.ibm.cluster.rsct.doc%2Frsctbooks/rsctbooks.html Reliable Scalable Cluster Technology, IBM]{{Dead link|date=July 2020 |bot=InternetArchiveBot |fix-attempted=yes }}
* [https://www.ibm.com/developerworks/wikis/display/tivoli/Tivoli+System+Automation Tivoli System Automation Wiki]
* [https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43438.pdf Large-scale cluster management at Google with Borg], April 2015, by Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune and John Wilkes