Supercomputer operating system: Difference between revisions

Content deleted Content added
Early systems: fix link
m Modern approaches: minor fixes, mostly disambig links using AWB
Line 36:
 
==Modern approaches==
[[ImageFile:IBM Blue Gene P supercomputer.jpg|240px|thumb|The [[Blue Gene]]/P supercomputer at [[Argonne National Laboratory|Argonne National Lab]]]]
The IBM [[Blue Gene]] supercomputer uses the [[CNK operating system]] on the compute nodes, but uses a modified [[Linux]]-based kernel called [[INK (operating system)|INK]] (for I/O Node Kernel) on the I/O nodes.<ref name=EuroPar2004>''Euro-Par 2004 Parallel Processing: 10th International Euro-Par Conference'' 2004, by Marco Danelutto, Marco Vanneschi and Domenico Laforenza ISBN 3-540-22924-8 pages 835</ref><ref name=EuroPar2006 >''Euro-Par 2006 Parallel Processing: 12th International Euro-Par Conference'', 2006, by Wolfgang E. Nagel, Wolfgang V. Walter and Wolfgang Lehner ISBN 3-540-37783-2 page</ref> CNK is a [[Lightweight Kernel Operating System|lightweight kernel]] that runs on each node and supports a single application running for a single user on that node. For the sake of efficient operation, the design of CNK was kept simple and minimal, with physical memory being statically mapped and the CNK neither needing nor providing scheduling or context switching.<ref name=EuroPar2004 /> CNK does not even implement [[Input/output|file I/O]] on the compute node, but delegates that to dedicated I/O nodes.<ref name=EuroPar2006 /> However, given that on the Blue Gene multiple compute nodes share a single I/O node, the I/O node operating system does require multi-tasking, hence the selection of the Linux-based operating system.<ref name=EuroPar2004/><ref name=EuroPar2006/>
 
While in traditional multi-user computer systems and early supercomputers, [[job scheduling]] was in effect a [[task scheduling|scheduling]] problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources.<ref name=Yariv /> The need to tune task scheduling and tune the operating system in different configurations of a supercomputer is essential. A typical parallel job scheduler has a [[Master/slave (technology)|master scheduler]] which instructs a number of slave schedulers to launch, monitor and control [[Parallel processingcomputing|parallel jobs]], and periodically receives reports from them about the status of job progress.<ref name=Yariv />
 
Some, but not all supercomputer schedulers attempt to maintain locality of job execution. The [[PBS Pro|PBS Pro scheduler]] used on the [[Cray XT3]] and [[Cray XT4]] systems does not attempt to optimize locality on its three-dimensional [[torus interconnect]], but simply uses the first available processor.<ref name=Eitan/> On the other hand, IBM's scheduler on the Blue Gene supercomputers aims to exploit locality and minimize network contention by assigning tasks from the same application to one or more midplanes of an 8x8x8 node group.<ref name=Eitan>''Job Scheduling Strategies for Parallel Processing:'' by Eitan Frachtenberg and Uwe Schwiegelshohn 2010 ISBN 3-642-04632-0 pages 138-144</ref> The [[Simple Linux Utility for Resource Management|SLURM scheduler]] uses a best fit algorithm, and performs [[Hilbert curve scheduling]] in order to optimize locality of task assignments.<ref name=Eitan/> A number of modern supercomputers such as the [[Tianhe-2]] use the SLURM job scheduler which arbitrates contention for resources across the system. SLURM is [[open source]], Linux-based, is quite scalable, and can manage thousands of nodes in a computer cluster with a sustained throughput of over 100,000 jobs per hour.<ref>[http://slurm.schedmd.com/ SLURM at SchedMD]</ref><ref>Jette, M. and M. Grondona, ''SLURM: Simple Linux Utility for Resource Management'' in the Proceedings of ClusterWorld Conference, San Jose, California, June 2003 [http://www.schedmd.com/slurmdocs/slurm_design.pdf]</ref>