Distributed operating system

This is an old revision of this page, as edited by JLSjr (talk | contribs) at 01:22, 10 May 2010. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.



A Distributed operating system is the logical aggregation of operating system software over a collection of independent, networked, communicating, and spatially disseminated computational nodes.[1] Individual system nodes each hold a discrete software subset of the global aggregate operating system. Each node-level software subset is a composition of two distinct provisioners of services.[2]

The first is a ubiquitous minimal kernel, or microkernel, situated directly above each node’s hardware. The microkernel provides only the necessary mechanisms for a node's functionality. Second, is a higher-level collection of system management components, providing all necessary policies for a node's individual and collaborative activities. This collection of management components exists immediately above the microkernel, and below any user applications or APIs that might reside at higher levels.[3]

These two entities, the microkernel and the management components collection, work together, They support the global system’s goal of seamlessly integrating all network-connected resources and processing functionality into an efficient, available, and unified system.[4] This seamless integration of individual nodes into a global system is referred to as transparency, or Single system image; describing the illusion provided to users of the global system’s appearance as a singular and local computational entity.

Description

A system within a system

Distributed microkernel OS model

Operating system organization:
Monolithic vs. Distributed OS

A Distributed operating system is an operating system. This statement may be trivial, but it is not always overt and obvious because the distributed operating system is such an integral part of the distributed system. This idea is synonymous to the consideration of a square. A square might not immediately be recognized as a rectangle. Although possessing all requisite attributes defining a rectangle, a square’s additional attributes and specific configuration provide a disguise. At its core, the distributed operating system provides only the essential services and minimal functionality required of an operating system, but its additional attributes and particular configuration make it different. The Distributed operating system fulfills its role as operating system; and does so in a manner indistinguishable from a centralized, monolithic operating system. That is, although distributed in nature, it supports a Single system image through the implementation of Transparency; or more simply said, the system’s appearance as a singular, local entity.

Operating system basics

An operating system, at a basic level, is expected to isolate and manage the physical complexities of lower-level hardware resources. In turn, these complexities are organized into simplified logical abstractions and presented to higher-level entities as interfaces into the underlying resources. These marshalling and presentation activities take place in a secure and protected environment, often referred to as the “system-level,” and describe a minimal scope of practical operating system functionality. In graphical depictions however, most monolithic operating systems would be illustrated as a discrete container sandwiched between the local hardware resources below and application programs above. The operating system container would be filled with a robust compliment of services and functions to support as many potential needs as possible or practical. This full-featured collection of services would reside and execute at the system-level and support higher, “user-level” applications and services.

Distributed operating system essentials

A distributed operating system, illustrated in a similar fashion, would be a container suggesting minimal operating system functionality and scope. This container would completely cover all disseminated hardware resources, defining the system-level. The container would extend across the system, supporting a layer of modular software components existing in the user-level. These software components supplement the distributed operating system with a configurable set of added services, usually integrated within the monolithic operating system (and the system-level). This division of minimal system-level function from additional user-level modular services provides a “separation of mechanism and policy.” Mechanism and policy can be simply interpreted as "how something is done" versus "why something is done," respectively. Achieving this separation allows for an exceptionally loosely coupled, flexible, and scalable distributed operating system.

Overview

The kernel

The kernel is a minimal, but complete set of node-level utilities necessary for access to a node’s underlying hardware and resources. These mechanisms provide the complete set of “building-blocks” essential for node operation; mainly low-level allocation, management, and disposition of a node’s resources, processes, communication, and I/O management support functions.[5] These functions are made possible by exposing a concise, yet comprehensive array of primitive mechanisms and services. The kernel is arguably the primary consideration in a distributed operating system; however, within the kernel, the subject of foremost importance is that of a well-structured and highly-efficient communications sub-system.[3]

In a distributed operating system, the kernel is often defined by a relative to absolute minimal architecture. A Kernel of this design is referred to as a Microkernel.[6] [7] The microkernel usually contains only the mechanisms and services which, if otherwise removed, would render a node or the global system functionally inoperable. The minimal nature of the microkernel strongly enhances a distributed operating system’s modular potential.[8] It is generally the case that the microkernel is implemented directly above its node’s hardware and resources; it is also common for a kernel to be idntically replicated over all nodes in a system.[9] The combination of a microkernel’s minimal design and ubiquitous node coverage enhances the global system's extensibility, and the ability to dynamically introduce new nodes or services.[10]

 
System management components overview

System management components

A node’s system management components are a collection of software server processes that basically define the policies of a system node. These components are the composite of a node’s system software not directly required within the kernel. These software services support all of the needs of the node; namely communication, process and resource management, reliability, performance, and security to mention just a few. In this capacity, system management components compare directly to the centralized operating software of a single-entity system.[3]

However, these system management components have additional challenges with respect to supporting a node's responsibilities to the global system. In addition, the system management components accept the defensive responsibilities of reliability, availability, and persistence inherent to the distributed operationg system. Quite often, any effort to realize a high-level of success in a particular area, incites conflict with similar efforts in other areas. Therefore, a consistent approach, balanced perspective, and a deep understanding of the overall system and its goals can help mitigate some complexity, and assist in quickly identifying potential points of diminishing returns. This is an example of why the separation of policy and mechanism is so critical.[10]

Working together as an operating system

The architecture and design of a distributed operating system is specifically aligned with realizing both individual node and global system goals. Any architecture or design must be approached in a manner consistent with separating policy and mechanism. In doing so, a distributed operating system attempts to provide a highly efficient and reliable distributed computing framework allowing for an absolute minimal user awareness of the underlying command and control efforts.[8]

The multi-level collaboration between a kernel and the system management components, and in turn between the distinct nodes in a distributed operating system is the functional challenge of the distributed operating system. This is the point in the system that must maintain a perfect harmony of purpose, and simultaneously maintain a complete disconnect of intent from implementation. This challenge is the distributed operating system's opportunity, to produce the foundation and framework for a reliable, efficient, available, robust, extensible, and scalable system. However, this opportunity comes at a very high cost in complexity.

The price of complexity

In a distributed operating system, the exceptional degree of inherent complexity could easily render the entire system an anathema to any user. As such, the logical price of realizing a distributed operation system must be calculated in terms of overcoming vast amounts of complexity in many areas, and on many levels. This calculation includes the depth, breadth, and range of design investment and architectural planning required in achieving even the most modest implementation.[11]

These design and development considerations are critical and unforgiving. For instance, a deep understanding of a distributed operating system’s overall architectural and design detail is required at an exceptionally early point.[1] There are an exhaustive array of design considerations inherent to the development of a distributed operating system. Each of these design considerations can potentially effect many of the others to a significant degree. This leads to a massive effort in balanced approach, in terms of the individual design considerations, and many of their permutations. As an aid in this effort, most rely strongly on the immense amount of documented experience and research in distributed computing which exists, and continues even today.

Perspectives: past, present, and future

Many look to the early 1970s for meaningful beginnings in distributed operating system research. These complete in definition and capability of being considered and implemented wholly. Research and experimentation efforts did began in earnest in the 1970s and continued through 1990s, with focused interest peaking in the late 1980's. A number of distributed operating systems were introduced during this period; however, very few of these implementations achieved modest commercial success.

The subject of distributed operating systems however, has a much richer historical perspective. This is especially evident when considering distributed operating system design issues severally, and with respect to some of the primordial strides taken towards their realization. There are several instances of fundamental and pioneering implementations of primitive distributed operating system component concepts dating back to the early 1950s.[12] [13] [14] Some of these very early individual steps were not focused directly on ditributed computing, and at the time, many may not have realized there important impact. These pioneering efforts laid important groundwork, and inspired continued research in areas related to distributed computing.

Begining in the mid 1970's, many important research efforts produced extremely important advances in distributed computing. These breakthroughs provided a solid, stable foundation for the continued efforts through the 1990's, mentioned earlier. Considering the modern distributed operating system and its future, one must look no further than the current incredible challenges of many-core and multi-processor science. The accelerating proliferation of multi-processor and multi-core processor systems research has led to a resurgence of the distributed operating system concept. Many of these research efforts are investigating interesting, exciting, and plausible paradigms impacting the future of distributed computing.

Distributed computing models

The nature of distribution

Generalized organization of nodes in a centralized model.

Generalized organization of nodes in a networked model.

Generalized organization of nodes in a distributed model.

Node Organization in different computing models

The unique nature of the Distributed operating system is both subtle and complex. A distributed operating system’s hardware infrastructure elements are not centralized, that is the elements do not have a tight proximity to one another at a single ___location. A given distributed operating system’s structure elements could reside in various rooms within a building, or in various buildings around the world. This geographically spatial dissemination defines its decentralization; however, the distributed operating system is distributed, not simply decentralized.

This distinction is the source of the subtlety and complexity. While decentralized systems and distributed operating systems are both spatially diverse, it is the specific manner of and relative degree in linkage between the elements, or nodes in the systems that differentiate the two. In the case of these two types of operating system, these linkages are the lines of communication between the nodes of the system.

Three basic distributions

To better illustrate this point, let us more closely reflect upon these three system architectures; centralized, decentralized, and distributed. In this examination, we will consider three tightly-related aspects of their structure: organization, connection, and control. Organization will describe physical arrangement characteristics, connection will involve associations among constituent structural entities, and control will correlate the manner, necessity, and rationale of the earlier considerations.

Organization

Firstly, we consider the subject of organization. A centralized system is organized most simply, basically one real level of structure and all constituent element’s highly influenced by and ultimately dependent upon this organization. The Decentralized system is a more federated structure, multiple levels where subsets of a system’s entities unite, these entity collections in turn uniting at higher levels, in the direction of and culminating at the central element. The distributed operating system has no discernable or necessary levels; it is purely an autonomous collection of discrete elements.

Connection

Association linkages between elements will be the second consideration. In each case, physical association is inextricably linked (or not), to conceptual organization. The centralized system has its constituent members directly united to a central entity. One could conceptualize holding a bunch of balloons -- each on a string, -- with the hand being the central figure. A decentralized system incorporates a single-step direct, or multi-step indirect path between any given constituent element and the central entity. This can be understood by thinking of a corporate organizational chart, the first level connecting directly, and lower levels connecting indirectly through successively higher levels (no lateral “dotted” lines). Finally, the distributed system has no inherent pattern; direct and indirect connections are possible between any two given elements of the system. Think of the 1970’s phenomena of “string art,” a spirograph drawing, a spider’s web, or the Interstate Highway System between U.S. cities.

Control

Notice, that the centralized and decentralized systems have distinctly directed flows of connection towards the central entity, while the distributed operating system is in no way influenced specifically by virtue of its organization. This is the pivotal notion of the third consideration. What correlations exist between a system’s organization, and its associations? In all three cases, it is an extremely delicate balance between the administration of processes, and the scope and extensibility of those processes; in essence is about the sphere of control. Simply put, in the directed systems there is more control, easing administration of processes, but constraining their possible scope. On the other hand, the distributed operating system is much more difficult to control, but is effectively limited in extensible scope only by the capabilities of that control. The associations of the distributed operating system conform to the needs of its processes, and not inherently in any way to its organizational configuration. There are key collections of extended distributed operating system processes discussed later in this article.

Conclusions

Lastly, as to the nature of the distributed operating system, has at times been considered not necessarily an operating system at all; but simply "as" the distributed system. This view is commonly justified by pointing to the deep and inextricable integration of the distributed operating system into the distributed system. The absolute and singular focus of sustaining and maintenance of the system is also used as a rationale. However, it is important to remember the separation of mechanism and policy. The distributed operating system and its responsibility in providing mechanisms are not affected to any degree by the method of its integration in a system, just as no amount of focus on providing these mechanisms change their responsibility toward policy; or more importantly, the expectation of results at the system level. As mentioned earlier, a square is a rectangle; and no level of effort exerted by the square in maintaining four equivalent dimensions changes anything.

Major Design Considerations

Transparency

Transparency, simply put, is the quality of a distributed operating system to be seen and understood as a single-system image. Transparency is the greatest overriding consideration in the high-level conceptual design of a distributed operating system. While a simple concept, the consideration of transparency directly effects decision making in every aspect of design of a distributed operating system. Depending on the degree to which transparency is implemented into a system, certain requirements and/or restrictions may be imposed upon the many design considerations, and the relationships between them.

Inter-process communication

Inter-Process Communication (IPC) is the implementation of general communication, process interaction, and data flow between threads and/or processes both within a system node, and between all nodes in a distributed operating system. The distributed nature of a system's nodes and the multi-level considerations of intra-node and inter-node requirements provide the base-line for high-level IPC design considerations. However, IPC in a distributed operating system is a low-level implementation. IPC is the low-level critical complement to the high-level concept of transparency. Many of the requirements and restrictions imposed on a system as a result of transparency will be accomplished directly or indirectly through IPC. In this sense, IPC is the greatest underlying concept in the low-level design considerations of a distributed operating system.

Process management

Process management provides policies and mechanisms for effective and efficient sharing of a system's distributed processing resources between that system's distributed processes. These policies and mechanisms support operations involving the allocation and de-allocation of processes and ports, as well as provisions to run, suspend, migrate, halt, or resume execution of processes, to mention a few. While these distributed operating system resources and the operations on them can be either local or remote with respect to each other, the distributed operating system must still maintain complete state of and synchronization over all processes in the system; and do so in a manner completely consistent from the user's unified system perspective.

As an example, load balancing is a common process management function. One consideration of load balancing is which process should be moved. The kernel may have several mechanisms, one of which might be priority-based choice. This mechanism in the kernel defines what can be done; in this case, choose a process based on some priority. The system management components would have policies implementing the decision making for this context. One of these policies would define what priority means, and how it is to be used to choose a process in this instance.

Resource management

Systems resources such as memory, files, devices, etc. are distributed throughout a system, and at any given moment, any of these nodes may have light to idle workloads. Load sharing and load balancing require many policy-oriented decisions, ranging from finding idle CPUs, when to move, and which to move. Many algorithms exist to aid in these decisions; however, this calls for a second-level of decision making policy in choosing the algorithm best suited for the scenario, and the conditions surrounding the scenario.

Reliability

One of the basic tenants of distributed operating systems is a high-level of reliability. This quality attribute of a distributed operating system has become a staple expectation. Reliability is most often considered from the perspectives of availability and security of a system's hardware, services, and data. Issues arising from availability failures or security violations are considered faults. Faults are physical or logical defects that can cause errors in the system. For a system to be reliable, it must somehow overcome the adverse effects of faults. There are three general methods for dealing with faults: fault avoidance, fault tolerance, and fault detection and recovery. Fault avoidance are proactive measures taken to minimize the occurrence of faults. These proactive measures can be in the form of transactions, replicated resources and processes, and primary back-ups of complete servers. Fault tolerance is the ability of a system to continue some meanful level of operation in the face of a fault. In the event a fault does occur, the system should detect the fault and have the capability to respond quickly and effectively to recover full functionality. In any event, Any actions taken should make every effort to preserving the single system image.

Performance

Performance is arguably the quintessential computing concern, and in the distributed operating system, it is no different. Many benchmark metrics exist for performance; throughput, job completions per unit time, system utilization, etc. Each of these benchmarks are more meaningful in describing some scenarios, and less in others. With respect to a distributed operating system, this consideration most often distills to a balance between process parallelism and IPC. Managing the task granularity of parallelism in a sensible relation to the messages required for support is extremely effective. Also, identifying when it is more beneficial to migrate a process to its data, rather than copy the data, is effective as well. Many process and resource management algorithms, and algorithms in this space work to maximize performance.

Synchronization

Cooperating concurrent processes have an inherent need for synchronization. Three basic situations that define the scope of this need; one or more processes must synchronize at a given point for one or more other processes to continue, one or more processes must wait for an asynchronous condition in order to continue, or a process must establish mutual exclusive access to a shared resource. There is a multitude of algorithms available for these scenarios, and their many variations. Unfortunately, whenever synchronization is required the opportunity for process deadlock usually exists. The ancillary situation of deadlock is covered below.

Flexibility

Flexibility in a distributed operating system is made possible through the modular characteristics of the microkernel. With the microkernel presenting a minimal -- but complete -- set of primitives and basic functionally cohesive services, The higher-level management components can be composed in a similar functionally cohesive manner. This capability leads to exceptional flexibility in the management components collection; but more importantly, it allows the opportunity to dynamically swap, upgrade, or install additional of components above the kernel.

Transparency responsibilities

Transparency is a property of a system or application, that allows a user to accomplish an objective with little, if any knowledge of the particular internal details related to the objective. A system or application may expose as much, or as little transparancy in a given area of functionality as deemed necessary. That is to say, the degree to which transparency is implemented can vary between subsets of functionality in a system or application. There are many specific areas of a system that can benefit from transparency; access, ___location, performance, naming, and migration to name a few.

For example, a distributed operating system may present access to a hard drive as "C:" and access to a DVD as "G:". The user does not require any knowledge of device drivers or methods of direct memory access techniques possibly used behind-the-scenes; both devices work the same way, from the user's perspective. This example demonstrates a high-level of transparency; and displays how low-level details are made somewhat "invisible" to the user through transparency. On the other hand, if a user desires to access another system or server, a host name or IP address may be required, along with a remote-machine user login and password. This would indicate a low-degree of transparency, as there is detailed knowledge required of the user in order to accomplish this task.

Generally, transparency and user-required knowledge form an inverse relation. As transparency is designed and implemented into various areas of a system, great care must be taken not to adversely effect other areas of transparency and other basic design concerns. Transparency, as a design concept, is one of the grand challenges in design of a distributed operating system; as it is a factor in the necessity for a complete upfront understanding.

Location transparency

Location transparency comprises two distinct aspects, Naming and User mobility. Naming transparency requires that nothing in the physical or logical references to an entity should expose any indication of the entities ___location. User mobility requires consistent referencing of an entity regardless of its ___location within the system. These two related concepts, naming transparency and user mobility, work together to remove the need for a user's knowledge regarding specific entities' details within a system.

Access transparency

Local and remote resources should remain indistinguishable through user interface system calls. The Distributed operating system maintains a user's perception of these entities in a clean, clear, and consistent manner.

System entities or processes maintain consistent access/entry mechanism, regardless of being local or remote

Migration transparency

Resources and processes can be migrated, without user-knowledge, by the system to another node in an attempt to maximize efficiency, reliability, and security. Requires policy decision-making abilities, Naming stability, and in the event of a process migration, all IPC messages must be received or held pending the migration.

Replication transparency

Systems entities can be copied to strategic points in the system to increase efficiencies through better proximity, and also provide for improved reliability through the distributed replication as a back-up; prompted by dynamic stratagem.

Concurrency transparency

System should possess and exhibit properties to allow multiple simultaneous uses of system resources between users ho are kept unaware of the concurrent usage. Required properties are synchronization mechanisms to keep events ordered and consistent, mutual-exclusivity management for resources, sufficient capabilities to detect and recover from both starvation and deadlock.

Parallel transparency

System should have stable performance characteristics, regardless if some nodes increase rapidly in workload, through properties of migration, replication, and concurrency. This requires an intelligent policy decision stratagem to facilitate the timely and accurate allocation, migration, and disposition of resources.

Failure transparency

The system should shield users from the knowledge of and the affects resulting from failures. In the event of a partial failure, the system is responsible for rapid and accurate detection and orchestration of a remedy with little, if any imposition on users. These methods can range from static proactive posturing to dynamic and more flexible response mechanisms.

Perform Transparency

System should create and maintain a reasonable, stable, and predictable performance expectation for the user, that is both resilient from and helpful in situations where parts of the system may experience significant delay or even failure. While reasonable and predictable are important, there should be no inherent expectation or expressed indication of fairness or equality.

Name transparency

All system entities should maintain a complete decoupling between entity naming from any spatial or temporal ___location, as well as any other system entity.

Size/Scale transparency

A user's experience or perception of their system should remain stable and consistent in the face of system extension, scaling, or waning due to failure.

Revision transparency

System users should be completely oblivious to system-software version changes and changes in internal implementation of system infrastructure. While a user may become aware of, or discover the availability of a new function or service, the implementation or alteration of the systems internal structure should in no way be the prompt for this discovery.

Control transparency

All system constants, properties, configuration settings, etc. should be completely consistent in appearance, connotation, and denotation to all users and software applications aware of them.

Data transparency

No system data-entity should expose itself as peculiar when required to interact remotely.


Historical perspectives

Pioneering inspirations

With a cursory glance around the internet, or a modest perusal of pertinent writings, one could very easily gain the notion that computer operating systems were a new phenomenon in the mid-twentieth century. In fact, important research in operating systems was being conducted at this time.[15][16][17][18][19][20] While early exploration into operating systems took place in the years leading to 1950; shortly afterward, highly advanced research began on new systems to conquer new problems. In the first decade of the second-half of the 20th century, many new questions were asked, many new problems were identified, many solutions were developed and working for years, in controlled production environments.

Aboriginal distributed computing

The DYSEAC[12] (1954)

One of the first solutions to these new questions was the DYSEAC, a self-described general-purpose synchronous computer; but at this point in history, exhibited signs of being much more than general-purpose. In one of the earliest publications of the ACM, in April of 1954, a researcher at the National Bureau of Standards – now the National Institute of Standards and Technology (NIST) – presented a detailed implementation design specification of the DYSEAC. Without carefully reading the entire specification, one could be misled by summary language in the introduction, as to the nature of this machine. The initial section of the introduction advises that major emphasis will be focused upon the requirements of the intended applications, and these applications would require flexible communication. However, suggesting the external devices could be typewriters, magnetic medium, and CRTs, and with the term “input-output operation” used more than once, could quickly limit any paradigm of this system to a complex centralized “ensemble.” Seemingly, saving the best for last, the author eventually describes the true nature of the system.

Finally, the external devices could even include other full-scale computers employing the same digital language as the DYSEAC. For example, the SEAC or other computers similar to it could be harnessed to the DYSEAC and by use of coordinated programs could be made to work together in mutual cooperation on a common task… Consequently[,] the computer can be used to coordinate the diverse activities of all the external devices into an effective ensemble operation.

— ALAN L. LEINER, System Specifications for the DYSEAC

While this more detailed description elevates the perception of the system, the best that can be distilled from this is some semblance of decentralized control. The avid reader, persevering in the investigation would get to a point at which the real nature of the system is divulged.

Each member of such an interconnected group of separate computers is free at any time to initiate and dispatch special control orders to any of its partners in the system. As a consequence, the supervisory control over the common task may initially be loosely distributed throughout the system and then temporarily concentrated in one computer, or even passed rapidly from one machine to the other as the need arises. …it should be noted that the various interruption facilities which have been described are based on mutual cooperation between the computer and the external devices subsidiary to it, and do not reflect merely a simple master-slave relationship.

— ALAN L. LEINER, System Specifications for the DYSEAC

This is one of the earliest examples of a computer with distributed control. Dept. of the Army reports[21] show it was certified reliable and passed all acceptance tests in April of 1954. It was completed and delivered on time, in May of 1954. In addition, was it mentioned that this was a portable computer? It was housed in tractor-trailer, and had 2 attendant vehicles and 6 tons of refrigeration capacity.

Multi-programming abstraction

The Lincoln TX-2[13] (1957)

Described as an input-output system of experimental nature, the Lincoln TX-2 placed a premium on flexibility in its association of simultaneously operational input-output devices. The design of the TX-2 was modular, supporting a high degree of modification and expansion, as well as flexibility in operating and programming of its devices. The system employed The Multiple-Sequence Program Technique.

This technique allowed for multiple program counters to each associate with one of 32 possible sequences of program code. These explicitly prioritized sequences could be interleaved and executed concurrently, affecting not only the computation in process, but also the control flow of sequences and switching of devices as well. Much discussion ensues related to the complexity and sophistication in the sequence capabilities of devices.

Similar to the previous system, the TX-2 discussion has a distinct decentralized theme until it is revealed that efficiencies in system operation are gained when separate programmed devices are operated simultaneously. It is also stated that the full power of the central unit can be utilized by any device; and it may be used for as long as the device's situation requires. In this, we see the TX-2 as another example of a system exhibiting distributed control, its central unit not having dedicated control.

Memory access abstraction

Intercommunicating Cells, Basis for a Distributed Logic Computer[14] (1962)

One early memory access paradigm was Intercommunicating Cells, where a cell is composed of a collection of memory elements. A memory element was basically a electronic flip-flop or relay, capable of two possible values. Within a cell there are two types of elements, symbol and cell elements. Each cell structure stores data in a string of symbols, consisting of a name and a set of associated parameters. Consequently, a system's information is linked through various associations of cells.

Intercommunicating Cells fundamentally break from tradition in that it has no counters or any concept of addressing memory. The theory contends that addressing is a wasteful and non-valuable level of indirection. Information is accessed in two ways, direct and cross-retrieval. Direct retrieval looks to a name and returns a parameter set. Cross-retrieval projects through parameter sets and returns a set of names containing the given subset of parameters. This would be similar to a modified hash table data structure that would allow for multiple values (parameters) for each key (name).

Cellular memory would have many advantages:
  A major portion of a system's logic is distributed within the associations of information stored in the cells,
  This flow of information association is somewhat guided by the act of storing and retrieving,
  The time required for storage and retrieval is mostly constant and completely unrelated to the size and fill-factor of the memory
  Cells are logically indistinguishable, making them both flexible to use and relatively simple to extend in size

This early research into alternative memory describes a configuration ideal for the distributed operating system. The constant-time projection through memory for storing and retrieval would be inherently atomic and exclusive. The cellular memory's intrinsic distributed characteristics would be an invaluable benefit; however, the impact on the user, hardware/device, or Application programming interfaces is uncertain. It is distinctly obvious that these early researchers had a distributed system concept in mind, as they state:

We wanted to present here the basic ideas of a distributed logic system with... the macroscopic concept of logical design, away from scanning, from searching, from addressing, and from counting, is equally important. We must, at all cost, free ourselves from the burdens of detailed local problems which only befit a machine low on the evolutionary scale of machines.

— Chung-Yeol (C. Y.) Lee, Intercommunicating Cells, Basis for a Distributed Logic Computer

Component abstraction

HYDRA:The Kernel of a Multiprocessor Operating System[22] (1974)
The design philosophy of HYDRA ... suggest that, at the heart of the system, one should build a collection of facilities of "universal applicability" and "absolute reliability" -- a set of mechanisms from which an arbitrary set of operating system facilities and policies can be conveniently, flexibly, efficiently, and reliably constructed.
Defining a kernel with all the attributes given above is difficult, and perhaps impractical... It is, nevertheless, the approach taken in the HYDRA system. Although we make no claim either that the set of facilities provided by the HYDRA kernel ... we do believe the set provides primitives which are both necessary and adequate for the construction of a large and interesting class of operating environments. It is our view that the set of functions provided by HYDRA will enable the user of C.mmp to create his own operating environment without being confined to predetermined command and file systems, execution scenarios, resource allocation policies, etc.

Initial composition

The National Software Works: A Distributed Processing System[23] (1975)

The National Software Works (NSW) is a significant new step in the development of distributed processing systems and computer networks. NSW is an ambitious project to link a set of geographically distributed and diverse hosts with an operating system which appears as a single entity to a prospective user.

Complete instantiation

The Rosco Distributed Operating System[24] (1979)

Roscoe is an operating system implemented at the University of Wisconsin that allows a network of microcomputers to cooperate to provide a general-purpose computing facility. The goal of the Roscoe network is to provide a general-purpose computation resource in which individual resources such as files and processors are shared among processes and control is distributed in a non-hierarchical fashion. All processors are identical. Similarly, all processors run the same operating system kernel. However, they may differ in the peripheral units connected to them. No memory is shared between processors. All communication involves messages explicitly passed between physically connected processors. No assumptions are made about the topology of interconnection.

The decision not to use logical or physical sharing of memory for communication is influenced both by the constraints of currently available hardware and by our perception of cost bottlenecks likely to arise as the number of processors increases.

Foundational work

Coherent memory abstraction

 Algorithms for scalable synchronization on shared-memory multiprocessors[25]
 A N algorithm for mutual exclusion in decentralized systems[26]

File System abstraction

 Measurements of a distributed file system[27]
 Memory coherence in shared virtual memory systems[28]

Transaction abstraction

 Transactions
 Sagas[29]

 Transactional Memory
 Composable memory transactions[30]
 Transactional memory: architectural support for lock-free data structures[31]
 Software transactional memory for dynamic-sized data structures[32]
 Software transactional memory[33]

Persistence abstraction

 OceanStore: an architecture for global-scale persistent storage[34]

Coordinator abstraction

 Weighted voting for replicated data[35]
 Consensus in the presence of partial synchrony[36]

Reliability abstraction

 Sanity checks
 The Byzantine Generals Problem[37]
 Fail-stop processors: an approach to designing fault-tolerant computing systems[38]

 Recoverability
 Distributed snapshots: determining global states of distributed systems[39]
 Optimistic recovery in distributed systems[40]

Current research

Replicated model extended to a component object model

 Architectural Design of E1 Distributed Operating System[41]
 The Cronus distributed operating system[42]
 Fine-grained mobility in the emerald system[43]
 Design and development of MINIX distributed operating system[44]

Future directions

Complexity/Trust exposure through accepted responsibility

Application performance and flexibility on exokernel systems.[45]
Scale and performance in the Denali isolation kernel.[46]

Multi/Many-core focused systems

The multikernel: a new OS architecture for scalable multicore systems.[47]
Corey: an Operating System for Many Cores.[48]

Distributed processing over extremes in heterogeneity

Helios: heterogeneous multiprocessing with satellite kernels.[49]

Effective and stable in multiple levels of complexity

Tesselation

See Also

  • Coming Soon...

References

  1. ^ a b Tanenbaum, Andrew S. 1993 Distributed operating systems anno 1992. What have we learned so far? Distributed Systems Engineering, 1, 1 (1993), 3-10
  2. ^ Nutt, G. J. 1992 Centralized and Distributed Operating Systems. Prentice Hall Press.
  3. ^ a b c Distributed Operating Systems: The Logical Design, 1st edition Goscinski, A. 1991 Distributed Operating Systems: the Logical Design. 1st. Addison-Wesley Longman Publishing Co., Inc.
  4. ^ Fortier, P. J. 1986 Design of Distributed Operating Systems: Concepts and Technology. Intertext Publications, Inc., McGraw-Hill, Inc.
  5. ^ P. Brinch Hansen, Ed. 2000 Classic Operating Systems: from Batch Processing to Distributed Systems. Springer-Verlag New York, Inc.
  6. ^ Using LOTOS for specifying the CHORUS distributed operating system kernel Pecheur, C. 1992. Using LOTOS for specifying the CHORUS distributed operating system kernel. Comput. Commun. 15, 2 (Mar. 1992), 93-102.
  7. ^ COOL: kernel support for object-oriented environments Habert, S. and Mosseri, L. 1990. COOL: kernel support for object-oriented environments. In Proceedings of the European Conference on Object-Oriented Programming on Object-Oriented Programming Systems, Languages, and Applications (Ottawa, Canada). OOPSLA/ECOOP '90. ACM, New York, NY, 269-275.
  8. ^ a b Distributed Operating Systems: Concepts and Design Sinha, P. K. 1996 Distributed Operating Systems: Concepts and Design. 1st. Wiley-IEEE Press.
  9. ^ Distributed Operating Systems Galli, D. L. 1999 Distributed Operating Systems: Concepts and Practice. 1st. Prentice Hall PTR.
  10. ^ a b Distributed Operating Systems and Algorithms Chow, R. and Chow, Y. 1997 Distributed Operating Systems and Algorithms. Addison-Wesley Longman Publishing Co., Inc.
  11. ^ Surajbali, B., Coulson, G., Greenwood, P., and Grace, P. 2007. Augmenting reflective middleware with an aspect orientation support layer. In Proceedings of the 6th international Workshop on Adaptive and Reflective Middleware: Held At the ACM/IFIP/USENIX international Middleware Conference (Newport Beach, CA, November 26 - 30, 2007). ARM '07. ACM, New York, NY, 1-6.
  12. ^ a b Leiner, A. L. 1954. System Specifications for the DYSEAC. J. ACM 1, 2 (Apr. 1954), 57-81.
  13. ^ a b Forgie, J. W. 1957. The Lincoln TX-2 input-output system. In Papers Presented At the February 26-28, 1957, Western Joint Computer Conference: Techniques For Reliability (Los Angeles, California, February 26 - 28, 1957). IRE-AIEE-ACM '57 (Western). ACM, New York, NY, 156-160.
  14. ^ a b Lee, C. Y. 1962. Intercommunicating cells, basis for a distributed logic computer. In Proceedings of the December 4-6, 1962, Fall Joint Computer Conference (Philadelphia, Pennsylvania, December 04 - 06, 1962). AFIPS '62 (Fall).
  15. ^ Dreyfuss, P. 1958. System design of the Gamma 60. In Proceedings of the May 6-8, 1958, Western Joint Computer Conference: Contrasts in Computers (Los Angeles, California, May 06 - 08, 1958). IRE-ACM-AIEE '58 (Western). ACM, New York, NY, 130-133.
  16. ^ Leiner, A. L., Notz, W. A., Smith, J. L., and Weinberger, A. 1958. Organizing a network of computers to meet deadlines. In Papers and Discussions Presented At the December 9-13, 1957, Eastern Joint Computer Conference: Computers with Deadlines To Meet (Washington, D.C., December 09 - 13, 1957). IRE-ACM-AIEE '57
  17. ^ Leiner, A. L., Smith, J. L., Notz, W. A., and Weinberger, A. 1958. PILOT, the NBS multicomputer system. In Papers and Discussions Presented At the December 3-5, 1958, Eastern Joint Computer Conference: Modern Computers: Objectives, Designs, Applications (Philadelphia, Pennsylvania, December 03 - 05, 1958). AIEE-ACM-IRE '58 (Eastern). ACM, New York, NY, 71-75.
  18. ^ Bauer, W. F. 1958. Computer design from the programmer's viewpoint. In Papers and Discussions Presented At the December 3-5, 1958, Eastern Joint Computer Conference: Modern Computers: Objectives, Designs, Applications (Philadelphia, Pennsylvania, December 03 - 05, 1958). AIEE-ACM-IRE '58 (Eastern). ACM, New York, NY, 46-51.
  19. ^ Leiner, A. L., Notz, W. A., Smith, J. L., and Weinberger, A. 1959. PILOT—A New Multiple Computer System. J. ACM 6, 3 (Jul. 1959), 313-335.
  20. ^ Estrin, G. 1960. Organization of computer systems: the fixed plus variable structure computer. In Papers Presented At the May 3-5, 1960, Western Joint IRE-AIEE-ACM Computer Conference (San Francisco, California, May 03 - 05, 1960). IRE-AIEE-ACM '60 (Western). ACM, New York, NY, 33-40.
  21. ^ Martin H. Weik, "A Third Survey of Domestic Electronic Digital Computing Systems," Ballistic Research Laboratories Report No. 1115, pg. 234-5, Aberdeen Proving Ground, Maryland, March 1961
  22. ^ Wulf, W., Cohen, E., Corwin, W., Jones, A., Levin, R., Pierson, C., and Pollack, F. 1974. HYDRA: the kernel of a multiprocessor operating system. Commun. ACM 17, 6 (Jun. 1974), 337-345.
  23. ^ Millstein, R. E. 1977. The National Software Works: A distributed processing system. In Proceedings of the 1977 Annual Conference ACM '77. ACM, New York, NY, 44-52.
  24. ^ Solomon, M. H. and Finkel, R. A. 1979. The Roscoe distributed operating system. In Proceedings of the Seventh ACM Symposium on Operating Systems Principles (Pacific Grove, California, United States, December 10 - 12, 1979). SOSP '79.
  25. ^ Mellor-Crummey, J. M. and Scott, M. L. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9, 1 (Feb. 1991), 21-65.
  26. ^ Maekawa, M. 1985. A N algorithm for mutual exclusion in decentralized systems. ACM Trans. Comput. Syst. 3, 2 (May. 1985), 145-159.
  27. ^ Baker, M. G., Hartman, J. H., Kupfer, M. D., Shirriff, K. W., and Ousterhout, J. K. 1991. Measurements of a distributed file system. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles (Pacific Grove, California, United States, October 13 - 16, 1991). SOSP '91. ACM, New York, NY, 198-212.
  28. ^ Li, K. and Hudak, P. 1989. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7, 4 (Nov. 1989), 321-359.
  29. ^ Garcia-Molina, H. and Salem, K. 1987. Sagas. In Proceedings of the 1987 ACM SIGMOD international Conference on Management of Data (San Francisco, California, United States, May 27 - 29, 1987). U. Dayal, Ed. SIGMOD '87. ACM, New York, NY, 249-259.
  30. ^ Harris, T., Marlow, S., Peyton-Jones, S., and Herlihy, M. 2005. Composable memory transactions. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Chicago, IL, USA, June 15 - 17, 2005). PPoPP '05. ACM, New York, NY, 48-60.
  31. ^ Herlihy, M. and Moss, J. E. 1993. Transactional memory: architectural support for lock-free data structures. In Proceedings of the 20th Annual international Symposium on Computer Architecture (San Diego, California, United States, May 16 - 19, 1993). ISCA '93. ACM, New York, NY, 289-300.
  32. ^ Herlihy, M., Luchangco, V., Moir, M., and Scherer, W. N. 2003. Software transactional memory for dynamic-sized data structures. In Proceedings of the Twenty-Second Annual Symposium on Principles of Distributed Computing (Boston, Massachusetts, July 13 - 16, 2003). PODC '03. ACM, New York, NY, 92-101.
  33. ^ Shavit, N. and Touitou, D. 1995. Software transactional memory. In Proceedings of the Fourteenth Annual ACM Symposium on Principles of Distributed Computing (Ottowa, Ontario, Canada, August 20 - 23, 1995). PODC '95. ACM, New York, NY, 204-213.
  34. ^ Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Wells, C., and Zhao, B. 2000. OceanStore: an architecture for global-scale persistent storage. In Proceedings of the Ninth international Conference on Architectural Support For Programming Languages and Operating Systems (Cambridge, Massachusetts, United States). ASPLOS-IX. ACM, New York, NY, 190-201.
  35. ^ Gifford, D. K. 1979. Weighted voting for replicated data. In Proceedings of the Seventh ACM Symposium on Operating Systems Principles (Pacific Grove, California, United States, December 10 - 12, 1979). SOSP '79. ACM, New York, NY, 150-162
  36. ^ Dwork, C., Lynch, N., and Stockmeyer, L. 1988. Consensus in the presence of partial synchrony. J. ACM 35, 2 (Apr. 1988), 288-323.
  37. ^ Lamport, L., Shostak, R., and Pease, M. 1982. The Byzantine Generals Problem. ACM Trans. Program. Lang. Syst. 4, 3 (Jul. 1982), 382-401.
  38. ^ Schlichting, R. D. and Schneider, F. B. 1983. Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1, 3 (Aug. 1983), 222-238.
  39. ^ Chandy, K. M. and Lamport, L. 1985. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1 (Feb. 1985), 63-75.
  40. ^ Strom, R. and Yemini, S. 1985. Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3, 3
  41. ^ L.B. Ryzhyk, A.Y. Burtsev. Architectural design of E1 distributed operating system. System Research and Information Technologies international scientific and technical journal, October 2004, Kiev, Ukraine.
  42. ^ Vinter, S. T. and Schantz, R. E. 1986. The Cronus distributed operating system. In Proceedings of the 2nd Workshop on Making Distributed Systems Work (Amsterdam, Netherlands, September 08 - 10, 1986). EW 2. ACM, New York, NY, 1-3.
  43. ^ Jul, E., Levy, H., Hutchinson, N., and Black, A. 1987. Fine-grained mobility in the emerald system. In Proceedings of the Eleventh ACM Symposium on Operating Systems Principles (Austin, Texas, United States, November 08 - 11, 1987). SOSP '87. ACM, New York, NY, 105-106.
  44. ^ Ramesh, K. S. 1988. Design and development of MINIX distributed operating system. In Proceedings of the 1988 ACM Sixteenth Annual Conference on Computer Science (Atlanta, Georgia, United States). CSC '88. ACM, New York, NY, 685.
  45. ^ M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, Héctor M. Briceño, Russell Hunt, David Mazières, Thomas Pinckney, Robert Grimm, John Jannotti, and Kenneth Mackenzie. In the Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP '97), Saint-Malô, France, October 1997.
  46. ^ Whitaker, A., Shaw, M., and Gribble, S. D. 2002. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation
  47. ^ Baumann, A., Barham, P., Dagand, P., Harris, T., Isaacs, R., Peter, S., Roscoe, T., Schüpbach, A., and Singhania, A. 2009. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (Big Sky, Montana, USA, October 11 - 14, 2009). SOSP '09.
  48. ^ S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Proceedings of the 2008 Symposium on Operating Systems Design and Implementation (OSDI), December 2008.
  49. ^ Nightingale, E. B., Hodson, O., McIlroy, R., Hawblitzel, C., and Hunt, G. 2009. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (Big Sky, Montana, USA, October 11 - 14, 2009). SOSP '09.

Further reading

  • Coming Soon...
  • Coming Soon...