Distributed operating system: Difference between revisions

Content deleted Content added
JLSjr (talk | contribs)
No edit summary
JLSjr (talk | contribs)
No edit summary
Line 230:
Lastly, as to the nature of the distributed system, it has been stated that a distributed operating system is not necessarily an operating system at all; but simply "is" the distributed system. This view is commonly justified by pointing to the deep and inextricable integration into the distributed system. The absolute and singular focus of sustaining and maintenance of the system is also used as rationale. However, it is important to remember the separation of mechanism and policy. The distributed operating system and its mechanism is not affected by any degree of integration, and no amount of focus on providing this mechanism changes the responsibility of policy, or expectation of results at the distributed system level. As mentioned earlier, a [[Square_(geometry)#Other_facts|square]] is a [[rectangle]]; and no level of effort exerted by the square in maintaining four equivalent dimensions changes anything.
 
==Major Design Considerations==
== Architectural features ==
 
=== Transparency ===
Transparency, simply put, is the quality of a distributed system to be seen and understood as a single-system image; and by far the greatest overriding consideration in the high-level conceptual design of a distributed operating system. While a simple concept, this one issue touches and affects decision making in almost every aspect of design by introducing requirements and/or restrictions on those aspects and often in their relationships with others.
Inter-Process Communication (IPC) is the critical complement to transparency, as low-level IPC implementation considerations. General communications, process interactions, and data flows all depend on IPC sub-systems. Each situation requires fast, efficient, and reliable exchange capabilities; requiring both efficient primitives and stable protocol. And while this often leads to various scenario-specific solutions, the calling interface must be consistent.
 
===Process Management===
Transparency is the attribute of a distributed operating system allowing it to appear as a unified, centralized, and local operating system. Many factors lend complexity to the concept of transparency in a distributed operating system (a system). Elements of a system are distributed spatially; a system’s software, its processes, and data are also distributed among these elements. Occasionally, elements need to communicate with other distant elements in the system. When a process asks a question of another process, it should not stand idly waiting for the answer; it should continue working productively. However, it should also remain alert for the answer; and receive it and process it immediately, to maintain the illusion of local elements. This added level of complexity is asynchronous communication. Communication time can become indefinite, when an element's connectivity is compromised, or an element itself fails. Connectivity and failure issues affect communication, but system processing is affected as well.
 
Process management is a global system concept, which provides mechanisms for effective and efficient use and sharing of processing resources throughout the system. These resources, and operations on them, can be either local or remote; however, in either event, they must remain completely consistent from the user perspective. As an example, Load Balancing is an important process management function. Some of the questions involved are which process to move, and when and where to move it. These are Policy decisions relegated to Resource Management; but, the migration of the process (ex. moveProcess(fromA, toB) is a mechanism implementation of Process Management. The migration process, either local to another core or remote to another computer, again must remain consistent in presentation to the user. Other functions of this sub-system include the allocation and de-allocation of processes and ports, as well as provisions to run, suspend, and resume execution of a process. Again, these are mechanisms, related only to "What" is done, not which one, how, or where.
To remain transparent, a system's elements may copy (replicate) portions of themselves onto collections of host elements. In times of need, a failed element's information can be retrieved from these host elements to continue processing, and eventually reconstitute the faulty element. This too is added complexity, and it does not end here. This replication of information throughout the system requires coordination, and therefore a coordinator. The coordinator oversees many aspects of a system's operation, unless that coordinator fails. In this event, some other element must be chosen and constituted a coordinator. This process adds complexity to the system. The complexity in the system can quickly add up, and these examples by no means sum to a total. Transparency envelope a system in an abstraction of extremely complex construction; but provide a user with a complete, consistent, and simplified local interface to hardware, devices, and resources. The various facets of a system contributing to this complexity are discussed individually, below.
 
===Resource Modularity Management===
Systems resources such as memory, files, devices, etc. are distributed throughout a system, and at any given moment, any of these nodes may have light to idle workloads. Load sharing and load balancing require many policy-oriented decisions, ranging from finding idle CPUs, when to move, and which to move. Many algorithms exist to aid in these decisions; however, this calls for a second-level of decision making policy in choosing the algorithm best suited for the scenario, and the conditions surrounding the scenario.
 
=== Reliability ===
A distributed operating system is inherently modular by definition. However, a system's '''modularity''' speaks more to its composition and configuration, the rationale behind these, and ultimately their effectiveness. A system element could be composed of multiple layers of components. Each of these components might vary in granularity of subcomponent. These layers and component compositions would each have a coherent and rational configuration towards some purpose in the system. The purpose could be for a more simplified abstraction, raw communication efficiency, accommodating heterogeneous elements, processing parallelism and concurrency, or possibly to support an object-oriented programming paradigm. In any event, the scattered distribution of system elements is not random, but is most often the result of detailed design and careful planning.
One of the basic tenants of distributed systems is a high-level of reliability. This quality attribute of a distributed system has become a staple expectation. Reliability is most often considered from the perspectives of availability and security of a system's hardware, services, and data. Issues arising from availability failures or security violations are considered faults. Faults are physical or logical defects that can cause errors in the system. For a system to be reliable, it must somehow overcome the adverse effects of faults. There are four general methods for dealing with faults: fault avoidance, fault tolerance, and fault detection and recovery. Fault avoidance are proactive measures taken to minimize the occurrence of faults, and fault tolerance is the ability of a system to continue some level operation in the face of a fault. In the event a fault does occur, the system should detect the fault and have the capability to respond quickly and effectively to recover full functionality.
 
===Performance===
=== Persistence of Entity state ===
Performance is arguably the quintessential computing concern, and in the distributed system, it is no different. Many benchmark metrics exist for performance; throughput, job completions per unit time, system utilization, etc. Each of these benchmarks are more meaningful in describing some scenarios, and less in others. With respect to a distributed system, this consideration most often distills to a balance between process parallelism and IPC. Managing the task granularity of parallelism in a sensible relation to the messages required for support is extremely effective. Also, identifying when it is more beneficial to migrate a process to its data, rather than copy the data, is effective as well. Many process and resource management algorithms, and algorithms in this space work to maximize performance.
{{pad|2em}}existance not time-bound, regardless of breaks in system functions continuously
<br />{{pad|2em}}resides in nonvolatile storage; synchronized with current, stable, active copy
<br />{{pad|2em}}Subject to consistent and timely updates
<br />{{pad|2em}}Able to survive hardware failure
 
=== Efficiency Synchronization===
Cooperating concurrent processes have an inherent need for synchronization. Three basic situations that define the scope of this need; one or more processes must synchronize at a given point for one or more other processes to continue, one or more processes must wait for an asynchronous condition in order to continue, or a process must establish mutual exclusive access to a shared resource. There is a multitude of algorithms available for these scenarios, and their many variations. Unfortunately, whenever synchronization is required the opportunity for process deadlock usually exists. The ancillary situation of deadlock is covered below.
{{pad|2em}}Many issues can adversly affect system performance:
<br />{{pad|2em}}latency in interactions among distributed entities
<br />{{pad|4em}}local response facade requires remote entities' state be cached locally
<br />{{pad|4em}}and consistently synchronized to maintain the paradigm
<br />{{pad|2em}}Workload variations, delays, interruptions, faults, and/or crashes of entities
<br />{{pad|4em}}Distributed processing community assists when needed
 
=== Replication Flexibility===
Flexibility in a distributed system is made possible through the modular characteristics of the microkernel. With the microkernel presenting a minimal -- but complete -- set of primitives and basic functionally cohesive services, The higher-level management components can be composed in a similar functionally cohesive manner. This capability leads to exceptional flexibility in the management components collection; but more importantly, it allows the opportunity to dynamically swap, upgrade, or install additional of components above the kernel.
{{pad|2em}}Duplication of state among selected distributed entities, and the synchronization of that state
<br />{{pad|2em}}Remote communication required to effect synchronization
 
==Transparency Responsibilities==
=== Reliability ===
{{pad|2em}}Inherent redundancy across the distributed entities provides fault-tolerance
<br />{{pad|2em}}Consistent synchronized redundancy across N nodes, tolerates up to N-1 node faults
 
===Location Flexibility Transparency===
System should create and maintain the user's perception and understanding of the entirety of the system, its devices, and resources as local entities. At no point in any user's system experience should there exist any expectation of any user to be
{{pad|2em}}OS has lattitude in degree of exposure to externals
<br />{{pad|2em}}Externals have lattitude in degree of exposure they accept
<br />{{pad|4em}}Coordination of process activity
<br />{{pad|4em}}Where run; Near user?, resources?, avail. CPU?, etc...
 
===Access Scalability Transparency===
System entities or processes maintain consistent access/entry mechanism, regardless of being local or remote
{{pad|2em}}node expansion
<br />{{pad|2em}}process migration
 
===Migration Transparency===
== History ==
Resources and processes can be migrated, without user-knowledge, by the system to another node in an attempt to maximize efficiency, reliability, and security. Requires policy decision-making abilities, Naming stability, and in the event of a process migration, all IPC messages must be received or held pending the migration.
 
===Replication Transparency===
Systems entities can be copied to strategic points in the system to increase efficiencies through better proximity, and also provide for improved reliability through the distributed replication as a back-up; prompted by dynamic stratagem.
 
===Concurrency Transparency===
System should possess and exhibit properties to allow multiple simultaneous uses of system resources between users ho are kept unaware of the concurrent usage. Required properties are synchronization mechanisms to keep events ordered and consistent, mutual-exclusivity management for resources, sufficient capabilities to detect and recover from both starvation and deadlock.
 
===Parallel Transparency===
System should have stable performance characteristics, regardless if some nodes increase rapidly in workload, through properties of migration, replication, and concurrency. This requires an intelligent policy decision stratagem to facilitate the timely and accurate allocation, migration, and disposition of resources.
===Failure Transparency===
The system should shield users from the knowledge of and the affects resulting from failures. In the event of a partial failure, the system is responsible for rapid and accurate detection and orchestration of a remedy with little, if any imposition on users. These methods can range from static proactive posturing to dynamic and more flexible response mechanisms.
 
===Perform Transparency===
System should create and maintain a reasonable, stable, and predictable performance expectation for the user, that is both resilient from and helpful in situations where parts of the system may experience significant delay or even failure. While reasonable and predictable are important, there should be no inherent expectation or expressed indication of fairness or equality.
 
===Name Transparency===
All system entities should maintain a complete decoupling between entity naming from any spatial or temporal ___location, as well as any other system entity.
===Size/Scale Transparency===
A user's experience or perception of their system should remain stable and consistent in the face of system extension, scaling, or waning due to failure.
===Revision Transparency===
System users should be completely oblivious to system-software version changes and changes in internal implementation of system infrastructure. While a user may become aware of, or discover the availability of a new function or service, the implementation or alteration of the systems internal structure should in no way be the prompt for this discovery.
 
===Control Transparency===
All system constants, properties, configuration settings, etc. should be completely consistent in appearance, connotation, and denotation to all users and software applications aware of them.
===Data Transparency===
No system data-entity should expose itself as peculiar when required to interact remotely.
 
 
==Historical Perspectives==
 
=== Pioneering inspirations ===