Content deleted Content added
No edit summary |
No edit summary |
||
Line 230:
Lastly, as to the nature of the distributed system, it has been stated that a distributed operating system is not necessarily an operating system at all; but simply "is" the distributed system. This view is commonly justified by pointing to the deep and inextricable integration into the distributed system. The absolute and singular focus of sustaining and maintenance of the system is also used as rationale. However, it is important to remember the separation of mechanism and policy. The distributed operating system and its mechanism is not affected by any degree of integration, and no amount of focus on providing this mechanism changes the responsibility of policy, or expectation of results at the distributed system level. As mentioned earlier, a [[Square_(geometry)#Other_facts|square]] is a [[rectangle]]; and no level of effort exerted by the square in maintaining four equivalent dimensions changes anything.
==Major Design Considerations==
===
Transparency, simply put, is the quality of a distributed system to be seen and understood as a single-system image; and by far the greatest overriding consideration in the high-level conceptual design of a distributed operating system. While a simple concept, this one issue touches and affects decision making in almost every aspect of design by introducing requirements and/or restrictions on those aspects and often in their relationships with others.
Inter-Process Communication (IPC) is the critical complement to transparency, as low-level IPC implementation considerations. General communications, process interactions, and data flows all depend on IPC sub-systems. Each situation requires fast, efficient, and reliable exchange capabilities; requiring both efficient primitives and stable protocol. And while this often leads to various scenario-specific solutions, the calling interface must be consistent.
===Process Management===
Process management is a global system concept, which provides mechanisms for effective and efficient use and sharing of processing resources throughout the system. These resources, and operations on them, can be either local or remote; however, in either event, they must remain completely consistent from the user perspective. As an example, Load Balancing is an important process management function. Some of the questions involved are which process to move, and when and where to move it. These are Policy decisions relegated to Resource Management; but, the migration of the process (ex. moveProcess(fromA, toB) is a mechanism implementation of Process Management. The migration process, either local to another core or remote to another computer, again must remain consistent in presentation to the user. Other functions of this sub-system include the allocation and de-allocation of processes and ports, as well as provisions to run, suspend, and resume execution of a process. Again, these are mechanisms, related only to "What" is done, not which one, how, or where.
===Resource
Systems resources such as memory, files, devices, etc. are distributed throughout a system, and at any given moment, any of these nodes may have light to idle workloads. Load sharing and load balancing require many policy-oriented decisions, ranging from finding idle CPUs, when to move, and which to move. Many algorithms exist to aid in these decisions; however, this calls for a second-level of decision making policy in choosing the algorithm best suited for the scenario, and the conditions surrounding the scenario.
One of the basic tenants of distributed systems is a high-level of reliability. This quality attribute of a distributed system has become a staple expectation. Reliability is most often considered from the perspectives of availability and security of a system's hardware, services, and data. Issues arising from availability failures or security violations are considered faults. Faults are physical or logical defects that can cause errors in the system. For a system to be reliable, it must somehow overcome the adverse effects of faults. There are four general methods for dealing with faults: fault avoidance, fault tolerance, and fault detection and recovery. Fault avoidance are proactive measures taken to minimize the occurrence of faults, and fault tolerance is the ability of a system to continue some level operation in the face of a fault. In the event a fault does occur, the system should detect the fault and have the capability to respond quickly and effectively to recover full functionality.
===Performance===
Performance is arguably the quintessential computing concern, and in the distributed system, it is no different. Many benchmark metrics exist for performance; throughput, job completions per unit time, system utilization, etc. Each of these benchmarks are more meaningful in describing some scenarios, and less in others. With respect to a distributed system, this consideration most often distills to a balance between process parallelism and IPC. Managing the task granularity of parallelism in a sensible relation to the messages required for support is extremely effective. Also, identifying when it is more beneficial to migrate a process to its data, rather than copy the data, is effective as well. Many process and resource management algorithms, and algorithms in this space work to maximize performance.
===
Cooperating concurrent processes have an inherent need for synchronization. Three basic situations that define the scope of this need; one or more processes must synchronize at a given point for one or more other processes to continue, one or more processes must wait for an asynchronous condition in order to continue, or a process must establish mutual exclusive access to a shared resource. There is a multitude of algorithms available for these scenarios, and their many variations. Unfortunately, whenever synchronization is required the opportunity for process deadlock usually exists. The ancillary situation of deadlock is covered below.
===
Flexibility in a distributed system is made possible through the modular characteristics of the microkernel. With the microkernel presenting a minimal -- but complete -- set of primitives and basic functionally cohesive services, The higher-level management components can be composed in a similar functionally cohesive manner. This capability leads to exceptional flexibility in the management components collection; but more importantly, it allows the opportunity to dynamically swap, upgrade, or install additional of components above the kernel.
==Transparency Responsibilities==
▲=== Reliability ===
===Location
System should create and maintain the user's perception and understanding of the entirety of the system, its devices, and resources as local entities. At no point in any user's system experience should there exist any expectation of any user to be
===Access
System entities or processes maintain consistent access/entry mechanism, regardless of being local or remote
===Migration Transparency===
Resources and processes can be migrated, without user-knowledge, by the system to another node in an attempt to maximize efficiency, reliability, and security. Requires policy decision-making abilities, Naming stability, and in the event of a process migration, all IPC messages must be received or held pending the migration.
===Replication Transparency===
Systems entities can be copied to strategic points in the system to increase efficiencies through better proximity, and also provide for improved reliability through the distributed replication as a back-up; prompted by dynamic stratagem.
===Concurrency Transparency===
System should possess and exhibit properties to allow multiple simultaneous uses of system resources between users ho are kept unaware of the concurrent usage. Required properties are synchronization mechanisms to keep events ordered and consistent, mutual-exclusivity management for resources, sufficient capabilities to detect and recover from both starvation and deadlock.
===Parallel Transparency===
System should have stable performance characteristics, regardless if some nodes increase rapidly in workload, through properties of migration, replication, and concurrency. This requires an intelligent policy decision stratagem to facilitate the timely and accurate allocation, migration, and disposition of resources.
===Failure Transparency===
The system should shield users from the knowledge of and the affects resulting from failures. In the event of a partial failure, the system is responsible for rapid and accurate detection and orchestration of a remedy with little, if any imposition on users. These methods can range from static proactive posturing to dynamic and more flexible response mechanisms.
===Perform Transparency===
System should create and maintain a reasonable, stable, and predictable performance expectation for the user, that is both resilient from and helpful in situations where parts of the system may experience significant delay or even failure. While reasonable and predictable are important, there should be no inherent expectation or expressed indication of fairness or equality.
===Name Transparency===
All system entities should maintain a complete decoupling between entity naming from any spatial or temporal ___location, as well as any other system entity.
===Size/Scale Transparency===
A user's experience or perception of their system should remain stable and consistent in the face of system extension, scaling, or waning due to failure.
===Revision Transparency===
System users should be completely oblivious to system-software version changes and changes in internal implementation of system infrastructure. While a user may become aware of, or discover the availability of a new function or service, the implementation or alteration of the systems internal structure should in no way be the prompt for this discovery.
===Control Transparency===
All system constants, properties, configuration settings, etc. should be completely consistent in appearance, connotation, and denotation to all users and software applications aware of them.
===Data Transparency===
No system data-entity should expose itself as peculiar when required to interact remotely.
==Historical Perspectives==
=== Pioneering inspirations ===
|