Cosm - Chapter 2 - Overview of Distributed Computing

Overview of
Distributed Computing

[ One of the goals of this chapter is to have it grow into the definative source of distributed computing information. Such a source of information is does not currently exist on the net. Think mini-textbook. ]

Overview

Computers started being connected to one another through networks so that data such as files and email could be exchanged. Over time, more and more of the computer's capabilities are being shared over the network creating the realm of distributed computing.

Distributed computing is simply applying the two old sayings to the realm of computer resources. The first, "Many hands make light work" refers to the idea that you can take a task and break it down so that many different people or computers can work on it at the same time. Then it only takes a small amount of work by each computer (or human) to complete the bigger task.

The second saying, "The whole is greater than the sum of its parts" applies to the fact that when a number of computers work in conjunction the result is something that can not be achieved by the computers working alone. One of the benefits that a distributed system has is that it can continue to operate even if some of its parts are missing, this resulting in much better reliability of the system than a single computer could ever provide. Working together a group of humans can lift a stone that none of them could lift alone, working together a group of computers can perform calculations none of them could have completed alone during a lifetime.

A traditional operating system on a standalone computer controls the hardware of that computer, and provides a nice abstracted interface to applications that run on that computer. A network operating system works with a standalone operating system to provide communication facilities to the applications that run on that computer. A network operating system usually is not defined separately but combined into the overall lump of things called the operating system. An application running on a computer with network connectivity has to know what other computers are out there and how to communicate with them.

A distributed operating system takes the abstraction to a higher level, and allows hides from the application where things are. The application can use things on any of many computers just as if it were one big computer. A distributed operating system will also provide for some sort of security across these multiple computers, as well as control the network communication paths between them. A distributed operating system can be created by merging these functions into the traditional operating system, or as another abstraction layer on top of the traditional operating system and network operating system.

Any operating system, including distributed operating systems, provide a number of services. First, they control what application gets to use the CPU and handle switching control between multiple applications. They also manage use of RAM and disk storage. Controlling who has access to which resources of the computer (or computers) is another issue that the operating system handles. In the case of distributed systems, all of these items need to be coordinated for multiple machines. As systems grow larger handling them can be complicated by the fact that not one person controls all of the machines so the security policies on one machine may not be the same as on another.

Some problems can be broken down into very tiny pieces of work that can be done in parallel. Other problems are such that you need the results of step one to do step two and the results of step two to do step three and so on. These problems can not be broken down into as small of work units. Those things that can be broken down into very small chunks of work are called fine-grained and those that require larger chunks are called coarse-grain. When distributing the work to be done on many CPUs there is a balancing act to be followed. You don't want the chunk of work to be done be so small that it takes too long to send the work to another CPU because then it is quicker to just have a single CPU do the work, You also don't want the chunk of work to be done to be too big of a chunk because then you can't spread it out over enough machines to make the thing run quickly.

A computer system with multiple processors in a single machine can handle very fine-grained problems well, while a system built from computers distributed over the Internet can handle only coarse-grained problems. Computer systems centered around a small fast local network can handle problems somewhere in the middle.

Cosm is an example of a distributed system built on top of traditional operating systems. It scales from handling systems on a local network to systems built on the Internet. Cosm provides the programmer with a framework for writing distributed programs (what constitutes the cores in Cosm).

Another example of a distributed computing system is Beowolf. Unfortunately, Beowolf is a well-known name, but it is not as well known when and where Beowolf makes sense to use. Beowolf is designed to work with very fine-grained problems, and only with a very small and very fast computer network. It does not scale to thousands of machines working together or to slower communication mechanisms such as going over the Internet.

v0.01 by Daniel Oelke

Overview ofDistributed Computing

Overview

Overview of
Distributed Computing