## 1995 BIO Round Two Question One Parallel Addition (45 mins)

Welcome to the world of parallel computing. In this question we will take a simple problem (known as Prefix Addition), and hopefully produce a series of solutions for a parallel machine. By the end we should have an algorithm that uses relatively few processors, and is quick. The question is quite long, so do not worry if you do not get to the end.

The processors for the parallel machine are reasonably simple. They can accept a pair of inputs, do a simple operation, and then return a single answer. A given processor can be though of as waiting for all its inputs to arrive, and only then producing an answer. You can picture a simple processor (which multiplies two numbers X = A * B) as:

```A --> * <-- B
|
|
Out
```
The answer can be put into an array, or passed to another processor directly. The answer can also be sent to several locations at once. Inputs can come from an array, or from another processor. For example, we can draw the machine that multiplies four numbers together :
```A     B     X     Y
|     |     |     |
\- * -/     \- * -/
|           |
\---- * ----/
|
Out
```
This has only taken two time steps, since x*y and a*b have been calculated simultaneously. Doing this on a normal machine would take three time steps.

You should assume, for our model, that a processor can only be used once. So despite the fact that both the processors used in step 1 are free during step 2, a third processor is needed. (Think of it as a hard-wired electronic circuit.)

Given an array A (A[1]..A[n] inclusive) of n elements we want to produce array B such that :
```      B[1] = A[1]
B[2] = A[1] + A[2]
B[3] = A[1] + A[2] + A[3]
...  B[n] = A[1] + .... + A[n]
```
For example, if we have
```    A = [1  2  3  4  5]
```
then we require
```    B = [1  3  6 10 15].
```

There is the obvious sequential algorithm:

``` total := 0
counter := 1
while (counter < n+1) do
begin
total := total + A[counter];
B[counter] := total;
counter := counter + 1;
end;
```

For all the following questions clear diagrams, such as above, will be sufficient to illustrate any methods. Any lines coming out of processors should be labelled with the values you expect to have on them. Furthermore, when you are asked how many processors or time steps are required you do not have to be perfectly spot on (e.g. saying 'n' when the answer is 'n-1' is okay).

In case you have forgotten (and in case you find it useful):

```  1/2 + 1/4 + 1/8 + 1/16 + ... = 1
```
If
```  1 + 2 + 4 + ... + 2^k = m  then  k = log2 (m+1)/2
```
where log2 is the logarithm to base 2.

### Question 1

Illustrate the sequential algorithm on our parallel machine. How many time steps does it take? How many processors?

### Question 2

We solve the problem faster at the expense of more processors. We have already seen how using 3 processors lets us calculate (x*y)*(a*b) in only 2 steps.

a) Suppose we only want to calculate B[n], how many time steps can we do it in, and how many processors do we require? You may find it easier to just consider the case when n is a power of 2.

b) One way of solving the problem is to calculate each element of B like this. Using the method in part a) for each element of B, and assuming the processors calculating each element cannot exchange data with the processors calculating another element, how many time steps do we require to calculate all of B, and how many processors?

### Question 3

We can consider a divide and conquer approach to the problem - i.e. splitting the problem into two halves, solving them, and then producing the answer.

We do need a little trick to produce the answer, since if we split

```  [1 2 3 4]
```
into [1 2] and [3 4], solving each half produces [1 3] and [3 7] respectively. The required answer though is [1 3 6 10] and not [1 3 3 7].

a) What is the trick? (If you illustrate this, try to consider the solving of each half as taking place in a "black box" - i.e. a device whose inside you do not need to explain, that takes n/2 elements and produces the correct n/2 prefix additions.)

In the divide and conquer approach these black boxes are also solved by using the same method. So if we solve A[1]..A[16] by first solving A[1]..A[8] and A[9]..A[16] and then applying the trick, we would solve A[1]..A[8] by solving A[1]..A[4], A[5]..A[8] and applying the trick etc... We can, of course, solve a single element instantly (the prefix addition of [i] is [i].)

b) Using this divide and conquer approach how many time steps do we need, and how many processors?

### Question 4

The previous method is good, but we can (at the expense of a bit more time) reduce the number of processors. Our final method works as follows :

1. Pair off the elements and sum them.
2. Feed these n/2 sums into a "black box" which produces the expected n/2 prefix additions.
3. Do some "trick" to get the required results from the information so far.

The black box is, of course, a miniature version of this algorithm again (and not one from any previous question.)

a) What is the "trick" used for part 3 of this method? b) How many time steps, and how many processors do we need?