## 1995 BIO Round One Solution to Q1 Prefix Addition

### Question Statement

This was probably the hardest question set in the 1995 BIO, and is taken from the Oxford University Computation course. Richard did not make himself popular with this one!

### Notation

For this solution, I shall use diagrams of the same format as in the question statement.

Exact answers for time and magnitude aren't going to be a major issue, so in some places I will use the Order notation O[n], which means "of the order of n" ie. give or take a constant multiplier or constant offset not far from one.

Other important notes from the question:

• Each process step takes a fixed time, T
• Processors cannot be re-used.

### Question 1

Illustrate the sequential algorithm on our parallel machine. How many time steps does it take? How many processors?

This calls for a diagram

```Initial                    Inputs
Total = 0        - A[1]
so no adder     /
needed here -> o   - A[2]
/ \ /
B[1] -   +   - A[3]
/ \ /
B[2] -   +
/ ...
B[3] -       - A[n]
\ /
Outputs                +
/
B[n] -
```
which shows that we need n-1 processors and that the summation takes n-1 time steps. O[n] is sufficient for both answers.

### Question 2a

We solve the problem faster at the expense of more processors. We have already seen how using 3 processors lets us calculate (x*y)*(a*b) in only 2 steps.
a) Suppose we only want to calculate B[n], how many time steps can we do it in, and how many processors do we require? You may find it easier to just consider the case when n is a power of 2.

The hint for this was given in the question: using a binary structure

```A[1]  A[2]  A[3]  A[4]
|     |     |     |
\- + -/     \- + -/
|           |
\---- + ----/
|
B[4]
```
for 2-input processors.

For n a power of 2, we are effectively given the formula for the number of processors: n-1 (O[n]) once again. This is as we might expect, as we're performing exactly the same amount of computation. The question's comment "at the expense of a few more processors", if n is not a power of 2, applies because we now don't have the values for the other B[i].

Time taken is much more efficient: T log2(n), where the logarithm to the base 2 is rounded up to the next highest integer if n is not a power of 2. O[log n] is a sufficient answer.

Thus far we have assumed 2-input processors, because these are what is used in practice. The same methods can easily be extended to 3-input devices and so on.

### Question 2b

b) One way of solving the problem is to calculate each element of B like this. Using the method in part a) for each element of B, and assuming the processors calculating each element cannot exchange data with the processors calculating another element, how many time steps do we require to calculate all of B, and how many processors?

We're not making any savings on processors here, so we have to consider n trees as above, calculating each value B[i] independently. This obviously will only take the same length of time for the last elements to be available (T log2(n) rounded up) but more hardware.

To be precise, we need

```   0 + 1 + 2 + ... + n-1
```
processors, n(n-1)/2 in fact. This is O[n^2].

### Question 3a

We can consider a divide and conquer approach to the problem - i.e. splitting the problem into two halves, solving them, and then producing the answer.

We do need a little trick to produce the answer, since if we split

```  [1 2 3 4]
```

into [1 2] and [3 4], solving each half produces [1 3] and [3 7] respectively. The required answer though is [1 3 6 10] and not [1 3 3 7].

a) What is the trick? (If you illustrate this, try to consider the solving of each half as taking place in a "black box" - i.e. a device whose inside you do not need to explain, that takes n/2 elements and produces the correct n/2 prefix additions.)

OK, so we assume we have a black box that performs prefix addition itself, and want to use this to split the problem up so we can perform parallel processing.

Now noting that the last output from such a box is the sum of all the elements in it, we need to correct subsequent black boxes' outputs by this amount.

There are two ways of showing this: one assumes that each black box has a "carry" input; the other performs the correction process explicitly. The only difference is that the correction adders can be found inside the carry-type boxes, and are thus wasted at the first element. We shall show the simpler solution, as the "carry" concept boils down to this anyway.

(A written answer describing the addition "trick" is all that is required here, but what follows will help us to answer part (b).)

The recursive definition of a "black box" prefix adder:

```      A[1]  A[2]   ...   A[n/2]     A[1+n/2]  ...     A[n]
|     |            |          |         |       |
-----------------------------------------------------------
|     |     |     ...    |          |  ...    | ...   |     | <- A black
|   ------------------------      -----------------------   |    box
|  |                        |    |                       |  |    prefix
|  |   Black box prefix +   |    |   Black box prefix +  |  |    adder
|  |      n/2 elements      |    |      n/2 elements     |  |
|  |                        |    |                       |  |    n
|   ------------------------      -----------------------   |    elements
|     |     |            |\ "carry" |         |       |     |
|     |     |            | -------------------------  |     |
|     |     |            |        \ |       \ |     \ |     |
|     |     |    ...     |         \|  ...   \| ...  \|     |
|     |     |            |          +         +       +     |
|     |     |            |          |         |       |     |
-----------------------------------------------------------
|     |            |          |         |       |
B[1]  B[2]         B[n/2]     B[1+n/2]  ...     B[n]
```

and a one-element prefix adder is simply empty:

```   A[1]
|
---
| | |
---
|
B[1]
```

### Question 3b

In the divide and conquer approach these black boxes are also solved by using the same method. So if we solve A[1]..A[16] by first solving A[1]..A[8] and A[9]..A[16] and then applying the trick, we would solve A[1]..A[8] by solving A[1]..A[4], A[5]..A[8] and applying the trick etc... We can, of course, solve a single element instantly (the prefix addition of [i] is [i].)
b) Using this divide and conquer approach how many time steps do we need, and how many processors?

The following figure shows the distribution of adders for 8 elements, ignoring the one-element prefix adders:

```12345678
+ + + +
++  ++
++++
12345678
```

Continuing to assume that n is a power of 2 (which will give us a correct O[ ] answer in any case), we can see that there will be a total of n log2(n)/2 (O[n log n]) processors taking time T log2(n) (O[T log n]).

### Question 4a

The previous method is good, but we can (at the expense of a bit more time) reduce the number of processors. Our final method works as follows:

1. Pair off the elements and sum them.
2. Feed these n/2 sums into a "black box" which produces the expected n/2 prefix additions.
3. Do some "trick" to get the required results from the information so far.

The black box is, of course, a miniature version of this algorithm again (and not one from any previous question.)

a) What is the "trick" used for part 3 of this method?

Now things are getting very tricky - only our best entrants got this far, and only a couple got the right answer!

We're dealing with something like:

```      A[1]  A[2]  A[3]  A[4]   ...   A[n-1] A[n]
|     |     |     |             |     |
-------------------------------------------------
|     |     |     |     |             |     |     |
|      \   /       \   /      ...      \   /      |
|       \ /         \ /                 \ /       |
|        +           +                   +        |  Black Box
|        | A'[1]     | A'[2] ... A'[n/2] |        |  prefix adder,
|    -----------------------------------------    |
|   |                                         |   |  n elements.
|   |   Black Box prefix adder, n/2 elements  |   |
|   |                                         |   |
|   |                                         |   |
|    -----------------------------------------    |
|        | B'[1]     | B'[2] ... B'[n/2] |        |
|                                                 |
|                                                 |
|           "some trick takes place here"         |
|                                                 |
|                                                 |
|     |     |     |     |             |     |     |
-------------------------------------------------
|     |     |     |             |     |
B[1]  B[2]  B[3]  B[4]   ...   B[n-1] B[n]
```

This puts us on the right track, and we can see some sort of structure emerging:

```B'[1] = A[1] + A[2]
B'[2] = A[1] + A[2] + A[3] + A[4]
...
```
so
```B[1] = A[1]
B[2] = B'[1]
B[3] = B'[1] + A[3]
...
B[n-1] = B'[n/2 - 1] + A[n-1]
B[n] = B'[n]
```
which can be implemented as follows:
```      A[1]  A[2]  A[3]  A[4]   ...   A[n-1] A[n]
|     |     |     |             |     |
-------------------------------------------------
|     |     |     |     |             |     |     |
|      \   /       \   /      ...      \   /      |
|       \ /         \ /                 \ /       |
|        +           +                   +        |
|        | A'[1]     | A'[2] ... A'[n/2] |        |
|    -----------------------------------------    |
|   |                                         |   |
|   |   Black Box prefix adder, n/2 elements  |   |
|   |                                         |   |
|   |                                         |   |
|    -----------------------------------------    |
|        | B'[1]     | B'[2] ... B'[n/2] |        |
|        |           |                   |        |
|   A[1]  \     A[3]  \      ...  A[n-1]  \       |
|     |    \      |    \              |    \      |
|     |     |     |     |  B[n/2 - 1] |     |     |
|     |     |\    |     |        \    |     |     |
|     |     | ----+     |         ----+     |     |
|     |     |     |     |             |     |     |
-------------------------------------------------
|     |     |     |             |     |
B[1]  B[2]  B[3]  B[4]   ...   B[n-1] B[n]
```

That is to say, the trick is that, for even values of i, the outputs B[i] = B' [i/2], the output of the intermediate prefix adder, and for odd values it is necessary to add A[i] to the previous value B[i-1] (except for B[1] = A[1]).

### Question 4b

b) How many time steps, and how many processors do we need?

This follows from the above.

Each black box takes two stages more than the black box inside it, so we have O[log n] time. To be more exact, we must take an example, for n=8:

```box   1 2 3 4 5 6 7 8

1      +   +   +   +    pairing/addition
2        +       +
3            +

3                       (no correction needed here)
2              +        correction
1         +  +   +
```

Thus, for n a power of 2, a black box of n inputs has n-1 adders, making a total of

```(n - 1) + (n/2 - 1) + (n/4 - 1) + ... + 1
= (1 + 2 + 4 + ... + n) - log2(n) - 1
= 2 * n - log2(n) - 2
```

that is, 2(n-1)-log2(n) or O[n] processors. The total time taken is therefore T (2 log2(n)-1).

### Further work

The solution of Question 4 represents a good compromise between time and complexity.

A faster solution can be produced using the ideas of Question 2 to give an answer in time T log2(n), with duplicate processors (that is, ones which carry out the same calculation as another) eliminated. This turns out to produce the same answer as Question 3.

Exercise 1: show this.

Another interesting (if infuriating) question is
Exercise 2: how are these results affected if n is not a power of 2?

Antony Rix