Prefix Addition

Exact answers for time and magnitude aren't going to be a major issue, so in some places I will use the Order notation O[n], which means "of the order of n" ie. give or take a constant multiplier or constant offset not far from one.

Other important notes from the question:

- Each process step takes a fixed time,
*T* - Processors cannot be re-used.

This calls for a diagram

Initial Inputs Total = 0 - A[1] so no adder / needed here -> o - A[2] / \ / B[1] - + - A[3] / \ / B[2] - + / ... B[3] - - A[n] \ / Outputs + / B[n] -which shows that we need

*We solve the problem faster at the expense of more processors. We have already seen how using 3
processors lets us calculate (x*y)*(a*b) in only 2 steps.
a) Suppose we only want to calculate B[n], how many time steps can we do it in, and how many
processors do we require? You may find it easier to just consider the case when n is a power of 2.
*

The hint for this was given in the question: using a binary structure

A[1] A[2] A[3] A[4] | | | | \- + -/ \- + -/ | | \---- + ----/ | B[4]for 2-input processors.

For *n* a power of 2, we are
effectively given the formula for the
number of processors: *n*-1 (O[n]) once again.
This is
as we might expect, as we're performing exactly the
same amount of computation. The question's comment
"at the expense of a few more processors",
if *n* is not a power of 2,
applies because we now **don't** have the values for
the other *B*[*i*].

Time taken is much more efficient:
*T* log2(*n*), where
the logarithm to the base 2 is rounded up to the next highest
integer if *n* is not a power of 2. O[log n] is
a sufficient answer.

Thus far we have assumed 2-input processors, because these are what is used in practice. The same methods can easily be extended to 3-input devices and so on.

*b) One way of solving the problem is to calculate each element of B like this. Using the method in
part a) for each element of B, and assuming the processors calculating each element cannot exchange
data with the processors calculating another element, how many time steps do we require to calculate
all of B, and how many processors?
*

To be precise, we need

0 + 1 + 2 + ... + n-1processors,

*
We can consider a divide and conquer approach to the problem - i.e.
splitting the problem into two halves, solving them, and then producing
the answer.
*

*We do need a little trick to produce the answer, since if we split*

[1 2 3 4]

*into [1 2] and [3 4], solving each half produces [1 3] and [3 7]
respectively. The required answer though is
[1 3 6 10] and not [1 3 3 7].
*

*a) What is the trick? (If you illustrate this, try to consider the
solving of each half as taking place in a "black box" - i.e. a device
whose inside you do not need to explain, that takes n/2 elements and
produces the correct n/2 prefix additions.)
*

OK, so we assume we have a black box that performs prefix addition itself, and want to use this to split the problem up so we can perform parallel processing.

Now noting that the last output from such a box is the sum of all the elements in it, we need to correct subsequent black boxes' outputs by this amount.

There are two ways of showing this: one assumes that each black box has a "carry" input; the other performs the correction process explicitly. The only difference is that the correction adders can be found inside the carry-type boxes, and are thus wasted at the first element. We shall show the simpler solution, as the "carry" concept boils down to this anyway.

(A written answer describing the addition "trick" is all that is required here, but what follows will help us to answer part (b).)

The recursive definition of a "black box" prefix adder:

A[1] A[2] ... A[n/2] A[1+n/2] ... A[n] | | | | | | ----------------------------------------------------------- | | | ... | | ... | ... | | <- A black | ------------------------ ----------------------- | box | | | | | | prefix | | Black box prefix + | | Black box prefix + | | adder | | n/2 elements | | n/2 elements | | | | | | | | n | ------------------------ ----------------------- | elements | | | |\ "carry" | | | | | | | | ------------------------- | | | | | | \ | \ | \ | | | | | ... | \| ... \| ... \| | | | | | + + + | | | | | | | | | ----------------------------------------------------------- | | | | | | B[1] B[2] B[n/2] B[1+n/2] ... B[n]

and a one-element prefix adder is simply empty:

A[1] | --- | | | --- | B[1]

*
In the divide and conquer approach these black boxes are also solved by
using the same method. So if we solve A[1]..A[16] by first solving
A[1]..A[8] and A[9]..A[16] and then applying the trick, we would solve
A[1]..A[8] by solving A[1]..A[4], A[5]..A[8] and applying the trick
etc... We can, of course, solve a single element instantly (the prefix
addition of [i] is [i].)
b) Using this divide and conquer approach how many time steps do we
need, and how many processors?
*

The following figure shows the distribution of adders for 8 elements, ignoring the one-element prefix adders:

12345678 + + + + ++ ++ ++++ 12345678

Continuing to assume that *n* is a power of 2 (which will give
us a correct O[ ] answer in any case), we can see that there will be
a total of *n* log2(*n*)/2 (O[*n* log *n*])
processors taking time
*T* log2(*n*) (O[*T* log *n*]).

*
The previous method is good, but we can (at the expense of a bit more
time) reduce the number of processors. Our final method works as
follows:
*

*Pair off the elements and sum them.**Feed these n/2 sums into a "black box" which produces the expected n/2 prefix additions.**Do some "trick" to get the required results from the information so far.*

*The black box is, of course, a miniature version of this algorithm again
(and not one from any previous question.)*

*a) What is the "trick" used for part 3 of this method?
*

Now things are getting very tricky - only our best entrants got this far, and only a couple got the right answer!

We're dealing with something like:

A[1] A[2] A[3] A[4] ... A[n-1] A[n] | | | | | | ------------------------------------------------- | | | | | | | | | \ / \ / ... \ / | | \ / \ / \ / | | + + + | Black Box | | A'[1] | A'[2] ... A'[n/2] | | prefix adder, | ----------------------------------------- | | | | | n elements. | | Black Box prefix adder, n/2 elements | | | | | | | | | | | ----------------------------------------- | | | B'[1] | B'[2] ... B'[n/2] | | | | | | | "some trick takes place here" | | | | | | | | | | | | | ------------------------------------------------- | | | | | | B[1] B[2] B[3] B[4] ... B[n-1] B[n]

This puts us on the right track, and we can see some sort of structure emerging:

B'[1] = A[1] + A[2] B'[2] = A[1] + A[2] + A[3] + A[4] ...so

B[1] = A[1] B[2] = B'[1] B[3] = B'[1] + A[3] ... B[n-1] = B'[n/2 - 1] + A[n-1] B[n] = B'[n]which can be implemented as follows:

A[1] A[2] A[3] A[4] ... A[n-1] A[n] | | | | | | ------------------------------------------------- | | | | | | | | | \ / \ / ... \ / | | \ / \ / \ / | | + + + | | | A'[1] | A'[2] ... A'[n/2] | | | ----------------------------------------- | | | | | | | Black Box prefix adder, n/2 elements | | | | | | | | | | | ----------------------------------------- | | | B'[1] | B'[2] ... B'[n/2] | | | | | | | | A[1] \ A[3] \ ... A[n-1] \ | | | \ | \ | \ | | | | | | B[n/2 - 1] | | | | | |\ | | \ | | | | | | ----+ | ----+ | | | | | | | | | | ------------------------------------------------- | | | | | | B[1] B[2] B[3] B[4] ... B[n-1] B[n]

That is to say, the trick is that, for even values of *i*,
the outputs
*B*[*i*] = *B' *[*i*/2], the output of the
intermediate prefix adder, and for odd values it is necessary to add
*A*[*i*] to the previous value *B*[*i*-1]
(except for *B*[1] = *A*[1]).

*
b) How many time steps, and how many processors do we need?
*

This follows from the above.

Each black box takes two stages more than the black box inside it,
so we have O[log n] time. To be more exact, we must take an example, for
*n*=8:

box 1 2 3 4 5 6 7 8 1 + + + + pairing/addition 2 + + 3 + 3 (no correction needed here) 2 + correction 1 + + +

Thus, for *n* a power of 2, a black box of *n*
inputs has *n*-1 adders, making a total of

(n - 1) + (n/2 - 1) + (n/4 - 1) + ... + 1 = (1 + 2 + 4 + ... + n) - log2(n) - 1 = 2 * n - log2(n) - 2

that is, 2(*n*-1)-log2(*n*) or O[*n*] processors.
The total time taken is therefore *T* (2 log2(*n*)-1).

The solution of Question 4 represents a good compromise between time and complexity.

A faster solution can be produced
using the ideas of Question 2 to give an answer in time *T* log2(*n*),
with
duplicate processors (that is, ones which carry out the same
calculation as another) eliminated. This
turns out to produce the same answer as Question 3.

** Exercise 1**: show this.

Another interesting (if infuriating) question is

** Exercise 2**: how are these results affected if

Antony Rix

(see contact details from home page)