Using only the original program, one can naturally do

The

usually give a better approximation, but cost an additional execution of program

then the differentiated program must execute additional operations on the variables and their

The derivatives are now computed analytically, using the well known formulas on derivation of elementary operations. Approximation errors have just vanished.

Again, there are two ways to implement A.D.:

**Overloading**consists in telling the compiler that each real number is replaced by a pair of real numbers, the second holding the differential. Each elementary operation on real numbers is overloaded, i.e. internally replaced by a new one, working on pairs of reals, that computes the value**and**its differential. The advantage is that the original program is virtually unchanged, since everything is done at compile time. The drawback is that the resulting program runs slowly because it constantly builds and destroys pairs of real numbers. Moreover, it is very hard to implement the "**reverse mode**" with overloading.**Source transformation**consists in adding into the program the new variables, arrays, and data structures that will hold the derivatives, and in adding the new instructions that compute these derivatives. The advantage is that the resulting program can be compiled into an efficient code, and the "**reverse mode**" is possible. The drawback is that this is an enormous transformation, that cannot be done by hand on large programs. Tools are needed to perform this transformation correctly and rapidly. Our team studies this sort of tools. Our Tapenade engine is just one such Automatic Differentiation tool that uses source transformation.

- One may really want the
**derivative**of the program. More exactly, since the program's inputs and outputs are generally**vectors**, the derivative is a matrix, known as the**Jacobian**matrix, Each elementof the Jacobian matrix is the partial derivative**(i,j)**

of outputwith respect to input**Y**_{i}. For that, the derivative object that must be computed along with each value**X**_{j}is the**v****vector**of all partial derivatives ofwith respect to each input. When values are themselves arrays, the derivative object becomes a matrix. This leads to carrying partial Jacobian matrices along with the main computation stream. This is usually expensive in time and memory space. This drawback can be partially coped for, using sparse representations for the derivative objects.**v** - One may simply want
**directional derivatives**, also known as**tangents**. This means that, instead of the explicit**Jacobian**matrix, one just needs its effect on a given direction vector**J**in the input space. This effect is actually**dX**, but one can evaluate it without explicitly computing**dY = J * dX**. For that, the derivative object that must be computed along with each value**J**is its "**v****differential**", i.e. (the first-order approximation of) the quantity by which**dv**varies when the input**v**varies by**X**. As expected, this is far less expensive than building the whole Jacobian. Note that directional derivatives also can provide us with the**dX****Jacobian**matrix, if directional deriatives are computed for each canonical direction in the input space. In other words, this requires one computation of the directional derivative per input variable, and even less than that, if one knows that the Jacobian matrix is sufficiently sparse. - One may want
**gradients**. When there is only one output, the Jacobian matrix has only one line, which is exactly the gradient of the result with respect to the inputs. When there are many outputs, one defines a scalar composite result, i.e. some linear combination of these outputs, so as to compute the gradient of this result. To put it differently, one wants a combination of the lines of the Jacobian. The "*" superscript indicates transposition, i.e.**Xb**^{*}= Yb^{*}* Jand**Xb**^{*}are row vectors. This can be computed without explicitly computing**Yb**^{*}. For that, the derivative object is called an "**J****adjoint**", and it is computed in the reverse of the original program's order, and will be shown later. Like with the directional derivatives, one can compute the whole Jacobian matrix by repeatedly computing gradients, for each canonical direction in the output space, or fewer than that when the Jacobian is sparse. It is easy to check that this approach is advisable when there are far more inputs than outputs. - One may want
**higher-order derivatives**(Hessian). This can be done computing a derivative object which is a partial Hessian tensor. Since this might be very expensive, another approach is to computehigher-order derivatives, and use them afterwards to recompute the coefficients of the Hessian. (cf [GriewankUtkeWalther97])**directional** - One may want
**directional higher-order derivatives**. Given a direction vectorin the input space, one computes along the original program a derivative object that contains the first and higher derivatives of each intermediate value in the direction*dX*. Higher-order derivatives are useful to get a more accurate approximation, or to optimize more efficiently.*dX* - One may want
**Taylor series**expansions. (cf [Griewank00]) - One may want to compute
**intervals**. This is not strictly speaking a matter of derivatives, but shares many similarities. The derivative object computed along with each valueis the interval into which this value may range.**v**

Programs contain

will execute according to the control as one of the following sequences of instructions:if (x <= 1.0) then printf("x too small\n"); else { y = 1.0; while (y <= 10.0) { y = y*x; x = x+0.5; } }

printf("x too small\n");

y = 1.0;

y = 1.0; y = y*x; x = x+0.5;

y = 1.0; y = y*x; x = x+0.5; y = y*x; x = x+0.5;

y = 1.0; y = y*x; x = x+0.5; y = y*x; x = x+0.5; y = y*x; x = x+0.5;

Each of these sequences is "differentiable". The new program generated by Automatic Differentiation contains of course some control, such that each of its run-time sequences of instructions computes the derivative of the corresponding sequence of instructions in the initial program. Thus, this differentiated program will probably look like (no matter whether by overloading or by source transformation):

But it sometimes happens, like in this example, that the control itself depends on differentiated variables. In that case, a small change of the initial values may result in a change of the control. Here, a small change ofif (x <= 1.0) then printf("x too small\n"); else { y = 1.0; dy = 0.0; while (y <= 10.0) { dy = dy*x + y*dx; y = y*x; x = x+0.5; } }

Now that programs are identified with sequences of instructions, these sequences are identified with

is identified to the function :

Of course each of these functions are extended to operate on the domain of all the program variables, but variables not overwritten by the instruction are just transmitted unchanged to the function's result. We can then use the chain rule to formally write the derivative of the program for a given vector input

We wrote

Therefore, computing the derivatives is just computing and multiplying these elementary Jacobian matrices

- it depends on some independent, in a differentiable way, and
- some dependent depends on it, in a differentiable way.

which must be computed from right to left, because "matrix time vector" products are so much cheaper than "matrix times matrix" products. Things are easy here, because

In the tangent mode, the differentiated program is just a copy of the given program, Additional derivative instructions are inserted just before each instruction that involves

Here is an example. Consider the original (piece of) program:

Suppose that only... v2 = 2*sin(v1) + 5 if (v2>0.0) then v4 = v2 + p1*v3/v2 p1 = p1 + 0.5 endif call sub1(v2,v4,p1) ...

Notice that the control structure is preserved (... dv2 = 2*cos(v1)*dv1 v2 = 2*sin(v1) + 5 if (v2>0.0) then dv4 = dv2*(1-p1*v3/(v2*v2)) + dv3*p1/v2 v4 = v2 + p1*v3/v2 p1 = p1 + 0.5 endif call sub1_d(v2,dv2,v4,dv4,p1) ...

Intrinsic functions, such as

The structure of subroutine calls is also preserved, but differs slightly: Since a routine call is in fact a shorthand for another piece of program, and that this piece of program will be augmented with differentiated instructions too, the call to the differentiated routine

Consider now the special case of a call to a

If the user knows that this is not the case, there must be a way to tell the tool. For example in Tapenade, one can give the "dependences" of each output with respect to each input. The Tapenade FAQ has a section that explains how to give information on... dp1 = 0.0 call sub1_d(v2,dv2,v4,dv4,p1,dp1) ...

This matrix can capture precise behavior. For example it also implies here that although the entry active

If the user then provides a hand-madecall sub1_d(v2,dv2,v4,dv4,p1)

+ non-specialization principle for multiple routine calls + specifics of commons, data structures non-specialization principle.

This is all we will say here about the general principles of AD in direct mode. To go further, you may read the Tapenade reference article and the Tapenade 2.1 user's guide, that will describe conventions about variable names and declarations, differentiation of individual instructions and control structures, and in general all you must know to understand and use differentiated programs produced by Tapenade.

Strictly speaking, the gradient of a function

Again, this is most efficiently computed from right to left, because matrix-times-vector products are so much cheaper than matrix-times-matrix products. This is the principle of the reverse mode.

This turns out to make a very efficient program, at least theoretically (cf Section 3.4 of [Griewank00]). The computation time required for the gradient is only a small multiple of the run time of the original program. It is independent from the number of inputs in

However, we observe that the

**Recompute All (RA):**the**X**_{n}is recomputed when needed, restarting the program on input**X**_{0}until instruction. The TAF tool uses this strategy. The following figure illustrates the`I`_{n}**RA**strategy:One can see that before the adjoint of each instruction, the necessary part of the original program is recomputed. Notice that this requires that we save the environment before the rirst run (big black dot), and restore it (big white dots) before each duplicate run. We see that brute-force **RA**strategy has a quadratic time cost with respect to the total number of run-time instructions*p*.**Store All (SA):**the**X**_{n}are restored from a stack when needed. This stack is filled during a preliminary run of the program, the**"forward sweep"**, that additionally stores variables on the stack just before they are overwritten. The tools Adifor and Tapenade use this strategy. The following figure illustrates the**SA**strategy:One can see that before each instruction of the forward sweep, some values are saved, (small black dots) and they are restored (small white dots) just before their corresponding derivative instruction during the so-called **backward sweep**. Brute-force**SA**strategy has a linear memory cost with respect to*p*.

- In
**Recompute All (RA)**mode, the result of checkpointing part**P**is shown on the top of next figure. The**snapshot**(big black dot) memorizes the state of the program just after**P**.The bottom part of the figure shows the execution of the program when there are many nested checkpoints. There are fewer duplicate executions, but it costs some extra snapshots. - In
**Store All (SA)**mode, the result of checkpointing part**P**is shown on the top of next figure. The**snapshot**(big black dot) memorizes the state of the program just before**P**. There is an additional cost, which is the duplicate execution of**P**.The bottom part of the figure shows the execution of the program when there are many nested checkpoints. There are fewer values memorized, but it costs some snapshots and some duplicate executions.

The structure of a reverse differentiated program is therefore more complex than in the tangent mode. There is first a

Suppose that only... v2 = 2*sin(v1) + 5 if (v2>0.0) then v4 = v2 + p1*v3/v2 p1 = p1 + 0.5 endif call sub1(v2,v4,p1) ...

The isolated... push(v2) v2 = 2*sin(v1) + 5 if (v2>0.0) then push(v4) v4 = v2 + p1*v3/v2 push(p1) p1 = p1 + 0.5 push(true) else push(false) endif push(v2) push(v4) push(p1) call sub1(v2,v4,p1) ...

Notice that the adjoint of one original instruction is now very often a sequence of instructions. Notice also that subroutine... pop(p1) pop(v4) pop(v2) call sub1_b(v2,v2b,v4,v4b,p1) pop(control) if (control) then pop(p1) pop(v4) v3b = v3b + v4b*p1/v2 v2b = v2b + v4b*(1-p1*v3/(v2*v2)) v4b = 0.0 endif pop(v2) v1b = v1b + v2b*2*cos(v1) v2b = 0.0 ...

The additional parameters passed to the call to

When the procedure is a function, then its derivative procedure may have an extra derivative parameter, which is the differential of the function's output. On the other hand, the function's output itself can very well be forgotten: remember the call to the derivative function are enclosed in a

This is all we will say here about the general principles of AD in reverse mode. To go further, you may read the Tapenade reference article and the Tapenade 2.1 user's guide, that will describe conventions about variable names and declarations, differentiation of individual instructions, control structures, procedure calls, and in general all you must know to understand and use differentiated programs produced by Tapenade.

...

Last modified: Wed Apr 17 15:39:30 CEST 2013 by the Tropics team