Differential dynamic programming

Comment: Please consider the following, to improve the Article:
1) by providing one or more detailed examples of application of Differential Dynamic Programming, particularly with intent of making it clearly distinct from rather ambiguous catch-all of dynamic programming
2) at least one visual depicting the methodology in an intuitive way, or an application of the methodology
3) checking for DOI or URL for this reference:
^[1].FeralOink (talk) 10:25, 4 March 2012 (UTC)

Differential Dynamic Programming (DDP) is an Optimal Control algorithm of the trajectory optimization class. The algorithm was was introduced in 1966 by Mayne^[2] and subsequently analysed in Jacobson and Mayne's eponymous book^[3]. The algorithm uses locally-quadratic models of the dynamics and cost functions, and displays quadratic convergence. It is closely related to Pantoja's step-wise Newton's method^[4]^[5].

Finite-Horizon Discrete-Time problems

The dynamics

\mathbf {x} _{i+1}=\mathbf {f} (\mathbf {x} _{i},\mathbf {u} _{i})

1

describe the evolution of the state $\textstyle \mathbf {x}$ given the control $\mathbf {u}$ from time $i$ to time $i+1$ . The total cost $J_{0}$ is the sum of running costs $\textstyle \ell$ and final cost $\ell _{f}$ , incurred when starting from state $\mathbf {x}$ and applying the control sequence $\mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u} _{1}\dots ,\mathbf {u} _{N-1}\}$ until the horizon is reached:

J_{0}(\mathbf {x} ,\mathbf {U} )=\sum _{i=0}^{N-1}\ell (\mathbf {x} _{i},\mathbf {u} _{i})+\ell _{f}(\mathbf {x} _{N}),

where $\mathbf {x} _{0}\equiv \mathbf {x}$ , and the $\mathbf {x} _{i}$ for $i>0$ are given by Eq. 1. The solution of the optimal control problem is the minimizing control sequence $\mathbf {U} ^{*}(\mathbf {x} )\equiv \operatorname {argmin} _{\mathbf {U} }J_{0}(\mathbf {x} ,\mathbf {U} ).$ Trajectory optimization means finding $\mathbf {U} ^{*}(\mathbf {x} )$ for a particular $\mathbf {x}$ , rather than for all possible initial states.

Dynamic Programming

Let $\mathbf {U} _{i}$ be the partial control sequence $\mathbf {U} _{i}\equiv \{\mathbf {u} _{i},\mathbf {u} _{i+1}\dots ,\mathbf {u} _{N-1}\}$ and define the cost-to-go $J_{i}$ as the partial sum of costs from $i$ to $N$ :

J_{i}(\mathbf {x} ,\mathbf {U} _{i})=\sum _{j=i}^{N-1}\ell (\mathbf {x} _{j},\mathbf {u} _{j})+\ell _{f}(\mathbf {x} _{N}).

The optimal cost-to-go or Value Function at time $i$ is the cost-to-go given the minimizing control sequence:

V(\mathbf {x} ,i)\equiv \min _{\mathbf {U} _{i}}J_{i}(\mathbf {x} ,\mathbf {U} _{i}).

Setting $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ , the Dynamic Programming Principle reduces the minimization over an entire sequence of controls to a sequence of minimizations over a single control, proceeding backwards in time:

V(\mathbf {x} ,i)=\min _{\mathbf {u} }[\ell (\mathbf {x} ,\mathbf {u} )+V(\mathbf {f} (\mathbf {x} ,\mathbf {u} ),i+1)].

2

This is the Bellman Equation.

Differential Dynamic Programming

DDP proceeds by iteratively performing a backward pass on the nominal trajectory to generate a new control sequence, and then a forward-pass to compute and evaluate a new nominal trajectory. We begin with the backward pass. If

\ell (\mathbf {x} ,\mathbf {u} )+V(\mathbf {f} (\mathbf {x} ,\mathbf {u} ),i+1)

is the argument of the $\min[]$ operator in Eq. 2, let $Q$ be the variation of this quantity around the $i$ -th $(\mathbf {x} ,\mathbf {u} )$ pair:

{\begin{aligned}Q(\delta \mathbf {x} ,\delta \mathbf {u} )\equiv &\ell (\mathbf {x} +\delta \mathbf {x} ,\mathbf {u} +\delta \mathbf {u} )&&+V(\mathbf {f} (\mathbf {x} +\delta \mathbf {x} ,\mathbf {u} +\delta \mathbf {u} ),i+1)\\-&\ell (\mathbf {x} ,\mathbf {u} )&&-V(\mathbf {f} (\mathbf {x} ,\mathbf {u} ),i+1)\end{aligned}}

and expand to second order

\approx {\frac {1}{2}}{\begin{bmatrix}1\\\delta \mathbf {x} \\\delta \mathbf {u} \end{bmatrix}}^{\mathsf {T}}{\begin{bmatrix}0&Q_{\mathbf {x} }^{\mathsf {T}}&Q_{\mathbf {u} }^{\mathsf {T}}\\Q_{\mathbf {x} }&Q_{\mathbf {x} \mathbf {x} }&Q_{\mathbf {x} \mathbf {u} }\\Q_{\mathbf {u} }&Q_{\mathbf {u} \mathbf {x} }&Q_{\mathbf {u} \mathbf {u} }\end{bmatrix}}{\begin{bmatrix}1\\\delta \mathbf {x} \\\delta \mathbf {u} \end{bmatrix}}

3

The $Q$ notation used here is a variant of the notation of Morimoto^[6]. Dropping the index $i$ for readability, primes denoting the next time-step $V'\equiv V(i+1)$ , the expansion coefficients are

{\begin{alignedat}{2}Q_{\mathbf {x} }&=\ell _{\mathbf {x} }&+&\mathbf {f} _{\mathbf {x} }^{\mathsf {T}}V'_{\mathbf {x} }\\Q_{\mathbf {u} }&=\ell _{\mathbf {u} }&+&\mathbf {f} _{\mathbf {u} }^{\mathsf {T}}V'_{\mathbf {x} }\\Q_{\mathbf {x} \mathbf {x} }&=\ell _{\mathbf {x} \mathbf {x} }&+&\mathbf {f} _{\mathbf {x} }^{\mathsf {T}}V'_{\mathbf {x} \mathbf {x} }\mathbf {f} _{\mathbf {x} }+V_{\mathbf {x} }'\cdot \mathbf {f} _{\mathbf {x} \mathbf {x} }\\Q_{\mathbf {u} \mathbf {u} }&=\ell _{\mathbf {u} \mathbf {u} }&+&\mathbf {f} _{\mathbf {u} }^{\mathsf {T}}V'_{\mathbf {x} \mathbf {x} }\mathbf {f} _{\mathbf {u} }+{V'_{\mathbf {x} }}\cdot \mathbf {f} _{\mathbf {u} \mathbf {u} }\\Q_{\mathbf {u} \mathbf {x} }&=\ell _{\mathbf {u} \mathbf {x} }&+&\mathbf {f} _{\mathbf {u} }^{\mathsf {T}}V'_{\mathbf {x} \mathbf {x} }\mathbf {f} _{\mathbf {x} }+{V'_{\mathbf {x} }}\cdot \mathbf {f} _{\mathbf {u} \mathbf {x} }.\end{alignedat}}

The last terms in the last 3 equations denote contraction of a vector with a tensor. Minimizing the quadratic approximation with respect to $\delta \mathbf {u}$ we have

{\delta \mathbf {u} }^{*}=\operatorname {argmin} \limits _{\delta \mathbf {u} }Q(\delta \mathbf {x} ,\delta \mathbf {u} )=-Q_{\mathbf {u} \mathbf {u} }^{-1}(Q_{\mathbf {u} }+Q_{\mathbf {u} \mathbf {x} }\delta \mathbf {x} ),

4

giving us an open-loop term $\mathbf {k} =-Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} }$ and a feedback gain term $\mathbf {K} =-Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} \mathbf {x} }$ . Plugging the result back into (3), we now have a quadratic model of the Value at time $i$ :

{\begin{alignedat}{2}\Delta V(i)&=&-&{\tfrac {1}{2}}Q_{\mathbf {u} }Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} }\\V_{\mathbf {x} }(i)&=Q_{\mathbf {x} }&-&Q_{\mathbf {u} }Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} \mathbf {x} }\\V_{\mathbf {x} \mathbf {x} }(i)&=Q_{\mathbf {x} \mathbf {x} }&-&Q_{\mathbf {x} \mathbf {u} }Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} \mathbf {x} }.\end{alignedat}}

Recursively computing the local quadratic models of $V(i)$ and the control modifications $\{\mathbf {k} (i),\mathbf {K} (i)\}$ , from $i=N-1$ down to $i=1$ , constitutes the backward pass. As above, the Value is initialized with $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ . Once the backward pass is completed, a forward pass computes a new trajectory:

{\begin{aligned}{\hat {\mathbf {x} }}(1)&=\mathbf {x} (1)\\{\hat {\mathbf {u} }}(i)&=\mathbf {u} (i)+\mathbf {k} (i)+\mathbf {K} (i)({\hat {\mathbf {x} }}(i)-\mathbf {x} (i))\\{\hat {\mathbf {x} }}(i+1)&=\mathbf {f} ({\hat {\mathbf {x} }}(i),{\hat {\mathbf {u} }}(i))\end{aligned}}

The backward passes and forward passes are iterated until convergence.

Regularization and Line-Search

Differential Dynamic Programming is a second-order algorithm like Newton's Method. It therefore takes large steps toward the minimum and often requires regularization and/or line-search to achieve convergence ^[7] ^[8] .Regularization in the DDP context means ensuring that the $Q_{\mathbf {u} \mathbf {u} }$ matrix in Eq. 4 is positive definite. Line-search in DDP amounts to scaling the open-loop control modification $\mathbf {k}$ by some $0<\alpha <1$ .

References

^ Liao, L. Z. (1992). "Advantages of differential dynamic programming over Newton's method for discrete-time optimal control problems". Cornell University, Ithaca, NY. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Mayne, D. Q. (1966). "A second-order gradient method of optimizing non-linear discrete time systems". Int J Control. 3: 85–95. doi:10.1080/00207176608921369.
^ Mayne, David H. and Jacobson, David Q. (1970). Differential dynamic programming. New York: American Elsevier Pub. Co. ISBN 0444000704.{{cite book}}: CS1 maint: multiple names: authors list (link)
^ de O. Pantoja, J. F. A. (1988). "Differential dynamic programming and Newton's method". International Journal of Control. 47 (5): 1539–1553. doi:10.1080/00207178808906114. ISSN 0020-7179.
^ Liao, L. Z. (1992). "Advantages of differential dynamic programming over Newton's method for discrete-time optimal control problems". Cornell University, Ithaca, NY. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Morimoto, J. (2003). "Minimax differential dynamic programming: Application to a biped walking robot". Intelligent Robots and Systems, 2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ International Conference on. Vol. 2. pp. 1927–1932. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Liao, L. Z (1991). "Convergence in unconstrained discrete-time differential dynamic programming". IEEE Transactions on Automatic Control. 36 (6): 692. doi:10.1109/9.86943. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Tassa, Y. (2011). Theory and implementation of bio-mimetic motor controllers (PDF) (Thesis). Hebrew University.

External links

This computer science article is a stub. You can help Wikipedia by expanding it.

[1] Liao, L. Z. (1992). "Advantages of differential dynamic programming over Newton's method for discrete-time optimal control problems". Cornell University, Ithaca, NY. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[2] Mayne, D. Q. (1966). "A second-order gradient method of optimizing non-linear discrete time systems". Int J Control. 3: 85–95. doi:10.1080/00207176608921369.

[3] Mayne, David H. and Jacobson, David Q. (1970). Differential dynamic programming. New York: American Elsevier Pub. Co. ISBN 0444000704.{{cite book}}: CS1 maint: multiple names: authors list (link)

[4] O. Pantoja, J. F. A. (1988). "Differential dynamic programming and Newton's method". International Journal of Control. 47 (5): 1539–1553. doi:10.1080/00207178808906114. ISSN 0020-7179.

[5] Liao, L. Z. (1992). "Advantages of differential dynamic programming over Newton's method for discrete-time optimal control problems". Cornell University, Ithaca, NY. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[6] Morimoto, J. (2003). "Minimax differential dynamic programming: Application to a biped walking robot". Intelligent Robots and Systems, 2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ International Conference on. Vol. 2. pp. 1927–1932. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[7] Liao, L. Z (1991). "Convergence in unconstrained discrete-time differential dynamic programming". IEEE Transactions on Automatic Control. 36 (6): 692. doi:10.1109/9.86943. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[8] Tassa, Y. (2011). Theory and implementation of bio-mimetic motor controllers (PDF) (Thesis). Hebrew University.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]