$\def\vect{\boldsymbol}$ $$ f( \vect x + \vect h ) \approx f( \vect x ) + \nabla f( \vect x )^T \vect h + \frac12 \vect h^T\nabla^2f( \vect x )\vect h =: g( \vect h ) $$

$$ \nabla g( \vect h ) = \nabla f( \vect x ) + \nabla^2 f( \vect x )\vect h $$

$$ \nabla g( \vect h ) = \vect 0 \Leftrightarrow \nabla f( \vect x ) + \nabla^2 f( \vect x )\vect h = \vect 0 $$

$$ \nabla g( \vect h ) = \vect 0 \Leftrightarrow \nabla^2 f( \vect x )\vect h = -\nabla f( \vect x ) $$

Proposição:

$\def\f{f : \mathbb R^n \to \mathbb R}$ $\def\x{\vect x \in \mathbb R^n}$ Se $\f$ é diferenciável em $\x$ e $\vect d^T\nabla f( \vect x ) < 0$ então, para qualquer $\sigma \in ( 0, 1 )$ dado, existe $\Lambda_\sigma > 0$ tal que para todo $\lambda \in ( 0, \Lambda_\sigma )$ temos: $$ f( \vect x + \lambda \vect d ) < f( \vect x ) + \lambda\sigma\vect d^T\nabla f( \vect x ). $$

$$ f( \vect x + \vect h ) = f( \vect x ) + \nabla f( \vect x )^T \vect h + \|\vect h\|E( \vect x, \vect h ) $$ com $\vect h = \lambda\vect d$ resulta, para $\lambda > 0$, em $$ f( \vect x + \lambda\vect d ) = f( \vect x ) + \lambda \vect d^T \nabla f( \vect x ) + \lambda\|\vect d\|E( \vect x, \lambda\vect d ). $$

Como $$ \lim_{\lambda \to 0} E( \vect x, \lambda\vect d ) = 0, $$ sabemos que existe $\Lambda > 0$ tal que $\lambda \in ( 0, \Lambda )$ implica em $$ E( \vect x, \lambda\vect d ) < -( 1 - \sigma )\frac{\vect d^T \nabla f( \vect x )}{\| \vect d \|}. $$ Ou seja, para $\lambda \in ( 0, \Lambda )$ temos $$ \begin{split} f( \vect x + \lambda\vect d ) & {}= f( \vect x ) + \lambda \vect d^T \nabla f( \vect x ) + \lambda\|\vect d\|E( \vect x, \lambda\vect d )\\ & {}= f( \vect x ) + \lambda \| \vect d \|\left( \frac{\vect d^T \nabla f( \vect x )}{\| \vect d \|} + E( \vect x, \lambda\vect d ) \right)\\ &{} < f( \vect x ) + \lambda \| \vect d \|\left( \frac{\vect d^T \nabla f( \vect x )}{\| \vect d \|} - ( 1 - \sigma )\frac{\vect d^T \nabla f( \vect x )}{\| \vect d \|} \right)\\ &{} = f( \vect x ) + \lambda\sigma\vect d^T \nabla f( \vect x ). \end{split} $$

In [1]:
import numpy as np
In [2]:
def newton_armijo_step( f,
                        df,
                        ddf,
                        x,
                        l,
                        beta = 0.1,
                        sigma = 0.1,
                        eps = 1e-10,
                        return_niter = False,
                        maxiter = 100000,
                        kappa = 1e-2,
                        gamma = 1e-2
                       ):
    
    grad = df( x )
    f_x = f( x )
    grad_sq = np.sum( grad ** 2.0 )
    niter = 0
    nfeval = 1
    while ( grad_sq ** 0.5 > eps ) and ( niter < maxiter ):
        
        l_k = l
        theta = np.pi / 2 * ( 1.0 - 1.0 / ( niter + 1 ) )
        M = np.array( [ [ np.cos( theta ), -np.sin( theta ) ],
                        [ np.sin( theta ), np.cos( theta ) ]
                      ] )
        try:
            d_k = -np.linalg.solve( ddf( x ), df( x ) )
        except np.linalg.LinAlgError:
            d_k = -grad
        nrm_d_k = np.linalg.norm( d_k )
        if d_k @ grad >= -gamma * ( grad_sq ** 0.5 ) * nrm_d_k:
            d_k = -grad
        if nrm_d_k <= kappa * ( grad_sq ** 0.5 ):
            d_k = -grad
        
        desc = sigma * np.sum( d_k * grad )
        
        x_next = x + l_k * d_k
        f_x_next = f( x_next )
        nfeval = nfeval + 1
        while f_x_next >= f_x + l_k * desc:
            l_k = l_k * beta
            x_next = x + l_k * d_k
            f_x_next = f( x_next )
            nfeval = nfeval + 1
        
        niter = niter + 1

        x = x_next
        grad = df( x )
        grad_sq = np.sum( grad ** 2.0 )
        f_x = f_x_next
    
    if return_niter :
        return ( x, niter, nfeval )
    else:
        return x
In [3]:
def rosenbrock( x, a = 1, b = 100 ):
    
    return ( a - x[ 0 ] ) ** 2.0 + b * ( x[ 1 ] - x[ 0 ] ** 2.0 ) ** 2.0

def rosenbrock_grad( x, a = 1, b = 100 ):
    
    retval = np.empty( 2 )
    
    retval[ 0 ] = 2.0 * ( x[ 0 ] - a ) - 4.0 * x[ 0 ] * b * ( x[ 1 ] - x[ 0 ] ** 2.0 )
    retval[ 1 ] = 2.0 * b * ( x[ 1 ] - x[ 0 ] ** 2.0 )
    
    return retval

def rosenbrock_hess( x, a = 1, b = 100 ):
    
    retval = np.empty( ( 2, 2 ) )
    
    retval[ 0, 0 ] = 2.0 * x[ 0 ] - 4.0 * b * ( x[ 1 ] - x[ 0 ] ** 2.0 ) + 8.0 * b * x[ 0 ]
    retval[ 0, 1 ] = -4.0 * x[ 0 ] * b
    retval[ 1, 0 ] = -4.0 * b * x[ 0 ]
    retval[ 1, 1 ] = 2.0 * b
    
    return retval
In [4]:
x, niter, nfeval = newton_armijo_step( 
                                       rosenbrock,
                                       rosenbrock_grad,
                                       rosenbrock_hess,
                                       np.array( [ 0.0, 0.0 ] ),
                                       1.0,
                                       return_niter = True,
                                       sigma = 0.1,
                                       beta = 1e-1,
                                       kappa = 1e-3,
                                       gamma = 1e-1
                                      )

print( niter + nfeval )
print( niter )
print( nfeval / niter )
print( x )
406
202
1.00990099009901
[1. 1.]

$\def\vect{\boldsymbol}$ $$ |\langle \vect x, \vect y \rangle| \le \| \vect x \|\| \vect y \| $$ $$ \cos\hat{\vect x\vect y} := \frac{\langle \vect x, \vect y \rangle}{\| \vect x \|\| \vect y \|} $$

$\def\vect{\boldsymbol}$ $$ \frac{\langle \vect x, \vect y \rangle}{\| \vect x \|\| \vect y \|} < -\gamma, \text{ com } \gamma \in ( 0, 1 ] $$

$$ \langle \vect x, \vect y \rangle < -\gamma\| \vect x \|\| \vect y \|, \text{ com } \gamma \in ( 0, 1 ] $$

Assim, nossa condição para impedir que a direção de busca fique quase ortogonal ao gradiente fica: $$ \langle \vect d_k, \nabla f( \vect x_k ) \rangle < -\gamma\| \vect d_k \|\| \nabla f( \vect x_k ) \|, \text{ com } \gamma \in ( 0, 1 ] $$

$$ \| \vect d_k \| > \kappa \| \nabla f( \vect x_k ) \|, \text{ com }\kappa \in ( 0, 1 ]. $$

Teorema:

O Algoritmo de Descida com Busca de Armijo ou para com $\| \nabla f( \vect x_k ) \| = 0$ ou ele gera uma sequência infinita $\{ \vect x_k \} \subset \mathbb R^n$ tal que todo ponto de acumulação dessa sequência é um ponto estacionário de $f$.