Minimização de quadráticas

Suponhamos que temos um problema de minimização da forma $$\def\vect{\boldsymbol}\large \min\quad \frac12\vect x^TQ\vect x - \vect c^T\vect x $$

Onde $\vect x \in \mathbb R^n$ são as variáveis, $Q \in \mathbb R^{n \times n}$ dada é simétrica e definida positiva, e $\vect c \in \mathbb R^n$.

Exercício:

Mostre que $$\large \vect x^T Q \vect x = \vect x^T\frac12( Q^T + Q )\vect x. $$

O gradiente da quadrática

Primeiro, notemos que, se $f( \vect x ) = \frac 12 \vect x^TQ\vect x - \vect c^T\vect x$ então: $$\large \nabla f( \vect x ) = Q\vect x - \vect c. $$ Ou seja, a solução do nosso problema satisfaz: $$\large Q\vect x = \vect c. $$

Exercício:

Mostre a afirmação acima.

Direções $Q$-conjugadas

Dizemos que dois vetores $\vect x \in \mathbb R^n$ e $\vect y \in \mathbb R^n$ são $Q$-conjugados com $Q \in \mathbb R^{n \times n}$ simétrica definida positiva quando $$\large \vect x^TQ\vect y = 0. $$

Note que a noção de conjugação acima corresponde a uma ideia de ortogonalidade de acordo com o produto interno $$\large \langle \vect x, \vect y \rangle_Q := \vect x^TQ\vect y. $$

Processo de Gram-Schmidt

O processo de Gram-Schmidt é um algoritmo para obter um conjunto ortonormal de vetores $\{ \vect e_1, \vect e_2, \dots, \vect e_n \} \subset \mathbb R^n$ a partir de um conjunto linearmente independente de vetores $\{ \vect v_1, \vect v_2, \dots, \vect v_n \} \subset \mathbb R^n$ com a propriedade que para todo $k \in \{ 1, 2, \dots, n \}$ o conjunto $\{ \vect e_1, \vect e_2, \dots, \vect e_k \}$ gera o mesmo espaço vetorial que $\{ \vect v_1, \vect v_2, \dots, \vect v_k \}$.

  1. Para $i \in \{ 1, 2, \dots, n \}$, faça:

    1.1. $$\large \tilde{\vect e}_i = \vect v_i - \sum_{k = 1}^{i - 1}\langle \vect v_i, \vect e_k \rangle\vect e_k $$

    1.2. $$\large \vect e_i = \frac{\tilde{\vect e}_i}{\|\tilde{\vect e}_i\|} $$

No algoritmo acima podemos utilizar qualquer produto interno desde que a norma seja dada por $$\large \| \vect x \| = \sqrt{\langle \vect x, \vect x \rangle}. $$

Observemos que $$\large \langle \tilde{\vect e}_i, \vect e_j \rangle = \biggl\langle \vect v_i - \sum_{k = 1}^{i - 1}\langle \vect v_i, \vect e_k \rangle\vect e_k, \vect e_j \biggr\rangle $$

Primeiro, nos restringimos para o caso $j < i$ e também, por indução, suporemos que os vetores $\{ \vect e_1, \vect e_2, \dots, \vect e_{i - 1} \}$ são ortonormais entre si.

Desta forma:

$$\large \begin{split} \langle \tilde{\vect e}_i, \vect e_j \rangle & {}= \langle \vect v_i, \vect e_j \rangle - \biggl\langle \sum_{k = 1}^{i - 1}\langle \vect v_i, \vect e_k \rangle\vect e_k, \vect e_j \biggr\rangle\\ & {}= \langle \vect v_i, \vect e_j \rangle - \sum_{k = 1}^{i - 1}\bigl\langle \langle \vect v_i, \vect e_k \rangle\vect e_k, \vect e_j \bigr\rangle\\ & {}= \langle \vect v_i, \vect e_j \rangle - \sum_{k = 1}^{i - 1}\langle \vect v_i, \vect e_k \rangle\langle \vect e_k, \vect e_j \rangle\\ & {}= \langle \vect v_i, \vect e_j \rangle - \langle \vect v_i, \vect e_j \rangle\langle \vect e_j, \vect e_j \rangle = 0. \end{split} $$

Processo sem normalização

  1. Para $i \in \{ 1, 2, \dots, n \}$, faça:

    1.1. $$\large \vect e_i = \vect v_i - \sum_{k = 1}^{i - 1}\frac{\langle \vect v_i, \vect e_k \rangle}{\langle \vect e_k, \vect e_k \rangle}\vect e_k. $$

$$\large \begin{split} \langle \vect e_i, \vect e_j \rangle & {}= \langle \vect v_i, \vect e_j \rangle - \biggl\langle \sum_{k = 1}^{i - 1}\frac{\langle \vect v_i, \vect e_k \rangle}{\langle \vect e_k, \vect e_k \rangle}\vect e_k, \vect e_j \biggr\rangle\\ & {}= \langle \vect v_i, \vect e_j \rangle - \sum_{k = 1}^{i - 1}\frac{\langle \vect v_i, \vect e_k \rangle}{\langle \vect e_k, \vect e_k \rangle}\langle \vect e_k, \vect e_j \rangle\\ & {}= \langle \vect v_i, \vect e_j \rangle - \frac{\langle \vect v_i, \vect e_j \rangle}{\langle \vect e_j, \vect e_j \rangle}\langle \vect e_j, \vect e_j \rangle = 0. \end{split} $$

Por exemplo, no caso em que temos $\langle \vect x, \vect y \rangle = \vect x^T Q \vect y$ o algoritmo não normalizado fica:

  1. Para $i \in \{ 1, 2, \dots, n \}$, faça:

    1.1. $$\large \vect e_i = \vect v_i - \sum_{k = 1}^{i - 1}\frac{\vect v_i^TQ \vect e_k}{\vect e_k^TQ \vect e_k}\vect e_k. $$

Método de direções conjugadas

Seja $\{ \vect d_0, \vect d_1, \dots, \vect d_{n - 1} \} \subset \mathbb R^n$ um conjunto de direções $Q$-conjugadas. O método das direções $Q$-conjugadas para a solução de $$\large \min\quad \frac12\vect x^TQ\vect x - \vect c^T\vect x =: f( \vect x ) $$ é dado por

  1. Para $k \in \{ 1, 2, \dots, n \}$:

    1.1. $$\large

     \vect x^{( k + 1 )} = \vect x^{( k )} + \lambda_k \vect d_k
    

    $$ onde $$\large\DeclareMathOperator*{\argmin}{arg\;min}

     \lambda_k = \argmin_{\lambda \in \mathbb R}\quad f\bigl( \vect x^{( k )} + \lambda \vect d_k \bigr)
    

    $$

Seja $g( \lambda ) = f\bigl( \vect x^{( k )} + \lambda \vect d_k \bigr)$. Então sabemos que $g'( \lambda_k ) = 0$. Mas $$\large \begin{split} g'( \lambda ) &{}= \nabla f\bigl( \vect x^{( k )} + \lambda \vect d_k \bigr)^T\vect d_k\\ &{}= \bigl( \vect x^{( k )} + \lambda \vect d_k \bigr)^TQ\vect d_k - \vect c^T\vect d_k\\ &{}= \lambda \vect d_k^TQ\vect d_k + \bigl( \vect x^{( k )} \bigr)^TQ\vect d_k - \vect c^T\vect d_k\\ &{}= \lambda \vect d_k^TQ\vect d_k + \nabla f\bigl( \vect x^{( k )} \bigr)^T\vect d_k \end{split} $$

Ou seja, para que $g'( \lambda_k ) = 0$, temos que satisfazer $$\large \lambda_k = \frac{-\nabla f\bigl( \vect x^{( k )} \bigr)^T\vect d_k}{\vect d_k^TQ\vect d_k} $$

As sucessivas iterações do método das direções conjugadas minimizam $f$ sobre uma variedade linear que cresce a cada iteração. Ou seja: $$\large \vect x^{( k )} = \argmin_{\vect x \in M^{( k )}} \quad f( \vect x ), $$ onde $$\large\DeclareMathOperator{\span}{span} M^{( k )} := \bigl\{ \vect x | \vect x = \vect x^{( 0 )} + \vect v, \vect v \in \span \{ \vect d_0, \vect d_1, \dots, \vect d_{k - 1} \} \bigr\} $$

Método dos gradientes conjugados

O método dos gradientes conjugados é o método obtido pela aplicação do método de direções conjugadas ao conjunto obtido após aplicação do processo de Gram-Schmidt ao conjunto $$\large \{ -\nabla f( \vect x^{( 0 )} ), -\nabla f( \vect x^{( 1 )} ), \dots, -\nabla f( \vect x^{( n - 1 )} ) \} =: \{ -\vect g^{( 0 )}, -\vect g^{( 1 )}, \dots, -\vect g^{( n - 1 )} \}. $$

  1. Para $k \in \{ 0, 1, \dots, n - 1 \}$

    1.1. $$\large \vect d_k = -\vect g^{( k )} + \sum_{i = 1}^{k - 1}\frac{(\vect g^{( k )})^TQ \vect d_i}{\vect d_i^TQ \vect d_i}\vect d_i = -\vect g^{( k )} + \beta_k \vect d_{k - 1}, $$ onde $$\large \beta_k = \frac{(\vect g^{( k )})^T\vect g^{( k )}}{(\vect g^{( k - 1 )})^T\vect g^{( k - 1 )}} $$

    1.2. $$\large \lambda_k = \frac{-(\vect g^{( k )})^T\vect d_k}{\vect d_k^TQ\vect d_k} $$

    1.3. $$\large \vect x^{( k + 1 )} = \vect x^{( k )} + \lambda_k\vect d_k. $$

In [1]:
import numpy as np
In [2]:
def CG_trouxa( Q, c, x = None, niter = None ):
    
    if x is None:
        
        x = np.random.normal( size = ( Q.shape[ 0 ], ) )
        
    if niter is None:
        
        niter = Q.shape[ 0 ]
        
    d_list = []
    Qd_list = []
    
    for k in range( niter ):
        
        g = Q @ x - c
        
        d = -g
        for i in range( k ):
            
            Qdi = Qd_list[ i ]
            d += ( g @ Qdi ) / ( d_list[ i ] @ Qdi ) * d_list[ i ]
            
        d_list.append( d )
        Qd = Q @ d
        Qd_list.append( Qd )
        
        step = ( -g @ d ) / ( d @ Qd )
        
        x = x + step * d
        
    return x        
In [3]:
n = 50
A = np.random.normal( size = ( n, n ) )
c = np.random.normal( size = ( n, ) )
Q = A.T @ A
In [4]:
x_trouxa = CG_trouxa( Q, c )
print( np.linalg.norm( Q @ x_trouxa - c ) )
3.7067247383716804e-12
In [5]:
def CG( Q, c, x = None, niter = None ):
    
    if x is None:
        
        x = np.random.normal( size = ( Q.shape[ 0 ], ) )
        
    if niter is None:
        
        niter = Q.shape[ 0 ]
    
    
    d_prev = np.zeros( x.shape )
    g_prev = np.ones( x.shape )

    d_0 = None
    
    for k in range( niter ):
        
        g = Q @ x - c
        beta = ( g @ g ) / ( g_prev @ g_prev )
        d = -g + beta * d_prev
        if d_0 is None:
            d_0 = d
        
        print( d_0 @ ( Q @ d ) )

        Qd = Q @ d
        step = ( -g @ d ) / ( d @ Qd )
        x = x + step * d
        
        d_prev = d
        g_prev = g
        
    return x

n = 20
A = np.random.normal( size = ( n, n ) )
c = np.random.normal( size = ( n, ) )
Q = A.T @ A
x = CG( Q, c )
print( np.linalg.norm( Q @ x - c ) )
567162.2504917798
-9.322320693172514e-11
-3.637978807091713e-11
6.194511570356553e-11
-6.0992988437647e-11
8.640199666842818e-12
3.410605131648481e-13
-1.0411227435724868e-11
1.432454155292362e-11
-1.7033130461641122e-10
1.3433751888669576e-09
-6.309859301723009e-09
9.749402352099423e-08
-8.989070456877357e-07
1.335938510038659e-05
-4.683933764360049e-05
0.0015908511347175747
-0.06074612915813926
2.195106281879012
-188.9272471477689
0.9104141898839527
In [6]:
x = CG( Q, c )
478443.29309289495
7.275957614183426e-12
-1.2732925824820995e-11
9.549694368615746e-12
2.8194335754960775e-11
-5.2864379540551454e-11
7.344169716816396e-11
-1.3665157894138247e-10
1.7763568394002505e-10
-7.88901388659724e-10
3.320906216686126e-09
-2.1183382159506436e-08
3.1304568892664975e-07
-1.6221073337874259e-06
2.4297673917317297e-05
-0.0002435418600725825
0.004589883819598128
-0.16510360271419122
17.325702083834678
-5864.427478552905
In [7]:
print( np.linalg.norm( Q @ x - c ) )
2.4765440108937042
In [8]:
print( CG_trouxa( Q, c ) - CG( Q, c ) )
995609.1631659911
-9.822542779147625e-11
3.639399892563233e-11
-6.320988177321851e-11
7.628386811120436e-11
-1.0163603292312473e-10
2.043378799498896e-10
-1.9995433087061087e-10
3.1297986424760893e-10
-1.957744188985089e-09
5.947063286271259e-09
-4.339666759278771e-08
4.341323318612922e-07
-3.1603045727024437e-06
2.3951646042519315e-05
-0.00012484882419450116
0.01061044485847118
-0.3506525057813302
14.234962855525595
-725.3345275087838
[ 528.1740378  -405.61961245   84.56890246 -403.02331744  210.87056981
  123.34824287  703.06584282  -82.37026562  505.05069992  637.3229651
   48.33013785 1168.86152486 -406.62098613  448.81807809 -352.9767942
 -429.029827   -108.25322688 -186.39202275   16.64132094 -315.64859679]
In [9]:
import ray_tracing_cuda_float
import matplotlib.pyplot as pp
In [10]:
radon, radon_t = ray_tracing_cuda_float.make_radon_transp( ( 2048, 2048 ) )
In [11]:
img = np.zeros( ( 2048, 2048 ) )
img[ 624 : 1518, 912 : 1950 ] += 1.0
img[ 480 : 900, 506 : 1600 ] += np.pi

pp.imshow( img )
pp.show()

Consideremos o problema $$\large \min_{\vect x\in \mathbb R^n} \quad \| A\vect x - \vect b \|_2^2 = \vect x^TA^TA\vect x - 2\vect b^T(A\vect x) + \vect b^T\vect b $$ que é equivalente a $$\large \min_{\vect x\in \mathbb R^n} \quad \vect x^TA^TA\vect x - 2\vect b^T(A\vect x) = \frac 12\vect x^TQ\vect x - \vect c^T\vect x $$ com $$\large Q := 2A^TA\quad\text e\quad \vect c = 2A^T\vect b. $$

No nosso caso, o vetor $\vect x \in \mathbb R^{4194304}$ pode ser visualizado como uma imagem de $2048 \times 2048$ pixels e o produto $A\vect x$ corresponde à transformada "tomográfica" de tal imagem. Este produto matriz-vetor está implementado como a função radon:

In [12]:
data = radon( img )
pp.imshow( data )
pp.show()
In [13]:
pp.imshow( radon_t( data ) )
pp.show()
In [14]:
def Q_vec( x ):
    
    return( radon_t( radon( x ) ) )
In [15]:
def CG_matvec( matvec, c, x, niter ):
    
    d_prev = np.zeros( x.shape )
    g_prev = np.ones( x.shape )

    d_0 = None
    
    for k in range( niter ):
        
        g = matvec( x ) - c

        beta = np.sum( g ** 2.0 ) / np.sum( g_prev ** 2.0 )
        d = -g + beta * d_prev
        if d_0 is None:
            d_0 = d
        
        Qd = matvec( d )
        step = -np.sum( g * d ) / np.sum( d * Qd )
        x = x + step * d
        
        d_prev = d
        g_prev = g
        
    return x
In [16]:
c = radon_t( data )
In [17]:
x = np.zeros( img.shape )
x_CG = CG_matvec( Q_vec, c, x, 20 )
pp.figure( figsize = ( 10, 10 ) )
pp.imshow( x_CG )    
pp.show()
In [18]:
pp.imsave( 'x_CG.png', x_CG )