The GMRES (Generalized Minimal RESidual) algorithm is completely presented and discussed in [2, 3]. Since the implementation of the algorithm is done in C-programming language in a modular and function oriented manner, the main-file is very short and looks like a pseudo-code itself! So we bring the main file here with some small details (like local variables, allocations, comments etc ..) omitted for brevity. In the next pages we describe the algorithm based on the implementation.
In line 5 each process reads its won share of the input matrix. In line 7 each process reads the corresponding section of right hand side array. The process of localization (hacking indices) is done in line 11. Then to increase the performance of the code we pick the diagonal entries of the matrix for only one time in line 19-25 and store them in array
diag
. So in each part of preconditioning when we need to have diagonal entries, instead of looping over matrix
A
we just use the values stored at this array.
In line 29 the initialization is done using 1. We will see that if we initialize with zero we will get different convergence behavior in the following sections. In lines 31-51 the initial vector of Krylov subspace is created based on normalization of the residual of the initial guess. Also this residual is stored in
g [ 0 ]
(the right hand side vector.
Then we will have the main GMRES loop in line 53 where the columns of the Krylov matrix are eventually created during this loop. Please note that the final implementation of GMRES which is presented in appendix I includes an outer loop for restarting. In lines 55-61 there is a section for dynamic allocation using
realloc
function which is neglected here. In lines 67-87 depending on the user flag for preconditioning, the appropriate type of matrix-vector multiplication is selected and applied. Now to make the resulting vector orthogonal to the previous vectors, a modified Gram-Schmidtt procedure is implemented in lines 87-93 where the components of the new vector along the previous vectors is gradually eliminated to have one final component which is normal to all previous vectors. Finally based on the normalization of the resulting vector, a new search vector is created in lines 95-99. Then previous givens are applied to the new column of Hessenberg matrix in lines 99-105. Once this is done we need to compute new givens to apply them the new column of the Hessenberg matrix. This is done in line 107-111. In lines 111-115 given are applied to the RHS vector
g
where we obtaine the residual of the least-square problem and store it in
g [ k+1 ]
. This is interesting because without solving ax=b and doing matrix vector multiplication we directly know the residual from the least-square problem that we solved using givens.
Then in line 119 we decide that if the residual is less than particular value, then we terminate the gmres loop and compute the final optimal update and report the solution. In this case the optimal coefficient
is obtained by backward solution of the upperdiagonal system in lines 123-129. Then the update value
zk
is computed and in the case of preconditioning it is divided by the diagonal entries. Finally the final solution is computed in line 143 and reported to user in line 147.
In this section, we verify the implementation.
The first argument in the code is the location and name of input matrix, the second argument is the RHS matrix, the third argument is the number of GMRES iterations and the last argument is 0 for normal GMRES and 1 for preconditioned GMRES algorithm. Running the parallel gmres code for k=6 and p=1 we obtain
$mpirun -np 1 ./gmres ../matrices/dense6x6.mtx (void) 6 0
$ cat out.0
contents of x :
0.18861997210055,
-0.15060264446066,
1.02527163328527,
-0.43822000212772,
0.98705425807485,
-1.21946494911443,
Note that since we don't read the rhs matrix (and actually in the source code I simply comment it here!), I put an arbitrary string
(void)
to preserve the number and arrangment of the arguments.For k=6 and p=2, the parallel code gives us
$mpirun -np 2 ./gmres ../matrices/dense6x6.mtx (void) 6 0
$ cat out.0 out.1
contents of x : ---> process 0
0.18861997210055,
-0.15060264446066,
1.02527163328527,
contents of x : ---> process 1
-0.43822000212772,
0.98705425807485,
-1.21946494911443,
For k=6 and p=3, the parallel code gives us
$mpirun -np 3 ./gmres ../matrices/dense6x6.mtx (void) 6 0
$ cat out.0 out.1 out.3
contents of x : ---> process 0
0.18861997210055,
-0.15060264446066,
contents of x : ---> process 1
1.02527163328527,
-0.43822000212772,
contents of x : ---> process 2
0.98705425807485,
-1.21946494911443,
For the extreme case of six processes p=6, the parallel code gives us
$mpirun -np 6 ./gmres ../matrices/dense6x6.mtx (void) 6 0
$ cat out.0 out.1 out.2 out.3 out.4 out.5
contents of x : ---> process 0
0.18861997210055,
contents of x : ---> process 1
-0.15060264446066,
contents of x : ---> process 2
1.02527163328527,
contents of x : ---> process 3
-0.43822000212772,
contents of x : ---> process 4
0.98705425807485,
contents of x : ---> process 5
-1.21946494911443,
Using the GMRES program written in GNU Octave (Appendix II), we will obtain
ans =
0.188619972100554
-0.150602644460663
1.025271633285273
-0.438220002127719
0.987054258074850
-1.219464949114430
which is in exact agreement with the parallel C-code. The residual history (including the initial residual before GMRES main for loop) is plotted in fig.(
0.2.1). As we see, it converges to machine zero after 6 iterations as we expected before.
In this section we consider the residual history of matrix fidapm05.mtx which is a 42x42 sparse matrix. The structure of the matrix is shown in fig.(
2).
First we set k=42 for the case that we have all search vectors. Therefore we expect that the residual should reach to machine zero. This is shown in fig.(
3).
Figure 3: Residuals for Matrix B for various processes.
|
To validate the parallel implementation, here we increase the number of processes to
and
. As shown the residual curves are identical.Another examination is to test the effect of preconditioning. For this purpose we first read the RHS vector for 42x42 given in the input file
fidamp05_rhs1.mtx
. The residual curves for GMRES with preconditioning and without preconditioning are shown in fig.(
4).
Figure 4: The effect of preconditioning on the convergence of 42x42 matrix.
|
As we see there isn't much difference (compared to the matrix D that will be discussed in the following section) between residual curves. Thus the procedure of diagonal preconditioning does improve the convergence in the intermediate stages but generally there is no big difference. This is mainly because the matrix B42x42 is highly off-diagonal
1 so when we approximate it with only diagonal entries, we induce considerable ambiguity into the preconditioner. Therefore we expect that diagonal preconditioning shouldn't work in this case as perfectly as it works for close-to-diagonal matrices. Here I expect that if we use the tridiagonal form of matrix B42x42 instead of only main diagonal, the convergence should improve.
This a huge 90448x90448 matrix will all non-zero elements close to the main diagonal. So we expect that diagonal preconditioning works well for this matrix because the main diagonal seems to be a good estimate of the matrix D itself. Here we fixed the number of GMRES iterations to 40 and without restarting we run the code with/without preconditioning option for one, two and sixteen processes. The results are presented in fig.(
5).
As we have expected, the diagonal preconditioning greatly improves the convergence of the original GMRES algorithm. As mentioned before, the reason for improving convergence for matrix D is that the main diagonal is a good choice for estimating the matrix since the matrix is close-to-the-main-diagonal oriented. We also notice that variation in the number of processes has absolutely no effect in the convergence curves and again curves are identical (to the eye).
To investigate the effect of restarting on the performance of PGMRES, we run the code for a couple of test cases represented by PGMRES(k,m) where k is the number of iterations and m is the number of restarts. For
, we found that for
restarts, the code converges to 1.e-10. However for larger k values, the number of required restarts greatly reduces. Therefore since number of restarts 'm' should be same for all cases we keep
same for all of them. The residual curves for PGMRES(10,8), PGMRES(20,8), PGMRES(40,8) and PGMRES(80,8) are shown in figs(
6,
7,
8,
9) respectively. The wall time for different number of precesses/iterations is measured using
MPI_Wtime( void )
function. The results are presented in table(1). As we see there is no difference in residual curves for different number of processes. For the rest of discussion please see to the last page.
Table 1: Timing for PGMRES and Matrix D. Values are in seconds.
p | k=10 | k=20 | k=40 | k=80 |
1 | 2.693337 | 3.155758 | 4.296086 | 6.696444 |
2 | 1.668714 | 1.719922 | 2.803091 | 3.771479 |
4 | 0.874024 | 1.041473 | 1.482399 | 1.839851 |
|
According to time table(1), we observe this key point that for small number of iterations ``k'' the computation time decreases. This is mainly because in the GMRES algorithm, there are two interior loops that depends to the value of k.
//main k loop
...
for ( j = 0 ; j <= k; j++)
{
dotproduct(v[j], u, &h[j][k], nrowsI, commun); //dot product
for( ss = 0; ss < nrowsI; ss++)
u[ss] -= h[j][k] * v[j][ss];
}
... updates
for ( j = 0 ; j < k; j++)
{
delta = h[j][k];
h[j][k] = c[j] * delta + s[j] * h[j+1][k];
h[j+1][k] = -s[j] * delta + c[j] * h[j+1][k];
}
The first loop is off course very expensive, because a collective dot product must be performed in each cycle, So for very large number of iterations ``k'', we expect the number of dot products to dramatically increases. Therefore instead of increasing ``k'', we prefer to reach to some point in the iteration space and the restart the process again by the new solution.
The result of numerical experiments on the etowah machine validates that the restarting approach decreases computation time. For one process, and
, we see that the computation time is 6.7 seconds. If we decrease ``k'' to 10 we get 2.7 seconds which is 40 percent of case
. We also note that the parallelization leads to speed-up which is limited by Amdahl's Law. For
and
, we get (
) speed-up when we increase the number of processes from 1 to 2. Also we get (
) when we increase
to
. However, for large ``k'' the speed-up improves. For
to
in case
we get 1.8 speed-up while for
to
in the same case we get 3.6 which is better than 3.1 for case
.
Another important thing that should be mentioned here is the choice of the initial solution (guessed solution) and the way it affects the residual curves. For the previous cases the initial solution was
. However we change this condition to
to see what happens in the residual curves when we restarting. Below the residual curves for restarting case is plotted. As we see the behavior which was a monotonically descending straight curve before is changing to curves which bumps.
As shown in fig.(
10), the more the number of gmres iterations increased, the better the solution is estimated before each restart. Therefore another important parameter is the choice of initial guess
. In the previous section where we used
the reason that residual curves were very similar was that after each restart the solution is a very good estimate of final solution so there is some bump but they are almost flat. But here after each restart, the solution needs to be improved and hence for each search vector that we find the accuracy of the solution dramatically improves and hence we see a visible bump.
[1]
[2]
[3] Y. Saad and M.H. Schultz, ``GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems'', SIAM J. Sci. Stat. Comput., 7:856-869, 1986.
All codes/reports are valid under GNU license. http://www.gnu.org/licenses/gpl.html
Download my source code from here!
please put your email in the comments and I'll send you one copy of the source code.
UPDATE FEB1 2016 : The complete code listing is available on
https://github.com/arrgasm/paraGMRES