Wednesday, May 2, 2012

Massively Parallel GMRES C-code for Giant Sparse System of Equations


Algorithm and Implementation

The GMRES (Generalized Minimal RESidual) algorithm is completely presented and discussed in [2, 3]. Since the implementation of the algorithm is done in C-programming language in a modular and function oriented manner, the main-file is very short and looks like a pseudo-code itself! So we bring the main file here with some small details (like local variables, allocations, comments etc ..) omitted for brevity. In the next pages we describe the algorithm based on the implementation.

In line 5 each process reads its won share of the input matrix. In line 7 each process reads the corresponding section of right hand side array. The process of localization (hacking indices) is done in line 11. Then to increase the performance of the code we pick the diagonal entries of the matrix for only one time in line 19-25 and store them in array diag. So in each part of preconditioning when we need to have diagonal entries, instead of looping over matrix A we just use the values stored at this array.
In line 29 the initialization is done using 1. We will see that if we initialize with zero we will get different convergence behavior in the following sections. In lines 31-51 the initial vector of Krylov subspace is created based on normalization of the residual of the initial guess. Also this residual is stored in g [ 0 ] (the right hand side vector.
Then we will have the main GMRES loop in line 53 where the columns of the Krylov matrix are eventually created during this loop. Please note that the final implementation of GMRES which is presented in appendix I includes an outer loop for restarting. In lines 55-61 there is a section for dynamic allocation usingrealloc function which is neglected here. In lines 67-87 depending on the user flag for preconditioning, the appropriate type of matrix-vector multiplication is selected and applied. Now to make the resulting vector orthogonal to the previous vectors, a modified Gram-Schmidtt procedure is implemented in lines 87-93 where the components of the new vector along the previous vectors is gradually eliminated to have one final component which is normal to all previous vectors. Finally based on the normalization of the resulting vector, a new search vector is created in lines 95-99. Then previous givens are applied to the new column of Hessenberg matrix in lines 99-105. Once this is done we need to compute new givens to apply them the new column of the Hessenberg matrix. This is done in line 107-111. In lines 111-115 given are applied to the RHS vector g where we obtaine the residual of the least-square problem and store it in g [ k+1 ]. This is interesting because without solving ax=b and doing matrix vector multiplication we directly know the residual from the least-square problem that we solved using givens.
Then in line 119 we decide that if the residual is less than particular value, then we terminate the gmres loop and compute the final optimal update and report the solution. In this case the optimal coefficient $\alpha$ is obtained by backward solution of the upperdiagonal system in lines 123-129. Then the update value zk is computed and in the case of preconditioning it is divided by the diagonal entries. Finally the final solution is computed in line 143 and reported to user in line 147.


Verification and Validation

In this section, we verify the implementation.

Matrix A : dense6x6.mtx

The first argument in the code is the location and name of input matrix, the second argument is the RHS matrix, the third argument is the number of GMRES iterations and the last argument is 0 for normal GMRES and 1 for preconditioned GMRES algorithm. Running the parallel gmres code for k=6 and p=1 we obtain
$mpirun -np 1 ./gmres ../matrices/dense6x6.mtx (void) 6 0

$ cat out.0

 contents of x : 
0.18861997210055, 
-0.15060264446066, 
1.02527163328527, 
-0.43822000212772, 
0.98705425807485, 
-1.21946494911443,
Note that since we don't read the rhs matrix (and actually in the source code I simply comment it here!), I put an arbitrary string (void) to preserve the number and arrangment of the arguments.For k=6 and p=2, the parallel code gives us
$mpirun -np 2 ./gmres ../matrices/dense6x6.mtx (void) 6 0

$ cat out.0 out.1

contents of x :  ---> process 0
0.18861997210055, 
-0.15060264446066, 
1.02527163328527, 

 contents of x : ---> process 1
-0.43822000212772, 
0.98705425807485, 
-1.21946494911443,
For k=6 and p=3, the parallel code gives us
$mpirun -np 3 ./gmres ../matrices/dense6x6.mtx (void) 6 0

$ cat out.0 out.1 out.3

 contents of x :  ---> process 0
0.18861997210055, 
-0.15060264446066, 
 contents of x :  ---> process 1
1.02527163328527, 
-0.43822000212772, 
 contents of x :  ---> process 2
0.98705425807485, 
-1.21946494911443,
For the extreme case of six processes p=6, the parallel code gives us
$mpirun -np 6 ./gmres ../matrices/dense6x6.mtx (void) 6 0

$ cat out.0 out.1 out.2 out.3 out.4 out.5 

 contents of x : ---> process 0
0.18861997210055, 

 contents of x : ---> process 1
-0.15060264446066, 

 contents of x : ---> process 2
1.02527163328527, 

 contents of x : ---> process 3
-0.43822000212772, 

 contents of x : ---> process 4
0.98705425807485, 

 contents of x : ---> process 5
-1.21946494911443,
Using the GMRES program written in GNU Octave (Appendix II), we will obtain

ans =

   0.188619972100554
  -0.150602644460663
   1.025271633285273
  -0.438220002127719
   0.987054258074850
  -1.219464949114430
which is in exact agreement with the parallel C-code. The residual history (including the initial residual before GMRES main for loop) is plotted in fig.(0.2.1). As we see, it converges to machine zero after 6 iterations as we expected before.

Matrix B : fidapm05.mtx

In this section we consider the residual history of matrix fidapm05.mtx which is a 42x42 sparse matrix. The structure of the matrix is shown in fig.(2).

First we set k=42 for the case that we have all search vectors. Therefore we expect that the residual should reach to machine zero. This is shown in fig.(3).

Figure 3: Residuals for Matrix B for various processes.
\includegraphics[width=.8\textwidth, angle = -90]{B_res.ps}
To validate the parallel implementation, here we increase the number of processes to $p=3$ and $p=42$. As shown the residual curves are identical.Another examination is to test the effect of preconditioning. For this purpose we first read the RHS vector for 42x42 given in the input file fidamp05_rhs1.mtx. The residual curves for GMRES with preconditioning and without preconditioning are shown in fig.(4).

Figure 4: The effect of preconditioning on the convergence of 42x42 matrix.
\includegraphics[width=.8\textwidth, angle = -90]{BB_res.ps}
As we see there isn't much difference (compared to the matrix D that will be discussed in the following section) between residual curves. Thus the procedure of diagonal preconditioning does improve the convergence in the intermediate stages but generally there is no big difference. This is mainly because the matrix B42x42 is highly off-diagonal1 so when we approximate it with only diagonal entries, we induce considerable ambiguity into the preconditioner. Therefore we expect that diagonal preconditioning shouldn't work in this case as perfectly as it works for close-to-diagonal matrices. Here I expect that if we use the tridiagonal form of matrix B42x42 instead of only main diagonal, the convergence should improve.


Matrix D : s3dkq4m2.mtx

This a huge 90448x90448 matrix will all non-zero elements close to the main diagonal. So we expect that diagonal preconditioning works well for this matrix because the main diagonal seems to be a good estimate of the matrix D itself. Here we fixed the number of GMRES iterations to 40 and without restarting we run the code with/without preconditioning option for one, two and sixteen processes. The results are presented in fig.(5).
As we have expected, the diagonal preconditioning greatly improves the convergence of the original GMRES algorithm. As mentioned before, the reason for improving convergence for matrix D is that the main diagonal is a good choice for estimating the matrix since the matrix is close-to-the-main-diagonal oriented. We also notice that variation in the number of processes has absolutely no effect in the convergence curves and again curves are identical (to the eye).


Performance

To investigate the effect of restarting on the performance of PGMRES, we run the code for a couple of test cases represented by PGMRES(k,m) where k is the number of iterations and m is the number of restarts. For $k = 10$, we found that for $m=8$ restarts, the code converges to 1.e-10. However for larger k values, the number of required restarts greatly reduces. Therefore since number of restarts 'm' should be same for all cases we keep $m=8$ same for all of them. The residual curves for PGMRES(10,8), PGMRES(20,8), PGMRES(40,8) and PGMRES(80,8) are shown in figs(6789) respectively. The wall time for different number of precesses/iterations is measured using MPI_Wtime( void ) function. The results are presented in table(1). As we see there is no difference in residual curves for different number of processes. For the rest of discussion please see to the last page.





Table 1: Timing for PGMRES and Matrix D. Values are in seconds.

pk=10k=20k=40k=80
12.6933373.1557584.2960866.696444
21.6687141.7199222.8030913.771479
40.8740241.0414731.4823991.839851


According to time table(1), we observe this key point that for small number of iterations ``k'' the computation time decreases. This is mainly because in the GMRES algorithm, there are two interior loops that depends to the value of k.

    //main k loop
    ...
   for ( j = 0 ; j <= k; j++)
   {
        dotproduct(v[j], u, &h[j][k], nrowsI, commun); //dot product
        for( ss = 0; ss < nrowsI; ss++)
        u[ss] -= h[j][k] * v[j][ss];
   }

    ... updates

   for ( j = 0 ; j < k; j++)
   {
        delta = h[j][k];
        h[j][k] = c[j] * delta + s[j] * h[j+1][k];
        h[j+1][k] = -s[j] * delta + c[j] * h[j+1][k];
   }
The first loop is off course very expensive, because a collective dot product must be performed in each cycle, So for very large number of iterations ``k'', we expect the number of dot products to dramatically increases. Therefore instead of increasing ``k'', we prefer to reach to some point in the iteration space and the restart the process again by the new solution.
The result of numerical experiments on the etowah machine validates that the restarting approach decreases computation time. For one process, and $k = 80$, we see that the computation time is 6.7 seconds. If we decrease ``k'' to 10 we get 2.7 seconds which is 40 percent of case $k = 80$. We also note that the parallelization leads to speed-up which is limited by Amdahl's Law. For $k = 10$ and $m=8$, we get ($1.61<2$) speed-up when we increase the number of processes from 1 to 2. Also we get ($3.1<4$) when we increase $p=1$ to $p=4$. However, for large ``k'' the speed-up improves. For $p=1$ to $p=2$in case $k = 80$ we get 1.8 speed-up while for $p=1$ to $p=4$ in the same case we get 3.6 which is better than 3.1 for case $k = 10$.



Another important thing that should be mentioned here is the choice of the initial solution (guessed solution) and the way it affects the residual curves. For the previous cases the initial solution was $x_0 = 1.0$. However we change this condition to $x_0=0.$ to see what happens in the residual curves when we restarting. Below the residual curves for restarting case is plotted. As we see the behavior which was a monotonically descending straight curve before is changing to curves which bumps.
As shown in fig.(10), the more the number of gmres iterations increased, the better the solution is estimated before each restart. Therefore another important parameter is the choice of initial guess $x_0$. In the previous section where we used $x_0 = 1.0$ the reason that residual curves were very similar was that after each restart the solution is a very good estimate of final solution so there is some bump but they are almost flat. But here after each restart, the solution needs to be improved and hence for each search vector that we find the accuracy of the solution dramatically improves and hence we see a visible bump.


References



[1] 

[2] 

[3] Y. Saad and M.H. Schultz, ``GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems'', SIAM J. Sci. Stat. Comput., 7:856-869, 1986.


All codes/reports are valid under GNU license. http://www.gnu.org/licenses/gpl.html 
Download my source code from here!
please put your email in the comments and I'll send you one copy of the source code.

UPDATE FEB1 2016 : The complete code listing is available on https://github.com/arrgasm/paraGMRES







35 comments:

  1. Dear Arash,
    Please send your code to alexandr.stolz.1981@gmail.com

    ReplyDelete
    Replies
    1. I sent the code. Let me know if you have any questions.

      Delete
  2. Dear Mr. Ghasemi,

    I like to have a copy of your code, too. Kindly, send it for me to "azadani.behnam@gmail.com".

    Thank you very much in advance.
    Nasr-Azadani

    ReplyDelete
    Replies
    1. I sent the code. Let me know if you have any questions.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. please send me the code
    my email is :::
    night_wing7@yahoo.com

    ReplyDelete
  5. Dear Mr. Ghasemi,

    could I have a copy of yuor code to my email address? espric@inwind.it

    Thanks in advance

    Kind regards
    Riccardo

    ReplyDelete
  6. Hi Arsha,

    Would it be possible to have a copy of your code (antoine.voos@gmail.com), it seems so great!
    Thank you in advance,

    ReplyDelete
  7. Hi Arsha,

    Would it be possible to have a copy of your code ( changmaowu@foxmail.com ), it seems so great!
    Thank you in advance,

    ReplyDelete
  8. Hi Arash, could you please send me the code at m979a608@ku.edu ?
    Thanks

    ReplyDelete
  9. Hi Arash,

    Could you send me a copy of your code at jean.bricot@gmail.com ?
    It seems really interesting.
    Thanks!

    ReplyDelete
  10. Hi Arash,

    Please send me a copy of your code at waitedm@gmail.com

    Thanks!

    ReplyDelete
  11. Hi Arash,

    Please send me a copy of your code at minhtuan710@gmail.com.

    The code seems great.

    Thank you in advance.

    ReplyDelete
  12. Hi Arash,

    Could you send me a copy of your code at pkrverma06@gmail.com ?
    Thanks!

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. Hi Arash,
    have you the code of Left-preconditioned GMRES algorithm with restart.
    if have please send me on pkrverma06@gmail.com.
    Thanks!

    ReplyDelete
    Replies
    1. Hi Parveen
      I guess it only has right prec but you can add the left prec. in couple of lines without any problem. Left/Right preconditioning is not a big issue in this implementation since only matrix vector product is MPI based

      Delete
    2. Could you send me a copy of your code at pkrverma06@gmail.com ?
      Thanks!

      Delete
  15. Hi Arash,
    Could you send me a copy of your code at red100126@gmail.com ?
    Thank you so much!

    ReplyDelete
  16. Hi Arash,

    May I have a copy of your code please? I'm at sesm630@gmail.com Very curious to see how you've got the performance you've got out of it!

    Thanks,

    Sean

    ReplyDelete
  17. Dear Chen,
    Its been a long time since I wrote that code and unfortunately I didn't keep the implementation notes with me. I wrote a report bu that misses the details of localize.c which is also an important part of implementation. However, the basic idea in localize.c is that each MPI-process goes and finds the rows in the sparse matrix that corresponds to it. and this correspondence is basically very easy. The matrix is divided into horizontal strips based on number of MPI-processes and then each process goes and stores the indices of its strip in a compressed array. Then during matrix-vector multiplication, each MPI-process uses that index-information array to locate the parts of matrix that corresponds to it and uses that is the matrix-vector product. Having said these, it should be easier to reread localize.c and understand it. But I agree that it is hard to comprehend that part and it needs to be read several times. Maybe if you do a MATLAB tester implementation first it'll help better. Regards!

    ReplyDelete
    Replies
    1. Thank you for your reply. I will try to do a matlab tester under your advice.

      Delete
    2. Hi,
      Another question is about the "receive\cntowner\to_be_sent\cntsent". I want to know what these parameters mean? I think it can help me to understand the localize.c code. Thank you again.

      Delete
  18. Asslam-o-Alaickum
    Find here with knowing only few peoples are working good in this field you are among those, I will be thankful to you if you please send codes of GMRES
    Regards

    ReplyDelete