Sunday, May 6, 2012

Recommended Tutorial for Learning Emacs Basics

If you have not decided on your text editor, I strongly recommend Emacs. The learning curve is steep meaning that you have to put so much time on basic stuff that you think is stupid at first! like copy and paste which is much harder than doing Ctrl+C/Ctrl+v in other editors. But you will get used to it as soon as you finish one project and then your never gonna leave it! Vim is also good but I personally prefer Emacs. These are typically editors of choice for advanced programmers working on giant codes.

The Emacs code has been under continuous development for more than two decades and it has been proven to be able to handle a wide range of programming tasks. It has syntax highlighting for almost all programming languages to be named a few here C/C++, Python, Pascal, Java, Fortran, Octave/Matlab, Perl, VHDL, Verilog, ...
To discover Emacs I recommend to first see this awesome introductory tutorial and then you are ready to jump on the advanced techniques.

Then you need to print the following cheat-sheet (http://www.cs.princeton.edu/~hains/os/emacs.html )and paste it on the walls exactly in front of you.
Try to use keyboard and shortcuts as mush as you can. Don't submit yourself to beginner style programming with mouse and graphical interface. The power of Emacs is that you can do task using keyboard much faster and easier than using a GUI. Have a nice programming experience!  

Friday, May 4, 2012

Three-Dimensional Mixed-Element Unstructured Finite-Volume Solver for Euler/Navier-Stokes Equations

Abstract

(Please refer the manual.pdf file in the attached archive for complete documentation of the code)


In this post, we implement and validate a three-dimensional algorithm for solving inviscid Euler equations on general geometries represented by a single block for a fair regime of fluid flow. To this end, we first use a set of predefined routines to read the grid topology. The input file is assumed to be in UGRID format. The grid topology is defined using the number of nodes, number of different elements in the grid, coordinates of all nodes, definitions for various elements including triangle and quadlateral for boundaries, and tetrahederal, pyramid, prism and hex for the interior regions. Prior to writing grid maps, we implement different element-based maps which locally specify the relation between face and node, relation between node and nodes, and the relation between edge and node for a particular element. Using these local maps, we proceed to implement various gird maps to access the grid point/edge of interest in the flow-solver section. These maps include, element sourrounding a point, points surrounding a particular point, edges sorrounding point, and element sourrounding an element.
In the second part of of the work, we transfer the cell-centered grid to the equal node centered representation which is more suitable for finite volume approach. In doing so, we define our geometrical data structure based on the edges in the interior of the domain and ghost edges on the boundaries. For each triangle boundary element, we define three ghost edges while for each quad, we will have four. The volume of the new node-centered cell together with conservative variables are stored in node-based array. All maps and transformed node-centered grid are validated using a set of predefined meshes including a 13 node generic grid, an all-hexahederal and all-tetrahederal meshes, a course and fine ramp mesh for supersonic evaluations and a heavy visoucs mesh for NACA0012 airfoil.
In the third part of the work, we implement the flow-solver. The conservation-laws are discretized using finite-volume approach \footnote{integration in space + Divergence theorem)} and a generic equation for temporal update is obtained. Flux vectors are using Roe approach. Roe averages are evaluated at the center of edges providing a Jacobian-based first-order approximation of flow field in space. For temporal discretization, both first-order Euler-explicit and fourth-order classic Runge-Kutta are implemented. A fifth-order Runge-Kutta scheme from Jameson-Baker is also implemented in the code.
To validate the current algorithm, a supersonic ramp flow is solved on both coarse and refined meshes. Convergence study is performed on both coarse and refined meshes. The shock angle and various parameter before and after shock are compared with the one-dimensional oblique shock theory.
The method is extended to second-order in space by using least square method for gradient construction. In addition, viscous terms are added by averaging the gradients over an edge. The final solver is validated for several benchmark problem including nonlinear acoustics of standing waves inside a pipe, vortex propagation and laminar boundary layer over flat plates. A comparison of the results with experimental data and analytical results for boundary layers proves the robustness of current solver.

Sample Results 


Here are some pictures. For complete documentation of the theory and running the code please download the attached archive. 

coarse grid ramp


A coarse grid supersonic ramp test case with second-order spatial Roe scheme. 
A coarse grid supersonic ramp test case with first-order spatial Roe scheme.
residual history for first-order spatial discretization


Fine grid ramp


Fine grid ramp with first-order spatial accuracy. The Shock is resolved perfectly!

Viscous Boundary Layer


Top) the three-dimensional grid for capturing the boundary layer on flat plate. The grid is refined where flat plate starts. Bottom) results from the current solver. For more info please see the PDF manual in the attached file.
Comparison Between Exact Blasius solution, Fluent code, and current code.


Download the Source Code and Documentation from Here:
PDF Manual:    http://ifile.it/amr59xi
Source Code in C: please put your email in the comments and I'll send you one copy of the source code.
Update Feb 2nd, 2016 : The code is uploaded here : https://github.com/arrgasm/ns3D




Wednesday, May 2, 2012

Completely Validated Unstructured Two-Dimensional Euler Solver Written in C

Here is my implementation of Van-Leer fluxes (and Jacobians) in the following C code. It has the following functionalities:
1- Has a special python script for plotting contours on unstructured grids!!! I wrote this Python script to get rid of third party packages like Tecplot and Fieldview. I can plot any contour using Python script right away in my C-code without calling external program.
2- Solve two-dimensional Euler equations with first-order and second-order spatial accuracy.
3- Implements Van-Leer fluxes and Jacobians. (ANALYTICAL Jacobians)
4- Has built-in explicit and implicit time-marching schemes.
5- Reads arbitrary unstructured grid in ".mesh" format.
6- Has a library for verification and validation using method of manufacturing solutions.
Below are some results for NACA0012 airfoil
   Fig. 1- Mach = .8 angle of attack is 1.25 degrees (second-order)
Fig. 2- Mach = .8 angle of attack is 1.25 degrees (first-order)
Fig. 3- Mach = 1.2 angle of attack is 0.0 degrees (first-order)
Fig. 4- Mach =  results from method of manufacturing solution. these contours are compared to exact solution and they are same contours.

To download my code please click on the following link:
please put your email in the comments and I'll send you one copy of the source code.
UPDATE FEB 1, 2016 : The code is published in https://github.com/arrgasm/UnstEuler2D

Massively Parallel GMRES C-code for Giant Sparse System of Equations


Algorithm and Implementation

The GMRES (Generalized Minimal RESidual) algorithm is completely presented and discussed in [2, 3]. Since the implementation of the algorithm is done in C-programming language in a modular and function oriented manner, the main-file is very short and looks like a pseudo-code itself! So we bring the main file here with some small details (like local variables, allocations, comments etc ..) omitted for brevity. In the next pages we describe the algorithm based on the implementation.

In line 5 each process reads its won share of the input matrix. In line 7 each process reads the corresponding section of right hand side array. The process of localization (hacking indices) is done in line 11. Then to increase the performance of the code we pick the diagonal entries of the matrix for only one time in line 19-25 and store them in array diag. So in each part of preconditioning when we need to have diagonal entries, instead of looping over matrix A we just use the values stored at this array.
In line 29 the initialization is done using 1. We will see that if we initialize with zero we will get different convergence behavior in the following sections. In lines 31-51 the initial vector of Krylov subspace is created based on normalization of the residual of the initial guess. Also this residual is stored in g [ 0 ] (the right hand side vector.
Then we will have the main GMRES loop in line 53 where the columns of the Krylov matrix are eventually created during this loop. Please note that the final implementation of GMRES which is presented in appendix I includes an outer loop for restarting. In lines 55-61 there is a section for dynamic allocation usingrealloc function which is neglected here. In lines 67-87 depending on the user flag for preconditioning, the appropriate type of matrix-vector multiplication is selected and applied. Now to make the resulting vector orthogonal to the previous vectors, a modified Gram-Schmidtt procedure is implemented in lines 87-93 where the components of the new vector along the previous vectors is gradually eliminated to have one final component which is normal to all previous vectors. Finally based on the normalization of the resulting vector, a new search vector is created in lines 95-99. Then previous givens are applied to the new column of Hessenberg matrix in lines 99-105. Once this is done we need to compute new givens to apply them the new column of the Hessenberg matrix. This is done in line 107-111. In lines 111-115 given are applied to the RHS vector g where we obtaine the residual of the least-square problem and store it in g [ k+1 ]. This is interesting because without solving ax=b and doing matrix vector multiplication we directly know the residual from the least-square problem that we solved using givens.
Then in line 119 we decide that if the residual is less than particular value, then we terminate the gmres loop and compute the final optimal update and report the solution. In this case the optimal coefficient $\alpha$ is obtained by backward solution of the upperdiagonal system in lines 123-129. Then the update value zk is computed and in the case of preconditioning it is divided by the diagonal entries. Finally the final solution is computed in line 143 and reported to user in line 147.


Verification and Validation

In this section, we verify the implementation.

Matrix A : dense6x6.mtx

The first argument in the code is the location and name of input matrix, the second argument is the RHS matrix, the third argument is the number of GMRES iterations and the last argument is 0 for normal GMRES and 1 for preconditioned GMRES algorithm. Running the parallel gmres code for k=6 and p=1 we obtain
$mpirun -np 1 ./gmres ../matrices/dense6x6.mtx (void) 6 0

$ cat out.0

 contents of x : 
0.18861997210055, 
-0.15060264446066, 
1.02527163328527, 
-0.43822000212772, 
0.98705425807485, 
-1.21946494911443,
Note that since we don't read the rhs matrix (and actually in the source code I simply comment it here!), I put an arbitrary string (void) to preserve the number and arrangment of the arguments.For k=6 and p=2, the parallel code gives us
$mpirun -np 2 ./gmres ../matrices/dense6x6.mtx (void) 6 0

$ cat out.0 out.1

contents of x :  ---> process 0
0.18861997210055, 
-0.15060264446066, 
1.02527163328527, 

 contents of x : ---> process 1
-0.43822000212772, 
0.98705425807485, 
-1.21946494911443,
For k=6 and p=3, the parallel code gives us
$mpirun -np 3 ./gmres ../matrices/dense6x6.mtx (void) 6 0

$ cat out.0 out.1 out.3

 contents of x :  ---> process 0
0.18861997210055, 
-0.15060264446066, 
 contents of x :  ---> process 1
1.02527163328527, 
-0.43822000212772, 
 contents of x :  ---> process 2
0.98705425807485, 
-1.21946494911443,
For the extreme case of six processes p=6, the parallel code gives us
$mpirun -np 6 ./gmres ../matrices/dense6x6.mtx (void) 6 0

$ cat out.0 out.1 out.2 out.3 out.4 out.5 

 contents of x : ---> process 0
0.18861997210055, 

 contents of x : ---> process 1
-0.15060264446066, 

 contents of x : ---> process 2
1.02527163328527, 

 contents of x : ---> process 3
-0.43822000212772, 

 contents of x : ---> process 4
0.98705425807485, 

 contents of x : ---> process 5
-1.21946494911443,
Using the GMRES program written in GNU Octave (Appendix II), we will obtain

ans =

   0.188619972100554
  -0.150602644460663
   1.025271633285273
  -0.438220002127719
   0.987054258074850
  -1.219464949114430
which is in exact agreement with the parallel C-code. The residual history (including the initial residual before GMRES main for loop) is plotted in fig.(0.2.1). As we see, it converges to machine zero after 6 iterations as we expected before.

Matrix B : fidapm05.mtx

In this section we consider the residual history of matrix fidapm05.mtx which is a 42x42 sparse matrix. The structure of the matrix is shown in fig.(2).

First we set k=42 for the case that we have all search vectors. Therefore we expect that the residual should reach to machine zero. This is shown in fig.(3).

Figure 3: Residuals for Matrix B for various processes.
\includegraphics[width=.8\textwidth, angle = -90]{B_res.ps}
To validate the parallel implementation, here we increase the number of processes to $p=3$ and $p=42$. As shown the residual curves are identical.Another examination is to test the effect of preconditioning. For this purpose we first read the RHS vector for 42x42 given in the input file fidamp05_rhs1.mtx. The residual curves for GMRES with preconditioning and without preconditioning are shown in fig.(4).

Figure 4: The effect of preconditioning on the convergence of 42x42 matrix.
\includegraphics[width=.8\textwidth, angle = -90]{BB_res.ps}
As we see there isn't much difference (compared to the matrix D that will be discussed in the following section) between residual curves. Thus the procedure of diagonal preconditioning does improve the convergence in the intermediate stages but generally there is no big difference. This is mainly because the matrix B42x42 is highly off-diagonal1 so when we approximate it with only diagonal entries, we induce considerable ambiguity into the preconditioner. Therefore we expect that diagonal preconditioning shouldn't work in this case as perfectly as it works for close-to-diagonal matrices. Here I expect that if we use the tridiagonal form of matrix B42x42 instead of only main diagonal, the convergence should improve.


Matrix D : s3dkq4m2.mtx

This a huge 90448x90448 matrix will all non-zero elements close to the main diagonal. So we expect that diagonal preconditioning works well for this matrix because the main diagonal seems to be a good estimate of the matrix D itself. Here we fixed the number of GMRES iterations to 40 and without restarting we run the code with/without preconditioning option for one, two and sixteen processes. The results are presented in fig.(5).
As we have expected, the diagonal preconditioning greatly improves the convergence of the original GMRES algorithm. As mentioned before, the reason for improving convergence for matrix D is that the main diagonal is a good choice for estimating the matrix since the matrix is close-to-the-main-diagonal oriented. We also notice that variation in the number of processes has absolutely no effect in the convergence curves and again curves are identical (to the eye).


Performance

To investigate the effect of restarting on the performance of PGMRES, we run the code for a couple of test cases represented by PGMRES(k,m) where k is the number of iterations and m is the number of restarts. For $k = 10$, we found that for $m=8$ restarts, the code converges to 1.e-10. However for larger k values, the number of required restarts greatly reduces. Therefore since number of restarts 'm' should be same for all cases we keep $m=8$ same for all of them. The residual curves for PGMRES(10,8), PGMRES(20,8), PGMRES(40,8) and PGMRES(80,8) are shown in figs(6789) respectively. The wall time for different number of precesses/iterations is measured using MPI_Wtime( void ) function. The results are presented in table(1). As we see there is no difference in residual curves for different number of processes. For the rest of discussion please see to the last page.





Table 1: Timing for PGMRES and Matrix D. Values are in seconds.

pk=10k=20k=40k=80
12.6933373.1557584.2960866.696444
21.6687141.7199222.8030913.771479
40.8740241.0414731.4823991.839851


According to time table(1), we observe this key point that for small number of iterations ``k'' the computation time decreases. This is mainly because in the GMRES algorithm, there are two interior loops that depends to the value of k.

    //main k loop
    ...
   for ( j = 0 ; j <= k; j++)
   {
        dotproduct(v[j], u, &h[j][k], nrowsI, commun); //dot product
        for( ss = 0; ss < nrowsI; ss++)
        u[ss] -= h[j][k] * v[j][ss];
   }

    ... updates

   for ( j = 0 ; j < k; j++)
   {
        delta = h[j][k];
        h[j][k] = c[j] * delta + s[j] * h[j+1][k];
        h[j+1][k] = -s[j] * delta + c[j] * h[j+1][k];
   }
The first loop is off course very expensive, because a collective dot product must be performed in each cycle, So for very large number of iterations ``k'', we expect the number of dot products to dramatically increases. Therefore instead of increasing ``k'', we prefer to reach to some point in the iteration space and the restart the process again by the new solution.
The result of numerical experiments on the etowah machine validates that the restarting approach decreases computation time. For one process, and $k = 80$, we see that the computation time is 6.7 seconds. If we decrease ``k'' to 10 we get 2.7 seconds which is 40 percent of case $k = 80$. We also note that the parallelization leads to speed-up which is limited by Amdahl's Law. For $k = 10$ and $m=8$, we get ($1.61<2$) speed-up when we increase the number of processes from 1 to 2. Also we get ($3.1<4$) when we increase $p=1$ to $p=4$. However, for large ``k'' the speed-up improves. For $p=1$ to $p=2$in case $k = 80$ we get 1.8 speed-up while for $p=1$ to $p=4$ in the same case we get 3.6 which is better than 3.1 for case $k = 10$.



Another important thing that should be mentioned here is the choice of the initial solution (guessed solution) and the way it affects the residual curves. For the previous cases the initial solution was $x_0 = 1.0$. However we change this condition to $x_0=0.$ to see what happens in the residual curves when we restarting. Below the residual curves for restarting case is plotted. As we see the behavior which was a monotonically descending straight curve before is changing to curves which bumps.
As shown in fig.(10), the more the number of gmres iterations increased, the better the solution is estimated before each restart. Therefore another important parameter is the choice of initial guess $x_0$. In the previous section where we used $x_0 = 1.0$ the reason that residual curves were very similar was that after each restart the solution is a very good estimate of final solution so there is some bump but they are almost flat. But here after each restart, the solution needs to be improved and hence for each search vector that we find the accuracy of the solution dramatically improves and hence we see a visible bump.


References



[1] 

[2] 

[3] Y. Saad and M.H. Schultz, ``GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems'', SIAM J. Sci. Stat. Comput., 7:856-869, 1986.


All codes/reports are valid under GNU license. http://www.gnu.org/licenses/gpl.html 
Download my source code from here!
please put your email in the comments and I'll send you one copy of the source code.

UPDATE FEB1 2016 : The complete code listing is available on https://github.com/arrgasm/paraGMRES