Define command for "bw" as well.

[GMRES_For_Journal.git] / GMRES_Journal.tex
diff --git a/GMRES_Journal.tex b/GMRES_Journal.tex

index 0ca22dfe72a05ddc50aa6b492e39d7e2479ded3c..696ccf14bac3a1e474b107ee6551c67343848259 100644 (file)
--- a/GMRES_Journal.tex
+++ b/GMRES_Journal.tex
@@ -1,5 +1,4 @@
  \documentclass[11pt]{article}
-%\documentclass{acmconf}
  \usepackage{multicol}
  
  \usepackage[paper=a4paper,dvips,top=1.5cm,left=1.5cm,right=1.5cm,foot=1cm,bottom=1.5cm]{geometry}
@@ -19,7 +18,6 @@
  \usepackage{url}
  \usepackage{mdwlist}
  \usepackage{multirow}
-%\usepackage{color}
  
  \date{}
  
@@ -43,6 +41,7 @@ IUT Belfort-Montb\'eliard\\
  \{\texttt{lilia.ziane\_khoja},~\texttt{raphael.couturier},~\texttt{arnaud.giersch},~\texttt{jacques.bahi}\}\texttt{@univ-fcomte.fr}
  }
  
+\newcommand{\BW}{\mathit{bw}}
  \newcommand{\Iter}{\mathit{iter}}
  \newcommand{\Max}{\mathit{max}}
  \newcommand{\Offset}{\mathit{offset}}
@@ -331,10 +330,10 @@ requires the vector elements of its neighboring nodes corresponding to the colum
  which its local sub-matrix has nonzero values. Consequently, each computing node manages a global
  vector composed of a local vector of size $\frac{n}{p}$ and a shared vector of size $S$:
  \begin{equation}
-  S = bw - \frac{n}{p},
+  S = \BW - \frac{n}{p},
  \label{eq:11}
  \end{equation}
-where $\frac{n}{p}$ is the size of the local vector and $bw$ is the bandwidth of the local sparse
+where $\frac{n}{p}$ is the size of the local vector and $\BW$ is the bandwidth of the local sparse
  sub-matrix which represents the number of columns between the minimum and the maximum column indices
  (see Figure~\ref{fig:01}). In order to improve memory accesses, we use the texture memory to
  cache elements of the global vector.
@@ -855,7 +854,7 @@ torso3                  & 183 863 292      & 25 682 514       & 613 250
  
  
  
-Hereafter, we show the influence of the communications on a GPU cluster compared to a CPU cluster. In Tables~\ref{tab:10},~\ref{tab:11} and~\ref{tab:12}, we compute the ratios between the computation time over the communication time of three versions of the parallel GMRES algorithm to solve sparse linear systems associated to matrices of Table~\ref{tab:06}. These tables show that the hypergraph partitioning and the compressed format of the vectors increase the ratios either on the GPU cluster or on the CPU cluster. That means that the two optimization techniques allow the minimization of the total communication volume between the computing nodes. However, we can notice that the ratios obtained on the GPU cluster are lower than those obtained on the CPU cluster. Indeed, GPUs compute faster than CPUs but with GPUs there are more communications due to CPU/GPU communications, so communications are more time-consuming while the computation time remains unchanged.
+Hereafter, we show the influence of the communications on a GPU cluster compared to a CPU cluster. In Tables~\ref{tab:10},~\ref{tab:11} and~\ref{tab:12}, we compute the ratios between the computation time over the communication time of three versions of the parallel GMRES algorithm to solve sparse linear systems associated to matrices of Table~\ref{tab:06}. These tables show that the hypergraph partitioning and the compressed format of the vectors increase the ratios either on the GPU cluster or on the CPU cluster. That means that the two optimization techniques allow the minimization of the total communication volume between the computing nodes. However, we can notice that the ratios obtained on the GPU cluster are lower than those obtained on the CPU cluster. Indeed, GPUs compute faster than CPUs but with GPUs there are more communications due to CPU/GPU communications, so communications are more time-consuming while the computation time remains unchanged. Furthermore, we can notice that the GPU computation times on Tables~\ref{tab:11} and~\ref{tab:12} are about 10\% lower than those on Table~\ref{tab:10}. Indeed, the compression of the vectors and the reordering of matrix columns allow to perform coalesced accesses to the GPU memory and thus accelerate the sparse matrix-vector multiplication.  
  
  \begin{table}
  \begin{center}
@@ -874,7 +873,7 @@ crashbasis        & 48.532 s       & 3195.183 s   & {\bf 0.015}   & 623.686 s
  FEM\_3D\_thermal2 & 37.211 s       & 1584.650 s   & {\bf 0.023}   & 370.297 s      & 3810.255 s   & {\bf 0.097}\\
  language          & 22.912 s       & 2242.897 s   & {\bf 0.010}   & 286.682 s      & 5348.733 s   & {\bf 0.054}\\
  poli\_large       & 13.618 s       & 1722.304 s   & {\bf 0.008}   & 190.302 s      & 4059.642 s   & {\bf 0.047}\\
-torso3            & 74.194 s       & 4454.936 s   & {\bf 0.017}   & 190.302 s      & 10800.787 s  & {\bf 0.083}\\ \hline       
+torso3            & 74.194 s       & 4454.936 s   & {\bf 0.017}   & 897.440 s      & 10800.787 s  & {\bf 0.083}\\ \hline       
  \end{tabular}
  \caption{Ratios of the computation time over the communication time obtained from the parallel GMRES algorithm using row-by-row partitioning on 12 GPUs and 24 CPUs.}
  \label{tab:10}
@@ -944,7 +943,7 @@ torso3            & 57.469 s       & 16.828 s     & {\bf 3.415}   & 926.588 s
  
  
   
-Another very important issue, which might be ignored by too many people, is that the communications have a greater influence on a cluster of GPUs than on a cluster of CPUs. There are two reasons for that. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU data transfers slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of a parallel algorithms.
+Another very important issue, which might be ignored by too many people, is that the communications have a greater influence on a cluster of GPUs than on a cluster of CPUs. There are two reasons for that. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU data transfers slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of parallel algorithms.
  
  %%--------------------%%
  %%      SECTION 7     %%