+Use for this part the `gemm_mpi.cpp
+<https://gitlab.com/PRACE-4IP/CodeVault/raw/master/hpc_kernel_samples/dense_linear_algebra/gemm/mpi/src/gemm_mpi.cpp>`_
+example, which is provided by the `PRACE Codevault repository
+<http://www.prace-ri.eu/prace-codevault/>`_.
+
+The computing part of this example is the matrix multiplication routine
+
+.. literalinclude:: /tuto_smpi/gemm_mpi.cpp
+ :language: cpp
+ :lines: 9-24
+
+.. code-block:: console
+
+ $ smpicxx -O3 gemm_mpi.cpp -o gemm
+ $ time smpirun -np 16 -platform cluster_crossbar.xml -hostfile cluster_hostfile --cfg=smpi/display-timing:yes --cfg=smpi/host-speed:1000000000 ./gemm
+
+This should end quite quickly, as the size of each matrix is only 1000x1000.
+But what happens if we want to simulate larger runs?
+Replace the size by 2000, 3000, and try again.
+
+The simulation time increases a lot, while there are no more MPI calls performed, only computation.
+
+The ``--cfg=smpi/display-timing`` option gives more details about execution
+and advises using sampling if the time spent in computing loops seems too high.
+
+The ``--cfg=smpi/host-speed:1000000000`` option sets the speed of the processor used for
+running the simulation. Here we say that its speed is the same as one of the
+processors we are simulating (1Gf), so that 1 second of computation is injected
+as 1 second in the simulation.
+
+.. code-block:: console
+
+ [5.568556] [smpi_kernel/INFO] Simulated time: 5.56856 seconds.
+
+ The simulation took 24.9403 seconds (after parsing and platform setup)
+ 24.0764 seconds were actual computation of the application
+ [5.568556] [smpi_kernel/INFO] More than 75% of the time was spent inside the application code.
+ You may want to use sampling functions or trace replay to reduce this.
+
+So in our case (size 3000), the simulation ran for 25 seconds, and the simulated time was 5.57s at the end.
+Computation by itself took 24 seconds, and can quickly grow with larger sizes
+(as computation is really performed, there will be variability between similar runs).
+
+SMPI provides sampling macros to accelerate simulation by sampling iterations
+of large computation loops, and skip computation after a certain amount of iterations,
+or when the sampling is stable enough.
+
+The two macros only slightly differ :
+
+- ``SMPI_SAMPLE_GLOBAL`` : the specified number of samples is produced by all processors
+- ``SMPI_SAMPLE_LOCAL`` : each process executes a specified number of iterations
+
+So if the size of the computed part varies between processes (imbalance),
+it's safer to use the LOCAL one.
+
+To use one of them, replacing the external for loop of the multiply routine:
+
+.. code-block:: c
+
+ for (int i = istart; i <= iend; ++i)
+
+by:
+
+.. code-block:: c
+
+ SMPI_SAMPLE_GLOBAL(int i = istart, i <= iend, ++i, 10, 0.005)
+
+The first three parameters are the ones from the loop, while the two last ones are for sampling.
+They mean that at most 10 iterations will be performed and that the sampling phase can be exited
+earlier if the expected stability is reached after fewer samples.
+
+Now run the code again with various sizes and parameters and check the time taken for the
+simulation, as well as the resulting simulated time.
+
+.. code-block:: console
+
+ [5.575691] [smpi_kernel/INFO] Simulated time: 5.57569 seconds.
+ The simulation took 1.23698 seconds (after parsing and platform setup)
+ 0.0319454 seconds were actual computation of the application
+
+In this case, the simulation only took 1.2 seconds, while the simulated time
+remained almost identical.
+
+The computation results will obviously be altered since most computations are skipped.
+These macros thus cannot be used when results are critical for the application behavior
+(convergence estimation for instance will be wrong on some codes).
+
+
+Lab 4: Memory folding on large allocations
+------------------------------------------
+
+Another issue that can be encountered when simulation with SMPI is lack of memory.
+Indeed we are executing all MPI processes on a single node, which can lead to crashes.
+We will use the DT benchmark of the NAS suite to illustrate how to avoid such issues.
+
+With 85 processes and class C, the DT simulated benchmark will try to allocate 35GB of memory,
+which may not be available on the node you are using.
+
+To avoid this we can simply replace the largest calls to malloc and free by calls
+to ``SMPI_SHARED_MALLOC`` and ``SMPI_SHARED_FREE``.
+This means that all processes will share one single instance of this buffer.
+As for sampling, results will be altered, and this should not be used for control structures.
+
+For DT example, there are three different calls to malloc in the file, and one of them is for a needed structure.
+Find it and replace the two other ones with ``SMPI_SHARED_MALLOC`` (there is only one free to replace for both of them).
+
+Once done, you can now run
+
+.. code-block:: console
+
+ $ make dt NPROCS=85 CLASS=C
+ (compilation logs)
+ $ smpirun -np 85 -platform ../cluster_backbone.xml bin/dt.C.x BH
+ (execution logs)
+
+And simulation should finish without swapping/crashing (Ignore the warning about the return value).
+
+If control structures are also problematic, you can use ``SMPI_PARTIAL_SHARED_MALLOC(size, offsets, offsetscount)``
+macro, which shares only specific parts of the structure between processes,
+and use specific memory for the important parts.
+It can be freed afterward with SMPI_SHARED_FREE.
+
+If allocations are performed with malloc or calloc, SMPI (from version 3.25) provides the option
+``--cfg=smpi/auto-shared-malloc-thresh:n`` which will replace all allocations above size n bytes by
+shared allocations. The value has to be carefully selected to avoid smaller control arrays,
+containing data necessary for the completion of the run.
+Try to run the (non modified) DT example again, with values going from 10 to 100,000 to show that
+too small values can cause crashes.
+
+A useful option to identify the largest allocations in the code is ``--cfg=smpi/display-allocs:yes`` (from 3.27).
+It will display at the end of a (successful) run the largest allocations and their locations, helping pinpoint the
+targets for sharing, or setting the threshold for automatic ones.
+For DT, the process would be to run a smaller class of problems,
+
+.. code-block:: console
+
+ $ make dt NPROCS=21 CLASS=A
+ $ smpirun --cfg=smpi/display-allocs:yes -np 21 -platform ../cluster_backbone.xml bin/dt.A.x BH
+
+Which should output:
+
+.. code-block:: console
+
+ [smpi_utils/INFO] Memory Usage: Simulated application allocated 198533192 bytes during its lifetime through malloc/calloc calls.
+ Largest allocation at once from a single process was 3553184 bytes, at dt.c:388. It was called 3 times during the whole simulation.
+ If this is too much, consider sharing allocations for computation buffers.
+ This can be done automatically by setting --cfg=smpi/auto-shared-malloc-thresh to the minimum size wanted size (this can alter execution if data content is necessary)