X-Git-Url: http://bilbo.iut-bm.univ-fcomte.fr/pub/gitweb/simgrid.git/blobdiff_plain/a6e488781dadd92ea13e74150074c4207fefe478..d68e1c39ec0832cb2391aedd17a868c597dd399e:/docs/source/Release_Notes.rst diff --git a/docs/source/Release_Notes.rst b/docs/source/Release_Notes.rst index 479d7a48ac..553a2e5264 100644 --- a/docs/source/Release_Notes.rst +++ b/docs/source/Release_Notes.rst @@ -568,32 +568,184 @@ Hopefully in the next release. Finally, this release mostly entails maintenance work **on the model front**: a bug was fixed when using ptasks on multicore hosts, and the legacy stochastic generator of external load has been reintroduced. -Version 3.33 (not released yet) -------------------------------- +Version 3.33 (never released) +----------------------------- -**On the maintainance front,** we removed the ancient MSG interface which end-of-life was scheduled for 2020, the Java -bindings that was MSG-only and support for native builds on Windows (WSL is now required). Keeping SimGrid alive while -adding new features require to remove old, unused stuff. The very rare users impacted by these removals are urged to +This version was overdue for more than 6 months, so it was skipped to not hinder our process of deprecating old code. + +Version 3.34 (June 26. 2023) +---------------------------- + +**On the maintenance front,** we removed the ancient MSG interface which end-of-life was scheduled for 2020, the Java bindings +that was MSG-only, support for native builds on Windows (WSL is now required) and support for 32 bits platforms. Keeping SimGrid +alive while adding new features require to remove old, unused stuff. The very rare users impacted by these removals are urged to move to the new API and systems. +We also conducted many internal refactorings to remove any occurrence of "surf" and "simix". SimGrid v3.12 used a layered design +where simix was providing synchronizations to actors, on top of surf which was computing the models. These features are now +provided in modules, not layers. Surf became the kernel::{lmm, resource, routing, timer, xml} modules while simix became +the kernel::{activity, actor, context} modules. + **On the model front,** we realized an idea that has been on the back of our minds for quite some time. The question was: could we use something in the line of the ptask model, that mixes computations and network transfers in a single fluid activity, to simulate a *fluid I/O stream activity* that would consume both disk and network resources? This -remained an open question for years, mainly because the implementation of the ptask doesn't rely on the LMM solver as +remained an open question for years, mainly because the implementation of the ptask does not rely on the LMM solver as the other models do. The *fair bottleneck* solver is convenient, but with less solid theoretical bases and the development of its replacement (the *bmf solver*) is still ongoing. However, this combination of I/Os and communications seemed easier as these activities share the same unit (bytes). -After a few tentatives, we opted for a simple, slightly unperfect, yet convenient way to implement such I/O streams -at the kernel level. It doesn't require a new model, just that the default HostModels implements a new function which -creates a classical NetworkAction, but add some I/O-related constraints to it. A couple little hacks here and there, -and done! A single activity mixing I/Os and communications can be created whose progress is limited by the resource -(Disk or Link) of least bandwidth value. - -**On the interface front**, the new ``Io::streamto()`` function has been inspired by the existing ``Comm::sendto()`` -function (which also derives from the ptask model). The user can specify a ``src_disk`` on a ``src_host`` and a -``dst_disk`` on a ``dst_host`` to stream data of a given ``size``. Note that disks are optional, allowing users to -simulate some kind of "disk-to-memory" or "memory-to-disk" I/O streams. +After a few tentatives, we opted for a simple, slightly imperfect, yet convenient way to implement such I/O streams at the +kernel level. It doesn't require a new model, just that the default HostModels implements a new function which creates a +classical NetworkAction, but add some I/O-related constraints to it. A couple little hacks here and there, and done! A single +activity mixing I/Os and communications can be created whose progress is limited by the resource (Disk or Link) of least +bandwidth value. As a result, a new :cpp:func:`Io::streamto()` function has been added to send data between arbitrary disks or +hosts. The user can specify a ``src_disk`` on a ``src_host`` and a ``dst_disk`` on a ``dst_host`` to stream data of a +given ``size``. Note that disks are optional, allowing users to simulate some kind of "disk-to-memory" or "memory-to-disk" I/O +streams. It's highly inspired by the existing :cpp:func:`Comm::sendto` that can be used to send data between arbitrary hosts. + +We also modified the Wi-Fi model so that the total capacity of a link depends on the amount of flows on that link, accordingly to +the result of some ns-3 experiments. This model can be more accurate for congestioned Wi-Fi links, but its calibration is more +demanding, as shown in the `example +`_ and in the `research +paper `_. + +We also worked on the usability of our models, by actually writing the long overdue documentation of our TCP models and by renaming +some options for clarity (old names are still accepted as aliases). A new function ``s4u::Engine::flatify_platform()`` dumps an +XML representation that is inefficient (all zones are flatified) but easier to read (routes are explicitly defined). You should +not use the output as a regular input file, but it will prove useful to double-check the your platform. + +**On the interface front**, some functions were deprecated and will be removed in 4 versions, while some old deprecated functions +were removed in this version, as usual. + +Expressing your application as a DAG or a workflow is even more integrated than before. We added a new tutorial on simulating +DAGs and a DAG loader for workflows using the `wfcommons formalism `_. Starting an activity is now +properly delayed until after all its dependencies are fulfilled. We also added a notion of :ref:`Task `, a sort +of activity that can be fired several time. It's very useful to represent complex workflows. We added a ``on_this`` variant of +:ref:`every signal `, to react to the signals emitted by one object instance only. This is sometimes easier than +reacting to every signals of a class, and then filtering on the object you want. Activity signals (veto, suspend, resume, +completion) are now specialized by activity class. That is, callbacks registered in Exec::on_suspend_cb will not be fired for +Comms nor Ios + +Three new useful plugins were added: The :ref:`battery plugin` can be used to create batteries that get discharged +by the energy consumption of a given host, the :ref:`solar panel plugin ` can be used to create +solar panels which energy production depends on the solar irradiance and the :ref:`chiller plugin ` can be used to +create chillers and compensate the heat generated by hosts. These plugins could probably be better integrated +in the framework, but our goal is to include in SimGrid the building blocks upon which everybody would agree, while the model +elements that are more arguable are provided as plugins, in the hope that the users will carefully assess the plugins and adapt +them to their specific needs before usage. Here for example, there is several models of batteries (the one provided does not +take the aging into account), and would not be adapted to every studies. + +It is now easy to mix S4U actors and SMPI applications, or even to start more than one MPI application in a given simulation +with the :ref:`SMPI_app_instance_start() ` function. + +**On the model checking front**, this release brings a huge load of good improvements. First, we finished the long refactoring +so that the model-checker only reads the memory of the application for state equality (used for liveness checking) and for +:ref:`stateful checking `. Instead, the network protocol is used to retrieve the information and the +application is simply forked to explore new execution branches. The code is now easier to read and to understand. Even better, +the verification of safety properties is now enabled by default on every platforms since it does not depend on advanced OS +mechanisms anymore. You can even run the verified application in valgrind in that case. On the other hand, liveness checking +still needs to be enabled at compile time if you need it. Tbh, this part of the framework is not very well maintained nowadays. +We should introduce more testing of the liveness verification at some point to fix this situation. + +Back on to safety verification, we fixed a bug in the DPOR reduction which resulted in some failures to be missed by the +exploration, but this somewhat hinders the reduction quality (as we don't miss branches anymore). Some scenarios which could be +exhaustively explored earlier (with our buggy algorithm) are now too large for our (correct) exploration algorithm. But that's +not a problem because we implemented several mechanism to improve the performance of the verification. First, we implemented +source sets in DPOR, to blacklist transitions that are redundant with previously explored ones. Then, we implemented several new +DPOR variants. SDPOR and ODPOR are very efficient algorithms described in the paper "Source Sets: A Foundation for Optimal +Dynamic Partial Order Reduction" by Abdulla et al in 2017. We also have an experimental implementation of UPDOR, described in +the paper "Unfolding-based Partial Order Reduction" by Rodriguez et al in 2015, but it's not completely functional yet. We hope +to finish it for the next release. And finally, we implemented a guiding mechanism trying to converge faster toward the bugs in +the reduced state space. We have some naive heuristics, and we hope to provide better ones in the next release. + +We also extended the sthread module, which allows to intercept simple code that use pthread mutex and semaphores to simulate and +verify it. You do not even need to recompile your code, as it uses LD_PRELOAD to intercept on the target functions. This module +is still rather young, but it could probably be useful already, e.g. to verify the code written by students in a class on UNIX +IPC and synchronization. Check `the examples `_. In addition, +sthread can now also check concurrent accesses to a given collection, loosely inspired from `this paper +`_. +This feature is not very usable yet, as you have to manually annotate your code, but we hope to improve it in the future. + +Version 3.35 (November 23. 2023) +-------------------------------- + +**On the performance front**, we did some profiling and optimisation for this release. We saved some memory in simulation +mixing MPI applications and S4U actors, and we greatly improved the performance of simulation exchanging many messages. We even +introduced a new abstraction called MessageQueue and associated Mess simulated object to represent control messages in a very +efficient way. When using MessageQueue and Mess instead of Mailboxes and Comms, information is automagically transported over +thin air between producer and consumer in no simulated time. The implementation is much simpler, yielding much better +performance. Indeed, this abstraction solves a scalability issue observed in the WRENCH framework, which is heavily based on +control messages. + + +**On the interface front**, we introduced a new abstraction called ActivitySets. It makes it easy to interact with a bag of +mixed activities, waiting for the next occurring one, or for the completion of the whole set. This feature already existed in +SimGrid, but was implemented in a crude way with vectors of activities and static functions. It is also much easier than earlier +to mix several kinds of activities in activity sets. + +We introduced a new plugin called JBOD (just a bunch of disks), that proves useful to represent a sort of hosts gathering many +disks. We also revamped the battery, photovoltaic and chiller plugins introduced in previous release to make it even easier to +study green computing scenarios. On a similar topic, we eased the expression of vertical scaling with the :ref:`Task +`, the repeatable activities introduced in the previous release that can be used to represent microservices +applications. + +We not only added new abstractions and plugins, but also polished the existing interfaces. For example, the declaration of +multi-zoned platforms was greatly simplified by adding methods with fewer parameters to cover the common cases, leaving the +complete methods for the more advanced use cases (see the ChangeLog for details). Another difficulty in the earlier interface +was related to :ref:`Mailbox::get_async()` which used to require the user to store the payload somewhere on her side. Instead, +it is now possible to retrieve the payload from the communication object once it's over with :ref:`Comm::get_payload()`. + +Finally on the SMPI front, we introduced :ref:`SMPI_app_instance_join()` to wait for the completion of a started MPI instance. +This enables further mixture of MPI codes and controlled by S4U applications and plugins. We are currently considering +implementing some MPI4 calls, but nothing happened so far. + +**On the model-checking front**, the first big news is that we completely removed the liveness checker from the code base. This +is unfortunate, but the truth is that this code was very fragile and not really tested. + +For the context, liveness checking is the part of model checking that can determine whether the studied system always terminates +(absence of infinite loops, called non-progression loops in this context), or whether the system can reach a desirable state in +a finite time (for example, when you press the button, you want the elevator to come eventually). This relies on the ability to +detect loops in the execution, most often through the detection that this system state was already explored earlier. SimGrid +relied on tricks and heuristics to detect such state equality by leveraging debug information meant for gdb or valgrind. It +kinda worked, but was very fragile because neither this information nor the compilation process are meant for state equality +evaluation. Not zeroing the memory induces many crufty bits, for example in the padding bytes of the data structures or on the +stack. This can be solved by only comparing the relevant bits (as instructed by the debug information), but this process was +rather slow. Detecting equality in the heap was even more hackish, as we usually don't have any debug information about the +memory blocks retrieved from malloc(). This prevents any introspection into these blocks, which is problematic because the order +of malloc calls will create states that are syntactically different (the blocks are not in the same location in memory) but +semantically equivalent (the data meaning rarely depends on the block location itself). The heuristics we used here were so +crude that I don't want to detail them here. + +If liveness checking were to be re-implemented nowadays, I would go for a compiler-aided approach. I would maybe use a compiler +plugin to save the type of malloced blocks as in `Compiler-aided type tracking for correctness checking of MPI applications +`_ and zero the stack on +call returns to ease introspection. We could even rely on a complete introspection mechanism such as `MetaCPP +`_ or similar. We had a great piece of code to checkpoint millions of states in a +memory-efficient way. It was able to detect the common memory pages between states, and only save the modified ones. I guess +that this would need to be exhumed from the git when reimplementing liveness checking in SimGrid. But I doubt that we will +happen anytime soon, as we concentrate our scarce manpower on other parts. In particular, safety checking is about searching for +failed assertions by inspecting each state. A counter example to a safety property is simply a state where the assertion failed +/ the property is violated. A counter example to a liveness property is an infinite execution path that violates the property. +In some sense, safety checking is much easier than liveness checking, but it's already very powerful to exhaustively test an +application. + +In this release, we made several interesting progress on the safety side of model checking. First, we ironed out many bugs in +the ODPOR exploration algorithm (and dependency and reversible race theorems on which it relies). ODPOR should now be usable, +and it's much faster than our previous DPOR reduction. We also extended sthread a bit, by adding many tests from the McMini tool +and by implementing pthread barriers and conditionals. It seems to work rather well with C codes using pthreads and with C++ +codes using the standard library. You can even use sthread on a process running in valgrind or gdb. + +Unfortunately, conditionals cannot be verified so far by the model checker, because our implementation of condition variables is +still synchronous. Our reduction algorithms are optimized because we know that all transitions of our computation model are +persistent: once a transition gets enabled, it remains so until it's actually fired. This enables to build upon previous +computations, while everything must be recomputed in each state when previously enabled transitions can get disabled by an +external event. Unfortunately, this requires that every activity is written in an asynchronous way. You first declare your +intent to lock that mutex in an asynchronous manner, and then wait for the ongoing_acquisition object. Once a given acquisition +is granted, it will always remain so even if another actor creates another acquisition on the same mutex. This is the equivalent +of asynchronous communications that are common in MPI, but for mutex locks, barriers and semaphores. The thing that is missing +to verify pthread_cond in sthread is that I didn't manage to write an asynchronous version of the condition variables. We have an +almost working code lying around, but it fails for timed waits on condition variables. This will probably be part of the next +release. .. |br| raw:: html