AW: Discussion: Switching Esprseso to shared memory parallelization

Dear Rudolf and maintainers,

I can understand why you think about this step. I want to include my thoughts here and give you some feedback.

1) You asked about scenarios that need MPI parallelism. At an IPVS+ITV (U. Stuttgart) collaboration, we perform large simulations

that subject particles to a background flow field. Within that flow field, we want include multiple scales of turbulence.

These simulations need MPI parallelism as they have millions of particles and millions of bonds. This data does simply not fit into the RAM of a single node. On current HPC machines (e.g. Hawk, SuperMUC-NG), you have 2 GB RAM per core. This effectively limits what you can do with one node. And sometimes that does not mean simply having less parallelism and therefore waiting for a longer time, but not being able to run time simulation at all.

Even without the RAM problems, our simulations would take way too long to run them on a single node. I suspect this is also true for other multi-scale simulations.

2) You talk about "HPC nodes" having about 20-64 cores. This is certainly true. I just want to make the remark that with a shared-memory paralleization there will be no more HPC nodes for ESPResSo users. When applying for runtime at an HPC center, you have to detail about the parallelization and the scalability of your code. If you run on one node only they will most likely turn you down and you are left with your local workstation.

3) While I see your point that the current MPI parallelization might not be the easiest to understand and roll out, I want to make it clear that devising a well-performing shared-memory parallelization is not a trivial matter, too. "Sprinkling in" a couple of "#pragma omp parallel for" will certainly not be enough. As with the distributed-memory parallelization you will have to devise a spatial domain decomposition and come up with a workload distribution between the threads. You will have to know which threads imports data from others and devise locking mechanisms to guard these accesses. Reasoning about this code and debugging it might turn out to be as hard as for the MPI-based code. If you want to go down this path, I strongly suggest not reinventing the wheel and taking a look at, e.g., the AutoPAS [1] project.

One particular problem that I encountered in the past and that I want to briefly mention here is bonds: They are only stored on one of the two (or more) involved particles. This is one of the reasons, why ESPResSo currently needs to communicate the forces back after calculating them and you will certainly need measures that deal with this circumstance in a shared-memory parallel code. Such details will increase the complexity of a shared-memory parallel code and it might end up not being easy to understand for newcomers or make it hard to implement new features, too.

4) Dependencies, MPI, (avoidable) marshalling and unmarshalling of data, public Python code platforms, etc.: Amusingly, in the past, ESPResSo included things like a "fake MPI" implementation to enable compilation without MPI, as well as local copy operations on ghost data (I don't know if they are still in the current version or not), that didn't require (un-)marshalling. My point here is, that over the time, requirements change. So when making a decision about removing a feature (like MPI parallelism) completely, you better be sure about it.

In the future maybe some other public Python platform will come along and further restrict what you can do. Up to which point are you willing to adapt your code to meet restrictions? What I want to say here is that you weigh up HPC aspects against ease of use (possibly on online platforms), right? Might this weighing change again in the near future?

Let me stress again, that I can understand why you think about this step. And, of couse, I appreciate the work that you maintainers do for ESPResSo every day and have done in the past. That being said, it is your task to maintain the project in the future and, therefore, ultimately your decision. I wrote these thoughts and comments to include a slightly different perspective and maybe start a discussion in the community.

Greetings,

Steffen

[1] https://github.com/AutoPas/AutoPas, https://www.researchgate.net/publication/348368663_AutoPas_in_ls1_mardyn_Massively_Parallel_Particle_Simulations_with_Node-Level_Auto-Tuning

Von: Espressomd-users <espressomd-users-bounces+steffen.hirschmann=ipvs.uni-stuttgart.de@nongnu.org> im Auftrag von Rudolf Weeber <weeber@icp.uni-stuttgart.de>
Gesendet: Montag, 5. Juli 2021 17:55
An: espressomd-users@nongnu.org
Betreff: Discussion: Switching Esprseso to shared memory parallelization

Dear Espresso users,

We are currently discussing switching Espresso's parallelization from
MPI-based to shared memory based. This should result in better parallel
performance and much simpler code. However, it would mean that a single
instance of Espresso would only run on a single machine. For current HPC
systems, that would be something between 20 and 64 cores, typically.

To help with the decision, we would like to know, if anyone runs simulations
using Espresso with more than 64 cores, and what kind of simulations those
are.
Please see technical details below and let us know, what you think.

Regards, Rudolf

# Technical details

## Parallelization paradigms

With MPI-based parallelization, information between processes is passed by
explicitly sending messages. This means that the data (such as particles at
the boundary of a processor) has to be packed, sent, and unpacked.
In shared memory parallelization, all processes have access to the same data.
It is only necessary to ensure that no two processes write the same data at
the same time. So, delays for packing, unpacking and sending the data can
mostly be avoided.

## Reasons for not using MPI

* Adding new features to Espresso will be easier, because a lot of non-trivial
communication code does not have to be written.
* The mix of controller-agent and synchronous parallelization used by Espresso
is difficult to understand for new developers, which makes it difficult to get
started with Espresso coding. This parallelization scheme is a result of
Espresso being controlled by a (Python) scripting interface.
* The MPI and Boost::MPI dependencies complicate Espresso's installation and
make it virtually impossible to run Espresso on public Python platforms such
as Azure Notebooks or Google Collab as well as building on Windows natively.
* The core team had to spend considerable time handling bugs in the MPI and
Boost::MPI dependencies that affected Espresso.
* Writing and validating MPI-parallel code is difficult. We had a few
instances of data not being correctly synchronized across MPI processes which
went unnoticed. In one instance, we were, after a lot of effort, not able to
solve the issue and had to disable a feature for MPI-parallel simulations.

## Advantages of supporting MPI

Simulations can run on more than a single node, i.e., more than the 20-64
cores which are present in typical HPC-nodes.

## Performance estimates

Assuming that one million time steps per day is acceptable, this corresponds
to slightly less than 10k particles per core in a charged soft sphere system
(LJ+P3M) at 10% volume fraction. So, approximately 300k particles would be
possible on an HPC node.
For a soft sphere + LB on a GPU, several million particles should be possible.

--
Dr. Rudolf Weeber
Institute for Computational Physics
Universität Stuttgart
Allmandring 3
70569 Stuttgart
Germany
Phone: +49(0)711/685-67717
Email: weeber@icp.uni-stuttgart.de
http://www.icp.uni-stuttgart.de/~icp/Rudolf_Weeber

From:	Hirschmann, Steffen
Subject:	AW: Discussion: Switching Esprseso to shared memory parallelization
Date:	Tue, 6 Jul 2021 10:41:01 +0000