|
From: | Hirschmann, Steffen |
Subject: | AW: Discussion: Switching Esprseso to shared memory parallelization |
Date: | Tue, 6 Jul 2021 10:41:01 +0000 |
Dear Rudolf and maintainers,
I can understand why you think about this step. I want to include my thoughts here and give you some feedback.
1) You asked about scenarios that need MPI parallelism. At an IPVS+ITV (U. Stuttgart) collaboration, we perform large simulations that subject particles to a background flow field. Within that flow field, we want include multiple scales of turbulence. These simulations need MPI parallelism as they have millions of particles and millions of bonds. This data does simply not fit into the RAM of a single node. On current HPC machines (e.g. Hawk, SuperMUC-NG), you have 2 GB RAM per core. This effectively limits what you can do with one node. And sometimes that does not mean simply having less parallelism and therefore waiting for a longer time, but not being able to run time simulation at all.
Even without the RAM problems, our simulations would take way too long to run them on a single node. I suspect this is also true for other multi-scale simulations.
2) You talk about "HPC nodes" having about 20-64 cores. This is certainly true. I just want to make the remark that with a shared-memory paralleization there will be no more HPC nodes for ESPResSo users. When applying for runtime at an HPC center, you have to detail about the parallelization and the scalability of your code. If you run on one node only they will most likely turn you down and you are left with your local workstation.
3) While I see your point that the current MPI parallelization might not be the easiest to understand and roll out, I want to make it clear that devising a well-performing shared-memory parallelization is not a trivial matter, too. "Sprinkling in" a couple of "#pragma omp parallel for" will certainly not be enough. As with the distributed-memory parallelization you will have to devise a spatial domain decomposition and come up with a workload distribution between the threads. You will have to know which threads imports data from others and devise locking mechanisms to guard these accesses. Reasoning about this code and debugging it might turn out to be as hard as for the MPI-based code. If you want to go down this path, I strongly suggest not reinventing the wheel and taking a look at, e.g., the AutoPAS [1] project.
One particular problem that I encountered in the past and that I want to briefly mention here is bonds: They are only stored on one of the two (or more) involved particles. This is one of the reasons, why ESPResSo currently needs to communicate the forces
back after calculating them and you will certainly need measures that deal with this circumstance in a shared-memory parallel code. Such details will increase the complexity of a shared-memory parallel code and it might end up not being easy to understand
for newcomers or make it hard to implement new features, too.
4) Dependencies, MPI, (avoidable) marshalling and unmarshalling of data, public Python code platforms, etc.: Amusingly, in the past, ESPResSo included things like a "fake MPI" implementation to enable compilation without MPI, as well as local copy operations
on ghost data (I don't know if they are still in the current version or not), that didn't require (un-)marshalling. My point here is, that over the time, requirements change. So when making a decision about removing a feature (like MPI parallelism) completely,
you better be sure about it.
In the future maybe some other public Python platform will come along and further restrict what you can do. Up to which point are you willing to adapt your code to meet restrictions? What I want to say here is that you weigh up HPC aspects against ease of use
(possibly on online platforms), right? Might this weighing change again in the near future?
Let me stress again, that I can understand why you think about this step. And, of couse, I appreciate the work that you maintainers do for ESPResSo every day and have done in the past. That being said, it is your task to maintain the project in the future and,
therefore, ultimately your decision. I wrote these thoughts and comments to include a slightly different perspective and maybe start a discussion in the community.
Greetings, Steffen
[1]
https://github.com/AutoPas/AutoPas,
https://www.researchgate.net/publication/348368663_AutoPas_in_ls1_mardyn_Massively_Parallel_Particle_Simulations_with_Node-Level_Auto-Tuning Von: Espressomd-users <espressomd-users-bounces+steffen.hirschmann=ipvs.uni-stuttgart.de@nongnu.org> im Auftrag von Rudolf Weeber <weeber@icp.uni-stuttgart.de>
Gesendet: Montag, 5. Juli 2021 17:55 An: espressomd-users@nongnu.org Betreff: Discussion: Switching Esprseso to shared memory parallelization Dear Espresso users,
We are currently discussing switching Espresso's parallelization from MPI-based to shared memory based. This should result in better parallel performance and much simpler code. However, it would mean that a single instance of Espresso would only run on a single machine. For current HPC systems, that would be something between 20 and 64 cores, typically. To help with the decision, we would like to know, if anyone runs simulations using Espresso with more than 64 cores, and what kind of simulations those are. Please see technical details below and let us know, what you think. Regards, Rudolf # Technical details ## Parallelization paradigms With MPI-based parallelization, information between processes is passed by explicitly sending messages. This means that the data (such as particles at the boundary of a processor) has to be packed, sent, and unpacked. In shared memory parallelization, all processes have access to the same data. It is only necessary to ensure that no two processes write the same data at the same time. So, delays for packing, unpacking and sending the data can mostly be avoided. ## Reasons for not using MPI * Adding new features to Espresso will be easier, because a lot of non-trivial communication code does not have to be written. * The mix of controller-agent and synchronous parallelization used by Espresso is difficult to understand for new developers, which makes it difficult to get started with Espresso coding. This parallelization scheme is a result of Espresso being controlled by a (Python) scripting interface. * The MPI and Boost::MPI dependencies complicate Espresso's installation and make it virtually impossible to run Espresso on public Python platforms such as Azure Notebooks or Google Collab as well as building on Windows natively. * The core team had to spend considerable time handling bugs in the MPI and Boost::MPI dependencies that affected Espresso. * Writing and validating MPI-parallel code is difficult. We had a few instances of data not being correctly synchronized across MPI processes which went unnoticed. In one instance, we were, after a lot of effort, not able to solve the issue and had to disable a feature for MPI-parallel simulations. ## Advantages of supporting MPI Simulations can run on more than a single node, i.e., more than the 20-64 cores which are present in typical HPC-nodes. ## Performance estimates Assuming that one million time steps per day is acceptable, this corresponds to slightly less than 10k particles per core in a charged soft sphere system (LJ+P3M) at 10% volume fraction. So, approximately 300k particles would be possible on an HPC node. For a soft sphere + LB on a GPU, several million particles should be possible. -- Dr. Rudolf Weeber Institute for Computational Physics Universität Stuttgart Allmandring 3 70569 Stuttgart Germany Phone: +49(0)711/685-67717 Email: weeber@icp.uni-stuttgart.de http://www.icp.uni-stuttgart.de/~icp/Rudolf_Weeber |
[Prev in Thread] | Current Thread | [Next in Thread] |