Implement common parallel functionality for trixi-framework#66
Implement common parallel functionality for trixi-framework#66vchuravy wants to merge 7 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #66 +/- ##
==========================================
- Coverage 94.56% 92.52% -2.05%
==========================================
Files 5 7 +2
Lines 313 361 +48
==========================================
+ Hits 296 334 +38
- Misses 17 27 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I don't think it makes sense to include so that the semidiscretization defines the backend. And then you can make the |
|
To keep dependencies small, there could also be an argument to make KA an extension. |
|
Thanks a lot for this push! I haven't looked at the implementation yet, but I very much like the idea of having a sensible default approach for multithreading and offloading to accelerators for all our packages, such that improvements/extensions directly benefit all downstream packages. Regarding dependencies and where to put them: I'd like to make using GPUs as easy as using standard Trixi{P,SW,A}.jl. Having to load additional packages is somewhat counter to that goal. On the other hand, if we could define a package "TrixiBoost.jl" (or "TrixiWarp.jl" or "TrixiNitro.jl") that captures all the necessary additional dependencies and the common code. These are just some thoughts I had, I do not have a definite answer. On a more procedural note: Would it make sense to keep this PR as a reference for the parallel implementations for now, and create a new PR to Trixi.jl that converts the GPU implementation in trixi-framework/Trixi.jl#2590 (hopefully merged soon) to this form of acceleration? My thoughts are that this would help us better understand if the same form works well for both TrixiParticles.jl and Trixi.jl, and if yes, then go ahead and tackle the detailed question of whether this should live here or in another - potentially optional - upstream package. |
If Polyester and KA are not added as dependencies and only in package extensions, then you only have to write
Why would it not? This is just a convenient way to write kernels. Even if you want to thread over the nodes and not the elements, you can simply do something like this: iterator = CartesianIndices((1:n_elements, 1:n_nodes_per_element))
@par backend for i in iterator
element, node = Tuple(i)
...
end |
I agree. I have some projects, where I depend on TrixiBase.jl and which do not support GPUs. I would like to avoid (implicitly) depending on KA.jl for these projects. Therefore, I would prefer an extension, too. |
|
Thanks for the clarification. I misunderstood the extension part and thought it would require to do |
|
IMHO we also need to come up with a better name than the original, historically chosen |
KernelAbstractions is intentionally a fairly light dependency, but I understand the sentiment. The only problem is the definition of |
We don't actually use this in TrixiP and PointNeighbors other than to restrict functions to not take any other types where a parallelization backend is required, but |
Could this problem be circumvented by defining a small struct like struct KernelAbstractionsBackend{B <: KernelAbstractions.Backend} <: AbstractThreadingBackend
backend::B
endin the extension, such that it is included when dispatching on |
|
I don't think |
|
Does that mean that in principle, we can proceed with this PR? |
efaulhaber
left a comment
There was a problem hiding this comment.
Does that mean that in principle, we can proceed with this PR?
Yes
| const ParallelizationBackend = Union{AbstractThreadingBackend, KernelAbstractions.Backend} | ||
|
|
There was a problem hiding this comment.
| const ParallelizationBackend = Union{AbstractThreadingBackend, KernelAbstractions.Backend} |
| @inline function TrixiBase.parallel_foreach(f::F, iterator, ::PolyesterBackend) where {F} | ||
| Polyester.@batch for i in iterator | ||
| @inline f(i) | ||
| end | ||
| end |
There was a problem hiding this comment.
Would it make sense to define something like
@inline function TrixiBase.parallel_foreach(f, iterator, ::PolyesterBackend)
error("Polyester.jl must be imported in order to use `PolyesterBackend`")
endin the main package outside of this extension?
| # On GPUs, execute `f` inside a GPU kernel with KernelAbstractions.jl | ||
| @inline function parallel_foreach(f::F, iterator, | ||
| backend::KernelAbstractions.Backend) where {F} | ||
| # On the GPU, we can only loop over `1:N`. Therefore, we loop over `1:length(iterator)` | ||
| # and index with `iterator[eachindex(iterator)[i]]`. | ||
| # Note that this only works with vector-like iterators that support arbitrary indexing. | ||
| indices = eachindex(IndexLinear(), iterator) | ||
| ndrange = length(indices) | ||
|
|
||
| # Skip empty loops | ||
| ndrange == 0 && return | ||
|
|
||
| # Call the generic kernel that is defined below, which only calls a function with | ||
| # the global GPU index. | ||
| foreach_ka(backend)(f, iterator, indices, ndrange = ndrange) | ||
| end | ||
|
|
||
| @kernel unsafe_indices=true function foreach_ka(f, iterator, indices) | ||
| # Calculate global index | ||
| N = @groupsize()[1] | ||
| iblock = @index(Group, Linear) | ||
| ithread = @index(Local, Linear) | ||
| i = ithread + (iblock - Int32(1)) * N | ||
|
|
||
| if i <= length(indices) | ||
| @inbounds @inline f(iterator[indices[i]]) | ||
| end | ||
| end |
There was a problem hiding this comment.
As discussed, this should become an extension.
| @par TrixiBase.SerialBackend() for i in 1:10 | ||
| @test i in 1:10 | ||
| end |
There was a problem hiding this comment.
Currently, this is only testing that the macro doesn't throw errors and that it doesn't go out of bounds. How about:
| @par TrixiBase.SerialBackend() for i in 1:10 | |
| @test i in 1:10 | |
| end | |
| result = zeros(Int, 10) | |
| @par TrixiBase.SerialBackend() for i in 1:10 | |
| result[i] = i | |
| end | |
| @test result == 1:10 |
Heavily based on @efaulhaber infrastructure in PointNeighbors.jl
During Trudi @efaulhaber, @benegee and I were discussing how to reduce the code duplication currently in trixi-framework/Trixi.jl#2590
Currently we have a lot of fairly simple kernel like:
With the old code looking more like
@efaulhaber has been using a macro of this style (and a clear evolution of Trixi own
@threaded) to great effect,so it would make sense to have one shared definition.
Considerations:
default_backenddefinition makes that non-sensicalAcceleratedKernels.jl has a very similar definition in
foreachindex, but also has the parallel mapreduce we would need in Trixi.https://github.com/JuliaGPU/AcceleratedKernels.jl/blob/e641c281f9373825f9852fb84457c608b6074586/src/foreachindex.jl#L122
I could see this definition living in KernelAbstractions (but it would then kinda conflict with AcceleratedKernels, and the Polyester dependency is a no-go).
Putting this up as a draft so that we can mull about it.