Implement common parallel functionality for trixi-framework by vchuravy · Pull Request #66 · trixi-framework/TrixiBase.jl

vchuravy · 2026-03-19T21:20:54Z

Heavily based on @efaulhaber infrastructure in PointNeighbors.jl

During Trudi @efaulhaber, @benegee and I were discussing how to reduce the code duplication currently in trixi-framework/Trixi.jl#2590

Currently we have a lot of fairly simple kernel like:

function calc_volume_integral!(backend::Backend, du, u, mesh,
                               have_nonconservative_terms, equations,
                               volume_integral, dg::DGSEM, cache)
    nelements(dg, cache) == 0 && return nothing
    kernel! = volume_integral_KAkernel!(backend)
    kernel_cache = kernel_filter_cache(cache)
    kernel!(du, u, typeof(mesh), have_nonconservative_terms, equations,
            volume_integral, dg, kernel_cache,
            ndrange = nelements(dg, cache))
    return nothing
end

@kernel function volume_integral_KAkernel!(du, u, meshT,
                                           have_nonconservative_terms, equations,
                                           volume_integral, dg::DGSEM, cache)
    element = @index(Global)
    volume_integral_kernel!(du, u, element, meshT, have_nonconservative_terms,
                            equations, volume_integral, dg, cache)
end

With the old code looking more like


function calc_volume_integral!(backend::Nothing, du, u, mesh,
                               have_nonconservative_terms, equations,
                               volume_integral, dg::DGSEM, cache)
    @threaded for element in eachelement(dg, cache)
        volume_integral_kernel!(du, u, element, typeof(mesh),
                                have_nonconservative_terms, equations,
                                volume_integral, dg, cache)
    end

    return nothing
end

@efaulhaber has been using a macro of this style (and a clear evolution of Trixi own @threaded) to great effect,
so it would make sense to have one shared definition.

Considerations:

Is this the right place for the code to live? Or is TrixiBase to "low" and does not want the additional dependencies.
I would like to have the Polyester code being an extension, but the default_backend definition makes that non-sensical

AcceleratedKernels.jl has a very similar definition in foreachindex, but also has the parallel mapreduce we would need in Trixi.

https://github.com/JuliaGPU/AcceleratedKernels.jl/blob/e641c281f9373825f9852fb84457c608b6074586/src/foreachindex.jl#L122

I could see this definition living in KernelAbstractions (but it would then kinda conflict with AcceleratedKernels, and the Polyester dependency is a no-go).

Putting this up as a draft so that we can mull about it.

codecov · 2026-03-19T21:22:39Z

Codecov Report

❌ Patch coverage is 79.16667% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.52%. Comparing base (9fb93f7) to head (29b11f5).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/parallel.jl	86.84%	5 Missing ⚠️
test/test_parallel.jl	50.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #66      +/-   ##
==========================================
- Coverage   94.56%   92.52%   -2.05%     
==========================================
  Files           5        7       +2     
  Lines         313      361      +48     
==========================================
+ Hits          296      334      +38     
- Misses         17       27      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

efaulhaber · 2026-03-20T08:04:23Z

I don't think it makes sense to include default_backend. In TrixiParticles.jl, we basically only use that for preprocessing (sampling STL files) when we don't yet have a semidiscretization. Everywhere else, we use @threaded semi and defined

@inline function PointNeighbors.parallel_foreach(f, iterator, semi::Semidiscretization)
    PointNeighbors.parallel_foreach(f, iterator, semi.parallelization_backend)
end

so that the semidiscretization defines the backend.

And then you can make the PolyesterBackend a package extension.

vchuravy · 2026-03-20T08:12:34Z

To keep dependencies small, there could also be an argument to make KA an extension.

sloede · 2026-03-20T09:22:18Z

Thanks a lot for this push! I haven't looked at the implementation yet, but I very much like the idea of having a sensible default approach for multithreading and offloading to accelerators for all our packages, such that improvements/extensions directly benefit all downstream packages.

Regarding dependencies and where to put them: I'd like to make using GPUs as easy as using standard Trixi{P,SW,A}.jl. Having to load additional packages is somewhat counter to that goal. On the other hand, if we could define a package "TrixiBoost.jl" (or "TrixiWarp.jl" or "TrixiNitro.jl") that captures all the necessary additional dependencies and the common code. These are just some thoughts I had, I do not have a definite answer.

On a more procedural note: Would it make sense to keep this PR as a reference for the parallel implementations for now, and create a new PR to Trixi.jl that converts the GPU implementation in trixi-framework/Trixi.jl#2590 (hopefully merged soon) to this form of acceleration? My thoughts are that this would help us better understand if the same form works well for both TrixiParticles.jl and Trixi.jl, and if yes, then go ahead and tackle the detailed question of whether this should live here or in another - potentially optional - upstream package.

efaulhaber · 2026-03-20T11:25:43Z

Having to load additional packages is somewhat counter to that goal.

If Polyester and KA are not added as dependencies and only in package extensions, then you only have to write using TrixiBase, Polyester, KernelAbstractions in src/Trixi.jl. That doesn't seem like a disadvantage to me. Especially since this means that any Trixi*.jl package can decide at any point to drop the Polyester dependency and make Julia Threads the default. Even further, if a package doesn't support GPUs (yet), they can avoid the KA dependency.

if the same form works well for both TrixiParticles.jl and Trixi.jl

Why would it not? This is just a convenient way to write kernels. Even if you want to thread over the nodes and not the elements, you can simply do something like this:

iterator = CartesianIndices((1:n_elements, 1:n_nodes_per_element))
@par backend for i in iterator
    element, node = Tuple(i)
    ...
end

JoshuaLampert · 2026-03-20T11:28:56Z

Even further, if a package doesn't support GPUs (yet), they can avoid the KA dependency.

I agree. I have some projects, where I depend on TrixiBase.jl and which do not support GPUs. I would like to avoid (implicitly) depending on KA.jl for these projects. Therefore, I would prefer an extension, too.

sloede · 2026-03-20T13:50:51Z

Thanks for the clarification. I misunderstood the extension part and thought it would require to do using Trixi, KernelAbstractions on the user side. But Erik and Valentin enlightened me, so it's all good :-)

sloede · 2026-03-20T13:56:45Z

IMHO we also need to come up with a better name than the original, historically chosen @threaded for the macro. @par is a little too generic for my taste, while a more descriptive name such as @intra_node_parallelization is unacceptably long. Maybe we can find a good middle ground :-)

vchuravy · 2026-03-23T09:04:10Z

have some projects, where I depend on TrixiBase.jl and which do not support GPUs. I would like to avoid (implicitly) depending on KA.jl for these projects. Therefore, I would prefer an extension, too.

KernelAbstractions is intentionally a fairly light dependency, but I understand the sentiment.

The only problem is the definition of const ParallelizationBackend = Union{AbstractThreadingBackend, KernelAbstractions.Backend} which this provides for easier dispatch..

efaulhaber · 2026-03-24T13:32:23Z

The only problem is the definition of const ParallelizationBackend = Union{AbstractThreadingBackend, KernelAbstractions.Backend} which this provides for easier dispatch..

We don't actually use this in TrixiP and PointNeighbors other than to restrict functions to not take any other types where a parallelization backend is required, but parallel_foreach would fail anyway, so we don't need this at all.

JoshuaLampert · 2026-04-27T13:08:15Z

The only problem is the definition of const ParallelizationBackend = Union{AbstractThreadingBackend, KernelAbstractions.Backend} which this provides for easier dispatch..

Could this problem be circumvented by defining a small struct like

struct KernelAbstractionsBackend{B <: KernelAbstractions.Backend} <: AbstractThreadingBackend
  backend::B
end

in the extension, such that it is included when dispatching on AbstractThreadingBackend? In this case, ParallelizationBackend would not be needed anymore.

efaulhaber · 2026-04-27T13:18:23Z

I don't think ParallelizationBackend is required at all. There is no need to dispatch on the backend type except for specific backends.

sloede · 2026-04-27T15:53:50Z

Does that mean that in principle, we can proceed with this PR?

efaulhaber

Does that mean that in principle, we can proceed with this PR?

Yes

efaulhaber · 2026-04-27T15:57:13Z

+const ParallelizationBackend = Union{AbstractThreadingBackend, KernelAbstractions.Backend}
+


Suggested change

const ParallelizationBackend = Union{AbstractThreadingBackend, KernelAbstractions.Backend}

efaulhaber · 2026-04-27T16:00:45Z

+@inline function TrixiBase.parallel_foreach(f::F, iterator, ::PolyesterBackend) where {F}
+    Polyester.@batch for i in iterator
+        @inline f(i)
+    end
+end


Would it make sense to define something like

@inline function TrixiBase.parallel_foreach(f, iterator, ::PolyesterBackend) error("Polyester.jl must be imported in order to use `PolyesterBackend`") end

in the main package outside of this extension?

efaulhaber · 2026-04-27T16:01:32Z

+# On GPUs, execute `f` inside a GPU kernel with KernelAbstractions.jl
+@inline function parallel_foreach(f::F, iterator,
+                                  backend::KernelAbstractions.Backend) where {F}
+    # On the GPU, we can only loop over `1:N`. Therefore, we loop over `1:length(iterator)`
+    # and index with `iterator[eachindex(iterator)[i]]`.
+    # Note that this only works with vector-like iterators that support arbitrary indexing.
+    indices = eachindex(IndexLinear(), iterator)
+    ndrange = length(indices)
+
+    # Skip empty loops
+    ndrange == 0 && return
+
+    # Call the generic kernel that is defined below, which only calls a function with
+    # the global GPU index.
+    foreach_ka(backend)(f, iterator, indices, ndrange = ndrange)
+end
+
+@kernel unsafe_indices=true function foreach_ka(f, iterator, indices)
+    # Calculate global index
+    N = @groupsize()[1]
+    iblock = @index(Group, Linear)
+    ithread = @index(Local, Linear)
+    i = ithread + (iblock - Int32(1)) * N
+
+    if i <= length(indices)
+        @inbounds @inline f(iterator[indices[i]])
+    end
+end


As discussed, this should become an extension.

efaulhaber · 2026-04-27T16:03:01Z

+    @par TrixiBase.SerialBackend() for i in 1:10
+        @test i in 1:10
+    end


Currently, this is only testing that the macro doesn't throw errors and that it doesn't go out of bounds. How about:

Suggested change

@par TrixiBase.SerialBackend() for i in 1:10

@test i in 1:10

end

result = zeros(Int, 10)

@par TrixiBase.SerialBackend() for i in 1:10

result[i] = i

end

@test result == 1:10

efaulhaber and others added 4 commits March 19, 2026 21:19

Import code from PointNeighbors

6cfef96

Rename macro to par

7dcaafb

fixup! Rename macro to par

3b23282

fixup! fixup! Rename macro to par

704614d

fixup! Rename macro to par

e1117ae

vchuravy added 2 commits March 20, 2026 09:45

use AcceleratedKernels form

91c8376

apply formatting

29b11f5

vchuravy mentioned this pull request Mar 20, 2026

GPU porting trixi-framework/Trixi.jl#2822

Open

18 tasks

This was referenced Mar 25, 2026

GPU-compatible reduction of speeds in step size computation trixi-framework/Trixi.jl#2823

Closed

Initial GPU support for P4estMesh 2D and 3D using simple KernelAbstractions kernels trixi-framework/Trixi.jl#2590

Merged

vchuravy mentioned this pull request Apr 27, 2026

Add support for tracing profilers like Nvidia NSight System and Intel VTune trixi-framework/Trixi.jl#2908

Open

efaulhaber requested changes Apr 27, 2026

View reviewed changes

		const ParallelizationBackend = Union{AbstractThreadingBackend, KernelAbstractions.Backend}

Conversation

vchuravy commented Mar 19, 2026

Considerations:

Uh oh!

codecov Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

efaulhaber commented Mar 20, 2026

Uh oh!

vchuravy commented Mar 20, 2026

Uh oh!

sloede commented Mar 20, 2026

Uh oh!

efaulhaber commented Mar 20, 2026

Uh oh!

JoshuaLampert commented Mar 20, 2026

Uh oh!

sloede commented Mar 20, 2026

Uh oh!

sloede commented Mar 20, 2026

Uh oh!

vchuravy commented Mar 23, 2026

Uh oh!

efaulhaber commented Mar 24, 2026

Uh oh!

JoshuaLampert commented Apr 27, 2026

Uh oh!

efaulhaber commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sloede commented Apr 27, 2026

Uh oh!

efaulhaber left a comment

Choose a reason for hiding this comment

Uh oh!

efaulhaber Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

efaulhaber Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

efaulhaber Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

efaulhaber Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Mar 19, 2026 •

edited

Loading

efaulhaber commented Apr 27, 2026 •

edited

Loading