Skip to content

(#Towards 3420) Move OMPParallelTrans to create the data sharing attributes at transformation time#3421

Draft
LonelyCat124 wants to merge 8 commits into
masterfrom
3420_clause_gen_transformation
Draft

(#Towards 3420) Move OMPParallelTrans to create the data sharing attributes at transformation time#3421
LonelyCat124 wants to merge 8 commits into
masterfrom
3420_clause_gen_transformation

Conversation

@LonelyCat124

Copy link
Copy Markdown
Collaborator

I have a draft implementation of moving OMPParallelTrans to do clause generation (using logic that is still located in the directive, so can be called independently if the transformation isn't used for whatever reason). Reduction logic isn't all available at transformation time, so I've not touched that yet (maybe when Kern <=> Call is done we can revisit that?)

@arporter @sergisiso If either of you could have a look and let me know i fyou think I've overlooked anything with this basic change before I look at some of the other potential improvements we can do as well. I'm going to set ITs going as well to see if this breaks any of the transformation scripts/runtime behaviour of real codes too.

@codecov

codecov Bot commented Apr 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.96%. Comparing base (ca8610a) to head (5e1ceb2).
⚠️ Report is 209 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3421   +/-   ##
=======================================
  Coverage   99.96%   99.96%           
=======================================
  Files         391      391           
  Lines       54609    54627   +18     
=======================================
+ Hits        54590    54608   +18     
  Misses         19       19           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sergisiso

Copy link
Copy Markdown
Collaborator

@LonelyCat124 I have a couple of questions

  1. the NEMO integration tests passed by diffing with the previous version it seems that is missing the private attribute for loop variable indices:
>     !$omp teams loop collapse(2) default(shared) private(ji,jj)
870c870
<       !$omp teams loop collapse(2) default(shared)
---
>       !$omp teams loop collapse(2) default(shared) private(ji,jj)
888c888
<     !$omp teams loop collapse(2) default(shared)
---
>     !$omp teams loop collapse(2) default(shared) private(idx_4,idx_5)
904c904
<     !$omp teams loop collapse(2) default(shared)
---
>     !$omp teams loop collapse(2) default(shared) private(idx_6,idx_7)
922c922
<     !$omp teams loop collapse(2) default(shared)
---
>     !$omp teams loop collapse(2) default(shared) private(ji,jj)
1024c1024
<       !$omp teams loop collapse(2) default(shared)

Is this valid with a default(shared)? can we recover them?

  1. The infer_sharing_attributes is still done in the lowering, could we avoid it? It is the larger contributor to the backend time:
image

@LonelyCat124

Copy link
Copy Markdown
Collaborator Author

@sergisiso Ready for another look now.

I think 1) is fixed - I'd missed some cases that I now catch.

  1. I think we can't yet - the reason is we need the need_sync variables in lowering for adding reduction clauses, which we can't do a transformation time due to LFRicKern (or Kern in general) and reproducible reductions I think.

@sergisiso

Copy link
Copy Markdown
Collaborator

the reason is we need the need_sync variables in lowering for adding reduction clauses, which we can't do a transformation time due to LFRicKern (or Kern in general) and reproducible reductions I think.

Is this because Kern does not report its used symbols in the reference_accesses that the infer_sharing_attributes uses to do its analysis? That may not be a problem (in the future) as providing them as virtual_children as we do in GOcean Kerns would fix it and is something that we plan to do anyway to make the DUC work with them to replace the old halo_exchange analysis.

But, is there a more fundamental problem that ANY symbol/reference that is created at lowering (if we don't repeat the infer_sharing_attributes) will become shared without synchronisation (the default)? I am not sure how I feel about this yet.

@sergisiso

Copy link
Copy Markdown
Collaborator

I think the list of aspects that we need to evaluate are:

  1. Check if it makes the decisions regarding which loops are parallel and which variables are private easier if they are done together instead of in very different places.
  2. Check if not repeating the same analysis in two different places (during transformation and after lowering) also prevents issues caused if we get different results (but we lose the ability to be smart about changes done latter, e.g in posterior transformations or new variables introduced during lowering ).
  3. Check if doing it the associated reference_access analysis once is noticeably faster as it is potentially a expensive analysis (but with the current hoisting is not a dominating factor).

This won't be possible only we really can skip the lowering call we need for the reductions, so I would block this PR until we do the Kern reference_access children.

@LonelyCat124

Copy link
Copy Markdown
Collaborator Author

the reason is we need the need_sync variables in lowering for adding reduction clauses, which we can't do a transformation time due to LFRicKern (or Kern in general) and reproducible reductions I think.

Is this because Kern does not report its used symbols in the reference_accesses that the infer_sharing_attributes uses to do its analysis? That may not be a problem (in the future) as providing them as virtual_children as we do in GOcean Kerns would fix it and is something that we plan to do anyway to make the DUC work with them to replace the old halo_exchange analysis.

But, is there a more fundamental problem that ANY symbol/reference that is created at lowering (if we don't repeat the infer_sharing_attributes) will become shared without synchronisation (the default)? I am not sure how I feel about this yet.

I'm not sure, I just went with the comment that says:

        # Now finish the LFRic reductions (which need lowering first)
        if reduction_kernels and not reprod_red_call_list:
            self._add_reduction_clauses()

Looking at this again now, I do have a slight concern with reproducible reductions:

        # Reproducible reduction will be done serially by accumulating the
        # partial results in an array indexed by the thread index
        reprod_red_call_list = self.reductions(reprod=True)
        if reprod_red_call_list:
            table = self.ancestor(Routine).symbol_table
            omp_lib = table.find_or_create(
                "omp_lib", symbol_type=ContainerSymbol)
            omp_get_thrd = table.find_or_create_tag(
                "omp_get_thread_num", symbol_type=RoutineSymbol,
                interface=ImportInterface(omp_lib))
            table.find_or_create_tag("omp_num_threads",
                                     root_name="nthreads",
                                     symbol_type=DataSymbol,
                                     datatype=INTEGER_TYPE)
            thread_idx = table.find_or_create_tag(
                "omp_thread_index", root_name="th_idx",
                symbol_type=DataSymbol, datatype=INTEGER_TYPE)
            assignment = Assignment.create(
                lhs=Reference(thread_idx),
                rhs=BinaryOperation.create(
                        BinaryOperation.Operator.ADD,
                        Call.create(omp_get_thrd),
                        Literal("1", INTEGER_TYPE))
            )
            self.dir_body.addchild(assignment, 0)

The omp_thread_index variable is not added to the private clause (if its not already present) despite being clearly thread private (omp_get_thread_num), which works correctly for subclasses of OMPParallel (maybe? But with default perhaps not), since they're also loop directives, but feels wrong on the OMPParallelDirective itself. We don't appear to test this anywhere - have we used reproducible reductions anywhere in production, as I'm concerned about this section of code.

From looking at Kern.reduction_sum_loop in general, it seems questionably defined/implemented for general OMPParallel as it is, but I'd need to test generating some code to be able to correctly see whats going on @sergisiso

@sergisiso

Copy link
Copy Markdown
Collaborator

@LonelyCat124 We have an issue to change this implementation, see #3409 , maybe we can do that first, as we need it regardless?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants