Skip to content

Merge and improve NVIDIA measurement overhead reduction feature#12

Open
boyuhang66 wants to merge 18 commits into
caps-tum:developfrom
boyuhang66:feature-integrate-overhead-reduction
Open

Merge and improve NVIDIA measurement overhead reduction feature#12
boyuhang66 wants to merge 18 commits into
caps-tum:developfrom
boyuhang66:feature-integrate-overhead-reduction

Conversation

@boyuhang66

Copy link
Copy Markdown

Summary

This MR finalizes the NVIDIA measurement overhead reduction feature and includes several improvements and fixes identified during testing.

Changes

  • Cleaned up and simplified the implementation.
  • Improved NCU kernel matching:
    • Use mangled names for automatically extracted kernels.
    • Use function names for user-specified kernels.
    • Removed the hardcoded skip limit to correctly profile short-lived kernels.

Qichen Liu and others added 18 commits June 12, 2026 02:26
- Extract `valid_analyses` array and `is_valid_analysis` function to the main GPUscout driver script to centralize parameter validation.
- Refactor the metric collection flow and remove the long, duplicated case blocks.
- Use validate_csv_list_syntax and normalize_csv_list helper functions.
- Remove the redundant empty token check inside the loop because pre-validation already handles it.
- Replace the long case statement and hardcoded execution lists with a dynamic loop to cut down on code duplication.
- Group variables into arrays and load them using `declare -n` to make adding new analysis modules easier in the future.
- Update timing functions to generate text labels automatically, removing the need for manual console print logs in each branch.
- Add a binary availability check to catch missing or uncompiled tools early instead of letting the script fail silently.
….sh. GPUscout.sh.in already handles this validation during early argument parsing, ensuring only valid names reach this stage.
…modes and fix short kernel profiling

- Fix empty profile outputs (0 rows) caused by demangled function names mismatching with hardware parameter signatures in Nsight Compute.
- For `kernels_selection_mode` = `auto_from_generated_sass`: Stop stripping mangled names to their base forms. Keep raw mangled symbols from SASS and set `--kernel-name-base mangled` to ensure exact hardware matching. This also prevents overloaded functions from being lost during de-duplication.
- For `kernels_selection_mode` = `user`: Set`--kernel-name-base function` to allow matching based on pure function name.
- Remove `-s 5 --launch-count 1` from the `ncu` command to stop skipping early iterations, ensuring short-lived or test kernels are captured properly instead of producing empty rows.
- Comment out unused `extract_kernel_base_name_from_symbol` and `build_auto_ncu_kernel_patterns` blocks.
…ngs]

- Implement a manual importer fallback to bypass internal `nsys-importer` crashes caused by strict Linux kernel security settings (`kernel.perf_event_paranoia = 4`).
- Add compatibility for legacy `nsys` toolchains (e.g., v2022.4.2) during the `nsys stats` stage. Automatically fall back to the older `gpukernsum` report name if the newer `cuda_gpu_kern_sum` report is not physically generated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants