Improve performance of h5trav interfaces for links to objects by jhendersonHDF · Pull Request #6400 · HDFGroup/hdf5

jhendersonHDF · 2026-05-06T22:52:59Z

For objects with multiple hard links, use hash table to map between object tokens and pathnames during traversal to avoid linear scan over all previous objects for each hard link seen

Use separate hash table for h5trav "table" interface to map between object tokens and an index into the table of visited objects. This facilitates quick lookups of objects when adding hard link name aliases for h5repack processing.

See the linked issue for context

For objects with multiple hard links, use hash table to map between object tokens and pathnames during traversal to avoid linear scan over all previous objects for each hard link seen Use separate hash table for h5trav "table" interface to map between object tokens and an index into the table of visited objects. This facilitates quick lookups of objects when adding hard link name aliases for h5repack processing

jhendersonHDF · 2026-05-06T22:54:40Z

 #include "hdf5.h"

-/* Typedefs for visiting objects */
-typedef herr_t (*h5trav_obj_func_t)(const char *path_name, const H5O_info2_t *oinfo, const char *first_seen,


These typedefs had to be moved down for the new trav_seen_t parameter to be available

jhendersonHDF · 2026-05-06T22:56:32Z

    trav_obj_t *objs;
+
+    /* Private data for this trav_table_t */
+    void *priv_data;


This is really a pointer to a structure with a UT_hash_handle, but exposing the uthash API at the h5trav.h level is problematic due to its header-only nature and already being used in H5private.h. This is just a quick hack to hide the implementation details.

I wonder if it makes sense to just include H5private.h here

Due to the header-only nature of uthash.h, it being included in H5private.h meant that having different settings for it in different parts of the library became somewhat difficult. In h5trav.c only, I needed to change change HASH_KEYCMP to use H5Otoken_cmp() to keep compatibility, but it became a first-to-include-uthash.h-wins scenario if I included H5private.h in h5trav.h since H5private.h is also including uthash.h with its own settings. The scope of including the uthash.h header should probably be reduced from H5private.h to where it's actually needed.

jhendersonHDF · 2026-05-06T22:59:36Z

+ * where a visited object was placed to facilitate quicker
+ * lookups when adding path aliases
+ */
+typedef struct trav_table_hash_t {


This structure is mostly for h5repack, which now has its own hash table separate from the main one used by other h5trav interfaces. I considered modifying the h5trav interfaces to allow h5repack to share a single hash table, but it would have been fairly awkward to do so and would have imposed unnecessary memory overhead for other tools that wouldn't need the extra information that would have been stored, so I instead just used a separate hash table specifically for the h5trav "table" interface.

jhendersonHDF · 2026-05-06T23:00:59Z

+ *-------------------------------------------------------------------------
+ */
+static int
+trav_token_visited_cmp(hid_t loc_id, const H5O_token_t *token1, const H5O_token_t *token2)


This function is called by HASH_FIND and is just used to wrap around H5Otoken_cmp() instead of a plain memcmp of object token bytes

jhendersonHDF · 2026-05-06T23:01:52Z

        udata.fields        = fields;

+        /* Check for multiple links to top group */
+        if (oinfo.rc > 1)


Moved this down below the udata initialization in case the hash table gets initialized by the call to trav_token_add().

mattjala

Only a couple minor issues

fortnern · 2026-06-04T17:23:14Z

+    /* HASH_ADD modifies what's pointed to by objects_seen_ptr when it
+     * initializes the hash table after being called for the first time
+     */
+    HASH_ADD(hh, (*objects_seen_ptr), obj.token, sizeof(H5O_token_t), entry);


Are we certain that tokens never contain garbage bytes? I suspect so just want to make sure, since they are stored as "MAX_TOKEN_SIZE", implying that some connectors don't use all of the bytes.

I see that H5VL_native_addr_to_token zeroes the unused bytes, just curious how strong our documentation is that third party VOLs should do the same.

That's true, I did make an assumption here and I haven't found any specific documentation mentioning that the bytes should be zeroed out or set to a specific value yet. We do provide H5O_TOKEN_UNDEF for undefined token values which in theory would be an initializer, but is not really documented as such. Without that assumption and without knowing how many valid bytes there are, I suppose you can't do much with an object token other than pass it to the token APIs. To be 100% sure here, we would need to either document that, or may need to have a callback for returning a hashed version of the token.

fortnern · 2026-06-04T17:28:14Z

        udata.fields        = fields;

+        /* Check for multiple links to top group */
+        if (oinfo.rc > 1) {


Is this fixing a bug?

This is the same logic as above, just had to be moved down a bit.

fortnern · 2026-06-04T17:36:11Z

+            return FAIL;
+    }
+    else {
        for (i = 0; i < table->nobjs; i++) {


Would it be worth adding a hash table for the path names to accelerate this?

Probably yes. I mostly tried to limit the scope of this to the performance problem I directly observed, but there's also likely a similar issue with soft links, as well as possibly with the path names here.

jhendersonHDF added the Component - Tools Command-line tools like h5dump, includes high-level tools label May 6, 2026

github-project-automation Bot added this to HDF5 - TRIAGE & TRACK May 6, 2026

github-project-automation Bot moved this to To be triaged in HDF5 - TRIAGE & TRACK May 6, 2026

jhendersonHDF linked an issue May 6, 2026 that may be closed by this pull request

HDF5 tools performance issue for multiply-linked objects #6399

Open

jhendersonHDF commented May 6, 2026

View reviewed changes

jhendersonHDF marked this pull request as ready for review May 6, 2026 23:30

jhendersonHDF requested review from bmribler, brtnfld, derobins, fortnern, glennsong09, lrknox, mattjala, qkoziol and vchoi-hdfgroup as code owners May 6, 2026 23:30

mattjala assigned mattjala and fortnern May 8, 2026

mattjala reviewed May 8, 2026

View reviewed changes

Comment thread tools/lib/h5trav.c

mattjala reviewed May 8, 2026

View reviewed changes

Comment thread tools/lib/h5trav.c Outdated

mattjala previously approved these changes May 8, 2026

View reviewed changes

jhendersonHDF dismissed mattjala’s stale review via 869cc8f May 8, 2026 17:42

jhendersonHDF added 2 commits May 8, 2026 12:42

Add check for non-NULL visited object before returning pointer

869cc8f

Add error check

1110aeb

ajelenak added this to the HDF5 2.2.0 milestone Jun 2, 2026

fortnern reviewed Jun 4, 2026

View reviewed changes

fortnern approved these changes Jun 4, 2026

View reviewed changes

Uh oh!

Conversation

jhendersonHDF commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mattjala left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jhendersonHDF commented May 6, 2026 •

edited

Loading