Skip to content

use HAS_GPU to determine of cuda is available#364

Open
jperez999 wants to merge 6 commits into
NVIDIA-Merlin:mainfrom
jperez999:gpu-count-cuda
Open

use HAS_GPU to determine of cuda is available#364
jperez999 wants to merge 6 commits into
NVIDIA-Merlin:mainfrom
jperez999:gpu-count-cuda

Conversation

@jperez999

Copy link
Copy Markdown
Collaborator

This PR changes how we determine if cuda is available on the system. We move from numba to using HAS_GPU which uses nvml device count. If there are no devices, then cuda is not available. Otherwise cuda is available.

@jperez999 jperez999 requested a review from rjzamora January 12, 2024 00:31
@jperez999 jperez999 self-assigned this Jan 12, 2024
@copy-pr-bot

copy-pr-bot Bot commented Jan 12, 2024

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-364

@pentschev

Copy link
Copy Markdown

Although this is tripping that block, I would suggest always using PyNVML to query GPU information, specifically what I mention in #363 (comment) can be dangerous with Dask if for some reason the cuda = None is removed in the future.

Comment thread merlin/core/compat/__init__.py Outdated
@jperez999 jperez999 added the bug Something isn't working label Jan 12, 2024

@rjzamora rjzamora left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linting may be off but content looks good to me. Thanks @jperez999 !

@jperez999

Copy link
Copy Markdown
Collaborator Author

This is not ready, the failures during writing have to do with when you are writing a file with a client available. Will continue investigating.

@jperez999

Copy link
Copy Markdown
Collaborator Author

Investigated seems that the logic for int_slice_size was not full proof. Because of the floor divide you can find yourself in a scenario where you have less records in the df than the int_slice_size and that can result in a zero. Then when you go to mod on zero the thread raises an exception. I do wonder how we hit this now and not before.

@jperez999

Copy link
Copy Markdown
Collaborator Author

/ok to test

@rjzamora

Copy link
Copy Markdown
Contributor

I do wonder how we hit this now and not before.

I agree that this is strange - I wonder if I was wrong about pynvml_mem_size be "the same".

@jperez999

Copy link
Copy Markdown
Collaborator Author

/ok to test

@jperez999

Copy link
Copy Markdown
Collaborator Author

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants