QDMA: release cdev region on device cleanup to stop major-number leak#397
Open
andrew-bolin wants to merge 1 commit into
Open
QDMA: release cdev region on device cleanup to stop major-number leak#397andrew-bolin wants to merge 1 commit into
andrew-bolin wants to merge 1 commit into
Conversation
qdma_cdev_device_cleanup() only cleared xcb->cdev_major and never called
unregister_chrdev_region(), so the 4096-minor character-device region a
board allocates in qdma_cdev_device_init() was leaked on every device
remove (e.g. PCIe remove/rescan). Only the module-wide qdma_cdev_cleanup()
at unload ever freed these regions.
Repeated remove/rescan cycles therefore exhaust the kernel's dynamic
char-device major pool. Once full, alloc_chrdev_region() returns -EBUSY
("CHRDEV dynamic allocation region is full") and probe fails:
qdma_cdev_device_init: unable to allocate cdev region -16.
qdma-pf: probe of 0000:XX:00.1 failed with error -16
Fix by reference-counting each board's major (multiple PFs on the same
bus share one major via the existing dedup path). qdma_cdev_device_cleanup()
now decrements the count and releases the region with
unregister_chrdev_region() + list_del() + kfree() when the last PF on the
board goes away. The module-wide cleanup remains as a safety net.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
qdma_cdev_device_cleanup()clearsxcb->cdev_majorbut never callsunregister_chrdev_region(). The 4096-minor character-device region that eachboard allocates in
qdma_cdev_device_init()is therefore leaked on every deviceremove (for example a PCIe remove/rescan). The region is only ever reclaimed by
the module-wide
qdma_cdev_cleanup()at module unload.As a result, repeated remove/rescan cycles steadily consume the kernel's pool of
dynamic character-device major numbers. Once the pool is exhausted,
alloc_chrdev_region()returns-EBUSYand probe fails:CHRDEV "qdma-pf" dynamic allocation region is full
qdma_pf:qdma_cdev_device_init: unable to allocate cdev region -16.
qdma-pf: probe of 0000:XX:00.1 failed with error -16
This was observed on with V80 FPGAs (QDMA PF on
XX:00.1): after a number ofremove/rescan cycles,
/proc/devicesshowed ~126qdma-pfmajors held, fillingthe entire
384–511dynamic range, and all subsequent probes failed with-16.The only stock recovery is a full
rmmod/modprobe, which is undesirable inproduction.
Fix
Reference-count each board's major number. Multiple PFs on the same bus already
share one major via the dedup path in
qdma_cdev_device_init(), so the count isincremented there and initialised to 1 for a freshly allocated board.
qdma_cdev_device_cleanup()now decrements the count and, when the last PF onthe board goes away, releases the region with
unregister_chrdev_region()+list_del()+kfree(). The module-wideqdma_cdev_cleanup()remains as asafety net.
With this change, remove/rescan no longer leaks majors and the
-16exhaustionno longer occurs.
Testing
Tested on a host with six V80s (QDMA 2024.1.0, kernel
5.15.0-179-generic), counting
grep -c qdma-pf /proc/devicesacross PCIeremove/rescan cycles on one (idle) card:
remove/rescan cycle while the number of bound PFs stayed at 6. In production
this had accumulated ~126 leaked majors, filling the
384–511dynamic rangeand failing all subsequent probes with
-16.remove/rescan cycles, with all 6 PFs re-binding each time. Normal
load/unload/queue setup unaffected. Builds clean under
-Werror.