Skip to content

QDMA: release cdev region on device cleanup to stop major-number leak#397

Open
andrew-bolin wants to merge 1 commit into
Xilinx:masterfrom
andrew-bolin:fix-qdma-cdev-region-leak
Open

QDMA: release cdev region on device cleanup to stop major-number leak#397
andrew-bolin wants to merge 1 commit into
Xilinx:masterfrom
andrew-bolin:fix-qdma-cdev-region-leak

Conversation

@andrew-bolin

Copy link
Copy Markdown

Problem

qdma_cdev_device_cleanup() clears xcb->cdev_major but never calls
unregister_chrdev_region(). The 4096-minor character-device region that each
board allocates in qdma_cdev_device_init() is therefore leaked on every device
remove (for example a PCIe remove/rescan). The region is only ever reclaimed by
the module-wide qdma_cdev_cleanup() at module unload.

As a result, repeated remove/rescan cycles steadily consume the kernel's pool of
dynamic character-device major numbers. Once the pool is exhausted,
alloc_chrdev_region() returns -EBUSY and probe fails:

CHRDEV "qdma-pf" dynamic allocation region is full
qdma_pf:qdma_cdev_device_init: unable to allocate cdev region -16.
qdma-pf: probe of 0000:XX:00.1 failed with error -16

This was observed on with V80 FPGAs (QDMA PF on XX:00.1): after a number of
remove/rescan cycles, /proc/devices showed ~126 qdma-pf majors held, filling
the entire 384–511 dynamic range, and all subsequent probes failed with -16.
The only stock recovery is a full rmmod / modprobe, which is undesirable in
production.

Fix

Reference-count each board's major number. Multiple PFs on the same bus already
share one major via the dedup path in qdma_cdev_device_init(), so the count is
incremented there and initialised to 1 for a freshly allocated board.
qdma_cdev_device_cleanup() now decrements the count and, when the last PF on
the board goes away, releases the region with unregister_chrdev_region() +
list_del() + kfree(). The module-wide qdma_cdev_cleanup() remains as a
safety net.

With this change, remove/rescan no longer leaks majors and the -16 exhaustion
no longer occurs.

Testing

Tested on a host with six V80s (QDMA 2024.1.0, kernel
5.15.0-179-generic), counting grep -c qdma-pf /proc/devices across PCIe
remove/rescan cycles on one (idle) card:

  • Before (stock module): baseline 6 majors → 7 → 8, climbing by one per
    remove/rescan cycle while the number of bound PFs stayed at 6. In production
    this had accumulated ~126 leaked majors, filling the 384–511 dynamic range
    and failing all subsequent probes with -16.
  • After (this patch): baseline 6 majors, holding steady at 6 across four
    remove/rescan cycles, with all 6 PFs re-binding each time. Normal
    load/unload/queue setup unaffected. Builds clean under -Werror.

qdma_cdev_device_cleanup() only cleared xcb->cdev_major and never called
unregister_chrdev_region(), so the 4096-minor character-device region a
board allocates in qdma_cdev_device_init() was leaked on every device
remove (e.g. PCIe remove/rescan). Only the module-wide qdma_cdev_cleanup()
at unload ever freed these regions.

Repeated remove/rescan cycles therefore exhaust the kernel's dynamic
char-device major pool. Once full, alloc_chrdev_region() returns -EBUSY
("CHRDEV dynamic allocation region is full") and probe fails:

  qdma_cdev_device_init: unable to allocate cdev region -16.
  qdma-pf: probe of 0000:XX:00.1 failed with error -16

Fix by reference-counting each board's major (multiple PFs on the same
bus share one major via the existing dedup path). qdma_cdev_device_cleanup()
now decrements the count and releases the region with
unregister_chrdev_region() + list_del() + kfree() when the last PF on the
board goes away. The module-wide cleanup remains as a safety net.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant