Skip to content

OOM with 8 A800 80G #59

@Fayeww

Description

@Fayeww

Hi, you did a great job!

While running the training scripts, I got stuck with an OOM issue right after the step 1 :

[36m(main_task pid=6705)�[0m collected 1100 / 1024 rollouts and each prompt has 4 responses
�[36m(main_task pid=6705)�[0m rollout batch size: 1024
�[36m(main_task pid=6705)�[0m reward: 511.8 seconds
�[36m(main_task pid=6705)�[0m adv: 1.4 seconds]
Error ....ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

It’s a bit weird because the resources and parameters I'm using are the same as yours (7B model).
I noticed that 8 GPUs match with 8 PIDs, and when I ran top to monitor the memory usage, it was at 12.1% per PID before the crash.
I also tried the methods mentioned in the Ray docs, like controlling num_cpus to limit the number of concurrently running tasks. I set num_cpus=4, but that didn't work – the task is pending forever due to resource demands. It seems like there’s a strict pack of 1 GPU and 1 CPU.

Finally, I switched to 4 GPUs, but now I get a CUDA out-of-memory error, even though I changed the train_batch_size to 64.

I'm really confused about this. I hope the 8 GPU setup will work. Could you help me with that?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions