OOM with 8 A800 80G

Hi, you did a great job! 

While running the training scripts, I got stuck with an OOM issue right after the step 1 :

[36m(main_task pid=6705)[0m collected 1100 / 1024 rollouts and each prompt has 4 responses
[36m(main_task pid=6705)[0m rollout batch size: 1024
[36m(main_task pid=6705)[0m reward: 511.8 seconds
[36m(main_task pid=6705)[0m adv: 1.4 seconds]
Error ....ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. 

It’s a bit weird because the resources and parameters I'm using are the same as yours (7B model).
I noticed that 8 GPUs match with 8 PIDs, and when I ran `top` to monitor the memory usage, it was at 12.1% per PID before the crash.
I also tried the methods mentioned in the Ray docs, like controlling num_cpus to limit the number of concurrently running tasks. I set num_cpus=4, but that didn't work – the task is pending forever due to resource demands. It seems like there’s a strict pack of 1 GPU and 1 CPU.

Finally, I switched to 4 GPUs, but now I get a CUDA out-of-memory error, even though I changed the train_batch_size to 64.

I'm really confused about this. I hope the 8 GPU setup will work. Could you help me with that?
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM with 8 A800 80G #59

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

OOM with 8 A800 80G #59

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions