[Software] Nvidia gpu stops being useable after a time

Message

quddus · #1 Post by **quddus** » 2024-02-19 02:33

Hello. I am new to debian and working with a new desktop with an rtx 3090 that I am using for generative ai workflows with stable diffusion using tools like ComfyUI. Here is my about this system:

Operating System: Debian GNU/Linux 12
KDE Plasma Version: 5.27.5
KDE Frameworks Version: 5.103.0
Qt Version: 5.15.8
Kernel Version: 6.1.0-17-amd64 (64-bit)
Graphics Platform: Wayland
Processors: 24 × AMD Ryzen 9 7900X 12-Core Processor
Memory: 30.5 GiB of RAM
Graphics Processor: AMD Radeon Graphics
Manufacturer: Gigabyte Technology Co., Ltd.
Product Name: B650 AORUS ELITE AX

The issue I am having is that often the gpu becomes unavailable after a period of time. Restarting the computer seems to get it working again.

When it becomes unavailable `nvidia-smi` returns Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

`nvidia-debugdump --dumpall`

returns
```
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
```

Possibly related, on https://wiki.debian.org/NvidiaGraphicsDrivers

It is mentioned that if `lspci | grep -E "VGA|3D"` returns two lines, you have an optimus card, for me the command returns:

```
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
10:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev c2)
```

So take that to mean it is an optimus card, but when I go to look at the steps for that I dont really understand what I am to do. It looks like the optimal solution is `Nvidia prime to render offload`. But that doesn't look like a configuration, it reads to me as something i prepend to commands. Does that mean that I should prepend that to any command that will be using a gpu. In the case of comfyui would I run `__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia python main.py` after activating the conda environment that I set up for comfyui? In that case I still get the same error:

```
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
```

How can I solve this so the gpu is ready whenever I need it?

Thanks in advance.

#2 Post by **FreewheelinFrank** » 2024-02-24 08:33

Not an Nvidia user, but I'll give you a bump. Maybe some of our Nvidia users will notice the topic.

Could it be the power supply crapping out under high load? Is the Nvidia gpu actually working when it becomes unavailable? The application sounds quite intense.

Is there anything in the journal when it happens?

Code: Select all

# journalctl -b -1

If you have rebooted or

Code: Select all

# journalctl -b

For current boot.

Debian User Forums

[Software] Nvidia gpu stops being useable after a time

[Software] Nvidia gpu stops being useable after a time

Re: [Software] Nvidia gpu stops being useable after a time