Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

[Software] Nvidia gpu stops being useable after a time

New to Debian (Or Linux in general)? Ask your questions here!
Post Reply
Message
Author
quddus
Posts: 1
Joined: 2024-02-19 02:10

[Software] Nvidia gpu stops being useable after a time

#1 Post by quddus »

Hello. I am new to debian and working with a new desktop with an rtx 3090 that I am using for generative ai workflows with stable diffusion using tools like ComfyUI. Here is my about this system:

Operating System: Debian GNU/Linux 12
KDE Plasma Version: 5.27.5
KDE Frameworks Version: 5.103.0
Qt Version: 5.15.8
Kernel Version: 6.1.0-17-amd64 (64-bit)
Graphics Platform: Wayland
Processors: 24 × AMD Ryzen 9 7900X 12-Core Processor
Memory: 30.5 GiB of RAM
Graphics Processor: AMD Radeon Graphics
Manufacturer: Gigabyte Technology Co., Ltd.
Product Name: B650 AORUS ELITE AX

The issue I am having is that often the gpu becomes unavailable after a period of time. Restarting the computer seems to get it working again.

When it becomes unavailable `nvidia-smi` returns Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

`nvidia-debugdump --dumpall`

returns
```
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
```

Possibly related, on https://wiki.debian.org/NvidiaGraphicsDrivers

It is mentioned that if `lspci | grep -E "VGA|3D"` returns two lines, you have an optimus card, for me the command returns:

```
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
10:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev c2)
```

So take that to mean it is an optimus card, but when I go to look at the steps for that I dont really understand what I am to do. It looks like the optimal solution is `Nvidia prime to render offload`. But that doesn't look like a configuration, it reads to me as something i prepend to commands. Does that mean that I should prepend that to any command that will be using a gpu. In the case of comfyui would I run `__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia python main.py` after activating the conda environment that I set up for comfyui? In that case I still get the same error:

```
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
```

How can I solve this so the gpu is ready whenever I need it?

Thanks in advance.

User avatar
FreewheelinFrank
Global Moderator
Global Moderator
Posts: 2117
Joined: 2010-06-07 16:59
Has thanked: 38 times
Been thanked: 232 times

Re: [Software] Nvidia gpu stops being useable after a time

#2 Post by FreewheelinFrank »

Not an Nvidia user, but I'll give you a bump. Maybe some of our Nvidia users will notice the topic.

Could it be the power supply crapping out under high load? Is the Nvidia gpu actually working when it becomes unavailable? The application sounds quite intense.

Is there anything in the journal when it happens?

Code: Select all

# journalctl -b -1
If you have rebooted or

Code: Select all

# journalctl -b
For current boot.

Post Reply