Welcome to SparkyLinux forums
Zapraszamy również na polsko-języczne Forum

Spontaneous system reboot. Nvidia GL issue?

Started by sparrowsion, December 24, 2019, 12:11:55 PM

Previous topic - Next topic


Symptom: system spontaneously reboots under certain circumstances. No warnings, nothing logged that I can see. (Only the third time I've seen anything like it in 25 years of running Linux at home and work! But generally I'm running Debian stable on oldish hardware; not the case here.)

I've spent three weeks trying to narrow down what circumstances, and finally have something repeatable. I'm satisfied that it's not a basic hardware problem (initial suspicion was overheating, that was one of the other causes of spontaneous reboot I've seen, but I've ruled that out with stress testing), or a specific app (next suspect was the Unity game engine, since I'd had reboots when running two different games built with it), and the primary suspect now is something around the Nvidia/GL drivers and the GTX 1050 Ti. Because FurMark runs for less than 5 seconds before the system reboots, consistently (and the last log of GPU temperature I can get is barely above 30C), and CUDA stress testing gets the GPU way hotter than that without any problems.

sion@theseus:~$ inxi -Fr
  Host: theseus Kernel: 5.3.0-3-amd64 x86_64 bits: 64 Desktop: Xfce 4.14.1
  Distro: SparkyLinux 6 (Po-Tolo)
  Type: Desktop Mobo: ASUSTeK model: PRIME B365M-A v: Rev X.0x
  serial: <root required> UEFI: American Megatrends v: 1202 date: 07/24/2019
  Topology: 8-Core model: Intel Core i7-9700 bits: 64 type: MCP
  L2 cache: 12.0 MiB
  Speed: 800 MHz min/max: 800/4700 MHz Core speeds (MHz): 1: 800 2: 800
  3: 800 4: 800 5: 800 6: 800 7: 801 8: 801
  Device-1: Intel UHD Graphics 630 driver: i915 v: kernel
  Device-2: NVIDIA GP107 [GeForce GTX 1050 Ti] driver: nvidia v: 430.64
  Display: x11 server: X.Org 1.20.6 driver: modesetting,nvidia
  unloaded: fbdev,nouveau,vesa resolution: 1360x768~60Hz
  OpenGL: renderer: GeForce GTX 1050 Ti/PCIe/SSE2 v: 4.6.0 NVIDIA 430.64
  Device-1: NVIDIA GP107GL High Definition Audio driver: snd_hda_intel
  Sound Server: ALSA v: k5.3.0-3-amd64
  Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
  driver: r8169
  IF: eth0 state: up speed: 100 Mbps duplex: full mac: 04:d9:f5:f5:de:dc
  Local Storage: total: 465.76 GiB used: 105.58 GiB (22.7%)
  ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 970 EVO Plus 500GB
  size: 465.76 GiB
  ID-2: /dev/nvme1n1 vendor: Samsung model: MZVLW128HEGR-00000
  size: 119.24 GiB
  ID-1: / size: 116.58 GiB used: 17.20 GiB (14.8%) fs: ext4
  dev: /dev/nvme1n1p2
  ID-2: swap-1 size: 31.25 GiB used: 0 KiB (0.0%) fs: swap
  dev: /dev/nvme0n1p1
  System Temperatures: cpu: 29.0 C mobo: 27.0 C gpu: nvidia temp: 27 C
  Fan Speeds (RPM): cpu: 951 fan-1: 850 fan-3: 0 fan-4: 0 fan-5: 854
  fan-7: 0 gpu: nvidia fan: 20%
  Active apt repos in: /etc/apt/sources.list
  1: deb testing main contrib non-free
  2: deb-src testing main contrib non-free
  3: deb testing-security/updates main contrib non-free
  4: deb-src testing-security/updates main contrib non-free
  5: deb testing main non-free
  Active apt repos in: /etc/apt/sources.list.d/google-chrome.list
  1: deb [arch=amd64] stable main
  No active apt repos in: /etc/apt/sources.list.d/liquorix.list
  No active apt repos in: /etc/apt/sources.list.d/sid.list
  Active apt repos in: /etc/apt/sources.list.d/sparky-testing.list
  1: deb core main
  2: deb-src core main
  3: deb testing main
  4: deb-src testing main
  Active apt repos in: /etc/apt/sources.list.d/sparky-unstable.list
  1: deb unstable main
  2: deb-src unstable main
  Active apt repos in: /etc/apt/sources.list.d/vscode.list
  1: deb [arch=amd64] stable main
  Processes: 250 Uptime: 19h 51m Memory: 15.51 GiB used: 2.18 GiB (14.1%)
  Shell: bash inxi: 3.0.37


Situation improved somewhat by latest nvidia drivers: furmark now runs for something more like a minute, certainly long enough to see that the system crash isn't being caused by overheating, memory exhaustion, or any other obvious hardware or resource problem.

View the most recent posts on the forum