Sick PC

I’ve had a sick PC for several weeks now. It has cost me a surprising amount of time and thought.

It started with my main work machine randomly hanging. This is Windows 11 with a Ryzen 9 5900X, and it has previously run faultlessly for two years or so. The hangs were, at first, annoying and I assumed that it was some driver that had been updated and was playing up, and initially I hoped that it would just fix itself with another update. It wasn’t hanging often enough to worry too much about. Windows update history showed no changes for longer than the problem had been happening though, so that made me doubt my initial thoughts on this.

The hangs became more frequent, or at least seemed that way. It got to the point where I was starting to lose work. I started googling and began to work through the usual recommendations, checking for driver updates, testing memory, etc. The box has 128GB of memory so running Memtest86 was taking over 30 hours.

The memory test was consistently showing errors on one of the DIMMs and the error stayed on the same DIMM when I moved it around between slots on the motherboard. I assumed that this was the issue and ordered some replacement memory and decided to look into warranty replacement after I’d got the machine stable again. Annoyingly it seemed impossible to buy just one 32GB DIMM as it was dual channel memory, so I needed to replace two DIMMs rather than just the failing one…

The problems then got worse when the machine was running with the 64GB of ‘good’ memory.

I limped on, switching between the machine that was sick and my previous development machine which I keep as a backup, and working mostly on the old machine. The replacement memory arrived but memtest showed that also had one stick failing; swapped it all around and the failing stick was consistent in different slots. Very strange but I RMA’d it and assumed it was just bad luck. At this point stress due to losing working time and thinking time was making me cranky.

Driver updates, no change. BIOS update, no change. I then disappeared down a rabbit hole of ’taking things out and seeing if it was better’. This lasted what seemed like forever. Was it the 10Gb ethernet? No. The graphics card? No. Running with driver verifier led to me thinking it might be one of the M.2 disks failing, but it wasn’t.

In the end I was left with a machine with a single M.2 boot disk, one DIMM and nothing else and the problem was worse. It had progressed from hangs to hangs and/or blue screens and the blue screen would usually hang during dump creation.

I pulled the final M.2 disk, plugged in a new scratch ssd and installed Windows 10. For a short time things were stable. Then it hung.

Since the hangs and crashes seemed to happen when the machine was idle it sometimes appeared that things were better but only because the machine wasn’t actually as idle is it needed to be to trigger the problem. With the new installation of Windows I had stability for quite a while, until I realised that what I should have done was disconnect the network cable so that it wasn’t doing a little bit of updating in the background and this was keeping things busy enough to make the machine stable.

By now I had ordered a replacement machine so that I could move forward. It meant bringing my normal machine refresh forward by 18 months or so but it’s always useful to have another box when you’re working with multiple clients at the same time and kicking off CI builds etc.

It seemed that now it was either a motherboard fault, a flaky PSU, the Ryzen or the cooler. I discovered that the whirring noise that my cooler has always made is air in the pump, but it’s an AIO thing and so not suitable for user maintenance. I wiggled the PC around to shift the air but the PC wasn’t overheating, and although I could get the air to clear itself and the cooler to run silently it didn’t affect the hanging, and the air kept coming back… My previous development machine has a custom water cooler loop, which, whilst more expensive in components was easier to fix to remove the air locks; especially for someone who was once a plumber…

Eventually I seem to have stumbled upon the cause… When I updated the BIOS I checked the settings before updating and noticed that pretty much everything was set to ‘auto’. I was a little surprised, but it’s a long time since I last looked at the settings. Anyway, after the update I made sure things were the same as before, but it seems that the problem is that they shouldn’t have all been on auto…

Changing a couple of power settings has finally given me a stable box. The settings are “Power Supply Idle Control” which I changed from auto to “Typical Current Low” and “Power Loading” which I changed from auto to “Enabled”. These both affect how the PSU reacts to the CPU using less than normal levels of power, and it seems that the Ryzen can use a tiny amount of power when all the cores are idle and this tiny amount can confuse the PSU into thinking it doesn’t need any at all. Or something like that.

Watching the power draw with Ryzen Master it’s clear that the CPU idles with a much higher power draw and this seems to prevent the hanging. I need to investigate more.

I then thought that the CMOS battery might be dead, but it’s not. So something caused the BIOS to clear its settings and that was the root cause of the hanging. I think the memory stick has likely always been bad, I’ve never tested it before and I used to have the occasional blue screen (once in a blue moon) and that may have been down to the dodgy memory; with so much memory in the box I expect the bad bits don’t affect things that often.

This was a painful, time-consuming and expensive journey. I now have backups of my BIOS settings and, hopefully, this post will help me if it happens again.

Things I learnt:

  1. Make notes. Detailed notes.

  2. Backup your bios - there’s an option in the ASUS and AORUS BIOSs that I have that let you save and load the config to USB drive. The ASUS are better, IMHO, as they are a very clear text format that you can read, edit and compare. The AORUS is a binary format which is less helpful.

  3. If things start going strange, check the BIOS hasn’t been reset for some reason; pick a single easy to spot setting that isn’t a default and make a note of it…

  4. If there are random hangs are they when the machine is idle? Any change to help with this is only tested once the machine is idle enough for long enough.

  5. Rather than removing things one at a time, remove everything, put a new disk in, install windows, pull the network and see what happens - this can exclude everything in one go…

  6. Keep the backup box reasonably up to date.