How I Fixed My PC’s Blue Screen of Death

Ever since I got my PC (unless and until specified, “my PC” specifically refers to my latest desktop), I had this strange issue where it used to randomly hang up and restart. You know, that dreaded . . .
*looks around*
Blue Screen of Death (BSoD) [1]
*screams echo in the background*

I’d be playing my favourite game of the week and all of a sudden, I’d get the damn BSoD and my whole fictional world would come crashing down. To say that I was angry, would be a prime contender for the understatement of the year. It was with superhuman willpower, effort and patience that I would calmly wait for it to restart again, which took around 2-3 minutes, and then continue my digital adventure where I left off. Sometimes though, the crash would occur when I let my PC run overnight to download a huge game. I’d feel like I plugged in the mobile to the charger overnight and realize that I forgot to turn it on in the morning. Although most of the times I would lose around 15 minutes of gameplay, that wasn’t the reason for my outburst. It had more to do with my fear of the PC breaking down. If these crashes continued, then it would ultimately lead to some hardware failure. That meant I had to get it repaired, if that doesn’t work, then I’d have to get a replacement part from the shop, which meant half a day gone in getting it, or worst-case scenario: I’d have to buy a new PC. Yeah I overthink, a lot (over-overthink?). So yes, I was obviously concerned and angry whenever those BSoDs paid a visit to my PC.

It’s not like I haven’t tried to find the root of the problem. But it increases the difficulty when the message BSoD gives changes every time it crashes. So I had to go one layer above and search for multiple BSoD failures. And after hours and hours of probing the deep recess of the web and even exhausting the Google search (yes, I went to the second page of search results, that’s how far I’m willing to go for my PC), I confirmed only one thing: one or more of my hardware has failed. That too was not guaranteed, since it could still be a Windows issue. So I had two options in front of me right now:

  1. Reinstall Windows
  2. Remove the faulty hardware and plug in a new one

Reinstalling Windows meant I had to get the repair guy and ask him to install it, and I could only call him on weekends because, well, I am not available any other time. But as the week crawls by over to the weekend my laziness takes over and I postpone it to next week.
I could try the second option, but it is easier said than done. First of all, there is no way to know which hardware is faulty and I don’t have any spares. And according to the Gods of the Web, it can be either the PSU, or the motherboard, or the CPU, or the RAM, or the fan, or the HDD (throw in video card and you have all the components required for a CPU). So I did what any rational person would do: ignore it. Since it works most of the time and only crashes once or twice in a fortnight, I figured I could live with it until I found a way to nab that faulty hardware.

Until that fateful day.

I was playing on the PC as usual when it crashed once again. So I calmly waited for it to restart but before I could even play, it crashed again. Now this was something new. No worries, I’ll just wait until it rest . . . crash. It crashed while it was recovering from the crash (crash-ception?). And after that it was all downhill. It was never able to run for more than 10 minutes and anything I do would cause it to crash. Trying to start a game? Crash. Trying to open the settings? Crash. Trying to open the Start Menu? Crash. Looking at the monitor at a slightly weird angle? Crash. After half an hour of agonizingly trying to find what was wrong, I gave up and realized that it was time to call the repair guy. The guy comes over, turns it on and waits for it to crash. Well whaddaya know, the damn thing didn’t crash even once. And so after waiting for half an hour for it to crash he left asking me to call him the next time it crashed. But before parting, he gave me one last piece of advice: this is most likely caused by the RAM or the HDD. And that actually narrowed it down to a more manageable list of things I could check. So I kept that in the back of my mind to use it the next time my PC crashed, which was two minutes after the guy left. It’s almost like the PC wanted to see me suffer.

But this time I was desperate and prepared. A deadly combination.

I mentally prepared myself to get my hands dirty. Although I didn’t have a spare HDD, I did have 2 RAM sticks that were presently within the CPU. I could remove one of them and check whether the crashes continue or not. I can then do the same with the other RAM stick. If they stop, then the one outside is the culprit, if they don’t stop for both the RAMs then it must be some other hardware. So I started my foray into the land of the silicon. After ~4 hours of back breaking work (what? turning off, swapping RAM sticks, trying out all combinations is hard work), I found that while there were still crashes, one of the RAMs had 3-4x times the crashes compared to the other. Bingo! That meant one the RAMs had gone rogue. But the crashes were still infrequent; I didn’t know how to reliably get it to crash so that I can confirm my diagnosis. That’s when my friend pointed me in the direction of prime95[2]. Started it up on my faulty RAM and waited for it to throw an error. I didn’t even have to wait for 5 minutes before it complained about multiple “hardware failure detected, “, which, did not occur for the other RAM stick. So I threw the faulty one out and magically, all the crashes went away. Not magically exactly, since I went through hell to find that faulty hardware, but I still felt pretty awesome for debugging a hardware issue for once.

The reason the working RAM crashed once or twice might have something to do with my constantly swapping in the good one and the bad one every 5 minutes. I’m still not sure as to why it crashed at all, because after that, the PC hasn’t crashed once till now (~3-4 months as of the time of writing).

All in all, I wasted the complete second half of my Sunday to find the issue. This is a reminder to myself that out of those thousands of working RAM sticks, only I got the faulty one. But that led to having one of the most fulfilling afternoons I’ve ever had in a long time, so maybe, should I even consider myself as unlucky? Or lucky enough to get the faulty piece and get new knowledge in the process of finding it?


[1]. BSoDs are those blue screens you get on a Windows machine (only Windows), due to some obscure issue. Ideally you should never get them, but we don’t live in an ideal world, so naturally it then becomes a question of how to reduce them and rectify them when they do occur. BSoDs offer you a very helpful message regarding why it crashed just before shutting down and restarting, and if you couldn’t detect the sarcasm in that statement or in that emphasis, then you haven’t encountered one or tried to fix it (BSoD, not sarcasm). They are notorious for appearing randomly and giving zero information on why it occurred. At-least, in my case it was zero information.

[2]. prime95 is an application which is used to find the next highest prime number. Since it finding a prime number is a very CPU intensive task, it was started to be used as a way to stress test a machine. So they made a separate section in the software just for stress testing. You can test things like the RAM, L1, L2, L3 caches and many more. Check out their website for more information. The application itself is very simple to use. I still cannot believe I did not come across this piece of software after all these years. If it looks like I’m pandering to them, then yes I am. Because it was mainly because of that software that I could finally confirm what was wrong with my PC.