A look into bugs and crashes in Turok2, and the job of finding and debugging them

Edward850 · Jan 24, 2021

So we recently reopened the code repository for Turok 2: Seeds of Evil remaster, initially made back in 2016 and last touched back in 2019 for the Nintendo Switch port.

As some may know, the this port has never been the most stable, usually in the form of odd crashes at times, especially in the multiplayer, crashes that we haven't always been able to investigate or make sense of. However recently due to currently undisclosed (though probably obvious) reasons, we've had the opportunity to finally investigate some of these mysterious issues.

So why this thread? I often find that some people can make some rather large assumptions about what our job as programmers involves, usually people point out a crash and then wonder why it takes weeks/months/years to address it, if ever. So I'm going to try something and that's actually discussing the technical details behind the bugs and crashes that can often plague software. The idea essentially being to help explain why crashes and bugs that seem so obvious can be anything but. It won't be often given the scope of these things, but there's no better time to start than now.

So without further ado, here is the first of my irregular segment: Seymour, The Heap Is On Fire.

---

Problem: Particle instancing in the renderer would pick up strange values off the heap to define the drawn particle order.

Results: Unknown. Seemingly this could either crash (hypothetically, though never observed) or just muck up the particle list in some way. But we only observed this as an issue in address sanitizer. We never actually found this directly linked to any known crash or visual issue. Mysterious.

Cause: The viewpoint pointer for the current camera would become wild and was never cleared.

Anybody who does programming in C++ might immediately know what's wrong, given I just explained the cause and highlighted it, but if you didn't have any of that it can lead to utter confusion. What this code is doing is a static pointer (a pointer that persists in memory even after the function it lives in is exited) is created to contain the current viewpoint that exists on the stack, we use it so we can access it in the sorting function just below inside the anonymous function, but statics work different from regular variables, they are only initialised once. So every time we re-entered the function, the same viewpoint is used. This is fine up until the current viewpoint is destroyed and recreated, say when changing between levels. Then we have a pointer that's "wild", it points to a valid location in memory but its data is anything but. In this case, it's mostly non-fatal as its address is only ever read, not written to, when used to sort the particle list distance. The values could be anything.

The fix: Quite simple, we add a line to explicitly reset the value of pView to the current viewpoint every time we approach that call.

Why was this never noticed: The behaviour was non-obvious given it never corrupted any memory, just gave back bogus floats, and there was nothing technically wrong with the code so the static analysis of the compiler would never notice anything either. It was just an extremely subtle error.

---

Problem: Cinematics could end up read/writing trash information into the heap if the cinematic actions list was resized in the middle of iterating through it. Notably could occur if the cinematic was skipped.

Results: Seemingly, heap corruption or outright illegal address access (resulting in an instant crash), however we've never had a report refer to anything like this. The nature of the problem however means it could cause red herring crashes down the line with seemingly no connection to this issue at all.

Cause: C++, like some other languages, has these things called automatic iterators. Basically you take an array and instead of using an ordinal number to mark where you are in the array, you use a pointer in it and then check to see if you've reached the end each loop. That way, the pointer can also be directly used to address the current element without having to do that separately, potentially saving an extra register or value in the stack.

However, these are very dangerous to use if the array can be dynamically reallocated if it needs to be resized, like ours could.

What would happen is ev (the current event) would attempt to move to the next element, however during the previous event a new event was perhaps added or removed, reallocating the array so it now exists in a physically different location. So the current event pointer would now be wild (hey, like before) and not match the expected array address at all. It would shift to the next entry and suddenly find itself very much outside the array, having somehow missed the end check (it's likely the pointer exists before the start or is even misaligned now). We are now accessing an event in a completely garbage location in memory.

The fix: We change it back into a normal iterator, using an ordinal to track the position. In the scheme of things it wouldn't produce any measurable performance impact after all, and it ensures that we are always grabbing a valid pointer for the correct location in the array, even if its reallocated, and we are always checking where we are based on an actual length value, rather than an address in memory.

Why was this never noticed: Probably because any crash this would actually produce would appear far later and in a different location in memory. If it wrote anything, it would just be in an irrelevant but allocated location in memory so no crash would be immediate, and something else that didn't expect the memory contents to change would need to be the thing that crashes. So if anything did, we would have never known about the true cause.

---

And lastly, the big one,

Problem: Multiplayer would crash, a lot. It would also crash if you went for multiplayer back to singleplayer. It also may not crash if you went straight to multiplayer and skipped the intro cinematics completely (i.e they never loaded). Confused yet?

Results: The impact would be seemingly random, going from weird inconstant assertion failures (an assertion is a type of statement in code that confirms the input is correct) in places that should never have broken to just the whole program falling apart and closing without warning. There was generally no consistency with these issues.

Cause and fix: Who wins? An entire program with memory heap management or... this... thing:

Okay, so note this isn't the cause, this is the fix, we missed one single pointer clear and the whole program would meltdown in extremely unpredictable ways. It's that old bastard heap corruption again!

What would happen is when you changed between singleplayer and multiplayer (note, those intro cinematics are just the game running singleplayer with a cinematic running, they are technically playable), your local client number would change, so you now need a different viewpoint object. Well that's fine, but then we also forgot to clear the pointers for the previous viewpoint on the now not-local player 1. Before we call this, we previously call a stop function when ending the old game that clears up all these viewpoints, so now there's this stale viewpoint pointer that the game now sees as not actually null, and thus must be valid right? And then it tries to write data to it. Such as here, inside the teleport function:

After all, the pointer still looks valid, it isn't null. So now we have things trying to setup a view object is a completely nonsense location in memory. And not just once, but many, many times over during the current game. Oops.

Why was this never noticed: Because the crashes had nothing to do with the problem, as is the nature of heap corruption. And because of when the problem manifested, there was no connection at all, the crashes would only crop up minutes maybe hours after the corruption took place. When I did get reports from QA recently, it would be about kexHeap (our memory heap manager) handling the network packets or a weird pointer inside the sound API (i.e something not even our code), they had never had anything to do with viewpoints and the creation there of. The problems were, in effect, nonsense.
This made even confirming this was a fix complicated, as once the code changed we had to do hours of testing to see if the randomly occurring issues would no longer randomly occur. You can probably see the problem with that.

A thought: I'm not certain, but this issue may have existed since the beginning of Turok2's development. It's probable it directly came from changing Turok1's singleplayer code, where only one viewpoint ever existed and was static, into multiplayer for Turok2. Their code-bases don't have a lot similar but Turok2 initially started life pulling stuff from Turok1 to speed up alpha development.

---

So that's been a look into the kind of problems we deal with when it comes to bugs and crashes. I hope to make something like this again soon. If we have more weird bugs like these. Maybe. It's kind of a weird thing to hope your program has more bugs, that's kind of not what I want.

Hope you have found this informative, thank you.

SofNascimento · Jan 24, 2021

I must confess I read the entire thing and it went right thourgh my head. I only took one programming course in college, and although I thought I was quite good at it, I peaked at making a cash register program.

For me, there is somethign etheral about coding. Part of me can not really truly believe that those lines becomes the game I play, or any other computer related stuff.

What I do believe is that things that might appear simple can be extremely complicated and the problems might be very hard to find. And I should also say that I played the Turok 2 (and 1) rematers on PC and absolutely loved it. If fact, I wonder, if you guys working on it ever thought the people making it for the NIntendo 64 were out of their mind, because it feels it can't be played in anything but a PC.

Bishop89 · Jan 24, 2021

I wish all developers were this open and transparent with behind the scene development.

Good stuff 👌

Edward850 · Jan 24, 2021

SofNascimento said:
If fact, I wonder, if you guys working on it ever thought the people making it for the NIntendo 64 were out of their mind, because it feels it can't be played in anything but a PC.

Turok2 was definitely over-reaching on the N64. They pulled just about every trick in the book to make it perform well and it absolutely created problems updating it for new platforms. Just about every single performance issue now is directly linked to the fragmented level geometry that was designed around the short fog distance and additionally how the AI also reacted to that.

FrakEarth · Jan 24, 2021

Nothing worse than issues like the last one where the errors and info on the stack seems to point you in a direction, but the cause was actually some oversight that caused something else altogether. Debugging is a valuable skill.. I mainly live in C# and JS, so don't deal with memory pointers that often, don't often have a great need for high performant code... but I have a good appreciation of how hard this stuff can be. Games are complex!

Mr_Antimatter · Jan 24, 2021

Awesome thread! Reminds me I need to finally play this, which I purchased way back at release.

SofNascimento · Jan 24, 2021

Edward850 said:
Turok2 was definitely over-reaching on the N64. They pulled just about every trick in the book to make it perform well and it absolutely created problems updating it for new platforms. Just about every single performance issue now is directly linked to the fragmented level geometry that was designed around the short fog distance and additionally how the AI also reacted to that.

I remember after playing the game I literally went to youtube too see how it looked like on the N64... and it felt like was I looking at another game. Maybe this is unfair to say, but it felt like the true vision behind the game wasn't really there, it was only with the remaster that it came to be. And not only from a technical stand point. One of the things that impressed me in Turok 2 (particuarly comparing it with the original Turok) is how much precision aiming played a role. Aiming for the head was a major thing... and it was hard to do with a mouse, I can't imagine with a N64 controller which I still not fully understand how am I supposed to hold it.

Blackpuppy · Jan 25, 2021

Awesome thread! Thanks for the insight into development. Does all this updating mean we'll see multiplayer on Switch...?

SofNascimento said:
and it was hard to do with a mouse, I can't imagine with a N64 controller which I still not fully understand how am I supposed to hold it.

Turok 1/2 on the N64 was a bit ahead of its time in terms of controls and used a form of "dual stick" controls. The yellow C buttons were used for walking forwards and backwards and strafing and the analogue stick was used for aiming.

Vintage · Jan 25, 2021

Are we using Era as Jira board now?

jk, great job and interesting read :)

Travis Pickle · Jan 25, 2021

As someone currently working on his CS degree, this was super interesting to read. Thank you.

AIan · Jan 25, 2021

I will understand what this means. Just give me a couple more years ^^

Edward850 · Jan 25, 2021

Blackpuppy said:
Awesome thread! Thanks for the insight into development. Does all this updating mean we'll see multiplayer on Switch...?

I can neither confirm nor deny that right now, though on a totally unrelated note I did recently tweet this image:

accx · Jan 25, 2021

Fantastic thread! I understood some words.

Tedmilk · Jan 25, 2021

Interesting thread (from someone who's dabbled with c++). Thanks!

Blackpuppy · Jan 25, 2021

Edward850 said:
I can neither confirm nor deny that right now, though on a totally unrelated note I did recently tweet this image:

Well, I think I might just quietly celebrate in my corner here....

sox · Jan 25, 2021

Man I'm glad I left C++ behind at uni and do basic stuff like SQL for a living. Thanks for the insight into your work and for looking into bugs in a game that is many years old at this point.

A look into bugs and crashes in Turok2, and the job of finding and debugging them

Edward850

Software & Netcode Engineer at Nightdive Studios

SofNascimento

cursed

Bishop89

What Are Ya' Selling?

Edward850

Software & Netcode Engineer at Nightdive Studios

FrakEarth

Mr_Antimatter

SofNascimento

cursed

Blackpuppy

Vintage

Travis Pickle

AIan

Edward850

Software & Netcode Engineer at Nightdive Studios

accx

Tedmilk

Blackpuppy

sox