So we recently reopened the code repository for Turok 2: Seeds of Evil remaster, initially made back in 2016 and last touched back in 2019 for the Nintendo Switch port.
As some may know, the this port has never been the most stable, usually in the form of odd crashes at times, especially in the multiplayer, crashes that we haven't always been able to investigate or make sense of. However recently due to currently undisclosed (though probably obvious) reasons, we've had the opportunity to finally investigate some of these mysterious issues.
So why this thread? I often find that some people can make some rather large assumptions about what our job as programmers involves, usually people point out a crash and then wonder why it takes weeks/months/years to address it, if ever. So I'm going to try something and that's actually discussing the technical details behind the bugs and crashes that can often plague software. The idea essentially being to help explain why crashes and bugs that seem so obvious can be anything but. It won't be often given the scope of these things, but there's no better time to start than now.
So without further ado, here is the first of my irregular segment: Seymour, The Heap Is On Fire.
---
Problem: Particle instancing in the renderer would pick up strange values off the heap to define the drawn particle order.
Results: Unknown. Seemingly this could either crash (hypothetically, though never observed) or just muck up the particle list in some way. But we only observed this as an issue in address sanitizer. We never actually found this directly linked to any known crash or visual issue. Mysterious.
Cause: The viewpoint pointer for the current camera would become wild and was never cleared.
Anybody who does programming in C++ might immediately know what's wrong, given I just explained the cause and highlighted it, but if you didn't have any of that it can lead to utter confusion. What this code is doing is a static pointer (a pointer that persists in memory even after the function it lives in is exited) is created to contain the current viewpoint that exists on the stack, we use it so we can access it in the sorting function just below inside the anonymous function, but statics work different from regular variables, they are only initialised once. So every time we re-entered the function, the same viewpoint is used. This is fine up until the current viewpoint is destroyed and recreated, say when changing between levels. Then we have a pointer that's "wild", it points to a valid location in memory but its data is anything but. In this case, it's mostly non-fatal as its address is only ever read, not written to, when used to sort the particle list distance. The values could be anything.
The fix: Quite simple, we add a line to explicitly reset the value of pView to the current viewpoint every time we approach that call.
Why was this never noticed: The behaviour was non-obvious given it never corrupted any memory, just gave back bogus floats, and there was nothing technically wrong with the code so the static analysis of the compiler would never notice anything either. It was just an extremely subtle error.
---
Problem: Cinematics could end up read/writing trash information into the heap if the cinematic actions list was resized in the middle of iterating through it. Notably could occur if the cinematic was skipped.
Results: Seemingly, heap corruption or outright illegal address access (resulting in an instant crash), however we've never had a report refer to anything like this. The nature of the problem however means it could cause red herring crashes down the line with seemingly no connection to this issue at all.
Cause: C++, like some other languages, has these things called automatic iterators. Basically you take an array and instead of using an ordinal number to mark where you are in the array, you use a pointer in it and then check to see if you've reached the end each loop. That way, the pointer can also be directly used to address the current element without having to do that separately, potentially saving an extra register or value in the stack.
However, these are very dangerous to use if the array can be dynamically reallocated if it needs to be resized, like ours could.
What would happen is ev (the current event) would attempt to move to the next element, however during the previous event a new event was perhaps added or removed, reallocating the array so it now exists in a physically different location. So the current event pointer would now be wild (hey, like before) and not match the expected array address at all. It would shift to the next entry and suddenly find itself very much outside the array, having somehow missed the end check (it's likely the pointer exists before the start or is even misaligned now). We are now accessing an event in a completely garbage location in memory.
The fix: We change it back into a normal iterator, using an ordinal to track the position. In the scheme of things it wouldn't produce any measurable performance impact after all, and it ensures that we are always grabbing a valid pointer for the correct location in the array, even if its reallocated, and we are always checking where we are based on an actual length value, rather than an address in memory.
Why was this never noticed: Probably because any crash this would actually produce would appear far later and in a different location in memory. If it wrote anything, it would just be in an irrelevant but allocated location in memory so no crash would be immediate, and something else that didn't expect the memory contents to change would need to be the thing that crashes. So if anything did, we would have never known about the true cause.
---
And lastly, the big one,
Problem: Multiplayer would crash, a lot. It would also crash if you went for multiplayer back to singleplayer. It also may not crash if you went straight to multiplayer and skipped the intro cinematics completely (i.e they never loaded). Confused yet?
Results: The impact would be seemingly random, going from weird inconstant assertion failures (an assertion is a type of statement in code that confirms the input is correct) in places that should never have broken to just the whole program falling apart and closing without warning. There was generally no consistency with these issues.
Cause and fix: Who wins? An entire program with memory heap management or... this... thing:
Okay, so note this isn't the cause, this is the fix, we missed one single pointer clear and the whole program would meltdown in extremely unpredictable ways. It's that old bastard heap corruption again!
What would happen is when you changed between singleplayer and multiplayer (note, those intro cinematics are just the game running singleplayer with a cinematic running, they are technically playable), your local client number would change, so you now need a different viewpoint object. Well that's fine, but then we also forgot to clear the pointers for the previous viewpoint on the now not-local player 1. Before we call this, we previously call a stop function when ending the old game that clears up all these viewpoints, so now there's this stale viewpoint pointer that the game now sees as not actually null, and thus must be valid right? And then it tries to write data to it. Such as here, inside the teleport function:
After all, the pointer still looks valid, it isn't null. So now we have things trying to setup a view object is a completely nonsense location in memory. And not just once, but many, many times over during the current game. Oops.
Why was this never noticed: Because the crashes had nothing to do with the problem, as is the nature of heap corruption. And because of when the problem manifested, there was no connection at all, the crashes would only crop up minutes maybe hours after the corruption took place. When I did get reports from QA recently, it would be about kexHeap (our memory heap manager) handling the network packets or a weird pointer inside the sound API (i.e something not even our code), they had never had anything to do with viewpoints and the creation there of. The problems were, in effect, nonsense.
This made even confirming this was a fix complicated, as once the code changed we had to do hours of testing to see if the randomly occurring issues would no longer randomly occur. You can probably see the problem with that.
A thought: I'm not certain, but this issue may have existed since the beginning of Turok2's development. It's probable it directly came from changing Turok1's singleplayer code, where only one viewpoint ever existed and was static, into multiplayer for Turok2. Their code-bases don't have a lot similar but Turok2 initially started life pulling stuff from Turok1 to speed up alpha development.
---
So that's been a look into the kind of problems we deal with when it comes to bugs and crashes. I hope to make something like this again soon. If we have more weird bugs like these. Maybe. It's kind of a weird thing to hope your program has more bugs, that's kind of not what I want.
Hope you have found this informative, thank you.
As some may know, the this port has never been the most stable, usually in the form of odd crashes at times, especially in the multiplayer, crashes that we haven't always been able to investigate or make sense of. However recently due to currently undisclosed (though probably obvious) reasons, we've had the opportunity to finally investigate some of these mysterious issues.
So why this thread? I often find that some people can make some rather large assumptions about what our job as programmers involves, usually people point out a crash and then wonder why it takes weeks/months/years to address it, if ever. So I'm going to try something and that's actually discussing the technical details behind the bugs and crashes that can often plague software. The idea essentially being to help explain why crashes and bugs that seem so obvious can be anything but. It won't be often given the scope of these things, but there's no better time to start than now.
So without further ado, here is the first of my irregular segment: Seymour, The Heap Is On Fire.
---
Problem: Particle instancing in the renderer would pick up strange values off the heap to define the drawn particle order.
Results: Unknown. Seemingly this could either crash (hypothetically, though never observed) or just muck up the particle list in some way. But we only observed this as an issue in address sanitizer. We never actually found this directly linked to any known crash or visual issue. Mysterious.
Cause: The viewpoint pointer for the current camera would become wild and was never cleared.
Anybody who does programming in C++ might immediately know what's wrong, given I just explained the cause and highlighted it, but if you didn't have any of that it can lead to utter confusion. What this code is doing is a static pointer (a pointer that persists in memory even after the function it lives in is exited) is created to contain the current viewpoint that exists on the stack, we use it so we can access it in the sorting function just below inside the anonymous function, but statics work different from regular variables, they are only initialised once. So every time we re-entered the function, the same viewpoint is used. This is fine up until the current viewpoint is destroyed and recreated, say when changing between levels. Then we have a pointer that's "wild", it points to a valid location in memory but its data is anything but. In this case, it's mostly non-fatal as its address is only ever read, not written to, when used to sort the particle list distance. The values could be anything.
The fix: Quite simple, we add a line to explicitly reset the value of pView to the current viewpoint every time we approach that call.
Why was this never noticed: The behaviour was non-obvious given it never corrupted any memory, just gave back bogus floats, and there was nothing technically wrong with the code so the static analysis of the compiler would never notice anything either. It was just an extremely subtle error.
---
Problem: Cinematics could end up read/writing trash information into the heap if the cinematic actions list was resized in the middle of iterating through it. Notably could occur if the cinematic was skipped.
Results: Seemingly, heap corruption or outright illegal address access (resulting in an instant crash), however we've never had a report refer to anything like this. The nature of the problem however means it could cause red herring crashes down the line with seemingly no connection to this issue at all.
Cause: C++, like some other languages, has these things called automatic iterators. Basically you take an array and instead of using an ordinal number to mark where you are in the array, you use a pointer in it and then check to see if you've reached the end each loop. That way, the pointer can also be directly used to address the current element without having to do that separately, potentially saving an extra register or value in the stack.
However, these are very dangerous to use if the array can be dynamically reallocated if it needs to be resized, like ours could.
What would happen is ev (the current event) would attempt to move to the next element, however during the previous event a new event was perhaps added or removed, reallocating the array so it now exists in a physically different location. So the current event pointer would now be wild (hey, like before) and not match the expected array address at all. It would shift to the next entry and suddenly find itself very much outside the array, having somehow missed the end check (it's likely the pointer exists before the start or is even misaligned now). We are now accessing an event in a completely garbage location in memory.
The fix: We change it back into a normal iterator, using an ordinal to track the position. In the scheme of things it wouldn't produce any measurable performance impact after all, and it ensures that we are always grabbing a valid pointer for the correct location in the array, even if its reallocated, and we are always checking where we are based on an actual length value, rather than an address in memory.
Why was this never noticed: Probably because any crash this would actually produce would appear far later and in a different location in memory. If it wrote anything, it would just be in an irrelevant but allocated location in memory so no crash would be immediate, and something else that didn't expect the memory contents to change would need to be the thing that crashes. So if anything did, we would have never known about the true cause.
---
And lastly, the big one,
Problem: Multiplayer would crash, a lot. It would also crash if you went for multiplayer back to singleplayer. It also may not crash if you went straight to multiplayer and skipped the intro cinematics completely (i.e they never loaded). Confused yet?
Results: The impact would be seemingly random, going from weird inconstant assertion failures (an assertion is a type of statement in code that confirms the input is correct) in places that should never have broken to just the whole program falling apart and closing without warning. There was generally no consistency with these issues.
Cause and fix: Who wins? An entire program with memory heap management or... this... thing:
Okay, so note this isn't the cause, this is the fix, we missed one single pointer clear and the whole program would meltdown in extremely unpredictable ways. It's that old bastard heap corruption again!
What would happen is when you changed between singleplayer and multiplayer (note, those intro cinematics are just the game running singleplayer with a cinematic running, they are technically playable), your local client number would change, so you now need a different viewpoint object. Well that's fine, but then we also forgot to clear the pointers for the previous viewpoint on the now not-local player 1. Before we call this, we previously call a stop function when ending the old game that clears up all these viewpoints, so now there's this stale viewpoint pointer that the game now sees as not actually null, and thus must be valid right? And then it tries to write data to it. Such as here, inside the teleport function:
After all, the pointer still looks valid, it isn't null. So now we have things trying to setup a view object is a completely nonsense location in memory. And not just once, but many, many times over during the current game. Oops.
Why was this never noticed: Because the crashes had nothing to do with the problem, as is the nature of heap corruption. And because of when the problem manifested, there was no connection at all, the crashes would only crop up minutes maybe hours after the corruption took place. When I did get reports from QA recently, it would be about kexHeap (our memory heap manager) handling the network packets or a weird pointer inside the sound API (i.e something not even our code), they had never had anything to do with viewpoints and the creation there of. The problems were, in effect, nonsense.
This made even confirming this was a fix complicated, as once the code changed we had to do hours of testing to see if the randomly occurring issues would no longer randomly occur. You can probably see the problem with that.
A thought: I'm not certain, but this issue may have existed since the beginning of Turok2's development. It's probable it directly came from changing Turok1's singleplayer code, where only one viewpoint ever existed and was static, into multiplayer for Turok2. Their code-bases don't have a lot similar but Turok2 initially started life pulling stuff from Turok1 to speed up alpha development.
---
So that's been a look into the kind of problems we deal with when it comes to bugs and crashes. I hope to make something like this again soon. If we have more weird bugs like these. Maybe. It's kind of a weird thing to hope your program has more bugs, that's kind of not what I want.
Hope you have found this informative, thank you.
Last edited: