Correct, it's not based on temperature. (Though I'm sure there's still a sensor in there to shut everything down if you try to game outside at noon in the desert.)
It's not based on power draw instead, though.
And it's not just SmartShift either. They include that, but it isn't the main focus. (And we can presume XSX has SmartShift too.)
Here's how I understand the approach, put as briefly as I can. (Note all numbers are illustrative examples, not real measurements.)
GPU power draw and thus temperatures rise based on the amount of transistors in use, and how fast they're clocked. Say you have a big GPU clocked at 1.5Ghz. When running a very simple indie game, all the necessary calculations to render a frame are done well ahead of the next frame having to start. The hardware is idle some of the time so may be 70% utilized, and power draw is commensurately low at 80W. (Note that this doesn't mean only 70% of the hardware is used. All games use all the chip, so in that sense everything from launch games to the last big showcase of the gen are "maxing out the hardware". Utilization here means over time.)
For an ambitious AAA game, the GPU doesn't have much idle time, with each frame being finished just before the next one starts. Power draw and heat are much higher too. Pretty much everyone is familiar with this. But fewer realize that this scenario doesn't represent 100% utilization. That's because in complex modern games, many calculations use input from previous calculations. If a shader needs 3 inputs and only 2 are ready before the next cycle starts, that shader stalls until the missing step completes. These kinds of inefficiencies happen more frequently the more you try to do at once, like in high-impact tentpole releases. The hardware may be 95% utilized, at 160W (power growth isn't linear). This is "heavy max".
That doesn't mean 95% is the ceiling for utilization numbers. Higher does happen, when a highly parallel engine is fed an easier workload. Mr. Cerny used the example of a menu screen, but it could also be a simpler-than-usual view during gameplay, like briefly running down a tight corridor. Unlike the indie, the work isn't so easy it can be finished well before the next frame. But it also isn't very complex like a heavy fight with particles and explosions and destruction, so fewer stalls happen. Thus all silicon may run full-tilt all the time, 100% utilized, with power draw spiking to 200W. This is "fast max".
When designing a system with fixed clocks, you have to predict the worst-case scenario those clocks will encounter. That'll be "fast max", so you put in a power supply and cooling system that handles 200W (not including efficiency and safety margins). But what about gameplay, the thing you're actually trying to provision best? Even the biggest setpiece moments only draw 160W out of your 200W design. Whereas if you clocked up to 1.75GHz, you'd get a lot more compute power to put onscreen. And 95% at 1.75GHz takes the same power as 100% at 1.5GHz, so your power and cooling system is still fine. ...But you can't, because then transient "fast max" loads of 100% at 1.75GHz would draw 250W and shut down your machine.
Here's where Sony's variable solution can help. Go ahead and clock to 1.75GHz to really squeeze all you can from gameplay. What about the "fast max" problem? Well, the power supply is monitoring activity, not draw or thermals. So when it sees 100% utilization, it drops the clocks to 1.5Ghz. There's still plenty of compute to render the easier work of the "fast max" segment, so the user doesn't see any change in quality onscreen. But the power draw stays at 200W instead of going higher. When the game load gets heavy again and utilization drops to 95%, the clock ramps back to 1.75Ghz and all 200W are dedicated to compute.
Hope this is helpful for some.