Adding operational modes that process low-precision workloads more efficiently isn't exactly adding
hardware (unless you're referring to the shader cores themselves versus their X1/X counterparts, in which case, sure), but that's beside the point at this juncture. Nobody was claiming the XSX GPU doesn't have the means to accelerate DirectML workloads, just that it doesn't have hardware designed for and aimed squarely at this singular purpose a la tensor cores; the latter is what myself and others who've replied to Alucardx23 were referring to with "dedicated hardware". I'm sure you can agree there's a distinction to make between machine learning-oriented shader cores and a pool of
discrete machine learning-oriented cores, and that's what we were trying to illustrate to Alucardx23, as he was conflating the two. That's all. The argument was never "The XSX GPU can't do what the tensor cores do."
Late edit: You may also want to read Liabe Brave's posts
here and
here (the latter especially). As it turns out, the ability to use an FP32 operation to process 4x INT8 or 8x INT4 isn't something Microsoft added; rather, it's inherent to the RDNA architecture. This means, incontrovertibly and unequivocally, that the XSX GPU isn't unique in that regard: said functionality is also supported by the PS5 GPU, any RDNA2-based GPUs AMD has in R&D, and even last year's RDNA1-based Radeon 5700 XT.
I never said it was unique to SX, in fact, I said both consoles, as I was aware that this is a rdna2 feature.
However, as Dictator pointed out, the consoles do have a custom rdna2, so they may not necessarily pick all available features, and he said sony haven't confirmed that feature to them. Now, I think that's more due to their messaging being a bit weird than they not having it.
However, I've been doing a bit of research on the topic, if anyone wants to validate those findings I'm more then welcoming the feedback:
- I remembered that back at the Xenos days, each of their alu cores had a vector and a scalar unit. The vector could do the same operations as a scalar unit, but on up to a 4 wide vector at once. This was back then touted as a big advantage because they could perform one vector multiply + add per cycle.
However, during real world usage, it was noted that most of the operations happened on a scalar, and that led to one effect: Most of the alu hardware was wasted because the vector unit was underutilized when doing scalar math. So for the next generation of their architecture, AMD replaced the vector unit with 4 scalar units and supported a vector operation.
- A similar thing happens to the tensor cores. The tensor cores are set up to perform in matrices, they can do a full matrix multiplication in one cycle. For a regular shader alu, I believe you need to distribute the matrix operation to 4 shader cores in order to have it performed in a single cycle. That's where the rpm comes in, by using int8 operations you can group a full 4x4 matrix in the registers and perform 1 4x4 matrix multiplication per cycle. For dimensions higher than 4x4 you still need more cores, but that's still the thing for the tensor cores, Nvidia also distributes the load across multiple cores if the matrix is too big.
- The above is likely why DLSS was implemented thus far without the tensor cores, and still performance wise it executed very well across the whole RTX lineup. So what are the advantages of the tensor cores then? From what I gather, 1 is that they can actually perform matrix 4x4 calculations on an FP16 space, so for int8 they can actually perform one 16x16 matrix multiplication per cycle, so either by FP16 or int8 space each core can perform way more matrix calculations per cycle than a single shader core.
- The problem is that the tensor cores suffer from a similar issue as xenos. They excel at matrix operations but are severely underutilized for scalar or smaller matrix workloads. Now I could only find somewhat documentation on that subject, but Nvidia actually acknowledges this, but the tensor cores also target non consumer workloads, for example, Nvidia wants their GPUs to be used in clusters for training the models, not just running them (especially true for fp16 matrix math), and even for those utilization of the tensor cores is no where near as high. This is supported by benchmarks that show how a 2080ti compares to a 1080ti in machine learning performance. If you account for the fact that 1080ti lacks some optimizations for int math and does not support rpm, and normalize the results just by the rpm multipliers, 2080ti stays in most cases below a 40% improvement over 1080ti.
- I found more reference for what I said above, but this one has a nice summary:
news.ycombinator.com
Granted, 36% over a 1080ti is no slouch, but it also means that due the poor utilization the tensor cores are delivering nowhere near close to the 110 tflops (fp16) or 220 tops (int8), and that's for the training phase which in theory has a higher utilization of the tensor cores than running the model.
Keep in mind that this is simply because the higher number of theoretical operations assumes a full matrix to be multiplied. If you don't have enough matrices to multiply then you are only using a fraction of the operations you could be performing in a cycle, not a Nvidia lied post or anything like that.
SX figures on the other hand, are much lower because they are scalar operations, but the thing about scalar operations is that you can always use them, so they have a penalty when dealing with matrices, but the quoted performance is fully achieved even by the simplest calculations.
With all that, I think that even though RTX can obviously achieve much higher theoretical matrix operations, in real world scenario utilization of the matrix units are so low that you don't really see the benefits unless your code is all about multiplying matrices, which even for ML training it's often not.
I don't know how that would translate to DLSS performance on consoles, but it honestly sounds like it really shouldn't be a problem for them, the fact that they have to give up graphics and compute alu to perform the ML workloads could be a problem, but as a counter point I think you can easily win them back by scaling down the resolution a bit, and I assume the process scales somewhat with the image resolution, so even an extreme case of a game running in 1080p on SX to scale up to 4k could see a scenario of the game running in 540p on Lockhart and scaling back to 1080p. I would say that the fact that the console is targeting a resolution 4x lower than SX but only has 3x less processing could actually account for that already. They are setting Lockhart to a higher flop per pixel rate than SX, likely to account for stuff that doesn't scale with resolution and perhaps also accounting for the extra load to perform machine learning upscaling and still reach the same performance. This could even lead to situations where the game is slightly below 4k on SX + reconstructed to a higher resolution, and on Lockhart it's still 1080p + reconstructed.
Obviously we will only know more once we see actual game code running on them, but thus far I'm fairly confident that similar models (with 4x resolution increase and that resolved subpixel detail) can be used in real time effectively on the consoles.