• Ever wanted an RSS feed of all your favorite gaming news sites? Go check out our new Gaming Headlines feed! Read more about it here.
  • We have made minor adjustments to how the search bar works on ResetEra. You can read about the changes here.

Alucardx23

Member
Nov 8, 2017
4,712
Microsoft did not add the hardware, though. RPM for INT4 and INT8 is an inherent feature of RDNA1, much less RDNA2 that'll be used in the consoles. Both XSX and PS5 will have this feature. It's an extension of the FP16 version you may have heard of in regards to PS4 Pro.

"We knew that many inference algorithms need only 8-bit and 4-bit integer positions for weights and the math operations involving those weights comprise the bulk of the performance overhead for those algorithms," says Andrew Goossen. "So we added special hardware support for this specific scenario. The result is that Series X offers 49 TOPS for 8-bit integer operations and 97 TOPS for 4-bit integer operations. Note that the weights are integers, so those are TOPS and not TFLOPs. The net result is that Series X offers unparalleled intelligence for machine learning."

www.eurogamer.net

Inside Xbox Series X: the full specs

This is it. After months of teaser trailers, blog posts and even the occasional leak, we can finally reveal firm, hard …

We still do not know if PS5 supports increased rate int8.
XSX does.

XSX is like 49 int8 TOPs - RTX 2080 Ti is 220 at ca. 1500mhz (most 2080 Tis run at 1800-1950 mhz in real life though).
 

Ryoku

Member
Oct 28, 2017
460
Thank you. I will stop talking to you now.

"We knew that many inference algorithms need only 8-bit and 4-bit integer positions for weights and the math operations involving those weights comprise the bulk of the performance overhead for those algorithms," says Andrew Goossen. "So we added special hardware support for this specific scenario. The result is that Series X offers 49 TOPS for 8-bit integer operations and 97 TOPS for 4-bit integer operations. Note that the weights are integers, so those are TOPS and not TFLOPs. The net result is that Series X offers unparalleled intelligence for machine learning."

www.eurogamer.net

Inside Xbox Series X: the full specs

This is it. After months of teaser trailers, blog posts and even the occasional leak, we can finally reveal firm, hard …

This is interesting as I hadn't seen these numbers before.
XSX 4-bit integer acceleration is 97 TOPS, and 8-bit integer acceleration is 49TOPS
For reference:
2080ti 4-bit integer acceleration is 455.4 TOPS, and 8-bit integer acceleration is 227.7 TOPS (Founder's Edition)

In addition to the seemingly massive performance difference, the RTX cards' integer acceleration is done on dedicated Tensor cores, separate from the shader cores whereas the XSX's integer operations are done within the shader cores.

I'm actually really excited at the potential for Switch 2 to utilize this technology. It would be cool to see Nintendo games finally have really good image quality on my TV.
 

Alucardx23

Member
Nov 8, 2017
4,712
This is interesting as I hadn't seen these numbers before.
XSX 4-bit integer acceleration is 97 TOPS, and 8-bit integer acceleration is 49TOPS
For reference:
2080ti 4-bit integer acceleration is 455.4 TOPS, and 8-bit integer acceleration is 227.7 TOPS

In addition to the seemingly massive performance difference, the RTX cards' integer acceleration is done on dedicated Tensor cores, separate from the shader cores whereas the XSX's integer operations are done within the shader cores.

I'm actually really excited at the potential for Switch 2 to utilize this technology. It would be cool to see Nintendo games finally have really good image quality on my TV.


Yes. We also know the RTX 2060 TOPs performance number for a more accurate comparison to see if someone useful can be done with a lower TOPs performance.

RTX 2060 is around 100 TOPs - the XSX is 49 for int8. int4 TOPs is double for both - so around 200 for RTX 2060 and 97 for XSX.
 

Zedark

Member
Oct 25, 2017
14,719
The Netherlands
This is interesting as I hadn't seen these numbers before.
XSX 4-bit integer acceleration is 97 TOPS, and 8-bit integer acceleration is 49TOPS
For reference:
2080ti 4-bit integer acceleration is 455.4 TOPS, and 8-bit integer acceleration is 227.7 TOPS

In addition to the seemingly massive performance difference, the RTX cards' integer acceleration is done on dedicated Tensor cores, separate from the shader cores whereas the XSX's integer operations are done within the shader cores.

I'm actually really excited at the potential for Switch 2 to utilize this technology. It would be cool to see Nintendo games finally have really good image quality on my TV.
The big TOPS differential makes the NVIDIA GPU more capable of DLSS-like algorithms (and of course that difference is powered by the tensor cores), but the specific point about the tensor cores being separate from the ALUs might not matter, because DLSS is implemented somewhere in the middle of the graphics pipeline, so the ALUs must wait for the DLSS algorithm to finish before they can continue. This would have been different if the DLSS 2.0 slotted in at the end of the pipeline, but it doesn't.

Therefore, the TOPS difference alone should dictate the time differential for computing the DLSS upscaling.
 

Ryoku

Member
Oct 28, 2017
460
The big TOPS differential makes the NVIDIA GPU more capable of DLSS-like algorithms (and of course that difference is powered by the tensor cores), but the specific point about the tensor cores being separate from the ALUs might not matter, because DLSS is implemented somewhere in the middle of the graphics pipeline, so the ALUs must wait for the DLSS algorithm to finish before they can continue. This would have been different if the DLSS 2.0 slotted in at the end of the pipeline, but it doesn't.

Therefore, the TOPS difference alone should dictate the time differential for computing the DLSS upscaling.
Oh, so you're saying that the integer calculations can't be done in parallel to the floating point operations?
 

Alucardx23

Member
Nov 8, 2017
4,712
The big TOPS differential makes the NVIDIA GPU more capable of DLSS-like algorithms (and of course that difference is powered by the tensor cores), but the specific point about the tensor cores being separate from the ALUs might not matter, because DLSS is implemented somewhere in the middle of the graphics pipeline, so the ALUs must wait for the DLSS algorithm to finish before they can continue. This would have been different if the DLSS 2.0 slotted in at the end of the pipeline, but it doesn't.

Therefore, the TOPS difference alone should dictate the time differential for computing the DLSS upscaling.

That is the point of all of this. You can render a 1080P or 1440P frame much faster than a native 4K frame. This can compensate for adding the machine learning upscaling step at the end or close to the end of the render pipeline.
 

Dictator

Digital Foundry
Verified
Oct 26, 2017
4,930
Berlin, 'SCHLAND
The big TOPS differential makes the NVIDIA GPU more capable of DLSS-like algorithms (and of course that difference is powered by the tensor cores), but the specific point about the tensor cores being separate from the ALUs might not matter, because DLSS is implemented somewhere in the middle of the graphics pipeline, so the ALUs must wait for the DLSS algorithm to finish before they can continue. This would have been different if the DLSS 2.0 slotted in at the end of the pipeline, but it doesn't.

Therefore, the TOPS difference alone should dictate the time differential for computing the DLSS upscaling.
yeah - Tensors and ALU do not run in parallel. THe difference then comes from how "fast" it can execute the ML code.
 

Zedark

Member
Oct 25, 2017
14,719
The Netherlands
Oh, so you're saying that the integer calculations can't be done in parallel to the floating point operations?
No, rather that when you render the image at for example 1080p, the shaders on the GPU then pass the image to the tensor cores, which upscale it. While the tensor cores are upscaling, the shaders cannot continue with different tasks, because their next steps depend on the 4K upscaled image. As a result, while the tensor cores are active, the shaders are idle. So, in theory, the approach between dedicated tensor cores and shader-based DLSS computation do not fundamentally differ: it's just that the tensor cores are much faster at this computation.
 

BitterFig

Member
Oct 30, 2017
1,099
Its advantage is running on tensor cores instead of SMs, so those with Nvidia GPUs potentially have more general resources to throw at things other than resolution.
Thanks for bringing that up. One thing I do not understand is what is the GPU supposed to do while the upscaling is taking place? I mean the upscaling needs the current frame (save post process) to be completly rendered before kicking in. Doesn't it make sense to throw ALL computational power for the upscaling to get it done as fast as possible and move on with the rest of the rendering steps? Do we have any evidence that DLSS is running only on the tensor cores?

Also the algorithm is a lot less black box deep learning magic than I thought. From the presentation it is more like a clever version of existing stuff where instead of using hand tuned programs to combine all previous low res frames as in checkerboard rendering and co, it computes the best program automatically by trying out a lot of them on several games and see what works best compared to native (i.e. ground truth, apparently 16K :o).
 

Ryoku

Member
Oct 28, 2017
460
No, rather that when you render the image at for example 1080p, the shaders on the GPU then pass the image to the tensor cores, which upscale it. While the tensor cores are upscaling, the shaders cannot continue with different tasks, because their next steps depend on the 4K upscaled image. As a result, while the tensor cores are active, the shaders are idle. So, in theory, the approach between dedicated tensor cores and shader-based DLSS computation do not fundamentally differ: it's just that the tensor cores are much faster at this computation.
yeah - Tensors and ALU do not run in parallel. THe difference then comes from how "fast" it can execute the ML code.
What do you think the differences would be if DLSS was done at the end of the pipeline?
What are the kind of operations that are done after the tensor cores provide the upscaled image?
Would things like post-processed effects look different?
 

ppn7

Member
May 4, 2019
740
I have a question guys. I saw that that there is now a DLSS 2.0 UE4 branch available for game developers.
Does it mean UE4 is updated to use natively DLSS 2.0 ? Is it possible that current games designed around UE4 could potentially be patched to get DLSS 2.0 ? They need to update the UE4 in game ? Or it will never happen and you will only see new game developped with that UE4 DLSS 2.0 branch ?

Example : PUBG, Fortnite, Shenmue III etc...
 

Deleted member 11276

Account closed at user request
Banned
Oct 27, 2017
3,223
From my understanding, Microsoft added dedicated hardware in the XSX which supports INT 4/8 instructions, and didn't just add support for the INT 4/8 instructions in the CUs themselves. So it's dedicated hardware similar to the tensor cores (but without FP16 instructions), meaning you have the full 97 TOPS INT4 for machine learning available without taking performance from the CUs.

Now the interesting question is if that hardware can operate concurrently with the ALUs, which the tensor cores afaik cannot because of bandwidth constraints.
 

Liabe Brave

Professionally Enhanced
Member
Oct 27, 2017
1,672
That is not the case. What Microsoft is doing can be defined as accelerating machine learning code as you can try to do the same on a regular GPU with much slower performance. The animation below shows the difference with Nvidia GPUs on how adding support for INT8 and INT4 accelerates performance.
On tensor cores, which are specifically designed for these workloads. The speedup using RPM on AMD CUs is much smaller, and takes from the same pool of resources all other rendering does, unlike with Nvidia.

"We knew that many inference algorithms need only 8-bit and 4-bit integer positions for weights and the math operations involving those weights comprise the bulk of the performance overhead for those algorithms," says Andrew Goossen. "So we added special hardware support for this specific scenario.
"We added hardware support" is being used to mean "we elected to use this part of the feature set AMD had on offer". Again, RPM for INT4 and INT8 is inherent to the RDNA1 architecture, not just XSX. I'm pretty sure it's available on already-released Radeon products.

So while Dictator is correct that Sony haven't announced specifically that they have it, it's much more likely than not. It's a base feature of the microarchitecture previous to the one they're using. And the PS4 Pro added hardware support of RPM for FP16 when it was available, even though Xbox One X a year later did not. If anything, this would suggest Sony value the approach even more highly than Microsoft.
 

Poison Jam

Member
Nov 6, 2017
2,984
OK, so this means that if the post processing would be happening at 1080P there would not be any performance hit?
Nvidia's presentation showed DLSS 2.0 running after the frame is done rendering geometry and shading, but before post-processing and HUD. Meaning, not in parallell.

There's an almost fixed, resolution dependant overhead.
 

Zedark

Member
Oct 25, 2017
14,719
The Netherlands
Thanks for bringing that up. One thing I do not understand is what is the GPU supposed to do while the upscaling is taking place? I mean the upscaling needs the current frame (save post process) to be completly rendered before kicking in. Doesn't it make sense to throw ALL computational power for the upscaling to get it done as fast as possible and move on with the rest of the rendering steps? Do we have any evidence that DLSS is running only on the tensor cores?

Also the algorithm is a lot less black box deep learning magic than I thought. From the presentation it is more like a clever version of existing stuff where instead of using hand tuned programs to combine all previous low res frames as in checkerboard rendering and co, it computes the best program automatically by trying out a lot of them on several games and see what works best compared to native (i.e. ground truth, apparently 16K :o).
The answer is that the GPU idles while the tensor cores are upscaling the image by applying the DLSS algorithm. The added time from this idling step is more than compensated for by the much shorter rendering time needed for 1080p/1440p images instead of native 4K ones.

DLSS 2.0 is a machine learning-based algorithm that learns typical upscaling situations that other upscaling algorithms typically trip up on (like rapidly alternating height levels at grazing angles). That's the advantage of deep learning: it can learn how to handle those situations much better than a static upscaling algorithm because it can learn the correct solution from native (or supersampled) ground truths, and thus can sidestep many issues. DLSS 2.0 has the added advantage that it pools general information, rather than game-specific patterns, so that each game that adds the technology in turn improves the learned patterns and improves the upscaling. The wonders of AI at work!
 
Oct 26, 2017
20,440
Couple questions.

How much of the cost of the RTX 2060 is the tensor cores? Do we have an estimate? Need to see how much cheaper this for manufactures than just increasing traditional power.

If you're making a game to be displayed in 1080p, could you just put 540p textures (which are 1/4 the file size) on the card (along with the pre-trained DLSS results) to potentially save a lot of disc space?
 

Alucardx23

Member
Nov 8, 2017
4,712
On tensor cores, which are specifically designed for these workloads. The speedup using RPM on AMD CUs is much smaller, and takes from the same pool of resources all other rendering does, unlike with Nvidia.

Don't know to who you are talking to here. I haven't said anything contrary to this. It is a fact that Microsoft has added hardware to accelerate machine learning code, this has nothing to do with how Nvidia is handling the same problem.

"We added hardware support" is being used to mean "we elected to use this part of the feature set AMD had on offer". Again, RPM for INT4 and INT8 is inherent to the RDNA1 architecture, not just XSX. I'm pretty sure it's available on already-released Radeon products.

So while Dictator is correct that Sony haven't announced specifically that they have it, it's much more likely than not. It's a base feature of the microarchitecture previous to the one they're using. And the PS4 Pro added hardware support of RPM for FP16 when it was available, even though Xbox One X a year later did not. If anything, this would suggest Sony value the approach even more highly than Microsoft.

No is not. Can you share how INT4 and INT8 are inherent to RDNA? I'm curious about that.
 
Last edited:

BitterFig

Member
Oct 30, 2017
1,099
DLSS 2.0 has the added advantage that it pools general information, rather than game-specific patterns, so that each game that adds the technology in turn improves the learned patterns and improves the upscaling. The wonders of AI at work!
Unless you test it on prior games, there is a good chance that blindly updating the neural network with new game data will make it worse for older games. That is also the magic of AI :D
(I'm sure in practice the DLSS version that was tested for a game will be the only one used by said game, so there will be no problem like this)

EDIT: What I wanted to emphasis is that DLSS is not JUST an AI algorithm. It is primarily a classical upscaling technique with AI replacing a specific component that was before tuned by hand. Maybe the older DLSS was closer to a pure AI upscaling technique.
 
Last edited:

Zedark

Member
Oct 25, 2017
14,719
The Netherlands
Couple questions.

How much of the cost of the RTX 2060 is the tensor cores? Do we have an estimate? Need to see how much cheaper this for manufactures than just increasing traditional power.

If you're making a game to be displayed in 1080p, could you just put 540p textures (which are 1/4 the file size) on the card (along with the pre-trained DLSS results) to potentially save a lot of disc space?
The tensor cores are less than 15% of the chip size on the RTX 2080 Ti, so doubling the number of CUs (for example) should be a lot more expensive as it makes up over 75% of the chip, as opposed to adding tensor cores and requiring much fewer CUs (because you render at much lower resolution).
 

Alucardx23

Member
Nov 8, 2017
4,712
Nvidia's presentation showed DLSS 2.0 running after the frame is done rendering geometry and shading, but before post-processing and HUD. Meaning, not in parallell.

There's an almost fixed, resolution dependant overhead.

OK, so it's just another step being added to the render pipeline. It does produce a faster framerate in the end because the time to render a 540P/1080P frame is a lot faster than a 1080P/4K one, so it compensates the added time to add the DLSS step, but it does have a performance hit when compared to the framerate with just the native 540P/1080P frame. At least that is how I understand it now.
 

Zedark

Member
Oct 25, 2017
14,719
The Netherlands
Unless you test it on prior games, there is a good chance that blindly updating the neural network with new game data will make it worse for older games. That is also the magic of AI :D
(I'm sure in practice the DLSS version that was tested for a game will be the only one used by said game, so there will be no problem like this)
Well yeah, but careful training/testing subset division ought to mitigate those issues I think. Currently, the number of games is small, so I'd say the AI would probably benefit more than it suffers from extra data.

OK, so it's just another step being added to the render pipeline. It does produce a faster framerate in the end because the time to render a 540P/1080P frame is a lot faster than a 1080P/4K one, so it compensates the added time to add the DLSS step, but it does have a performance hit when compared to the framerate with just the native 540P/1080P frame. At least that is how I understand it now.
Yeah, benefits are purely due to rendering at a much lower resolution, which offsets the overhead of the DLSS algorithm.
 

Liabe Brave

Professionally Enhanced
Member
Oct 27, 2017
1,672
From my understanding, Microsoft added dedicated hardware in the XSX which supports INT 4/8 instructions, and didn't just add support for the INT 4/8 instructions in the CUs themselves. So it's dedicated hardware similar to the tensor cores (but without FP16 instructions), meaning you have the full 97 TOPS INT4 for machine learning available without taking performance from the CUs.
Your understanding is incorrect. To be clear, we don't have a direct quote from Microsoft themselves denying dedicated hardware. But in the vetted reveal article, Digital Foundry do say specifically there is no such hardware, and integer jobs run on standard GPU resources:

Richard Leadbetter said:
The RDNA 2 architecture used in Series X does not have tensor core equivalents, but Microsoft and AMD have come up with a novel, efficient solution based on the standard shader cores.

There's additional, less direct evidence in how Microsoft talk about performance. Compare how they talk about something we're sure there is dedicated hardware for, raytracing. In that case, they put it this way:

Andrew Goossen said:
Without hardware acceleration, this work could have been done in the shaders, but would have consumed over 13 TFLOPs alone. For the Series X, this work is offloaded onto dedicated hardware and the shader can continue to run in parallel with full performance. In other words, Series X can effectively tap the equivalent of well over 25 TFLOPs of performance while ray tracing.

He mentions "dedicated" hardware twice. And he's adding together the actual compute power, plus the equivalent general-purpose compute power that'd be required to do what the specialized RT cores do, to come up with 25TF. Contrast that with how he talks about machine learning work:

Andrew Goossen said:
So we added special hardware support for this specific scenario. The result is that Series X offers 49 TOPS for 8-bit integer operations and 97 TOPS for 4-bit integer operations. Note that the weights are integers, so those are TOPS and not TFLOPs. The net result is that Series X offers unparalleled intelligence for machine learning.

Here, he merely refers to hardware "support", with no use of "dedicated". And the numbers are just exactly what you'd get by RPM into the standard compute units. (INT8 takes up a quarter of an FP32 register, so TOPS equals TF times 4; INT4 is packed twice denser still, so TOPS equals TF times 8.) At no point is the standard 12TF of general compute added in on top, it's merely being replaced with other speeds.
 

BitterFig

Member
Oct 30, 2017
1,099
Well yeah, but careful training/testing subset division ought to mitigate those issues I think. Currently, the number of games is small, so I'd say the AI would probably benefit more than it suffers from extra data.
Yeah I agree.

Do you have an idea how big are the networks in DLSS2.0? Sadly the presentation has little info. I understand they would want it to be a secret, but 1.5ms on a 2080Ti seems too much for next gen consoles, especially 60fps games. But I'm sure if MS puts the research resources on it they are ought to find some faster network that would perform close to DLSS at the fraction of the cost.
 
OP
OP
ILikeFeet

ILikeFeet

DF Deet Master
Banned
Oct 25, 2017
61,987
I have a question guys. I saw that that there is now a DLSS 2.0 UE4 branch available for game developers.
Does it mean UE4 is updated to use natively DLSS 2.0 ? Is it possible that current games designed around UE4 could potentially be patched to get DLSS 2.0 ? They need to update the UE4 in game ? Or it will never happen and you will only see new game developped with that UE4 DLSS 2.0 branch ?

Example : PUBG, Fortnite, Shenmue III etc...
not really. it's a Nvidia specific branch of UE4 that has to be downloaded, but in the future, it'll be added to the base version of UE4. and any game can get patched to use DLSS, but having a UE4 option would make things easier. the hardest part is updating to a different version of UE4, which can cause some things to break
 

Deleted member 11276

Account closed at user request
Banned
Oct 27, 2017
3,223
Your understanding is incorrect. To be clear, we don't have a direct quote from Microsoft themselves denying dedicated hardware. But in the vetted reveal article, Digital Foundry do say specifically there is no such hardware, and integer jobs run on standard GPU resources:



There's additional, less direct evidence in how Microsoft talk about performance. Compare how they talk about something we're sure there is dedicated hardware for, raytracing. In that case, they put it this way:



He mentions "dedicated" hardware twice. And he's adding together the actual compute power, plus the equivalent general-purpose compute power that'd be required to do what the specialized RT cores do, to come up with 25TF. Contrast that with how he talks about machine learning work:



Here, he merely refers to hardware "support", with no use of "dedicated". And the numbers are just exactly what you'd get by RPM into the standard compute units. (INT8 takes up a quarter of an FP32 register, so TOPS equals TF times 4; INT4 is packed twice denser still, so TOPS equals TF times 8.) At no point is the standard 12TF of general compute added in on top, it's merely being replaced with other speeds.
You are right, I forgot about that part in the DF article, it's clearly stated that it's not dedicated hardware like the tensor cores. Thanks for clearing that up.
 

BeI

Member
Dec 9, 2017
5,980
So is DLSS limited to a 4x factor, or could you do 1080p to 8k? And is it also only limited to one step? Like, could you do 1080p to 4k, then take that 4k image and go to 8k?
 

Vimto

Member
Oct 29, 2017
3,714
So is DLSS limited to a 4x factor, or could you do 1080p to 8k? And is it also only limited to one step? Like, could you do 1080p to 4k, then take that 4k image and go to 8k?
Its limited to 4x, out of every 4 pixels, 3 are generated by AI. Which is already far beyond anything else.
 
OP
OP
ILikeFeet

ILikeFeet

DF Deet Master
Banned
Oct 25, 2017
61,987
So is DLSS limited to a 4x factor, or could you do 1080p to 8k? And is it also only limited to one step? Like, could you do 1080p to 4k, then take that 4k image and go to 8k?
jumping larger and larger gaps would probably need a lot more computing power than is reasonable to throw at video games. 1080p>2160p>4320p might end up creating more artifacts
 

Liabe Brave

Professionally Enhanced
Member
Oct 27, 2017
1,672
No is not. Can you share how INT4 and INT8 are inherent to RDNA? I'm curious about that.
Here's a link to the AMD whitepaper on RDNA1. The relevant bit is at the top of page 13:

AMD said:
More importantly, the compute unit vector registers natively support packed data including two half-precision (16-bit) FP values, four 8-bit integers, or eight 4-bit integers.

The tensor cores are less than 15% of the chip size on the RTX 2080 Ti, so doubling the number of CUs (for example) should be a lot more expensive as it makes up over 75% of the chip, as opposed to adding tensor cores and requiring much fewer CUs (because you render at much lower resolution).
The estimate I saw was that a tensor core is ~12% the size of an SM. So in theory yes, you could add 8 tensors for the silicon cost of 1 SM. But you have to keep in mind that integer workloads feed back into other systems eventually. So it might be pointless to have 2 or 3 or 8 tensors per SM if they'd be bottlenecked by other resources. Nvidia's promotion of DLSS in fact seems to me to be partially driven by the fact that few games were using tensor cores for much of anything else. So their general applicability seems, at best, to not have been found yet.

As for DLSS itself, I don't find it anywhere near effective enough that I'd start making GPUs bigger just to support it. Digital Foundry spend most of this video talking about how much an improvement the current version is over the previous version. I wholly agree, because the previous version was actively awful. But this comparative leap in quality is easily being misinterpreted as a leap upward past other reconstruction methods. I definitely don't agree with that. The amount of artifacting and oversharpening in the best cases of DLSS 2.0 are worse than the best other reconstruction methods. Even DLSS's strength, with transparencies, is approximately matched by some of the best other games using older methods. (Though we don't know what the comparative computational loads are.)
 

Simuly

Alt-Account
Banned
Jul 8, 2019
1,281
Dare I suggest that with next gen GPUs and consoles, a key 'battleground' will be between vendor specific rendering techniques to assist with displaying higher resolutions.

Nvidia's DLSS
Sony's Checkerboard Rendering 2.0
MS' Machine Learning assisted rendering
AMD's ???? (not known yet but we'll find out soon).

The takeaway is this - merely adding extra CU's on a GPU die is great but there are more efficient uses for transistors/space. Dedicated Tensor Cores being one example of accelerator engines that take up a small amount of space but can benefit enormously the output of a GPU. The trend going forward is more and more of these specific accelerators being put inside GPUs. I think one of Apple's latest mobile chips (i can't recall which one) is packed full of different engines.
 
Dec 4, 2017
11,481
Brazil
I'm in a serious need for hardware upgrade. My pc cant handle 1080p with high settings + 60 fps. I tried using it on lower settings but check how different they look:
High:
BRZhlT1.png



Medium:
GK1BE1F.png
 
OP
OP
ILikeFeet

ILikeFeet

DF Deet Master
Banned
Oct 25, 2017
61,987
Dare I suggest that with next gen GPUs and consoles, a key 'battleground' will be between vendor specific rendering techniques to assist with displaying higher resolutions.

Nvidia's DLSS
Sony's Checkerboard Rendering 2.0
MS' Machine Learning assisted rendering
AMD's ???? (not known yet but we'll find out soon).

The takeaway is this - merely adding extra CU's on a GPU die is great but there are more efficient uses for transistors/space. Dedicated Tensor Cores being one example of accelerator engines that take up a small amount of space but can benefit enormously the output of a GPU. The trend going forward is more and more of these specific accelerators being put inside GPUs. I think one of Apple's latest mobile chips (i can't recall which one) is packed full of different engines.
Qualcomm also has tensor accelerators in their mobile CPUs. so this design scheme seems to be the go-to method
 

Deleted member 18161

user requested account closure
Banned
Oct 27, 2017
4,805
This is crazy shit. Native resolution basically doesn't matter anymore.

Remedy were ahead of the curve with Quantum Break and it's image reconstruction. Wasn't it native 720p up to 1080p on base XB1?

Could Switch do this or does it need specific hardware? If not this has to be supported in the next model.

Great video, Alex.
 

Muhammad

Member
Mar 6, 2018
187
OK, so it's just another step being added to the render pipeline. It does produce a faster framerate in the end because the time to render a 540P/1080P frame is a lot faster than a 1080P/4K one, so it compensates the added time to add the DLSS step, but it does have a performance hit when compared to the framerate with just the native 540P/1080P frame. At least that is how I understand it now.
Tensor cores are also a lot faster than general CUDA cores at ML code, so while CUDA cores don't run in parallel with Tensors, if Tensors were not there, CUDA cores will require a lot more time to finish the upscale than the Tensor cores.

Example:
Native 4K on CUDA cores: require 16ms to finish the frame
DLSS 4K (from 1080p) on CUDA cores: require 10ms to finish the frame
DLSS 4K (from 1080p) on Tensor cores: require 7ms to finish the frame
 

Muhammad

Member
Mar 6, 2018
187
Here, he merely refers to hardware "support", with no use of "dedicated". And the numbers are just exactly what you'd get by RPM into the standard compute units. (INT8 takes up a quarter of an FP32 register, so TOPS equals TF times 4; INT4 is packed twice denser still, so TOPS equals TF times 8.) At no point is the standard 12TF of general compute added in on top, it's merely being replaced with other speeds.
Exactly, right on point.
 

Muhammad

Member
Mar 6, 2018
187
The estimate I saw was that a tensor core is ~12% the size of an SM. So in theory yes, you could add 8 tensors for the silicon cost of 1 SM. But you have to keep in mind that integer workloads feed back into other systems eventually. So it might be pointless to have 2 or 3 or 8 tensors per SM if they'd be bottlenecked by other resources. Nvidia's promotion of DLSS in fact seems to me to be partially driven by the fact that few games were using tensor cores for much of anything else. So their general applicability seems, at best, to not have been found yet.
Tensors also handle double pumped FP16 support, or Rapid Packed Math, which is supported by several games already. So Tensor cores in RTX cards are not used for ML workloads alone.
 

Poison Jam

Member
Nov 6, 2017
2,984
OK, so it's just another step being added to the render pipeline. It does produce a faster framerate in the end because the time to render a 540P/1080P frame is a lot faster than a 1080P/4K one, so it compensates the added time to add the DLSS step, but it does have a performance hit when compared to the framerate with just the native 540P/1080P frame. At least that is how I understand it now.
Yeah, that sounds about right. It slots into the pipeline where Temporal Anti-Aliasing would normally go, and need the same frame-info to work. So if an engine supports TAA, then it should be able to work with DLSS.
 

smocaine

Member
Oct 30, 2019
2,015
Hopefully DF or someone else'll do a comparison between native, native with T-AA disable, Radeon Image Sharpening, and DLSS, with performance metrics.

Seeing a lot of people online still calling is just another sharpening filter.
 

Deleted member 11276

Account closed at user request
Banned
Oct 27, 2017
3,223
Tensors also handle double pumped FP16 support, or Rapid Packed Math, which is supported by several games already. So Tensor cores in RTX cards are not used for ML workloads alone.

Yes, if more games would use FP16 calculations for their effects, Turing could also gain a big advantage, since the tensor cores deliver so much more FP16 performance.

But I don't know if these rapid packed math games use the turing tensor cores at all, wouldn't the devs have to program with that in mind? FP16 in Far Cry 5 for example is used for water physics but only for Vega apparently? I really wonder how many effects in games can be rendered with half precision.
 
Last edited:

Liabe Brave

Professionally Enhanced
Member
Oct 27, 2017
1,672
Native resolution basically doesn't matter anymore.
Incorrect, as a generalization. As with other reconstruction methods, there are tradeoffs; you have to be willing to live with the blurring, sharpening, and artifacts of DLSS. Native resolution still matters for people who want the most accurate rendering, whereas many reconstruction solutions are perfectly acceptable to other folks. Though it's true that at very high resolutions like we're about to see in gen 9, you can get away with more unnoticed.

Remedy were ahead of the curve with Quantum Break and it's image reconstruction.
Image reconstruction has been happening to certain parts of the rendered image for much longer than that, and for the whole image at least since Killzone: Shadow Fall several years earlier. (There may well be earlier examples I'm not aware of.)

Could Switch do this or does it need specific hardware?
The 2.0 version of DLSS that produces acceptable results requires specific hardware.

Tensors also handle double pumped FP16 support, or Rapid Packed Math, which is supported by several games already. So Tensor cores in RTX cards are not used for ML workloads alone.
I didn't mean to imply the tensor cores have been totally idle. Just that they aren't saturated by typical game loads, or else adding DLSS tasks would tank performance compared to native, as that other work was moved to SMs. This makes sense, as FP16 isn't viable for all or even a majority of game calculations.

Hopefully DF or someone else'll do a comparison between native, native with T-AA disable, Radeon Image Sharpening, and DLSS, with performance metrics.

Seeing a lot of people online still calling is just another sharpening filter.
To be fair, it includes a pretty heavy sharpening pass that's very visible. Sharpening is used in other reconstruction techniques too, but since DLSS is trying to cover a bigger gap it's more obvious. (And it seems to be more intense than some others, too.) But the fact that DLSS also produces lots of pixel breakup artifacts definitively proves it's not just sharpening.
 

Dekuman

Member
Oct 27, 2017
19,026
This is crazy shit. Native resolution basically doesn't matter anymore.

Remedy were ahead of the curve with Quantum Break and it's image reconstruction. Wasn't it native 720p up to 1080p on base XB1?

Could Switch do this or does it need specific hardware? If not this has to be supported in the next model.

Great video, Alex.
Needs new hardware, but only one console maker is working with nvidia right now, it's Nintendo.

Here's DF's video about a possible DLSS use in a future Switch
youtu.be

In Theory: Could Next-Gen Switch Use Nvidia DLSS AI Upscaling?

Nvidia's DLSS upscaler is starting to become seriously good - and AI upscaling has been added to the latest Shield Android TV. So what happens if we apply *b...
 

Phellps

Member
Oct 25, 2017
10,805
This is a must have on the next Switch. Being able to render lower resolutions and output a high definition reconstructed image will allow it to punch way above its weight.
 

PLASTICA-MAN

Member
Oct 26, 2017
23,614
Can DLSS 2.0 tech be used on low quality videos to improve quality and save bandwidth speed when watching or streaming?
 

ArchedThunder

Uncle Beerus
Member
Oct 25, 2017
19,060
Echoing what everyone else is saying, man this makes me excited for Switch 2, DLSS could be a huge game changer for it.
 

Deleted member 18161

user requested account closure
Banned
Oct 27, 2017
4,805
Needs new hardware, but only one console maker is working with nvidia right now, it's Nintendo.

Here's DF's video about a possible DLSS use in a future Switch
youtu.be

In Theory: Could Next-Gen Switch Use Nvidia DLSS AI Upscaling?

Nvidia's DLSS upscaler is starting to become seriously good - and AI upscaling has been added to the latest Shield Android TV. So what happens if we apply *b...

Interesting, thanks.

Liabe Brave

Thanks. For me if this version was the best it would ever get then I'd be more than happy considering you need to zoom in 600% to see the trade offs.

The thought of the next 3D Mario, Mario Kart and Zelda running at 1080p on the Switch 2 / Pro but displaying what is essentially very close to native 4k is extremely exciting to me, a very frustrated (when it comes to hardware power) Nintendo fan.

Does AMD have something similar which could be in PS5? Sorry if already answered I've been watching wrestling instead of reading.
 

Heraldic

Prophet of Regret
The Fallen
Oct 28, 2017
1,633
Exciting. PC gaming hitting incredible strides. Can't wait for my RTX 2070 super.
 

Dekuman

Member
Oct 27, 2017
19,026
Already happening with the Nvidia Shield TV.

AI Upscaling

AI Upscaling



Yeah the AI upscaling Rich tested in his video is from the Sheild TV which uses the exact same revised Mariko chipset in the new Switch
That said, it delivered mixed results (which he points out pretty early on in his video) , speculating perhaps since it's a streaming box and not a games machine the AI algorithm may be more tuned for videos at 30fps

So if the theoretical Switch 2/Pro is going to use DLSS tech to deliver better image quality without requiring many more teraflops needed in conventional hardware, then I would suspect they would go with something like DLSS from th new nvidia GPUs like form the volta family.

So here's my prediction. The raw FLOP number of the pro or Switch 2 will be underwheling, but it will have tensor cores specifically set a side to do things like DLSS 2.x or whatever version is more current closer to release. So games will still look amazing.