XaiJu
Nekotekina
Nekotekina

patreon


Status update from kd-11 (30-04-2023)

Hi,

It's kd-11 with another update on RPCS3 emulation.

It's been an eventful 3 months since the last update. There's a lot to unpack in this update, so let's start with the general fixes and improvements.

1. Reimplemented the native overlay message queue system. This allows more consistent management of popup alerts and notifications while ingame. Animation support (fade-in, fade-out, animated icons, etc) was also added. We finally can design a comfortable ingame notification and alerts system like a real console UI does.

2. Fixed a long-standing font-rendering bug when displaying ingame controls such as message dialogs and popup alerts. Many other fixes were included here, including fixing the broken rendering of rounded rectangles.

3. Added a fix to fully discard broken shaders on input instead of crashing. This allows games that may have desync problems to keep running and self-heal instead of just asserting when we see a bad shader. Real hardware doesn't crash either, so its the correct solution.

4. Input unification was added to the overlay system. This means that all input (e.g loading save games, accepting confirmation dialogs, etc) goes through one always-on background thread. This simplifies adding new elements to our native UI as we don't need to reimplement the input handling every single time.

5. Performance optimization - memory allocation was tightened significantly. By completely removing dynamic allocations from the FIFO loop performance was improved measurably (~10%) in some titles that are draw-call heavy.

6. Implemented deferred video memory allocations for resource heavy work such as texture decoding. RPCS3 uses a memory pool concept where each component has a capped amount of VRAM it is expected to take. There are several paths that can be taken depending on the hardware capabilities of the host GPU each with different memory requirements. Most GPUs don't actually need shared memory allocations to decode textures as we use the GPU to move the data from system RAM to VRAM. However, in the background, the allocator kept providing working memory as though it was going to be used, which meant unnecessary memory pressure. We can instead just do a virtual allocation that will take the memory only if it is needed. This lowering of memory pressure in texture-heavy scenarios significantly improves frametime smoothness in some games as we no longer waste time dealing with out-of-memory issues on memory that was never used in the first place.

7. Some fixes and improvements were made on the vulkan renderer to improve compliance with the official specification and in turn be friendlier to GPU drivers.

8. An experimental patch to cap VRAM allocations on GTX970 cards to around 3GB was introduced. This was after we noticed a much higher than average number of vulkan crashes came from GTX970 users and seemed to coincide with high memory usage scenarios. We continue to monitor the situation and find the real root cause.

9. Improved detection and workaround shims for apple M1 support. Apple GPUs still have issues running rpcs3 in some games, a situation that we're investigating. Due to the use of multiple abstraction layers (moltenVK, rosseta) however, it is not easy to find what triggers bugs on M1 that are not present on other systems.

10. Fixed a JIT recompiler bug that caused incorrect vertex streams to be sent to the GPU. This fixed missing visuals in some titles such as CoD3.

11. Added a workaround for low-precision NVIDIA interpolation. This is still work in progress as I still have some unsubmitted patches that fix the problem a lot more efficiently. This should be closed in the next few days.

12. Fixed some problems with attachment handling on the OpenGL renderer. On some platforms the OpenGL renderer is still a viable way to enjoy your games.

13. Multiple improvements were made to the viewport handling code fixing missing and broken visuals in several titles.

Now that we have the fixes out of the way, let's talk about research. Over a period of about 8 weeks from mid-February, I started investigating the viability of GPU-accelerating some SPU instructions. This has been a long-time-asked question due to RPCS3's heavy CPU requirements. I have always maintained that GPUs are not good candidates for SPU emulation, but I had to confirm this empirically. At first, this seemed like just a silly experiment to do on the side, but as I kept working on it, things improved to the point of running the PS3 mandelbrot sample - a heavy SPU workload that is opensource. As I had theorized many years ago, performance was terrible. In fact, on the first run, I kept stopping the application as I thought it had hung. However, in reality it just took several minutes to render a single frame. Yes, that's right, we were in several minutes per frame territory. The biggest challenge with GPU work is that unlike CPU workloads, your load has to parallelize easily to work well. SPUs are separate cores with their own hardware units, so this is actually quite difficult to map on to a GPU. For the sake of thoroughness, I started tuning aggressively and in the end I managed to only achieve parity with the SPU interpreter. This isn't particularly great performance - for reference the LLVM recompiler is about 90-100x faster than the same kernel recompiled for GPU. The main bottleneck seems to be memory related. The native SPU core has 128x128-bit general purpose registers which doesn't leave much headroom for emulating. SPU kernels are also much, much larger than what a normal CPU function is like. Register allocations are also permanent across the whole application, and we have to persist the writes on exit for the next invocation. This leads to a horrible situation where most of the time is actually wasted handing register spills and committing writes to memory rather than executing ALU instructions. That said, even if we could somehow get over this hurdle, ALU performance is terrible because SPUs are not IEEE compliant. Emulation headroom is just severely punishing to GPUs. The experiment is therefore not a complete failure, but there are enough issues with it that I consider the approach infeasible. I'll keep tinkering with it on the side, but this experiment is shelved for now. I will publish the source at a later date for the curious to peek at.

Some screenshots:

1. Early progress. Outputs are incorrect as loops were not executing (hence the "good" performance).

2. Making progress. Execution results are correct but performance is atrocious.

3. With more aggressive tuning and a basic register relocation pass.


Thank you all for your continued support.

Regards,

kd-11

Comments

How to install kd-11 ?

Rizki Ramadhan

respect

JiaWen Li

Time to grab the popcorn

Dormant_Hero

Thank you all for the continued passion and research into this great emulator!

Master

Interesting update. Thanks for keeping us in the loop.

polytoad

Thank you all for your continued work :)

Andreas


More Creators