Status update from Nekotekina (2020-10-31)
Added 2020-11-01 02:30:08 +0000 UTCHello everyone. This is my late post for October, and it will be big because I made a lot of commits. I'm extremely sorry for being late, but I was so busy with fixing last day regressions… I'm also sorry if commit links are unrelated, because there is too much information to check.
- Improve Trophy Installer robustness. This commit I made mostly for myself because I symlinked my userdata directory for backups. Relaxed paranoidal mount point locking and temporary dir creation mechanism. It was incompatible with a setup where user directory is symlinked. Instead, create temporary directory as close to target as possible. https://github.com/RPCS3/rpcs3/commit/0ac3dbfec9cb8f171b2e98c9a9b8e6dcb2717a71
- Improve filesystem mountpoint detection. Properly handle . and .. path components in mountpoint detection. For example, "/./dev_hdd0" is the same as "/dev_hdd0", but RPCS3 was unable to handle it correctly with some minor possible glitches. Also removed /app_home mountpoint and detect mountpoint from where the game was booted. Also, missing dev_root mountpoint was added for paths simply like "/". https://github.com/RPCS3/rpcs3/commit/9b22661c19fd44dcdc34ef053b3ae01d33dc08e5
- Implement vm::reservation_op, an extremely old idea. Maybe you know, PS3 SPU does atomic updates on the whole cache lines, which is normally impossible on most x86-64 processors, putting TSX aside for a moment. But on SPU, it's split in two basic instruction, called GETLLAR (load cache line and monitor for writes) and PUTLLC (conditional store if no writes were made). They typically go in loop, if conditional store failed, the program starts from load again. But in order to implement said PUTLLC, we had to implement complicated mechanism, basically pause all active threads and create an illusion of atomicity by only letting one to compare and update big chunk memory. So, here is the idea, if we have to do this expensive stuff anyway, why cannot we just put the whole loop inside, pausing threads before we ever load the memory? But this approach is untested and is currently only possible with HLE emulation of cellSpurs (for example). Several developers reimplemented some chunks of cellSpurs, including me and Elad, who was pushing updates to cellSpurs recently, so I decided to look at this again.
- Also, if TSX is available, vm::reservation_op attempts to execute a transaction instead of locking other threads, which can save a lot of CPU time in theory, if it succeeds, otherwise it fallbacks to heavyweight thread pausing.
- Implement vm::reservation_peek (lately renamed to peek_op), similarly to vm::reservation_op, but only for reading memory, mimicking GETLLAR. GETLLAR doesn't pause threads at this moment, but does read memory in loop and compares for changes. This approach is shaky and potentially unsafe, but generally faster than doing it like hard way, like PUTLLC. https://github.com/RPCS3/rpcs3/commit/89f124814089981aeedcf1d1edab987d9ba31c88
- Implement vm::reservation_light_op (lately renamed to light_op), which permits doing small atomic operation possible on x86-64 with respect to heavyweight operation like GETLLAR and reservation_op. https://github.com/RPCS3/rpcs3/commit/346a1d4433621db384005eff587e69dceb46dd47
- Remove PPU transaction for conditional store. PPU side of PS3 is much closer to x86-64, so it can be emulated without heavyweight techniques like thread pausing. TSX transaction only benefit from more heavyweight operations. https://github.com/RPCS3/rpcs3/commit/6d83c9cc0ea54d5e907f13bc8b9fa9d31b03271a
- Rewrite cpu_thread::suspend_all as a function of higher order (accepting another function), if was RAII before. I don't believe had some kind of revelation how to implement it effectively, or maybe I'm just stupid, but now it's much simpler and more efficient than before. This is similar to heavyweight lock used by Non-TSX processors, but it was unoptimized due to the ability of transactions to do otherwise impossible atomic operations. But desperate times require desperate measures, as TSX-FA came into the game, transactions started to fail more and it became obvious that without some work, TSX may become useless and even removed from the emulator. I refused to do this path because I still believe it may be beneficial, even if not for everyone and for everything. But back to suspend_all. I did really stupid thing at first, I allowed all threads who wanted to do their transactions to enter some sort of "critical section" which was supposed to stop all other threads. Not only it created a huge amount of contention between different CPU cores, it also was extremely hard to do right. Now it's much more simple, only first thread to attempt it does all the work, others just push their own work into the queue and wait. It has some disadvantages too, but it can be worked on. I also implemented missing feature of removing already sleeping threads from the huge queue of threads destined to be paused. https://github.com/RPCS3/rpcs3/commit/050c3e1d6b56b2cad3cfc41bde2b641a7403beab https://github.com/RPCS3/rpcs3/commit/b74c5e04f581ebc9b5fa97da9a7e8e5ffd38055b https://github.com/RPCS3/rpcs3/commit/ec7d243ee942a56a76f7f0a1734f92aea7de7b95
- JIT cleanup for PPU LLVM. It was a work done few months ago but now I decided to finally rebase and commit it. Basically, we had a huge area in first 2G of the process. And it area was fixed to 512M, so here came one of the developers and made it even more complicated, by splitting it in multiple chunks to still fit in the 2G. It's such a hack that's not really necessary on 64-bit system but it's still being used for various purposes. For example, all RPCS3 code is still located in this area. It was used to workaround some bugs in LLVM which expected it, but then I wrote a better fix. It was also used to compress function pointer into 32 bit values, in order to optimize the PPU interpreter, but who needs it if we can get recompiler to work? Now LLVM can use as much virtual memory as it needs, dynamically allocating and deallocating it properly. Sadly, I removed some interpreter optimization made by Elad which compressed both function pointer and PPU instruction code into a single 64 value (32+32=64). Now it's a single 64-bit value. If you remember, I had plans to compile interpreters at runtime, and I may need 64-bit pointers for this to work.
- Also I completely (I hope) disabled unwind info registration on all platforms, because we disabled exceptions in the emulator and don't really need unwinding callstack anymore. https://github.com/RPCS3/rpcs3/commit/f2d2a6b605894cd514ac13a376f784bc1987adbd
- Implement utils::tx_start (for TSX) — what's notable about this is that I was unable to do similar thing before although I wanted to. I used inline assembler for all non-MSVC compilers, although you ideally don't need assembly at all, but I had really bad experience with either GCC or Clang compiling totally wrong code, so I decided to give it another chance with this new approach. vm::reservation_op also uses inline assembly, let's hope it will work. https://github.com/RPCS3/rpcs3/commit/b57a9c31f049a612c73fe321edb1988cf5293d1b
- SPU: Implement S1/S2 (SNR) events with little experimental TSX path for it. The problem with it it's a bunch of communication channels for SPU (SNR1 and SNR2 in this case) which accept some data from other threads asynchronously. But SPU also has interrupts, which have to be triggered in such case if enabled. Basically, I have to modify two values at once, which is normally impossible without locking on x86-64. https://github.com/RPCS3/rpcs3/commit/a806be8bc4b40e80c922e04347d3fe83fc83b9a9
- Some cleanup, like partial removal of vm::reservation_lock which resembles a mutex lock, but I don't want to explain this now since this post is already too long. https://github.com/RPCS3/rpcs3/commit/dcff8c2637c6583d855bd8e2abc33e33b6931bfb
- SPU: improve PUTLLC for TSX-FA affected processors. https://github.com/RPCS3/rpcs3/commit/91db4b724c1639cb297730d268c7f3eade9c9606
- TSX: reimplement spu_getllar_tx transaction. Only used as a backup method of reading reservation data, since it's capable of reading huge amount of memory (128 bytes) atomically.
- Fix and simplify a bit ppu_stcx_accurate_tx. Only used in one of the accurate modes for PPU implemented by Elad and ported from TSX PUTLLC implementation. https://github.com/RPCS3/rpcs3/commit/facde634602f670be9c2f1f35ab5ee38d4f33565
- Reimplement ASMJIT runtime a bit. Ironically, it uses the same hack of finding a place in first 2G of RPCS3 process. But unlike LLVM, it's a very very small area of memory. It's required for some dynamically-generated (but at the start) assembly routines written using ASMJIT. This may be beneficial to access other functions of RPCS3 because they are emplaced close — the limitation of standard code model of x86-64 (called Small code model) is that to encode the jump distance most effectively, it should be within reach, basically, within 2G, due to the fact that the distance is 32 bit (legacy of 32-bit x86). Also added error checking in ASMJIT assembler to waste less time for debugging (thanks Elad). https://github.com/RPCS3/rpcs3/commit/3d980a9f6657a6384277dafcee66834eb36c2e2d
- Implement cpu_thread::if_suspended routine. Unlike suspend_all, mentioned above, this function is tentative and does not initiate heavyweight thread pausing, but only adds its workload to the list if it already exists. I used it as an opportunistic GETLLAR execution on TSX fallback. How does it work? Well, emulator threads periodically poll (check) specific set of bits (flags), and one of them tells them to sleep and wait. This is much, much faster, rather than trying to use dangerous system routines such as SuspendThread in Windows, and is also safe, since we can't sleep in the wrong place (but it can still when the system forcefully interrupts emulator thread execution, but it shouldn't be a big problem if you have enough CPU cores). https://github.com/RPCS3/rpcs3/commit/adf50b7c4bb6681a5890bb03a3ee74b46844fec3
- Implement performance statistics counters for PPU/SPU reservation ops and some other function. You may ask why RPCS3 reports TSC frequency on startup? This is one of the use of it (assuming it's correct, but if it isn't, we can see it in logs). Performance counters use RDTSC, the most raw instruction available on x86-64 to obtain certain "time". I wanted it to add as little overhead to the RPCS3 as possible. But historically, this "time" returned from RDTSC is not in nanoseconds or cycles, you can probably say it's a separate timer running at fixed frequency on your motherboard. Its implementation may vary, we only require modern processor where this timer is steady and independent from overclocking and other factors. Performance statistics is enabled by "Enable Performance Report" setting, lately added to the debug tab. https://github.com/RPCS3/rpcs3/commit/120849c73455f1aed35a68d1aaa9464b4da2f6d6
- Remove XABORT in TSX transactions. It can be expensive for uncertain reasons from my tests. It seems much cheaper to just commit (finish) an empty transaction (which did nothing or just read some memory). The real use of XABORT may be to handle complex situations which use fundamentally different, optimistic, coding style, for its ability to rollback all changes to memory and register state. https://github.com/RPCS3/rpcs3/commit/4384ae15b46f3444f98f8c5a2a5167b00298c02e
- Improve Waitable Atomics a lot. Originally it's a feature from oncoming C++20 language standard which adds an optimized wait() function to std::atomic. But not only this is unavailable, but we can also do better (or die trying). First, I made each thread individual "semaphore" (implemented differently on Windows, Linux, other platforms). It's a fundamental thing that puts current thread to sleep with the ability to wake-up later, after signal or timeout. The CPU core resources are freed and may be reused by other threads, or simply save some power. Unfortunately, I limited max thread waiting on a single address to 60 (will be 56 if I compress some memory tables). If needed, the limit can be increased but it will require very tricky work, but currently it simply isn't needed. https://github.com/RPCS3/rpcs3/commit/8628fc441db320d8e8770dfb5346e3ebee5ad7ad
- Optimization for Windows 7 wake up functions. These are actually undocumented syscalls, called keyed events. The big PITA with these is that they can be blocking. Basically, if they don't wake up thread, they go to sleep instead and thread that was going to wait wakes them up instead! They are effectively symmetrical. This is nonsense. Gladly, they can be called instantly (if zero is specified to the timeout, they never go to sleep), so if we need to wake up more than 1 thread, we can optimistically wake all wakeable threads first, and then go back to waking remaining threads. https://github.com/RPCS3/rpcs3/commit/c479d431a458ed3bc6f5cbc45a136ae28d125f3d
- Another bunch of syscalls was available since Windows 8. https://github.com/RPCS3/rpcs3/commit/97ae5ab56165bd3371060344bd19ba8b6693dc2e
- I made use of them in waitable atomics when they are available, they are designed much better and notifying function does not sleep, instead it sets some flag which lets the sleeping syscall wake up instantly. Not only this is better, but this also fixed some weird issues when UNFOCUSED RPCS3 window had significantly better performance! This is another nonsense. Perhaps, older Windows 7 syscalls were not meant for use with newer systems. If you ask, why we bother with these undocumented syscalls at all, it's because they accept timeout not in milliseconds, but in smaller units (100ns), and this seems critical for RPCS3 performance on Windows to be able to wait for half millisecond than the whole. https://github.com/RPCS3/rpcs3/commit/7db77a55807f7a81fa37470f160a8db7e594defa
- Another improvements in waitable atomics is now the ability to cooperate with sleeping function (such as suspend_all) in notification functions. Waking up many threads can be a long task, and we can notify that we are busy with this stuff, so suspend function does not have to wait for us and start its work immediately. But enough details, I guess. https://github.com/RPCS3/rpcs3/commit/6806e3d5c73e472b497140dff9fc5058976c4997
- Improve range_lock and shareable cache. Range locks were first introduced by Elad for Non-TSX emulation. I made range locks dynamically allocateable by locking a bit in a single 64-bit atomic variable. Before, they were allocated in some kind of endless loop and were limited to 6 which could cause problems. Now each SPU thread which needs range locks, gets it once and releases it at its end of life. We don't have so many SPU threads, so 64 is enough. https://github.com/RPCS3/rpcs3/commit/4966f6de73c390a7f2314c8b45f7e4db7fdd97a3
- Another improvement to waitable atomics is the ability to wait on 128-bit atomics (supported by x86-64). https://github.com/RPCS3/rpcs3/commit/c50233cc923c3bbc638a80fa7f853c772219279e
- SPU: Improve Accurate DMA option to not use vm::reservation_lock, because it only allows one thread at a time. https://github.com/RPCS3/rpcs3/commit/c491b73f3a7bc5dfd17ed8c96b44f39ae11e72e0
- Implement basis for range lock flags which can communicate with threads in thread-safe but lock-free manner, instead of locking over-used global memory mutex. For instance, such locks will allow individual threads to check memory for writeability or readability without locking and without failing on heavyweight access violation, which is handled by OS and takes a lot of CPU time. https://github.com/RPCS3/rpcs3/commit/86785dffa4827713d5cc88e57c4bd6d1c0bea038
- SPU: load previous data on PUTLLC failure. Since it will most likely execute GETLLAR to load it again. Only implemented for TSX at moment. It's essentially free for a transaction to load it atomically. https://github.com/RPCS3/rpcs3/commit/425fce5070b56c489c7028251a4c228f8c1a5e7e
- SPU: fix do_dma_transfer() by making it static (not getting default "this" argument of SPU thread). This function can be called by other thread in some cases, but it also became extremely complex, so I hope I fixed some problems in past in future. https://github.com/RPCS3/rpcs3/commit/006c783aba482c0516ea7c231488f1fe5b1573ad
- CPU: Improve cpu_thread::suspend_all for cache efficiency (TSX) with prefetch list parameters. Workloads may be executed by another thread on another CPU core. It means they may benefit by directly prefetching data from the provided list. Also implement mov_rdata_nt, for "streaming" data from such workloads. This is a feature of x86-64 to have a chance push a cache line to memory without acquiring it by current CPU core. https://github.com/RPCS3/rpcs3/commit/0da24f21d65f3a6a1bd8e92cc0c2d27d3432bb04
- TSX: new fallback method (time-based). Basically, using timestamp counter (RDTSC, I explained it before). Added two settings — execution limits for transactions (first chance and second chance). They are specified in nanoseconds. Not available in GUI now, but they are in Core, called "TSX Transaction First Limit" and "TSX Transaction Second Limit", can be edited directly in config.yml after saving it. https://github.com/RPCS3/rpcs3/commit/86fc842c89565b696263353785f9f5db35b9000d
- Fix vm::page_protect and range flags yet unused. Fix some bugs, such as freezing in Skate 3 when TSX is disabled, or throwing error message in certaing games such as Heavenly Sword. https://github.com/RPCS3/rpcs3/commit/ca57f25f261ab255681e94fb4c6df87dc5ed234e
It’s been a long while since my last post and I once again apologize for the delay. I’ve been very busy with both the project and personal issues and I hope you all can understand. Aside from personal issues, I’ve been working very hard on particular functionality for the emulator, though I can’t say my progress has been without faults, I will resume my plans and gain new ideas.
I want to mention that during this month I decided to once again share my September donation to one of our staff, our web developer. DAGINATSUKO has been contributing to RPCS3 since January of 2017. She has done amazing work on the website with countless updates and usability improvements. This website is used to house and promote the progress of RPCS3, teach new users how to use the emulator, report compatibility and to learn a little about our developers. It acts as the very front-end for all the work the team has poured into the project. Without the newly designed website and its updates over the years, I’m not sure how popular or reputable RPCS3 would be without her.
We’re now nearing the end of October and I’ve been working on a few features for the emulator — such as upgrade to LLVM 11, optimising TSX and especially TSX-FA, fixing bugs...
For my next update, I will do my best to target the end of November and improve performance and stability, but also implement my old plans with PPU and SPU LLVM recompilers!
Thank you all for your support.
Comments
Also the incoming Zen 3 CPUs will increase the amount of AMD configs. It's investment in the future. Intel will return with competition likely in 2022.
2020-11-05 09:26:32 +0000 UTCKeep up a great work you guys!
Nikola Sekulic
2020-11-01 17:36:33 +0000 UTCI have plans for it too. But TSX is not discontinued, they add 2 new TSX instructions for transactions in future generation of their processors. So things may change.
Nekotekina
2020-11-01 13:51:07 +0000 UTCI'm always amazed by the amount of work that goes into these updates! I love it and it's absolutely mind-boggling.
Jibril Ikharo
2020-11-01 13:38:44 +0000 UTCThat’s a lot of improvements, thanks for the hard work. Although as a person who doesn’t own a CPU with TSX, I do wish more effort is put into improving performance for non-TSX CPUs, which I think makes sense since TSX is discontinued in latest intel CPUs, and I can imagine Ryzen CPUs will only get more popular, these CPUs don’t have TSX.
dachao li
2020-11-01 07:07:56 +0000 UTC