XaiJu
dobiestation
dobiestation

patreon


Shadow of the VUlossus: Fixing unfixable PS2 games

The Vector Units (VUs) are the soul of the PS2. They are SIMD processors embedded in the Emotion Engine (EE) responsible for quickly performing vector and matrix operations. VU0 and VU1, as they are called, have different responsibilities: VU0, being available as a coprocessor and having more limited memory, is intended for physics, AI, and other dynamic calculations, whereas VU1, being more independent and able to feed graphical data directly to the GPU, is intended for processing background geometry. Because the main CPU is rather weak, games must offload as much work onto the VUs as possible in order to fully utilize the PS2's capabilities. As a direct result of this, games abuse as many VU quirks as possible to achieve optimization.

PCSX2 and DobieStation do a pretty good job of handling VU logic and all of its edge cases, although PCSX2 is far more optimized. They differ in one crucial aspect, however: timings. Dobie isn't perfect, but it makes some attempt to handle not only internal VU timings, but also synchronizing the VUs with other components. By contrast, PCSX2's handling of VU timings is easily one of its weakest points in accuracy. Let's see just how much games care about that...

Instant Transmission

To start using a VU, a game must upload a microprogram to VU memory and manually start the VU. Although there are various methods to do this, the end result is the same: the VU will concurrently run alongside the EE. 

For reasons unknown, when a game starts a microprogram, PCSX2 gives the VU a headstart of several thousand cycles! Usually this means that from the perspective of the game, the VU instantly finishes execution. This is fine since the vast majority of games only ever use VU1, which as mentioned before, is generally independent. Most games don't care about VU1 timings: once they've started executing VU1, they're either going to prepare a new batch of VU data, execute some game logic, or wait for VU1 to end. The ultimate result is usually just higher internal FPS for games that are bottlenecked on the VUs.

Things get dicier when VU0 is involved. Although only a few games relative to the rest of the library use VU0, they care a lot more about timings. This is because the EE can treat VU0 as a coprocessor: that is to say, it can execute vector instructions on COP2, which shares registers with VU0. Games make the most out of this arrangement by starting a VU0 microprogram and, for instance, loading new data into VU registers using COP2 while the program is still executing! If the microprogram is too fast, it will not read the data in time, causing junk polygons to be output in the best case and a crash in the worst case. PCSX2 has workarounds for these common cases, which brings us to the next section...

Whack-a-Mole

It seems to have been originally the case that PCSX2 always ran VU microprograms instantly, rather than "merely" giving them a headstart. Properly handling timings back in the early 2000s would have only made PCSX2 even slower, and the goal was to at least get it working at full speed. This would break a game like Ratchet and Clank, however, which has VU0 microprograms that communicate with the EE, so running VU0 at a limitless speed would deadlock PCSX2. Thus, a limit had to be placed on how long a microprogram could run for, so that R&C wouldn't deadlock on a VU0 microprogram spinning in a loop waiting for a response that never came.

The picture I painted in the previous section isn't fully accurate either: before starting a VU microprogram, PCSX2 actually waits for a little while. Baldur's Gate: Dark Alliance starts a VU1 microprogram without having prepared all the necessary data. It later sends an UNPACK command to fill VU1 memory while the microprogram is still running, so if you run the program instantly, the game fails to boot. In light of this, the microprogram is held off until an UNPACK or other command is sent... but this breaks the game Boogie. Boogie has a bug where it overwrites data a microprogram is currently working on with junk, and if the program can't finish on time, the display list it sends is corrupted. Since the program doesn't even start until after the corruption has happened, it is doomed to not work on PCSX2 without a patch.

PCSX2 has a gamefix called XGKICK hack, another workaround for bad VU timings. The XGKICK instruction on VU1 sends a display list to the GPU, and this transfer is executed concurrently with the microprogram. Of course, games take advantage of this in one of two ways: either they send an XGKICK with incomplete data and fill in the missing data later, or they overwrite data they know has been sent with new data. PCSX2 transfers all of the data in XGKICK instantly after a set delay to handle most cases, but some games break PCSX2's assumptions, so the XGKICK hack modifies this delay. Yet even more games still don't work with or without the hack... the NTSC version of WRC has graphical glitches that aren't fully fixed even with the hack.

Then we get into the really nasty edge cases. PCSX2 is incapable of handling three games I call the "M-bit Trio", consisting of Totally Spies Totally Party, My Street, and Mike Tyson's Heavyweight Boxing. When a VU0 micro is running, most COP2 instructions will stall the EE until execution has ended, but a few instructions can be configured to stall or not stall. If these instructions are set to stall, the EE can also be unstalled if a VU0 microprogram instruction has the M-bit set. The M-bit Trio heavily relies on M-bit for EE<->VU0 communication many times during microprogram execution, and without perfect synchronization, these games will have glitched graphics. PCSX2 has no way of coping with these games because VU0 will always have that headstart. Marvel Nemesis: Rise of the Imperfects expects cycle-accurate synchronization between VU0 and VU1, something even Dobie struggles with, let alone PCSX2. 24: The Game has a VU0 microprogram that has a busy loop. PCSX2 wastes so much time running VU0 that this game can't reach full speed on any hardware.

These are just the games that I personally know have VU timing issues. Who knows how many SPS glitches or flickering/missing graphics are caused by bad timings? We'll never know for sure.

A Profound Discovery

Dobie makes some attempt to handle VU synchronization. It's led to promising results, such as the M-bit Trio working without glitches. However, we were running into unexplainable bugs, such as Ratchet and Clank's camera being wildly broken, and many games having much lower internal FPS than normal. That is, until we discovered that we've been running the VUs at half speed this entire time!

The EE runs at approximately 300 MHz, whereas most of the other hardware components (GPU, DMA, bus) run at half speed, 150 MHz. Sony's official manuals never state how fast the VUs run, and there's a lot of conflicting information on the web, some stating 300 MHz and some stating 150 MHz. PCSX2 uses 150 MHz (for cases where the "headstart" isn't enough to finish the micro), so Dobie also used 150 MHz. However, hardware tests from water111 indicated this was wrong, and the VUs actually ran at the same speed as the EE. Implementing this not only made all those glitches go away, it also simplified our synchronization logic, and it even fixed other problems such as in WRC! This goes to show how important hardware tests are for any emulator seeking accuracy.

To end off this article, here's some shiny screenshots in Dobie. Mind the texture filtering and FPS...


Comments

Whoo! Awesome stuff! Love you, man.

Wonderful write-up, these are always fun to read!

Hazy


More Creators