XaiJu
dobiestation
dobiestation

patreon


Sins of the PS2: Ecco the Dolphin

Sometimes, bugs just have no reason to exist.

Welcome to Sins of the PS2, a series that exposes the horrific crimes against humanity that lurk within video games. This is dedicated to problematic games that break on emulators but only work on real hardware due to coincidence. Ecco the Dolphin: Defender of the Future is one of those games, suffering from an evil timing bug. Even with decent timings, Ecco will fail to pass the initial loading screen, which means as of this moment, every major PS2 emulator needs a hack in order for this game to work.

What does it mean to have timing problems? How can a game be so sensitive to timing, it fails to load? Can this even be fixed?

Timing Theory

Timing is important to emulation. In a broad sense, timing refers to how long it takes an emulated component to finish a task, especially compared to other components. A well-designed game will be resilient to changes in timings: indeed, many PS2 games can be overclocked by as much as 300% on PCSX2 with no ill effect, and this may even be a bonus as some games can run at a higher framerate than they would on a real PS2. However, not all games are created equally, and many games have at least one race condition. This means that if there are two tasks or components running at the same time, the game will expect one to finish before the other. Some race conditions are relatively safe and may not be an issue even with bad timings, but other races are particularly sensitive to timings. Because so many games have race conditions, it is important for an accurate emulator to have accurate timings.

As the complexity of a system grows, the harder it becomes to accurately emulate its timings, but at the same time, the software running on a system becomes less concerned with precise timings. An NES game might require strict timings down to individual CPU cycles because the NES was such a limited system, but it is much more difficult for a PS2 game to exploit timings on this level due to increased levels of abstraction. Every emulator for a sixth-gen console and above has to make concessions for timings, not only because accurate emulation at full speed is impossible, but also games do not have the strict requirements they used to. This is why it's all the more painful when a modern game is sensitive to timings, because there ends up not being an easy way to fix it.

PCSX2 uses average timings - using data from tests run on real-world code, it applies an average cycle count to different instruction types that represents how long it usually takes for instructions to execute. Play says that every instruction is a single cycle. Dobie keeps track of the CPU pipeline to properly handle most instructions, but it does not handle memory accesses and branches properly as cache timings and branch prediction are nondeterministic. All these models have their own inaccuracies, and games will end up breaking or working in different ways. This implies that changing a timing model to fix one game might break multiple games, which is why emulators use per-game hacks for sensitive games. 

That's enough theory for now. Let's look at Ecco's own race condition.

Unnecessary Initialization

Back when this game was not well understood, the PCSX2 developers found out that Ecco would start working when SIF0, a DMA channel used by the Input/Output Processor (IOP) to send data to the EE, was slowed down by a factor of 24. Other comments describe the effects of messing with the timings of this DMA channel.

Slowing down SIF0 by 24x is wildly unrealistic, however, especially when other games will break hard, so I sought to find the true cause of the race condition.

While Ecco the Dolphin is booting, it loads custom IOP modules from the disc. These modules allow it to mix sound and stream data from the disc without the EE needing to send requests to the IOP to do so.

Above is a snippet of decompiled output of the game's initialization code on the EE side. One of the IOP modules loaded by the game, called IOPMAIN, is responsible for the race condition.

To summarize this code, it initializes SIF RPC (the protocol used for the EE and IOP to communicate with each other) and the CDVD drive, and it registers a new RPC server that will listen to requests sent from the IOP.

The first half of this code makes no sense. SIF would have had to been initialized already for the module to be loaded, and the CDVD drive would also have had to be initialized in order for the module to be loaded off the disc. The developers may not have fully understood the SDK and figured it would be best to reinitialize everything just to be safe. While the SIF initialization code is harmless, CDVD initialization in this case is deadly.

The IOP's clockrate is 8x slower than the EE's, which means the EE can do a lot more work than the IOP can. And while this thread on the IOP is executing, the EE is proceeding with initialization. The EE will send a request to load another module off the disc, and this is where the race condition begins. Can you spot it?

If the EE sends the request too quickly, this thread will be interrupted as it is low-priority. The IOP will send an asynchronous command to the CDVD drive to begin reading the module off the disc. Afterwards, the IOP returns to this thread... which will call sceCdInit, resetting the CDVD drive. This means that the disc read command is interrupted and doesn't complete! The game has no way of knowing if the module was loaded properly, so it continues on its merry way and hits a roadblock when it tries to send a command to this module. It unceremoniously hangs as a result.

Although it is the IOP module that is bugged, the race condition actually relies on the EE being slow enough for this thread to finish executing. On real hardware, the race condition would likely resolve due to the EE's small caches and steep memory latency. Initialization code would punish the cache hard, because it touches code and data that has yet to be accessed. This is just not possible to emulate quickly, and so no PS2 emulator has a timing model that takes this into account. The reason the race condition also works by slowing down SIF0 is because it takes a lot longer for the EE to receive a reply, giving the IOP more time to do work before the EE sends its next request. This is a flagrant hack that breaks other games, however, so it should not be used as part of the timing model. The best thing that can be done on PCSX2 is patching the game.

Closing

As far as race conditions go, Ecco's bug is truly ridiculous. Games often have race conditions because they know they can do some work before a task completes; in other words, the race condition is done intentionally to optimize. Other race conditions are unintentional, but they can easily appear in large codebases due to complicated multithreaded libraries. Prince of Persia: The Two Thrones has another CDVD race condition that could have been prevented with more robust code in Sony's SDK, for instance. However, Ecco's race condition is caused by useless code that could be completely removed and have no impact on the program. It is impossible to say with full certainty if this was intentional, but given that the Dreamcast port has no problems on emulators, I find it difficult to believe this is anything but a bug. This is why Ecco the Dolphin deserves a spot on the Sins of the PS2 list.


More Creators