Here is a proposal for a software workaround to speculative
execution on a non-coherent system such as the i2 R10k and the o2 R10k.
The R10000 processor can (and will) execute intructions ahead.
These instructions will be cancelled if they're not supposed to execute,
e.g. if a jump happened. If a load or store instruction is executed
speculatively, and the accessed memory is not in the cache, the cache
line will be fetched in main memory and, on a store, be marked dirty.
These speculative loads and stores can happen anywhere, since there might
be old values in registers used in a speculative load/store
instruction that would be cancelled afterwards.
The problem is:
- on a speculative load, the fetched cache line will remain in the
cache even if the speculative load is cancelled
- on a speculative store, the *dirty* cache line will remain in
the cache even if the speculative store is cancelled
On non-coherent systems we need to flush the cache lines to main
memory before doing DMA to device, so that the device can see them. We
also need to invalidate lines before reading from a DMA'd buffer to make
sure the CPU will read main memory and not the cache.
However, if a speculative load or store happens during DMA
transfer, the cache line will be fetched from memory and, on a store,
be marked dirty. That means this cache line could be evicted when the
line is needed, thus being written back in memory if it was dirty,
thus overwritting the data a device could have put in the DMA
buffer. Something we really don't want to happen ;)
2. Proposed solution
Speculative execution will not happen in the following conditions:
- access to memory is uncached
- the speculated instruction causes an exception: that
also means a speculative load/store will not happen in a mapped memory
region which doesn't have a TLB line for it.
This second point means that any mapped space can be made safe by
removing the DMA'd buffer address translations from the TLB or by marking
them 'uncached' during DMA transfer.
The remaining unmapped adress spaces are:
- kseg1, which is safe since uncached
- kseg0, which can turned uncached with the K0 bits
from the CPO Config register
- xkphys which will cause adress error if the KX bit is
not set, the aborting the speculative load/store before it can do harm ;)
Since we need to turn KX off, xkseg will not be accessible
either.. and since we need to have KSEG0 uncached, we need to remap the
kernel elsewhere if we want performance ;). We could use the xsseg
segment, available in Supervisor mode, which is mapped (safe) and moreover
allows to access all memory (on o2 it can be up to 2G I think, whereas in
32bit mode, only 512Mb would be accessible). So the proposed workaround is
to permanently map the lower 16MB of memory in xsseg in using a wired TLB
entry and a page size of 16MB. This memory would not be usable for
DMA. Everything else would, so we could for example reserve the upper 16Mb
for DMA (and give them to the DMA zoned memory allocator). On exception or
error, the handler (in KSEG0) would set CU0 to allow access to CPO, then
switch to Supervisor mode and jump to the equivalent xsseg location and
continue execution in Supervisor mode. The code for returning to userland
would need to clear the CU0 bit, to prevent user access to CP0.
Before DMA transfer, the DMA'd buffer cache lines would be
flushed, and then it would be remapped 'uncached', thus preventing that
any speculative load or store to this memory happens during
transfer. After the DMA transfer, the cache would be invalidated to make
sure main memory is read, and the DMA buffer would be remapped 'cacheable
A diagram is attached to illustrate the workaround. Comments,
suggestions (and even flames) are welcome before anyone starts coding
the workaround ;)
Description: PostScript document