Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RISC-V LR/SC Translation #75

Open
wants to merge 3 commits into
base: riscv
Choose a base branch
from

Conversation

mastercaution
Copy link

@mastercaution mastercaution commented Mar 24, 2022

In contrast to ARM, an LR/SC sequence (code between LR and SC) is very limited on RISC-V platforms. A maximum number of 16 instructions and only a part of the base "I" and "C" instruction set is permitted. Since additional loads and stores are also excluded, instrumenting an instruction inside the sequence will most likely turn it into an "unconstrained LR/SC loop" resulting in the trailing SC to always fail on our test device. The ISA only guaranties for "constrained LR/SC loops" to succeed eventually.

The way unconstrained LR/SC loops are handled is considered a hardware implementation detail. On a SiFive U54, unconstrained LR/SC loops will never succeed, resulting in deadlocks in some cases.

The Approach to fix this issue is to translate the LR/SC sequence to a mixture of a software emulated and hardware atomic sequence. The following figure hopefully gives you an idea of how it works:
Screenshot from 2022-03-24 17-17-54

The actual implementation stores the value of register x into the dbm_thread structure and only uses one temporary scratch register. The ordering flags aq and rl were not considered in the software emulation part (LR replaced by LD) which may lead to side effects (we did not encounter any side effects).

Benchmarks

In terms of performance, the implementation seems to have no negative effect on real world applications. In all 4 applications, LR/SC sequences were called 40-60 times (per run).

ref dbm dbm + Atomic Translation
Primes (exection time) 1 1.04 1.04
GCC (exection time) 1 1.04 1.04
SHA1 (exection time) 1 12.70 12.76
CoreMark (score) 1 13.52 13.52

Fix unhandled ELF vector types on Linux kernel 5.12+ with glibc 2.34+.
In contrast to ARM, an LR/SC sequence (code between LR and SC) is very
limited on RISC-V platforms. A maximum number of 16 instructions and
only a part of the base "I" and "C" instruction set is permitted. Since
additional loads and stores are also excluded, instrumenting an
instruction inside the sequence will most likely turn it into an
"unconstrained LR/SC loop" resulting in the trailing SC to always fail
on our test device. The ISA only garanties for "constrained LR/SC loops"
to succeed eventually.

A LR/SC loop may spread over 2 or more basic blocks which makes the
translation a little complex. For now, one scrach register is used to
save the original value read by LR and translate the loop to a mix of
software and hardware atomic sequence. The scrach register is hardcoded
to x31 (t6) which could interfere with a function that makes use of x31
and contains this translation, but it seems to work for the most
programs (luckily).
Changes the translation of atomic sequences with lr/sc to use a shadow
register in memory (in `dbm_thread` struct) instead of a hard-coded
CPU register.
@mastercaution
Copy link
Author

An Issue with the dbm reference benchmark lead to far better scores in SHA1. I corrected the values in both PRs.

@mastercaution mastercaution marked this pull request as ready for review March 25, 2022 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant