Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] X64 - Extend emitter peephole optimization of eliminating unnecessary mov instructions #79381

Merged
merged 88 commits into from
Feb 25, 2023

Conversation

TIHan
Copy link
Contributor

@TIHan TIHan commented Dec 8, 2022

Description

Resolves #10315 based on the example code given in the first post.

This will not eliminate all possible unnecessary mov instructions, but it handles more now.

The JIT's emitter already had an existing way of removing the instructions using:

bool emitter::AreUpper32BitsZero(regNumber reg)

But, it was only able to look back at one instruction.

This PR extends AreUpper32BitsZero to allow looking back up to 256 instructions(max limit of instructions for an IG).

Example diffs:

@@ -28,13 +28,9 @@ G_M17551_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000004 {rdx}, byref
        movzx    r8, byte  ptr [rcx+01H]
        movzx    r9, byte  ptr [rcx+02H]
        movzx    rcx, byte  ptr [rcx+03H]
-       mov      eax, eax
        movsx    rax, byte  ptr [rdx+rax]
-       mov      r8d, r8d
        movsx    r8, byte  ptr [rdx+r8]
-       mov      r9d, r9d
        movsx    r9, byte  ptr [rdx+r9]
-       mov      ecx, ecx
        movsx    rdx, byte  ptr [rdx+rcx]
        ; byrRegs -[rdx]
        shl      eax, 18
@@ -43,12 +39,12 @@ G_M17551_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000004 {rdx}, byref
        or       eax, edx
        or       r8d, r9d
        or       eax, r8d
-						;; size=66 bbWeight=1    PerfScore 27.25
+						;; size=56 bbWeight=1    PerfScore 26.25
 G_M17551_IG03:        ; , epilog, nogc, extend
        ret      
 						;; size=1 bbWeight=1    PerfScore 1.00
 
-; Total bytes of code 67, prolog size 0, PerfScore 34.95, instruction count 19, allocated bytes for code 67 (MethodHash=ea74bb70) for method System.Buffers.Text.Base64:Decode(ulong,byref):int
+; Total bytes of code 57, prolog size 0, PerfScore 32.95, instruction count 15, allocated bytes for code 57 (MethodHash=ea74bb70) for method System.Buffers.Text.Base64:Decode(ulong,byref):int

Diffs from the issue's example:
image

Acceptance Criteria

  • Disasm test case
  • Comments

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 8, 2022
@ghost ghost assigned TIHan Dec 8, 2022
@ghost
Copy link

ghost commented Dec 8, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Resolves #10315 based on the example code given in the first post.

This will not eliminate all possible unnecessary mov instructions, but it handles a little more than before in common cases.

The JIT's emitter already had an existing way of removing the instructions using:

bool emitter::AreUpper32BitsZero(regNumber reg)

But, it was only able to look back at one instruction.

This PR extends AreUpper32BitsZero to allow looking back at any number of instructions, but only if it's safe to do so. It does not keep a list of instructions, it only keeps a single unsigned int for tracking if a register has its upper 32-bits set to zero:

    // IMPORTANT: Only contains information from **before** the last emitted instruction.
    // A lookup where each bit position corresponds to a register.
    // A bit that is set means the register's upper 32-bits are zero.
    // This effectively keeps track of which registers have their upper 32-bits set to zero.
    // GPRs (general-purpose registers) only.
    unsigned int upper32BitsZeroRegLookup;

Example diffs:

@@ -28,13 +28,9 @@ G_M17551_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000004 {rdx}, byref
        movzx    r8, byte  ptr [rcx+01H]
        movzx    r9, byte  ptr [rcx+02H]
        movzx    rcx, byte  ptr [rcx+03H]
-       mov      eax, eax
        movsx    rax, byte  ptr [rdx+rax]
-       mov      r8d, r8d
        movsx    r8, byte  ptr [rdx+r8]
-       mov      r9d, r9d
        movsx    r9, byte  ptr [rdx+r9]
-       mov      ecx, ecx
        movsx    rdx, byte  ptr [rdx+rcx]
        ; byrRegs -[rdx]
        shl      eax, 18
@@ -43,12 +39,12 @@ G_M17551_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000004 {rdx}, byref
        or       eax, edx
        or       r8d, r9d
        or       eax, r8d
-						;; size=66 bbWeight=1    PerfScore 27.25
+						;; size=56 bbWeight=1    PerfScore 26.25
 G_M17551_IG03:        ; , epilog, nogc, extend
        ret      
 						;; size=1 bbWeight=1    PerfScore 1.00
 
-; Total bytes of code 67, prolog size 0, PerfScore 34.95, instruction count 19, allocated bytes for code 67 (MethodHash=ea74bb70) for method System.Buffers.Text.Base64:Decode(ulong,byref):int
+; Total bytes of code 57, prolog size 0, PerfScore 32.95, instruction count 15, allocated bytes for code 57 (MethodHash=ea74bb70) for method System.Buffers.Text.Base64:Decode(ulong,byref):int

Diffs from the issue's example:
image

Diff Summary

Diffs are based on 1,398,805 contexts (351,415 MinOpts, 1,047,390 FullOpts).

MISSED contexts: base: 20, diff: 20

Overall (-1,124 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.windows.x64.checked.mch 25,002,887 -124
coreclr_tests.run.windows.x64.checked.mch 362,751,515 -176
libraries.crossgen2.windows.x64.checked.mch 35,246,384 -153
libraries.pmi.windows.x64.checked.mch 52,033,158 -367
libraries_tests.pmi.windows.x64.checked.mch 114,343,307 -304
FullOpts (-1,124 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.windows.x64.checked.mch 23,285,431 -124
coreclr_tests.run.windows.x64.checked.mch 96,229,781 -176
libraries.crossgen2.windows.x64.checked.mch 35,245,195 -153
libraries.pmi.windows.x64.checked.mch 50,532,678 -367
libraries_tests.pmi.windows.x64.checked.mch 107,461,046 -304
Details

Improvements/regressions per collection

Collection Contexts with diffs Improvements Regressions Improvements (bytes) Regressions (bytes)
benchmarks.run.windows.x64.checked.mch 33 30 0 -124 +0
coreclr_tests.run.windows.x64.checked.mch 49 49 0 -176 +0
libraries.crossgen2.windows.x64.checked.mch 36 36 0 -153 +0
libraries.pmi.windows.x64.checked.mch 118 116 1 -372 +5
libraries_tests.pmi.windows.x64.checked.mch 100 99 0 -304 +0
336 330 1 -1,129 +5

Context information

Collection Diffed contexts MinOpts FullOpts Missed, base Missed, diff
benchmarks.run.windows.x64.checked.mch 66,545 16,370 50,175 0 0
coreclr_tests.run.windows.x64.checked.mch 509,739 320,923 188,816 0 0
libraries.crossgen2.windows.x64.checked.mch 216,154 15 216,139 4 4
libraries.pmi.windows.x64.checked.mch 271,097 4,958 266,139 8 8
libraries_tests.pmi.windows.x64.checked.mch 335,270 9,149 326,121 8 8
1,398,805 351,415 1,047,390 20 20

jit-analyze output

Acceptance Criteria

  • Disasm test cases
  • Regression test cases
Author: TIHan
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

{
assert(emitHasLastIns() == (emitLastInsIG != nullptr));

return emitHasLastIns() && // there is an emitLastInstr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is for a separate PR, but I think we need to prevent peephole optimizations when we're in the prolog or epilog. I.e.,

 if (emitIGisInProlog(emitCurIG) || emitIGisInEpilog(emitCurIG))
    {
        return false;
    }
#ifdef FEATURE_EH_FUNCLETS
    if (emitIGisInFuncletProlog(emitCurIG) || emitIGisInFuncletEpilog(emitCurIG))
    {
        return false;
    }
#endif

There is too much special handling in the prolog/epilog (e.g., unwinding) to allow peeps to kick in. There may be very specific cases where they are ok, but that requires some careful thinking.

@TIHan
Copy link
Contributor Author

TIHan commented Feb 23, 2023

/azp run runtime-coreclr superpmi-diffs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@TIHan
Copy link
Contributor Author

TIHan commented Feb 23, 2023

/azp run runtime-coreclr superpmi-replay

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@TIHan
Copy link
Contributor Author

TIHan commented Feb 23, 2023

/azp run runtime-coreclr gcstress0x3-gcstress0xc

@TIHan
Copy link
Contributor Author

TIHan commented Feb 23, 2023

/azp run jitstress

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@TIHan
Copy link
Contributor Author

TIHan commented Feb 23, 2023

/azp run runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@TIHan
Copy link
Contributor Author

TIHan commented Feb 24, 2023

@dotnet/jit-contrib @BruceForstall This is ready again. I ran gcstress and jistress, they passed CI. The current failures are unrelated.

@AndyAyersMS
Copy link
Member

Is there any way to avoid the TP impact in min opts?

@BruceForstall
Copy link
Member

Diffs

@BruceForstall
Copy link
Member

Is there any way to avoid the TP impact in min opts?

I believe it's a result of enabling backwards navigation in the insGroup/instrDesc (#80840), and maintaining those data structures. It's not obvious how this could only be done for non-MinOpts (and whether it would be advisable if it could be done).

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jit can generate pointless movs
6 participants