Load independent and minimal syntax sets when using --language #1787

Enselic · 2021-08-08T14:47:55Z

Ok, so now the #951 work is starting to get interesting.

This PR improves startup time in the following scenario:

--language is used
A syntax without any external dependencies is used (see complete list below)

To keep small and digestible PRs, that is the only scenario that this PR improves startup time for. The plan is to improve startup time for all code paths in the upcoming PR.

The binary size only increases by ~400 kB as a result of this PR. That is because each SyntaxSet contains only one SyntaxDefinition, so it is not possible to optimize further (with current syntect data structures).

List of syntaxes that this PR (should) improve startup time for:

"ActionScript", "Apache Conf", "AppleScript", "ARM Assembly", "Assembly
(x86_64)", "Batch File", "BibTeX", "C", "C#", "Cabal", "CMakeCache",
"CoffeeScript", "Comma Separated Values", "CpuInfo", "CSS", "D", "Dart Analysis
Output", "Dart", "Diff", "Dockerfile", "DotENV", "Erlang", "F#", "Fortran (Fixed
Form)", "Fortran (Modern)", "Fortran Namelist", "Friendly Interactive Shell
(fish)", "fstab", "GLSL", "gnuplot", "Go", "GraphQL", "Groff/troff", "group",
"Haskell", "Highlight non-printables", "hosts", "INI", "Java Properties",
"Jinja2", "JSON", "jsonnet", "Kotlin", "LaTeX Log", "Lean", "Less", "Lisp",
"LLVM", "log", "Lua", "MATLAB", "MemInfo", "NAnt Build File", "Nim", "Ninja",
"Nix", "Pascal", "passwd", "Plain Text", "PowerShell", "Private Key", "Protocol
Buffer (TEXT)", "Puppet", "R", "Rego", "Regular Expression", "requirements.txt",
"resolv", "Robot Framework", "Rust", "Scala", "SML", "Solidity", "SQL",
"Strace", "Stylus", "Swift", "SystemVerilog", "Tcl", "Terraform", "TeX", "TOML",
"TypeScript", "TypeScriptReact", "varlink" , "Verilog", "VimL", "Vyper", "XML",
"YAML", "Zig"

Some example benchmarks:

hyperfine -L bat bat-pr,bat '{bat} -f --language kotlin ./tests/syntax-tests/source/Kotlin/test.kt'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bat-pr -f --language kotlin ./tests/syntax-tests/source/Kotlin/test.kt`	22.3 ± 1.3	20.2	27.2	1.00
`bat -f --language kotlin ./tests/syntax-tests/source/Kotlin/test.kt`	108.3 ± 1.9	104.4	111.5	4.86 ± 0.30

hyperfine -L bat bat-pr,bat '{bat} -f --language c tests/benchmarks/test-src/miniz.c'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bat-pr -f --language c tests/benchmarks/test-src/miniz.c`	89.8 ± 2.3	85.9	96.4	1.00
`bat -f --language c tests/benchmarks/test-src/miniz.c`	171.8 ± 1.2	170.0	174.2	1.91 ± 0.05

hyperfine --export-markdown /dev/tty -L bat bat-pr,bat '{bat} -f --language rust examples/simple.rs'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bat-pr -f --language rust examples/simple.rs`	22.1 ± 1.2	19.9	25.7	1.00
`bat -f --language rust examples/simple.rs`	107.6 ± 2.5	103.5	111.3	4.88 ± 0.28

NOTE: I am not updating syntaxes.bin, so e.g. the HTTP syntax is missing from that file, but present in independent_syntax_sets.bin, which can be a bit confusing unless you know about it.

NOTE: I am reserving the right to change the format of the new binary files in incompatible ways, until we have made a release. After we have made a release, we should try to be backwards compatible, of course

I have not yet thoroughly verified this change for edge cases etc, but all regression tests pass, and I'm pretty happy with the code, so this is is certainly ready for a real code review round.

sharkdp · 2021-08-15T11:45:26Z

I took a first look at the code. It looks great - thank you! I have a question before I get into a more detailed review: Why did you decide to perform the "offset and size" handling yourself, instead of trying to shift that work to serde?

I haven't looked into the details, but I could imagine that we could ask serde to serialize a HashMap<String, SerializedSyntaxSet> instead of serializing the concatenated SerializedSyntaxSets and a lookup HashMap<String, OffsetAndSize> in addition. Even if there would be "two layers" of serialization, that could maybe still work out?

Enselic · 2021-08-15T13:41:29Z

Interesting idea! I would be fine to explore that method when I get time before you do a detailed review. It seems like it should be quick to deserialize raw data, even if that ends up being measured in megs (which would be the case when independent_syntax_sets.bin contains SyntaxSets for all syntaxes). The only way to know is to try. If it only adds say 4ms (on my machine) it would perhaps be worth the simplicity.

So let me try this out before we decide what method to use, and before detailed code review 👍

sharkdp · 2021-08-15T17:10:25Z

It seems like it should be quick to deserialize raw data, even if that ends up being measured in megs (which would be the case when independent_syntax_sets.bin contains SyntaxSets for all syntaxes). The only way to know is to try. If it only adds say 4ms (on my machine) it would perhaps be worth the simplicity.

Hm, I wouldn't sacrifice this much, performance-wise. I was kind of hoping that [u8] (de)serialization would be zero-cost(?).

Enselic · 2021-08-16T04:38:20Z

Hmm yes serde does indeed support zero-copy deserialization: https://serde.rs/lifetimes.html

If I could get that to work that would simplify the code a lot

Definitely worth exploring more, so I'll do that

This significantly speeds up the startup time of bat, since only a single linked SyntaxDefinition is loaded for each file. The size increase of the binary is just ~400 kB. In order for startup time to be improved, the --language arg must be used, and it must match one of the following names: "Plain Text", "ActionScript", "AppleScript", "Batch File", "NAnt Build File", "C#", "C", "CSS", "D", "Diff", "Erlang", "Go", "Haskell", "JSON", "Java Properties", "BibTeX", "LaTeX Log", "TeX", "Lisp", "Lua", "MATLAB", "Pascal", "R", "Regular Expression", "Rust", "SQL", "Scala", "Tcl", "XML", "YAML", "Apache Conf", "ARM Assembly", "Assembly (x86_64)", "CMakeCache", "Comma Separated Values", "Cabal", "CoffeeScript", "CpuInfo", "Dart Analysis Output", "Dart", "Dockerfile", "DotENV", "F#", "Friendly Interactive Shell (fish)", "Fortran (Fixed Form)", "Fortran (Modern)", "Fortran Namelist", "fstab", "GLSL", "GraphQL", "Groff/troff", "group", "hosts", "INI", "Jinja2", "jsonnet", "Kotlin", "Less", "LLVM", "Lean", "MemInfo", "Nim", "Ninja", "Nix", "passwd", "PowerShell", "Protocol Buffer (TEXT)", "Puppet", "Rego", "resolv", "Robot Framework", "SML", "Strace", "Stylus", "Solidity", "Vyper", "Swift", "SystemVerilog", "TOML", "Terraform", "TypeScript", "TypeScriptReact", "Verilog", "VimL", "Zig", "gnuplot", "log", "requirements.txt", "Highlight non-printables", "Private Key", "varlink" Later commits will improve startup time for more code paths.

Enselic · 2021-09-08T15:15:16Z

With all preparatory work and investigations done, it is now time to re-visit this PR! 🎉
I did a force-push because the diff was so big from the old code.

The code probably needs some more polishing, but it is definitely ready for at least a high-level review. I would say it is probably also ready for a detailed review.

Turns out it is not necessary to do zero-copy deserialization. Deserialization into Vec<u8> is plenty fast. The reason it is convenient to use Vec<u8> is because with custom assets, it is a bit messy to find an owner to the data when using &'a [u8]. If we only had integrated assets, zero-copy deserialization would be trivial. In fact, I had that working at first, before I added support to custom assets. But again, it is not necessary to get good performance.

I have verified that the performance numbers are practically the same with this new code as they are in the PR description. Startup time in loop-through mode (see #1747) has become a tiny bit slower since we don't do lazy deserialization of MinimalSyntaxes. But even though it should be easy to fix, the gains are so small that I don't think it makes sense to bother at this point.

hyperfine --export-markdown /dev/tty -L bat bat-pr,bat '{bat} tests/examples/multiline.txt'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bat-pr tests/examples/multiline.txt`	13.1 ± 0.9	11.5	16.3	1.16 ± 0.15
`bat tests/examples/multiline.txt`	11.3 ± 1.2	9.6	17.0	1.00

Note that I have ended up changing terminology from "independent syntax sets" to "minimal syntax sets" because the latter is both easier to write, and more accurate.

Looking forward to your comments on this new code!

sharkdp

This looks great - Thank you so much! Really looking forward to this.

I added a few (very minor) review comments.

src/build_assets.rs

src/assets.rs

sharkdp · 2021-09-08T20:29:03Z

Another question, purely for my understanding: Is it true that we can potentially get (almost) rid of the 400KB overhead at some point? Once the syntaxes in minimal sets can be found by each available code-path/method, it shouldn't be required to add them again to the large syntax set, right?

Enselic · 2021-09-09T04:46:45Z

If everything goes according to plan, we will get rid of syntaxes.bin and keep everything in minimal_syntaxes.bin.

Note that the 400kB overhead is temporary. As we add more syntaxes to minimal_syntaxes.bin, it will become bigger. As long as we use syntaxes.bin as a fallback, it will probably make sense to keep all syntaxes there. But again, eventually it should be possible to get rid of syntaxes.bin entirely.

I am keeping the known remaining steps up to date in this comment: #951 (comment) (I know that GitHub does not send emails for edits of comments, but I still try to keep that comment up to date nevertheless)

Enselic · 2021-09-09T04:48:36Z

Left to do before this can be merged:

Add entry to CHANGELOG.md
Perform a last round of verification of the code

Enselic · 2021-09-09T18:36:51Z

I have now done some additional verification, and everything seems to work as it should. So I will merge this shortly.

I also updated CHANGELOG.md. But the changes are to be seen as preliminary. We will want to look over the Performance section when it is time to make the next release, I think.

Enselic mentioned this pull request Aug 8, 2021

Improve bat startup speed #951

Closed

keith-hall approved these changes Aug 8, 2021

View reviewed changes

Enselic added the needs-work label Aug 17, 2021

Enselic mentioned this pull request Aug 18, 2021

Don't take a HighlightingAssets detour to build assets #1802

Merged

Enselic mentioned this pull request Sep 7, 2021

Add regression testing for the custom assets functionality #1829

Merged

Enselic force-pushed the load-independent-syntax-set-when-language-is-explicit branch from 2403984 to a4fb754 Compare September 8, 2021 15:05

Enselic removed the needs-work label Sep 8, 2021

sharkdp approved these changes Sep 8, 2021

View reviewed changes

src/build_assets.rs Outdated Show resolved Hide resolved

src/assets.rs Outdated Show resolved Hide resolved

src/assets.rs Outdated Show resolved Hide resolved

src/assets.rs Outdated Show resolved Hide resolved

src/assets.rs Show resolved Hide resolved

fix some typos and misspellings

474d6e6

Enselic added 2 commits September 9, 2021 20:25

CHANGELOG.md: Add Performance section (preliminary)

abf32bf

Add a CHANGELOG.md entry for this PR

88be2b2

Enselic merged commit 9124271 into sharkdp:master Sep 9, 2021

Enselic deleted the load-independent-syntax-set-when-language-is-explicit branch September 9, 2021 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load independent and minimal syntax sets when using --language #1787

Load independent and minimal syntax sets when using --language #1787

Enselic commented Aug 8, 2021 •

edited

Loading

sharkdp commented Aug 15, 2021

Enselic commented Aug 15, 2021

sharkdp commented Aug 15, 2021

Enselic commented Aug 16, 2021

Enselic commented Sep 8, 2021 •

edited

Loading

sharkdp left a comment

sharkdp commented Sep 8, 2021

Enselic commented Sep 9, 2021

Enselic commented Sep 9, 2021

Enselic commented Sep 9, 2021

Load independent and minimal syntax sets when using --language #1787

Load independent and minimal syntax sets when using --language #1787

Conversation

Enselic commented Aug 8, 2021 • edited Loading

sharkdp commented Aug 15, 2021

Enselic commented Aug 15, 2021

sharkdp commented Aug 15, 2021

Enselic commented Aug 16, 2021

Enselic commented Sep 8, 2021 • edited Loading

sharkdp left a comment

Choose a reason for hiding this comment

sharkdp commented Sep 8, 2021

Enselic commented Sep 9, 2021

Enselic commented Sep 9, 2021

Enselic commented Sep 9, 2021

Enselic commented Aug 8, 2021 •

edited

Loading

Enselic commented Sep 8, 2021 •

edited

Loading