Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load independent and minimal syntax sets when using --language #1787

Conversation

Enselic
Copy link
Collaborator

@Enselic Enselic commented Aug 8, 2021

Ok, so now the #951 work is starting to get interesting.

This PR improves startup time in the following scenario:

  • --language is used
  • A syntax without any external dependencies is used (see complete list below)

To keep small and digestible PRs, that is the only scenario that this PR improves startup time for. The plan is to improve startup time for all code paths in the upcoming PR.

The binary size only increases by ~400 kB as a result of this PR. That is because each SyntaxSet contains only one SyntaxDefinition, so it is not possible to optimize further (with current syntect data structures).

List of syntaxes that this PR (should) improve startup time for:

"ActionScript", "Apache Conf", "AppleScript", "ARM Assembly", "Assembly
(x86_64)", "Batch File", "BibTeX", "C", "C#", "Cabal", "CMakeCache",
"CoffeeScript", "Comma Separated Values", "CpuInfo", "CSS", "D", "Dart Analysis
Output", "Dart", "Diff", "Dockerfile", "DotENV", "Erlang", "F#", "Fortran (Fixed
Form)", "Fortran (Modern)", "Fortran Namelist", "Friendly Interactive Shell
(fish)", "fstab", "GLSL", "gnuplot", "Go", "GraphQL", "Groff/troff", "group",
"Haskell", "Highlight non-printables", "hosts", "INI", "Java Properties",
"Jinja2", "JSON", "jsonnet", "Kotlin", "LaTeX Log", "Lean", "Less", "Lisp",
"LLVM", "log", "Lua", "MATLAB", "MemInfo", "NAnt Build File", "Nim", "Ninja",
"Nix", "Pascal", "passwd", "Plain Text", "PowerShell", "Private Key", "Protocol
Buffer (TEXT)", "Puppet", "R", "Rego", "Regular Expression", "requirements.txt",
"resolv", "Robot Framework", "Rust", "Scala", "SML", "Solidity", "SQL",
"Strace", "Stylus", "Swift", "SystemVerilog", "Tcl", "Terraform", "TeX", "TOML",
"TypeScript", "TypeScriptReact", "varlink" , "Verilog", "VimL", "Vyper", "XML",
"YAML", "Zig"

Some example benchmarks:

hyperfine -L bat bat-pr,bat '{bat} -f --language kotlin ./tests/syntax-tests/source/Kotlin/test.kt'
Command Mean [ms] Min [ms] Max [ms] Relative
bat-pr -f --language kotlin ./tests/syntax-tests/source/Kotlin/test.kt 22.3 ± 1.3 20.2 27.2 1.00
bat -f --language kotlin ./tests/syntax-tests/source/Kotlin/test.kt 108.3 ± 1.9 104.4 111.5 4.86 ± 0.30
hyperfine -L bat bat-pr,bat '{bat} -f --language c tests/benchmarks/test-src/miniz.c'
Command Mean [ms] Min [ms] Max [ms] Relative
bat-pr -f --language c tests/benchmarks/test-src/miniz.c 89.8 ± 2.3 85.9 96.4 1.00
bat -f --language c tests/benchmarks/test-src/miniz.c 171.8 ± 1.2 170.0 174.2 1.91 ± 0.05
hyperfine --export-markdown /dev/tty -L bat bat-pr,bat '{bat} -f --language rust examples/simple.rs'  
Command Mean [ms] Min [ms] Max [ms] Relative
bat-pr -f --language rust examples/simple.rs 22.1 ± 1.2 19.9 25.7 1.00
bat -f --language rust examples/simple.rs 107.6 ± 2.5 103.5 111.3 4.88 ± 0.28

NOTE: I am not updating syntaxes.bin, so e.g. the HTTP syntax is missing from that file, but present in independent_syntax_sets.bin, which can be a bit confusing unless you know about it.

NOTE: I am reserving the right to change the format of the new binary files in incompatible ways, until we have made a release. After we have made a release, we should try to be backwards compatible, of course

I have not yet thoroughly verified this change for edge cases etc, but all regression tests pass, and I'm pretty happy with the code, so this is is certainly ready for a real code review round.

@Enselic Enselic mentioned this pull request Aug 8, 2021
@sharkdp
Copy link
Owner

sharkdp commented Aug 15, 2021

I took a first look at the code. It looks great - thank you! I have a question before I get into a more detailed review: Why did you decide to perform the "offset and size" handling yourself, instead of trying to shift that work to serde?

I haven't looked into the details, but I could imagine that we could ask serde to serialize a HashMap<String, SerializedSyntaxSet> instead of serializing the concatenated SerializedSyntaxSets and a lookup HashMap<String, OffsetAndSize> in addition. Even if there would be "two layers" of serialization, that could maybe still work out?

@Enselic
Copy link
Collaborator Author

Enselic commented Aug 15, 2021

Interesting idea! I would be fine to explore that method when I get time before you do a detailed review. It seems like it should be quick to deserialize raw data, even if that ends up being measured in megs (which would be the case when independent_syntax_sets.bin contains SyntaxSets for all syntaxes). The only way to know is to try. If it only adds say 4ms (on my machine) it would perhaps be worth the simplicity.

So let me try this out before we decide what method to use, and before detailed code review 👍

@sharkdp
Copy link
Owner

sharkdp commented Aug 15, 2021

It seems like it should be quick to deserialize raw data, even if that ends up being measured in megs (which would be the case when independent_syntax_sets.bin contains SyntaxSets for all syntaxes). The only way to know is to try. If it only adds say 4ms (on my machine) it would perhaps be worth the simplicity.

Hm, I wouldn't sacrifice this much, performance-wise. I was kind of hoping that [u8] (de)serialization would be zero-cost(?).

@Enselic
Copy link
Collaborator Author

Enselic commented Aug 16, 2021

Hmm yes serde does indeed support zero-copy deserialization: https://serde.rs/lifetimes.html

If I could get that to work that would simplify the code a lot

Definitely worth exploring more, so I'll do that

This significantly speeds up the startup time of bat, since only a single
linked SyntaxDefinition is loaded for each file. The size increase of the
binary is just ~400 kB.

In order for startup time to be improved, the --language arg must be used, and
it must match one of the following names:

"Plain Text", "ActionScript", "AppleScript", "Batch File", "NAnt Build File",
"C#", "C", "CSS", "D", "Diff", "Erlang", "Go", "Haskell", "JSON", "Java
Properties", "BibTeX", "LaTeX Log", "TeX", "Lisp", "Lua", "MATLAB", "Pascal",
"R", "Regular Expression", "Rust", "SQL", "Scala", "Tcl", "XML", "YAML", "Apache
Conf", "ARM Assembly", "Assembly (x86_64)", "CMakeCache", "Comma Separated
Values", "Cabal", "CoffeeScript", "CpuInfo", "Dart Analysis Output", "Dart",
"Dockerfile", "DotENV", "F#", "Friendly Interactive Shell (fish)", "Fortran
(Fixed Form)", "Fortran (Modern)", "Fortran Namelist", "fstab", "GLSL",
"GraphQL", "Groff/troff", "group", "hosts", "INI", "Jinja2", "jsonnet",
"Kotlin", "Less", "LLVM", "Lean", "MemInfo", "Nim", "Ninja", "Nix", "passwd",
"PowerShell", "Protocol Buffer (TEXT)", "Puppet", "Rego", "resolv", "Robot
Framework", "SML", "Strace", "Stylus", "Solidity", "Vyper", "Swift",
"SystemVerilog", "TOML", "Terraform", "TypeScript", "TypeScriptReact",
"Verilog", "VimL", "Zig", "gnuplot", "log", "requirements.txt", "Highlight
non-printables", "Private Key", "varlink"

Later commits will improve startup time for more code paths.
@Enselic Enselic force-pushed the load-independent-syntax-set-when-language-is-explicit branch from 2403984 to a4fb754 Compare September 8, 2021 15:05
@Enselic
Copy link
Collaborator Author

Enselic commented Sep 8, 2021

With all preparatory work and investigations done, it is now time to re-visit this PR! 🎉
I did a force-push because the diff was so big from the old code.

The code probably needs some more polishing, but it is definitely ready for at least a high-level review. I would say it is probably also ready for a detailed review.

Turns out it is not necessary to do zero-copy deserialization. Deserialization into Vec<u8> is plenty fast. The reason it is convenient to use Vec<u8> is because with custom assets, it is a bit messy to find an owner to the data when using &'a [u8]. If we only had integrated assets, zero-copy deserialization would be trivial. In fact, I had that working at first, before I added support to custom assets. But again, it is not necessary to get good performance.

I have verified that the performance numbers are practically the same with this new code as they are in the PR description. Startup time in loop-through mode (see #1747) has become a tiny bit slower since we don't do lazy deserialization of MinimalSyntaxes. But even though it should be easy to fix, the gains are so small that I don't think it makes sense to bother at this point.

hyperfine --export-markdown /dev/tty -L bat bat-pr,bat '{bat} tests/examples/multiline.txt'  
Command Mean [ms] Min [ms] Max [ms] Relative
bat-pr tests/examples/multiline.txt 13.1 ± 0.9 11.5 16.3 1.16 ± 0.15
bat tests/examples/multiline.txt 11.3 ± 1.2 9.6 17.0 1.00

Note that I have ended up changing terminology from "independent syntax sets" to "minimal syntax sets" because the latter is both easier to write, and more accurate.

Looking forward to your comments on this new code!

@Enselic Enselic removed the needs-work label Sep 8, 2021
Copy link
Owner

@sharkdp sharkdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - Thank you so much! Really looking forward to this.

I added a few (very minor) review comments.

src/build_assets.rs Outdated Show resolved Hide resolved
src/assets.rs Outdated Show resolved Hide resolved
src/assets.rs Outdated Show resolved Hide resolved
src/assets.rs Outdated Show resolved Hide resolved
src/assets.rs Show resolved Hide resolved
@sharkdp
Copy link
Owner

sharkdp commented Sep 8, 2021

Another question, purely for my understanding: Is it true that we can potentially get (almost) rid of the 400KB overhead at some point? Once the syntaxes in minimal sets can be found by each available code-path/method, it shouldn't be required to add them again to the large syntax set, right?

@Enselic
Copy link
Collaborator Author

Enselic commented Sep 9, 2021

If everything goes according to plan, we will get rid of syntaxes.bin and keep everything in minimal_syntaxes.bin.

Note that the 400kB overhead is temporary. As we add more syntaxes to minimal_syntaxes.bin, it will become bigger. As long as we use syntaxes.bin as a fallback, it will probably make sense to keep all syntaxes there. But again, eventually it should be possible to get rid of syntaxes.bin entirely.

I am keeping the known remaining steps up to date in this comment: #951 (comment) (I know that GitHub does not send emails for edits of comments, but I still try to keep that comment up to date nevertheless)

@Enselic
Copy link
Collaborator Author

Enselic commented Sep 9, 2021

Left to do before this can be merged:

  • Add entry to CHANGELOG.md
  • Perform a last round of verification of the code

@Enselic
Copy link
Collaborator Author

Enselic commented Sep 9, 2021

I have now done some additional verification, and everything seems to work as it should. So I will merge this shortly.

I also updated CHANGELOG.md. But the changes are to be seen as preliminary. We will want to look over the Performance section when it is time to make the next release, I think.

@Enselic Enselic merged commit 9124271 into sharkdp:master Sep 9, 2021
@Enselic Enselic deleted the load-independent-syntax-set-when-language-is-explicit branch September 9, 2021 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants