Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auditing use of unsafe rust #398

Closed
pchickey opened this issue Jun 26, 2024 · 5 comments
Closed

Auditing use of unsafe rust #398

pchickey opened this issue Jun 26, 2024 · 5 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@pchickey
Copy link

Hi - I am new to the logos ecosystem.

I don't want to cast any doubt on the correctness of logos or that anyone should choose to use it - it looks like its a remarkable project that is clearly a great choice for many users.

I am upstreaming a crate which uses logos to generate a lexer ( bytecodealliance/wasmtime#8872 ) into wasmtime. As part of accepting logos to be used as a transitive dependency in the wasmtime project, I need to certify that it meets cargo-vet's safe-to-deploy criteria: https://mozilla.github.io/cargo-vet/built-in-criteria.html#safe-to-deploy .

The lexer will be handling untrusted input, and after spending some time examining the way logos codegen works, I don't feel that I can certify that the use of unsafe rust is sound: while I don't have any evidence that it is unsound, the code generator, and the code it generates, is too complex for me to reasonably declare that any use of the logos derive macro is fully sound, as would be implied by marking it as safe-to-deploy.

My rough understanding, from reading the code generator and the author's blog, is that performance is a huge goal of logos, and it has achieved very high performance. In my use case, I don't really care about performance: the inputs to the lexer are small and infrequent, and if the lexer was one or two orders of magnitude slower, that would be fine. However, I do care about correctness, to the point of being very conservative in what dependencies I can accept.

One path forward might be to have an alternative code generator for the logos macros that uses entirely safe rust. Have the logos authors ever considered this approach? Otherwise, I will have to rewrite my lexer by hand, and in doing so I will lose composability with other logos lexers in our ecosystem.

@jeertmans jeertmans added the question Further information is requested label Jun 27, 2024
@jeertmans
Copy link
Collaborator

Hello @pchickey! Thanks for your comment on using unsafe Rust.

I am not the author of the code itself, so I cannot really guarantee that the use of unsafe code is done safely, nor that it is actually important at every place it is used.
But I guess that @maciejhirsz knows that better than me, and could maybe give his opinion.

I think an important first step would be to investigate the use of the unsafe keyword across all crates, and maybe list them here. I know some Rust tools have been developed to this end. Second, we could analyze each use-case separately, and see if we can simply go back to a safe alternative without loss of performances.

The solution to have a opt-in for safe (or unsafe) Rust could be obtained via a feature, #[cfg(feature = safe-only)] for example, but I wonder if that is that easy to implement, or if the compile-time might increase a lot.

This question needs further investigation, and help from anyone is more than welcome!

@jeertmans jeertmans added the help wanted Extra attention is needed label Jun 27, 2024
@davidkern
Copy link
Contributor

davidkern commented Aug 19, 2024

I also have an application for logos handling untrusted input, so I took some time today to audit the usage of unsafe code in logos.

The logos crate itself does not pull in any external dependencies (yay!), so the only source of unsafety is the crate itself. Of these, there are four areas using unsafe: read_byte_unchecked, slice_unchecked, Chunk::from_ptr, and ManuallyDrop.

  1. read_byte_unchecked - these are defined in LexerInternal and Source traits, but not actually used anywhere. Disposition: remove entirely (See edit note below, this is actually used in the generator code)
  2. slice_unchecked - defined by the Source trait, and that trait already defines a safe alternative slice. Disposition: feature flag
  3. from_ptr - defined by the Chunk trait. Disposition: add a from_slice function and feature flag the usage
  4. ManuallyDrop - wraps the temporary storage of token in the parser. Disposition: use Option::take instead and feature flag the usage

Beyond that, there is an additional safety issue with the current implementation. The Source trait may be implemented by the user as a safe trait. However the data returned by the functions of this trait are ultimately used for pointer manipulation in unsafe code. In particular if the source trait implements the len, find_boundary, or is_boundary functions incorrectly, then this can lead to out-of-bounds access in the unsafe code.

If the unsafe code is enabled, then the entire Source trait should be marked as an unsafe trait to indicate to the implementer that incorrect implementation can cause UB.

I've implemented all of the above, except for marking the Source trait unsafe.

There is a decision to be made around including allow_unsafe in the default features (unsafe opt-out), or not and allow for it to be opt-in. Or even, less conservatively, removing unsafe code entirely. I have some opinions but there are pros/cons to each option which warrant a discussion.

Regarding benchmarks, I hoped there would be a clear difference which would help make that decision. But compilers are smart and there are surprises! I'll include benchmark results on the PR.


EDIT: I realized logos-codegen also has two places where unsafe shows up. bytes_to_regex_string is internal to the code generator and doesn't get emitted to the generated macro code. The other, in fork_read, does emit an unsafe block - which uses the read_byte_unchecked method I thought was unused.

@davidkern
Copy link
Contributor

@pchickey The forbid_unsafe feature in #413 is now ready for review. I think this should address the unsafe concern that you raised in this issue. If you have a moment to check it against your PR I'm happy to make any changes needed before that gets merged.

Should just need to add a forbid_unsafe feature to your logos dependency and both the unsafe code internal to the logos crate, and the unsafe code generated by the proc-macro will be disabled.

Some benchmarking showed that there wasn't a massive throughput difference, but the performance outcome will depend greatly on your specific grammar.

@pchickey
Copy link
Author

@davidkern Thanks so much David for your work on this feature! @jeertmans Can you ping me when a new release is made, so I can move bytecodealliance/wasmtime#8872 forward?

@jeertmans
Copy link
Collaborator

Done with v0.14.2 @pchickey ;-) Note that you can also subscribe to new release notification if you want!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants