Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACP: BufReader::peek #417

Closed
lolbinarycat opened this issue Jul 24, 2024 · 13 comments
Closed

ACP: BufReader::peek #417

lolbinarycat opened this issue Jul 24, 2024 · 13 comments
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@lolbinarycat
Copy link

lolbinarycat commented Jul 24, 2024

Proposal

Problem statement

It is often desirable to look ahead a small amount in an unseekable stream, such as a unix pipe.

Motivating examples or use cases

the main usecase is inspecting the magic number of a file, such as is trying to be done here, but it could also be useful for some simple parsers.

Solution sketch

add a method BufReader::peek that has the signature fn peek(&mut self, n: usize) -> io::Result<&[u8]>

the method would work as followed:

  1. clear from the front of the buffer any consumed bytes
  2. do read calls on the underlying Read object to fill the buffer until it reaches any of: the length requested, the BufReader's capacity, or end of file
  3. return the slice of this new buffer

Alternatives

  1. give it the same signature as Read::read (adds an extra copy, but has nice parity)
  2. a Peek trait that abstracts over all streams that can do a limited lookahead
  3. instead of limiting the lookahead to the capacity of the buffer, automatically grow the buffer if a larger lookahead is requested (less performant and more complexity)
  4. leave this up to other crates (this would require those other crates to reimplement basically all of BufReader)

Links and related work

original thread linked previously

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
@lolbinarycat lolbinarycat added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Jul 24, 2024
@the8472
Copy link
Member

the8472 commented Jul 24, 2024

This doesn't differ much from fill_buf, so what's the important distinguishing factor here?

@lolbinarycat
Copy link
Author

using peek is more robust since it will perform multiple reads if needed, similar to read_exact, but without the behavior of throwing an error at end of file. this is important for properly handling line-buffered ptys and other edge cases.

(sidenote: when looking for a function to compare it to i was surprised that rust doesn't appear to have an equivalent to Go's io.ReadFull, which performs a similar task of doing multiple underlying reads, but doesn't error on end of file like read_exact)

@the8472
Copy link
Member

the8472 commented Jul 24, 2024

clear from the front of the buffer any consumed bytes

So, backshifting? Yeah, people have asked for this.

Though I think a backshift-and-read-once-more function might be a possible middle ground. If the read doesn't read enough you can call it again without having to fill the entire buffer.
That could help with variable-length messages where the caller doesn't know how long it's going to be.

@kennytm
Copy link
Member

kennytm commented Jul 24, 2024

  1. a Peek trait that abstracts over all streams that can do a limited lookahead

The other obvious Peek candidates are TcpStream and UdpSocket, but I think these have quite different semantic than the current suggestion.

  1. instead of limiting the lookahead to the capacity of the buffer, automatically grow the buffer if a larger lookahead is requested (less performant and more complexity)

None of the BufReader methods resizes the buffer. The buffer is actually now implemented as a Box<[MaybeUninit<u8>]> rather than a Vec<u8>, so clearly it is intended to be staying unchanged.

@the8472
Copy link
Member

the8472 commented Jul 24, 2024

The buffer is actually now implemented as a Box<[MaybeUninit]> rather than a Vec, so clearly it is intended to be staying unchanged.

It used to be a vec. So that's more of an implementation detail and could be changed. But then we should have explicit resize methods. Introducing it through implicit resizing would be odd.

@kennytm
Copy link
Member

kennytm commented Jul 24, 2024

the method would work as followed:

  1. clear from the front of the buffer any consumed bytes
  2. do read calls on the underlying Read object to fill the buffer until it reaches any of: the length requested, the BufReader's capacity, or end of file
  3. return the slice of this new buffer

I think the peek(n) method should instead do the following:

  1. if self.buffer().len() <= n, just return buffer[..n].
  2. otherwise, clear the consumed bytes (memmove the buffer to the beginning), and then refill the buffer
  3. return buffer[..min(n, capacity)]

Step 1 is considering that, if the buffer is large enough to support peeking n bytes it should not modify the buffer position, to preserve the performance of seek_relative(-1).

Step 2 may refill the buffer up to capacity or up to n. I prefer it to refill up to capacity because that's also what the normal fill_buf() method does.

@lolbinarycat
Copy link
Author

None of the BufReader methods resizes the buffer. The buffer is actually now implemented as a Box<[MaybeUninit]> rather than a Vec, so clearly it is intended to be staying unchanged.

yep i looked at the implementation, that's why this is an alternative and not the main proposal

I think the peek(n) method should instead do the following:

1. if `self.buffer().len() <= n`, just return `buffer[..n]`.

2. otherwise, clear the consumed bytes (memmove the buffer to the beginning), and then refill the buffer

3. return `buffer[..min(n, capacity)]`

seems like it handles that one edge case a lot better with no real downside, nice!

@the8472
Copy link
Member

the8472 commented Jul 30, 2024

We discussed this during today's libs-API meeting. We're generally ok with adding this method, but the tracking issue should note the open questions whether requesting an n larger the remaining capacity should result in an error or a panic or return a short slice and also whether it should read the full spare capacity or only as much as is requested.


Speaking for myself, I think we should also have lower level building blocks backshift and read_more and will file a separate ACP for that.

@the8472 the8472 closed this as completed Jul 30, 2024
@the8472 the8472 added the ACP-accepted API Change Proposal is accepted (seconded with no objections) label Jul 30, 2024
@blueglyph
Copy link

blueglyph commented Aug 4, 2024

I first posted that in the tracking issue, but I think here is a better place. What I'm quoting below is the question in that tracking issue.

what should happen if n > capacity? options are: short slice, error, or panic.

Another possibility that may be more in line with other methods reading more than one item, like read, read_until, read_line, etc., and also more like TcpStream::peek(&self, buf: &mut [u8]) -> Result<usize>, is to give a mutable reference to a buffer and return a Result:

impl BufReader {
    fn peek(&mut self, n: usize, buf: &mut [u8]) -> io::Result<usize>
}

I'm suggesting a Result<usize> to indicate that either the peek went well but the size could still be different than n or an error occurred when trying to perform the peek. Maybe there are other possibilities, like returning an error if the number of bytes isn't n (Result<()>), though it's less informative (unless the error type details what went wrong...).

It could also be convenient to have it as a trait rather than a implementation of only BufReader.

@lolbinarycat
Copy link
Author

@blueglyph you seem to be suggesting using alternatives #1 and #2 together.

@blueglyph
Copy link

@lolbinarycat Heh, you're right. I first saw the tracking issue and the questions I quoted, without the alternatives. I replied there, then I saw there was a proposal for it here, quickly moved the suggestion and misread the alternatives in my hurry.

Sorry. Just ignore my earlier comment.

FWIW, I also think returning a shorter slice instead of an error is more useful and perhaps more conventional. Errors should be reserved for unexpected I/O issues, but when someone peeks further without knowing the total size, it's hardly an error to give n even if there are fewer items. On the other hand, a programmer might forget to check the length (which is another reason why I prefer read's signature).

PS: I had also thought of the [Peekable](https://doc.rust-lang.org/std/iter/struct.Peekable.html#method.peek) mentioned earlier for TcpStream and UdpSocket, but it's an iterator.

@lolbinarycat
Copy link
Author

On the other hand, a programmer might forget to check the length (which is another reason why I prefer read's signature).

with a function like read, forgetting to check the length means reusing the old contents of the buffer, a silent logic error (a very nasty class of error)

with a function like the proposed peek, forgetting to check the length will simply panic.

@blueglyph
Copy link

with a function like read, forgetting to check the length means reusing the old contents of the buffer, a silent logic error (a very nasty class of error)

with a function like the proposed peek, forgetting to check the length will simply panic.

That's a good point. The length is returned explicitly, but ignoring it could be undetected. And maybe it's easier to ignore it when n is already given in argument.

Note that you can use the #[must_use] attribute on a function or trait method declaration to produce a warning if the return value is ignored. I don't think you can trigger an error, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

4 participants