-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change UnbufferedCharStream to use code points #1796
Change UnbufferedCharStream to use code points #1796
Conversation
8e90f0b
to
dd8ac6a
Compare
@@ -82,7 +82,7 @@ public UnbufferedCharStream() { | |||
/** Useful for subclasses that pull char from other than this.input. */ | |||
public UnbufferedCharStream(int bufferSize) { | |||
n = 0; | |||
data = new char[bufferSize]; | |||
data = new int[bufferSize]; | |||
} | |||
|
|||
public UnbufferedCharStream(InputStream input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this ends up calling new InputStreamReader(input)
without specifying a Charset
. That means this logic depends on the client's $LANG
environment variable or equivalent, which is usually not what anyone wants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we should add an optional charset arg on one or more of the arguments. Also, doesn't the fill() method assume UTF-X? I.e., it would not work with some other encoding, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! We absolutely should add a Charset
arg, or there's no way to use anything but the default in the calling environment. (We should probably also change to default to UTF-8..)
fill()
uses the InputStreamReader
to get UTF-16 encoded char
s. It will work with any encoding, although there's no way to specify an encoding other than the environment default with the current UnbufferedCharStream
API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, great! could you please make the changes? Does that mean we change this field to InputStream?
protected Reader input;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll make the changes! Nope, we'll still use a Reader
, we'll just use a different InputStreamReader
constructor which allows us to specify a Charset
.
dd8ac6a
to
8108b34
Compare
@@ -183,8 +206,8 @@ public int LA(int i) { | |||
int index = p + i - 1; | |||
if ( index < 0 ) throw new IndexOutOfBoundsException(); | |||
if ( index >= n ) return IntStream.EOF; | |||
char c = data[index]; | |||
if ( c==(char)IntStream.EOF ) return IntStream.EOF; | |||
int c = data[index]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, just realized this can be tidied up to just return data[index];
.
add(c); | ||
if (c > Character.MAX_VALUE || c == IntStream.EOF) { | ||
add(c); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder that our "style" is else starts a line ;)
Fixes #1795 .
This changes the Java implementation of
UnbufferedCharStream
to use code points, not UTF-16 code units, following the changes toCharStreams
.For now, I didn't bother putting this in the
CharStreams
interface, but if we end up changing that to use aBuilder
style, we could make anunbuffered()
option.I updated the tests. Note the weirdness around
\uFFFF
— before, we were relying on(char)IntStream.EOF
converting-1
to\uFFFF
, but now we no longer have that, so the test code explicitly appends\uFFFF
.