Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change UnbufferedCharStream to use code points #1796

Merged
merged 1 commit into from
Mar 29, 2017

Conversation

bhamiltoncx
Copy link
Contributor

@bhamiltoncx bhamiltoncx commented Mar 29, 2017

Fixes #1795 .

This changes the Java implementation of UnbufferedCharStream to use code points, not UTF-16 code units, following the changes to CharStreams.

For now, I didn't bother putting this in the CharStreams interface, but if we end up changing that to use a Builder style, we could make an unbuffered() option.

I updated the tests. Note the weirdness around \uFFFF — before, we were relying on (char)IntStream.EOF converting -1 to \uFFFF, but now we no longer have that, so the test code explicitly appends \uFFFF.

@@ -82,7 +82,7 @@ public UnbufferedCharStream() {
/** Useful for subclasses that pull char from other than this.input. */
public UnbufferedCharStream(int bufferSize) {
n = 0;
data = new char[bufferSize];
data = new int[bufferSize];
}

public UnbufferedCharStream(InputStream input) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this ends up calling new InputStreamReader(input) without specifying a Charset. That means this logic depends on the client's $LANG environment variable or equivalent, which is usually not what anyone wants.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should add an optional charset arg on one or more of the arguments. Also, doesn't the fill() method assume UTF-X? I.e., it would not work with some other encoding, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! We absolutely should add a Charset arg, or there's no way to use anything but the default in the calling environment. (We should probably also change to default to UTF-8..)

fill() uses the InputStreamReader to get UTF-16 encoded chars. It will work with any encoding, although there's no way to specify an encoding other than the environment default with the current UnbufferedCharStream API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, great! could you please make the changes? Does that mean we change this field to InputStream?

    protected Reader input;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll make the changes! Nope, we'll still use a Reader, we'll just use a different InputStreamReader constructor which allows us to specify a Charset.

@bhamiltoncx bhamiltoncx force-pushed the unbuffered-char-stream-code-points branch from dd8ac6a to 8108b34 Compare March 29, 2017 20:14
@parrt parrt added this to the 4.7 milestone Mar 29, 2017
@parrt parrt merged commit 802f4c4 into antlr:master Mar 29, 2017
@@ -183,8 +206,8 @@ public int LA(int i) {
int index = p + i - 1;
if ( index < 0 ) throw new IndexOutOfBoundsException();
if ( index >= n ) return IntStream.EOF;
char c = data[index];
if ( c==(char)IntStream.EOF ) return IntStream.EOF;
int c = data[index];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, just realized this can be tidied up to just return data[index];.

add(c);
if (c > Character.MAX_VALUE || c == IntStream.EOF) {
add(c);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder that our "style" is else starts a line ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants