Change UnbufferedCharStream to use code points #1796

bhamiltoncx · 2017-03-29T19:40:49Z

Fixes #1795 .

This changes the Java implementation of UnbufferedCharStream to use code points, not UTF-16 code units, following the changes to CharStreams.

For now, I didn't bother putting this in the CharStreams interface, but if we end up changing that to use a Builder style, we could make an unbuffered() option.

I updated the tests. Note the weirdness around \uFFFF — before, we were relying on (char)IntStream.EOF converting -1 to \uFFFF, but now we no longer have that, so the test code explicitly appends \uFFFF.

bhamiltoncx · 2017-03-29T19:43:17Z

runtime/Java/src/org/antlr/v4/runtime/UnbufferedCharStream.java

@@ -82,7 +82,7 @@ public UnbufferedCharStream() {
 	/** Useful for subclasses that pull char from other than this.input. */
 	public UnbufferedCharStream(int bufferSize) {
 		n = 0;
-		data = new char[bufferSize];
+		data = new int[bufferSize];
 	}

 	public UnbufferedCharStream(InputStream input) {


Note this ends up calling new InputStreamReader(input) without specifying a Charset. That means this logic depends on the client's $LANG environment variable or equivalent, which is usually not what anyone wants.

It seems like we should add an optional charset arg on one or more of the arguments. Also, doesn't the fill() method assume UTF-X? I.e., it would not work with some other encoding, right?

Yes! We absolutely should add a Charset arg, or there's no way to use anything but the default in the calling environment. (We should probably also change to default to UTF-8..)

fill() uses the InputStreamReader to get UTF-16 encoded chars. It will work with any encoding, although there's no way to specify an encoding other than the environment default with the current UnbufferedCharStream API.

ok, great! could you please make the changes? Does that mean we change this field to InputStream?

protected Reader input;

Sure, I'll make the changes! Nope, we'll still use a Reader, we'll just use a different InputStreamReader constructor which allows us to specify a Charset.

…bit buffer

bhamiltoncx · 2017-03-29T20:30:24Z

runtime/Java/src/org/antlr/v4/runtime/UnbufferedCharStream.java

@@ -183,8 +206,8 @@ public int LA(int i) {
        int index = p + i - 1;
        if ( index < 0 ) throw new IndexOutOfBoundsException();
 		if ( index >= n ) return IntStream.EOF;
-        char c = data[index];
-        if ( c==(char)IntStream.EOF ) return IntStream.EOF;
+        int c = data[index];


Oops, just realized this can be tidied up to just return data[index];.

parrt · 2017-03-29T20:56:29Z

runtime/Java/src/org/antlr/v4/runtime/UnbufferedCharStream.java

-				add(c);
+				if (c > Character.MAX_VALUE || c == IntStream.EOF) {
+					add(c);
+				} else {


reminder that our "style" is else starts a line ;)

bhamiltoncx force-pushed the unbuffered-char-stream-code-points branch from 8e90f0b to dd8ac6a Compare March 29, 2017 19:41

bhamiltoncx mentioned this pull request Mar 29, 2017

What to do with UnbufferedCharStream for 4.7? #1795

Closed

bhamiltoncx commented Mar 29, 2017

View reviewed changes

Change UnbufferedCharStream to use 32-bit Unicode code points and 32-…

8108b34

…bit buffer

bhamiltoncx force-pushed the unbuffered-char-stream-code-points branch from dd8ac6a to 8108b34 Compare March 29, 2017 20:14

parrt added this to the 4.7 milestone Mar 29, 2017

parrt added lexers unicode labels Mar 29, 2017

parrt merged commit 802f4c4 into antlr:master Mar 29, 2017

bhamiltoncx commented Mar 29, 2017

View reviewed changes

This was referenced Mar 29, 2017

C#: Change UnbufferedCharStream to use 32-bit Unicode code points and 32-bit buffer #1798

Merged

Small tidy-up for Java UnbufferedCharStream #1799

Merged

parrt reviewed Mar 29, 2017

View reviewed changes

bhamiltoncx mentioned this pull request Mar 29, 2017

Use UTF-8 by default in UnbufferedCharStream and allow specifying charset #1800

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change UnbufferedCharStream to use code points #1796

Change UnbufferedCharStream to use code points #1796

bhamiltoncx commented Mar 29, 2017 •

edited

Loading

bhamiltoncx Mar 29, 2017

parrt Mar 29, 2017

bhamiltoncx Mar 29, 2017

parrt Mar 29, 2017

bhamiltoncx Mar 29, 2017

bhamiltoncx Mar 29, 2017

parrt Mar 29, 2017

Change UnbufferedCharStream to use code points #1796

Change UnbufferedCharStream to use code points #1796

Conversation

bhamiltoncx commented Mar 29, 2017 • edited Loading

bhamiltoncx Mar 29, 2017

Choose a reason for hiding this comment

parrt Mar 29, 2017

Choose a reason for hiding this comment

bhamiltoncx Mar 29, 2017

Choose a reason for hiding this comment

parrt Mar 29, 2017

Choose a reason for hiding this comment

bhamiltoncx Mar 29, 2017

Choose a reason for hiding this comment

bhamiltoncx Mar 29, 2017

Choose a reason for hiding this comment

parrt Mar 29, 2017

Choose a reason for hiding this comment

bhamiltoncx commented Mar 29, 2017 •

edited

Loading