Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF #1633

Merged
merged 2 commits into from
Feb 23, 2017

Conversation

bhamiltoncx
Copy link
Contributor

@bhamiltoncx bhamiltoncx commented Jan 27, 2017

Fixes #276 .

This used to be a WIP PR, but it's now ready for review.

This PR introduces a new extended Unicode escape \u{10ABCD} in ANTLR4 grammars to support Unicode literal values > U+FFFF.

The serialized ATN represents any atom or range with a Unicode value > U+FFFF as a set. Any such set is serialized in the ATN with 32-bit arguments.

I bumped the UUID, since this changes the serialized ATN format.

I included lots of tests and made sure everything is passing on Linux, Mac, and Windows.

@sharwell
Copy link
Member

sharwell commented Jan 27, 2017

📝 I'm looking into the possibility of implementing this without changing the serialized ATN at all. I'm not quite done but I've started the work in my unicode-stream branch.

Note that I haven't implemented the decodeCodePoints method, but manipulation of the ATN during deserialization is nearing common practice for me. 😆

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Jan 27, 2017 via email

@sharwell
Copy link
Member

sharwell commented Jan 27, 2017

@bhamiltoncx To preserve compatibility, I put the decode behind a deserialization option which defaults to false. I'll have to make the option available for the ATN instance which appears in generated code (so it can be set to true), but that step would appear after the initial step of just getting things working.

@bhamiltoncx
Copy link
Contributor Author

@sharwell: I took a look at your branch. I think there are two places we need to support Unicode values > U+FFFF:

  1. Serialized as the start or end value of one of the sets
  2. Serialized as an argument to an edge

Without changing the serialized ATN format, I'm not convinced we can store Unicode SMP values > U+FFFF in either of these locations.

Even using UTF-16 to encode Unicode SMP values as a series of 16-bit values, the problem is that we can only store a single UTF-16 value as a start/end value of a set or as an argument to an edge.

As far as I can tell, the existing ATN format doesn't have any way we can bypass that restriction. Please let me know if I missed something!

@sharwell
Copy link
Member

I'm guessing we actually don't need to handle them in the ATN for SetTransition or RangeTransition.

Serialization

AtomTransition

  • If the label is up to U+FFFF, serialize it normally
  • If the label is above U+FFFF, serialize a sequence of the high surrogate followed by the low surrogate

RangeTransition

  1. Separate the range into an Interval containing the values up to U+FFFF, and an Interval containing the values above U+FFFF
  2. Serialize the RangeTransition as only containing the first interval (values up to U+FFFF)
  3. Serialize the second interval using at most three parallel branches. Let the range be [min, max], where min={H1}{L1} and max={H2}{L2}.
    1. If H1=H2, serialize one transition of the form ('\uH1' '\uL1'..'\uL2')

    2. If H1=H2-1, serialize the block

      ( '\uH1' '\uL1'..'\uDFFF'
      | '\uH2' '\uDF00'..'\uL2'
      )
      
    3. If H1<H2-1, serialize the block

      ( '\uH1' '\uL1'..'\uDFFF'
      | '\uH1+1'..'\uH2-1' '\uDF00'..'\uDFFF'
      | '\uH2' '\uDF00'..'\uL2'
      )
      

SetTransition, NotSetTransition

  1. Split the IntervalSet into two, one containing the values from the set up to U+FFFF, and the other containing values over U+FFFF
  2. Serialize the SetTransition using the first IntervalSet (values below U+FFFF)
  3. Serialize the remaining values as a sequence of ranges, as described above for RangeInterval

NotSetTransition

  1. If the maximum value in the not-set is at most U+FFFF, serialize it normally.
  2. Otherwise, negate the set over the range [Character.MIN_CODE_POINT, Character.MAX_CODE_POINT] and serialize it as a SetTransition.

WildcardTransition

Serialize as-is.

Deserialization

  1. Identify the three relevant patterns seen when serializing RangeTransition, and in each case collapse the sequence of transitions into a single RangeTransition

📝 There is no need to reconstruct the fully optimized set transitions. Literal values above U+FFFF would likely be infrequent in the grammar, so the number of excess range transitions would likely be minimal. Even if the input contained these values, they are most likely to be handled by NotSetTransition and WildcardTransition. However, it is also possible to collapse these transitions, a feature I've actually implemented in my fork.

@bhamiltoncx
Copy link
Contributor Author

If the label is above U+FFFF, serialize a sequence of the high surrogate followed by the low surrogate

If I understand correctly, changing the sequence from one 16-bit value to two 16-bit values means we'll break compatibility with existing deserializers which expect all sequences to have a single 16-bit value.

If that's correct, we might as well change the UUID, right?

It seems like the main advantage of your proposal is a smaller serialized ATN for grammars which contain no values > U+FFFF. Offline, I had earlier asked @parrt if serialized ATN size was important, and he said it wasn't that important, which is why I didn't pursue that.

Overall, it seems like your proposal will work, but it's a lot more complicated than extending set and transition values from 16 to 32 bits — especially implementing it in all the runtimes will be a good chunk of work.

We can definitely improve the size of the serialized ATN in a few ways if that becomes a bottleneck.

@sharwell
Copy link
Member

sharwell commented Jan 27, 2017

If I understand correctly, changing the sequence from one 16-bit value to two 16-bit values means we'll break compatibility with existing deserializers which expect all sequences to have a single 16-bit value.

I mean serialize two transitions instead of just one, as a sequential pair.

@bhamiltoncx
Copy link
Contributor Author

I mean serialize two transitions instead of just one.

I see! Yes, that proposal would definitely maintain binary compatibility with the existing ATN format.

I still think keeping a 16-bit restriction on sets and transition arguments may not be the most useful goal. Do we know if it's worthwhile? Are there any tests we can run to determine this?

@sharwell
Copy link
Member

sharwell commented Jan 27, 2017

If we're changing the ATN format, we could still keep the overhead down for all existing grammars to a single int. When serializing the IntervalSet sets, order them so all sets which contain no explicit values above U+FFFF appear first, followed by sets which do contain values above U+FFFF.

  1. Total number of sets
  2. Number of sets with no values over U+FFFF ← This is the only new serialized value for existing grammars
  3. Array of sets with no values over U+FFFF, using current serialization form for IntervalSet
  4. Array of sets with values over U+FFFF, using new serialization form (current grammars would not have any values in this array)

Then we always serialize transitions containing an explicit value over U+FFFF as a SetTransition (this is trivial, since AtomTransition ⊆ RangeTransition ⊆ SetTransition). During deserialization, if a set only has one element we construct an AtomTransition, and if it only has one interval we construct a RangeTransition.

@bhamiltoncx
Copy link
Contributor Author

If we're changing the ATN format, we could still keep the overhead down for all existing grammars to a single int.

Definitely agreed!

Then we always serialize transitions containing an explicit value over U+FFFF as a SetTransition.

Interesting. Is there a reason we don't do this today in general for all transitions? Is it another storage optimization?

@sharwell
Copy link
Member

Interesting. Is there a reason we don't do this today in general for all transitions? Is it another storage optimization?

Yes, we avoid storing interval sets for atoms and ranges by inlining those values instead.

@bhamiltoncx
Copy link
Contributor Author

I haven't yet implemented @sharwell's suggestions to reduce the size of the serialized ATN, but the WIP branch should now work for atoms > U+FFFF, sets containing values > U+FFFF, and ranges containing values > U+FFFF.

Lots of low-hanging fruit to fix, but the basic tests are now passing!

@parrt
Copy link
Member

parrt commented Jan 27, 2017

Great news Ben! I'll be poking around this weekend.

@parrt
Copy link
Member

parrt commented Jan 29, 2017

While we're at it, guys, should we open ANTLR up to parsing arbitrary 32-bit int streams? I can see limiting unicode 32 to be \u{...} but perhaps \xABCDABCD for 32-bit values? Seems like we may need a grammar-level option like encoding=unicode32 or some such anyway so perhaps encoding=int ?? In that case, getText() would be meaningless or we could just define it as the little or big endian sequence of 16bit words.

@bhamiltoncx
Copy link
Contributor Author

should we open ANTLR up to parsing arbitrary 32-bit int streams?

This is a great question. I think it'd be fine to have a grammar option to say "this is not unicode, but a stream of x-bit-wide units" so people could specify what type of input they intend to parse.

That would improve life for folks parsing 8-bit binary formats as well.

Probably that work should be separate from this work.

@parrt
Copy link
Member

parrt commented Jan 30, 2017

Agreed, let's just keep 32-bit int parsing in the back of our minds. IntStream was a good start that we did long ago so I figured we might as well go whole hog at some point.

@bhamiltoncx
Copy link
Contributor Author

Man, Windows is not fun! Python 3.5 on Windows doesn't actually support writing Unicode to stdout via print(). They only got that working on Python 3.6 and later. Lots of workarounds for me.. :)

http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console/32176732#32176732

@bhamiltoncx
Copy link
Contributor Author

Oh, Windows..

https://msdn.microsoft.com/en-us/library/system.console(v=vs.110).aspx

Unicode Support for the Console
In general, the console reads input and writes output by using the current console code page, which the system locale defines by default.

Ugh, well, okay, at least there's a workaround..

To display Unicode characters to the console. you set the OutputEncoding property to either UTF8Encoding or UnicodeEncoding.

Geez, I swear I tried that.. reads further

Display of characters outside the Basic Multilingual Plane (that is, of surrogate pairs) is not supported

headdesk

(FYI, Mono gets this right; setting Console.OutputEncoding = Encoding.UTF8 works just fine on Linux. It seems only Windows can't handle writing Unicode code points > U+FFFF via Console.WriteLine().)

@bhamiltoncx
Copy link
Contributor Author

Thank goodness this stuff is open-source. Looks like for some reason UTF8 and Unicode are different:

https://github.com/dotnet/corefx/blob/bffef76f6af208e2042a2f27bc081ee908bb390b/src/System.Console/src/System/ConsolePal.Windows.cs#L123

I'll try Unicode, which seems to mean UTF-8 on Linux with Mono.

@parrt
Copy link
Member

parrt commented Feb 6, 2017

Oh man, those rascals!

@bhamiltoncx
Copy link
Contributor Author

Hmm, even Unicode isn't working; the C# tests are still failing on Windows.

I guess that's because under the hood, Windows' interprocess I/O (in our case, stdout) for C# <-> Java communication still has to convert Unicode to bytes, so it's going to use whatever is the system default code page (probably CP1252 if this is US English Windows).

Mono doesn't seem to have a problem getting this right, but Windows is proving to be a challenge.

My options are:

  1. Ensure Java explicitly sets the Windows console code page to UTF-8 before we run unit tests
  2. Have the C# tests write to a file instead of stdout
  3. ???

@parrt
Copy link
Member

parrt commented Feb 6, 2017

For (1.) is there a generic way to set the console to UTF-8 that works on all platforms?

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Feb 6, 2017 via email

@bhamiltoncx
Copy link
Contributor Author

OK, I'm going to give up on trying to get C# tests on Windows to write Unicode via stdout, and write to a file instead.

@parrt
Copy link
Member

parrt commented Feb 7, 2017

ok, how big a change will it be in the test rig?

@mike-lischke
Copy link
Member

Yes, just saw it and love it. You are really going the extra round to get this fully done. Respect.

@parrt
Copy link
Member

parrt commented Feb 21, 2017

Yeah, it's amazing work! I think we'll need to update the doc too. Do you have a complete list of new features? Is it "just" \u{...} and \p{...}?

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Feb 21, 2017 via email

@mike-lischke
Copy link
Member

Sorry, I must have missed something. I don't see a CodePointInputStream class in the patch nor can I find it in the current code (everything is merged already to ANTLR4 master, right?). What is this class for? I don't think we need to change e.g. the C++ implementation of the ANTLRInputStream class, as it already handles UTF-32.

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Feb 22, 2017 via email

@parrt
Copy link
Member

parrt commented Feb 22, 2017

BTW, Sam suggested we verify:

  1. Verify binary parsing applications still work as expected (byte streams instead of char streams)
  2. Verify lexer performance on very large inputs, e.g. the Java lexer over the JDK source

@bhamiltoncx
Copy link
Contributor Author

OK! 1. should be pretty easy to add a unit test for, I'll do that.
For 2., I'd love some tips, I'm not really sure how to do that.

@parrt
Copy link
Member

parrt commented Feb 22, 2017 via email

@bhamiltoncx
Copy link
Contributor Author

OK, sent out WIP #1688 to add new lexer charset escapes, so:

[\p{Lowercase_Letter}]

matches both a-z as well as 𝐚-𝐳.

I'll add the tests for the "binary" parsing feature next.

@bhamiltoncx
Copy link
Contributor Author

Added a test for a "binary" grammar, works fine.

@bhamiltoncx
Copy link
Contributor Author

Here's the difference in size of the serialized ATN for LargeLexer:

Before change

% wc L.java                                                    
  17549   28074 1343154 L.java

After change

% wc L.java                                                          
  17549   28074 1343156 L.java

Looks pretty good to me (2 bytes difference, which is what I'd expect from the extra serialized int for the "0 SMP sets").

@parrt
Copy link
Member

parrt commented Feb 23, 2017

Ok, i see nothing super scary in the change list and @bhamiltoncx has been meticulous and fastidious. He has lots of tests on the new \u{...} feature and it tests backward compatibility with existing grammars that manually specify UTF-16 sequences. Also has a binary parsing test. I'm going to merge and then we can all test on our pet grammars. Perhaps @teverett will try it out on grammars-v4 repo.

@parrt parrt added this to the 4.6.1 milestone Feb 23, 2017
@parrt parrt merged commit 4ec294f into antlr:master Feb 23, 2017
@bhamiltoncx bhamiltoncx deleted the unicode branch February 23, 2017 22:08
@bhamiltoncx
Copy link
Contributor Author

Great! I'll follow up on any issues and take a look at the documentation.

@teverett
Copy link
Member

@parrt I'll try this on grammars-v4. It might be best to try it after the issue with this commit has been repaired.

#1537

@KvanTTT
Copy link
Member

KvanTTT commented Feb 24, 2017

@teverett we should repair warnings and errors in existing grammars. See issue update.

@jlettvin
Copy link

jlettvin commented Jun 23, 2017 via email

@parrt
Copy link
Member

parrt commented Jun 23, 2017

so one of the built-in sets is off by one?

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Jun 23, 2017 via email

@jlettvin
Copy link

jlettvin commented Jun 23, 2017 via email

@parrt
Copy link
Member

parrt commented Jun 23, 2017

ok, i'll let @teverett grab the latest.

@teverett
Copy link
Member

teverett commented Jul 8, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants