Better C++ runtime error for non-UTF-8 content #3280

iterumllc · 2021-09-15T23:07:46Z

Passing a file with non-UTF-8 to a C++-runtime-compiled Antlr 4 parser typically yields an error like:

libc++abi.dylib: terminating with uncaught exception of type std::range_error: wstring_convert: from_bytes error

In this situation a user would greatly benefit from knowing 1) that their file has non-UTF-8 content (nothing in the error message indicates this) and 2) The (preceding-UTF-8-relative) line number and offset of the first non-UTF-8 bytes.

The text was updated successfully, but these errors were encountered:

mike-lischke · 2021-10-12T09:33:14Z

@parrt This is a valid and good FR.

parrt · 2021-10-14T16:58:34Z

Hi @mike-lischke is there a PR associated with this request?

iterumllc · 2021-10-14T22:28:23Z

When I filed this I hadn't looked at the code in any detail but I have a little time right now and decided to take a look.

It looks to me like this problem may be a configuration issue. The error I get is from utf32_to_utf8 which is calling wstring_convert. The latter should only be called if USE_UTF8_INSTEAD_OF_CODECVT is not defined, and utfcpp may have a better error message.

This is all confusing to me because I had a number of build problems related to utfcpp and I just verified that the package is pulled into runtime/Cpp/runtime/thirdparty/utfcpp. But looking closer it appears that USE_UTF8_INSTEAD_OF_CODECVT is not defined for the build of the parser/generator itself.

Just defining it on its own leads to include path problems. So I'm going to try to fix that temporarily and see if utf8's copy method throws a more useful exception. If it does I'll either try to throw a draft PR together or put more information in this issue.

iterumllc · 2021-10-14T23:18:08Z

Compiling with USE_UTF8_INSTEAD_OF_CODECVT with the supplied ExternalAntlr4Cpp.cmake isn't very easy but the result with the error is:

terminate called after throwing an instance of 'utf8::invalid_utf8'
  what():  Invalid UTF-8

Which is at least an improvement.

The "Introductory Sample" from the main utfcpp page includes the use of getline() and utf8::find_invalid() to do the rest, which would be an easy enough fallback to add in StringUtils.h after catching the exception. I would be willing to do a PR for that if someone else can sort out the utfcpp build issues -- it's not clear why the package is pulled down, built, but not used and I don't think I'm the right person to answer that.

mike-lischke · 2021-10-18T06:49:33Z

@parrt I don't know of any PR for this issue.

jcking · 2021-12-10T20:05:19Z

@mike-lischke The new standalone UTF-8 handling in #3398 could be easily updated to keep track of the byte offset (code units), code point offset and include it in the error message.

iterumllc · 2021-12-10T20:59:37Z

If somebody else doesn't first I may take a look once #3398 is merged.

mike-lischke · 2021-12-11T11:31:59Z

@iterumllc That would be great if you could add the error handling you want and create a new PR! In the meantime we can close this issue here, right?

iterumllc · 2021-12-13T07:40:15Z

I pulled over and built a copy of the Antlr 4 repository matching 43fb4c2 . Then I built a project against that same revision of the Cpp runtime. Everything compiles OK but when I run I get an exception thrown by std::call_once() called at line 31 of ATNDeserializationOptions.cpp. So I can't currently see the utf8-specific error.

That line was changed towards the end of October. I'm on Arch Linux compiling with gcc 10.3.0. This project's use of Antlr 4 is not trivial but pretty vanilla. I'm willing to look into this more but probably not solo. @jcking any adivce?

(BTW @parrt may be amused to know that the project is makeotfexe in Adobe's AFDKO and the port to Antlr 4 was from PCCTS 1.33.)

jcking · 2021-12-13T17:15:38Z

Its probably because pthread is not linked to whatever you are trying to run. You may need to pass -pthread when compiling and linking.

parrt · 2021-12-13T18:34:15Z

@iterumllc wow! A blast from 30 years ago :)

iterumllc · 2021-12-13T20:29:56Z

@jcking I'll play around with that but:

Grepping around I don't see any docs changes related to pthread or CMake FindThreads and I think I remember the linking for Posix Threads changes by OS, (Presumably with CMake one would sort this out using target_link_libraries?)
I'm not positive it makes sense to add this requirement on implementations that aren't intended to be threaded. Is there a reason it's not #ifdefed?

iterumllc · 2021-12-13T21:06:00Z

Adding lib pthread worked in my test environment, so I now get:

terminate called after throwing an instance of 'antlr4::IllegalArgumentException'
  what():  UTF-8 string contains an illegal byte sequence

@mike-lischke Like the earlier utfcpp-based error this meets criterion 1 of my report but not criterion 2. It also doesn't display the relevant filename so one may not know if one is at a top-level file or 3 included files down.

Finding the byte sequence in question then becomes a platform-specific adventure for the (not necessarily sophisticated) user.

In our own use case we have many files out in the world that have strict 7-bit-ASCII contents except for comment regions, which could contain comment text in any encoding the author preferred at the time over the past 30 years. (This is because the previous parser would just strip everything between the # and the newline.) We don't see it as a problem to require that comments be updated to UTF-8 but we do see it as a problem not to give any help in finding the problematic contents.

I therefore wouldn't close this issue myself but it's your queue.

Having looked a bit at the code the best way to tackle this is probably to use the new lenient flag and do something intelligent with the added character. That would make the content before the error available to the lexer so that the error can be emitted in the normal error context (which may or may not be manipulated by the caller, as when using the input stream to implement include files).

The tricky question is how to represent the problem at the token level. Eventually it would be preferable such errors were recoverable. With lenient one might already be able to implement that at the tokenizer level by looking for the added character, but then the grammar would be cpp-specific. Maybe for now that character should hard-map to a token similar to EOF but that can't be accepted by the parser to force an error at the right point. If the printable text of that token is something like "MALFORMED_UTF8" the patch could be pretty simple.

parrt · 2021-12-13T21:07:48Z

Ok, thanks @iterumllc closing per your comment.

frankrolf mentioned this issue Sep 16, 2021

Make OTF latest version big disaster! adobe-type-tools/afdko#1409

Closed

parrt added target:cpp unicode labels Oct 14, 2021

skef mentioned this issue Oct 31, 2021

ExternalAntlr4Cpp: Mitigate build problems with restricted path lengths and zip files #3200

Closed

skef mentioned this issue Nov 14, 2021

Actually use utf8cpp after building it #3354

Closed

parrt closed this as completed Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better C++ runtime error for non-UTF-8 content #3280

Better C++ runtime error for non-UTF-8 content #3280

iterumllc commented Sep 15, 2021

mike-lischke commented Oct 12, 2021

parrt commented Oct 14, 2021

iterumllc commented Oct 14, 2021

iterumllc commented Oct 14, 2021

mike-lischke commented Oct 18, 2021

jcking commented Dec 10, 2021

iterumllc commented Dec 10, 2021

mike-lischke commented Dec 11, 2021

iterumllc commented Dec 13, 2021

jcking commented Dec 13, 2021

parrt commented Dec 13, 2021

iterumllc commented Dec 13, 2021 •

edited

Loading

iterumllc commented Dec 13, 2021

parrt commented Dec 13, 2021

Better C++ runtime error for non-UTF-8 content #3280

Better C++ runtime error for non-UTF-8 content #3280

Comments

iterumllc commented Sep 15, 2021

mike-lischke commented Oct 12, 2021

parrt commented Oct 14, 2021

iterumllc commented Oct 14, 2021

iterumllc commented Oct 14, 2021

mike-lischke commented Oct 18, 2021

jcking commented Dec 10, 2021

iterumllc commented Dec 10, 2021

mike-lischke commented Dec 11, 2021

iterumllc commented Dec 13, 2021

jcking commented Dec 13, 2021

parrt commented Dec 13, 2021

iterumllc commented Dec 13, 2021 • edited Loading

iterumllc commented Dec 13, 2021

parrt commented Dec 13, 2021

iterumllc commented Dec 13, 2021 •

edited

Loading