Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better C++ runtime error for non-UTF-8 content #3280

Closed
iterumllc opened this issue Sep 15, 2021 · 14 comments
Closed

Better C++ runtime error for non-UTF-8 content #3280

iterumllc opened this issue Sep 15, 2021 · 14 comments

Comments

@iterumllc
Copy link

Passing a file with non-UTF-8 to a C++-runtime-compiled Antlr 4 parser typically yields an error like:

libc++abi.dylib: terminating with uncaught exception of type std::range_error: wstring_convert: from_bytes error

In this situation a user would greatly benefit from knowing 1) that their file has non-UTF-8 content (nothing in the error message indicates this) and 2) The (preceding-UTF-8-relative) line number and offset of the first non-UTF-8 bytes.

@mike-lischke
Copy link
Member

@parrt This is a valid and good FR.

@parrt
Copy link
Member

parrt commented Oct 14, 2021

Hi @mike-lischke is there a PR associated with this request?

@iterumllc
Copy link
Author

When I filed this I hadn't looked at the code in any detail but I have a little time right now and decided to take a look.

It looks to me like this problem may be a configuration issue. The error I get is from utf32_to_utf8 which is calling wstring_convert. The latter should only be called if USE_UTF8_INSTEAD_OF_CODECVT is not defined, and utfcpp may have a better error message.

This is all confusing to me because I had a number of build problems related to utfcpp and I just verified that the package is pulled into runtime/Cpp/runtime/thirdparty/utfcpp. But looking closer it appears that USE_UTF8_INSTEAD_OF_CODECVT is not defined for the build of the parser/generator itself.

Just defining it on its own leads to include path problems. So I'm going to try to fix that temporarily and see if utf8's copy method throws a more useful exception. If it does I'll either try to throw a draft PR together or put more information in this issue.

@iterumllc
Copy link
Author

Compiling with USE_UTF8_INSTEAD_OF_CODECVT with the supplied ExternalAntlr4Cpp.cmake isn't very easy but the result with the error is:

terminate called after throwing an instance of 'utf8::invalid_utf8'
  what():  Invalid UTF-8

Which is at least an improvement.

The "Introductory Sample" from the main utfcpp page includes the use of getline() and utf8::find_invalid() to do the rest, which would be an easy enough fallback to add in StringUtils.h after catching the exception. I would be willing to do a PR for that if someone else can sort out the utfcpp build issues -- it's not clear why the package is pulled down, built, but not used and I don't think I'm the right person to answer that.

@mike-lischke
Copy link
Member

@parrt I don't know of any PR for this issue.

@jcking
Copy link
Collaborator

jcking commented Dec 10, 2021

@mike-lischke The new standalone UTF-8 handling in #3398 could be easily updated to keep track of the byte offset (code units), code point offset and include it in the error message.

@iterumllc
Copy link
Author

If somebody else doesn't first I may take a look once #3398 is merged.

@mike-lischke
Copy link
Member

@iterumllc That would be great if you could add the error handling you want and create a new PR! In the meantime we can close this issue here, right?

@iterumllc
Copy link
Author

I pulled over and built a copy of the Antlr 4 repository matching 43fb4c2 . Then I built a project against that same revision of the Cpp runtime. Everything compiles OK but when I run I get an exception thrown by std::call_once() called at line 31 of ATNDeserializationOptions.cpp. So I can't currently see the utf8-specific error.

That line was changed towards the end of October. I'm on Arch Linux compiling with gcc 10.3.0. This project's use of Antlr 4 is not trivial but pretty vanilla. I'm willing to look into this more but probably not solo. @jcking any adivce?

(BTW @parrt may be amused to know that the project is makeotfexe in Adobe's AFDKO and the port to Antlr 4 was from PCCTS 1.33.)

@jcking
Copy link
Collaborator

jcking commented Dec 13, 2021

Its probably because pthread is not linked to whatever you are trying to run. You may need to pass -pthread when compiling and linking.

@parrt
Copy link
Member

parrt commented Dec 13, 2021

@iterumllc wow! A blast from 30 years ago :)

@iterumllc
Copy link
Author

iterumllc commented Dec 13, 2021

@jcking I'll play around with that but:

  1. Grepping around I don't see any docs changes related to pthread or CMake FindThreads and I think I remember the linking for Posix Threads changes by OS, (Presumably with CMake one would sort this out using target_link_libraries?)
  2. I'm not positive it makes sense to add this requirement on implementations that aren't intended to be threaded. Is there a reason it's not #ifdefed?

@iterumllc
Copy link
Author

Adding lib pthread worked in my test environment, so I now get:

terminate called after throwing an instance of 'antlr4::IllegalArgumentException'
  what():  UTF-8 string contains an illegal byte sequence

@mike-lischke Like the earlier utfcpp-based error this meets criterion 1 of my report but not criterion 2. It also doesn't display the relevant filename so one may not know if one is at a top-level file or 3 included files down.

Finding the byte sequence in question then becomes a platform-specific adventure for the (not necessarily sophisticated) user.

In our own use case we have many files out in the world that have strict 7-bit-ASCII contents except for comment regions, which could contain comment text in any encoding the author preferred at the time over the past 30 years. (This is because the previous parser would just strip everything between the # and the newline.) We don't see it as a problem to require that comments be updated to UTF-8 but we do see it as a problem not to give any help in finding the problematic contents.

I therefore wouldn't close this issue myself but it's your queue.

Having looked a bit at the code the best way to tackle this is probably to use the new lenient flag and do something intelligent with the added character. That would make the content before the error available to the lexer so that the error can be emitted in the normal error context (which may or may not be manipulated by the caller, as when using the input stream to implement include files).

The tricky question is how to represent the problem at the token level. Eventually it would be preferable such errors were recoverable. With lenient one might already be able to implement that at the tokenizer level by looking for the added character, but then the grammar would be cpp-specific. Maybe for now that character should hard-map to a token similar to EOF but that can't be accepted by the parser to force an error at the right point. If the printable text of that token is something like "MALFORMED_UTF8" the patch could be pretty simple.

@parrt
Copy link
Member

parrt commented Dec 13, 2021

Ok, thanks @iterumllc closing per your comment.

@parrt parrt closed this as completed Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants