-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better C++ runtime error for non-UTF-8 content #3280
Comments
@parrt This is a valid and good FR. |
Hi @mike-lischke is there a PR associated with this request? |
When I filed this I hadn't looked at the code in any detail but I have a little time right now and decided to take a look. It looks to me like this problem may be a configuration issue. The error I get is from This is all confusing to me because I had a number of build problems related to utfcpp and I just verified that the package is pulled into Just defining it on its own leads to include path problems. So I'm going to try to fix that temporarily and see if utf8's copy method throws a more useful exception. If it does I'll either try to throw a draft PR together or put more information in this issue. |
Compiling with
Which is at least an improvement. The "Introductory Sample" from the main utfcpp page includes the use of |
@parrt I don't know of any PR for this issue. |
@mike-lischke The new standalone UTF-8 handling in #3398 could be easily updated to keep track of the byte offset (code units), code point offset and include it in the error message. |
If somebody else doesn't first I may take a look once #3398 is merged. |
@iterumllc That would be great if you could add the error handling you want and create a new PR! In the meantime we can close this issue here, right? |
I pulled over and built a copy of the Antlr 4 repository matching 43fb4c2 . Then I built a project against that same revision of the Cpp runtime. Everything compiles OK but when I run I get an exception thrown by That line was changed towards the end of October. I'm on Arch Linux compiling with gcc 10.3.0. This project's use of Antlr 4 is not trivial but pretty vanilla. I'm willing to look into this more but probably not solo. @jcking any adivce? (BTW @parrt may be amused to know that the project is |
Its probably because pthread is not linked to whatever you are trying to run. You may need to pass |
@iterumllc wow! A blast from 30 years ago :) |
@jcking I'll play around with that but:
|
Adding lib
@mike-lischke Like the earlier utfcpp-based error this meets criterion 1 of my report but not criterion 2. It also doesn't display the relevant filename so one may not know if one is at a top-level file or 3 included files down. Finding the byte sequence in question then becomes a platform-specific adventure for the (not necessarily sophisticated) user. In our own use case we have many files out in the world that have strict 7-bit-ASCII contents except for comment regions, which could contain comment text in any encoding the author preferred at the time over the past 30 years. (This is because the previous parser would just strip everything between the I therefore wouldn't close this issue myself but it's your queue. Having looked a bit at the code the best way to tackle this is probably to use the new The tricky question is how to represent the problem at the token level. Eventually it would be preferable such errors were recoverable. With |
Ok, thanks @iterumllc closing per your comment. |
Passing a file with non-UTF-8 to a C++-runtime-compiled Antlr 4 parser typically yields an error like:
In this situation a user would greatly benefit from knowing 1) that their file has non-UTF-8 content (nothing in the error message indicates this) and 2) The (preceding-UTF-8-relative) line number and offset of the first non-UTF-8 bytes.
The text was updated successfully, but these errors were encountered: