Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number types' FromStr impl should recognize Unicode minus #130315

Open
Enyium opened this issue Sep 13, 2024 · 13 comments
Open

Number types' FromStr impl should recognize Unicode minus #130315

Enyium opened this issue Sep 13, 2024 · 13 comments
Labels
A-unicode Area: Unicode C-discussion Category: Discussion or questions that doesn't represent real issues. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@Enyium
Copy link

Enyium commented Sep 13, 2024

You can say the following about the character currently exclusively recognized as minus by FromStr impls:

  • Its official Unicode name is U+002D HYPHEN-MINUS : hyphen, dash, minus sign (copied from BabelMap).
  • The general typographical reality of font designs is that it's a hyphen and not a minus sign.
  • The usage-wise reality is that it's widely used as a replacement character for a real minus sign.

In comparison, U+2212 is a dedicated minus sign:

  • Official Unicode name: U+2212 MINUS SIGN
  • Matches the horizontal bar of a plus sign.
  • GitHub, e.g., uses it to display the red "lines deleted" values. Wikipedia and LaTeX equation renderings also use it.
  • In HTML, you have the entity − for it.

Benefits of adding support:

  • If the FromStr implementations of number types (i32, f64 etc.) would support U+2212 MINUS SIGN in addition to U+002D HYPHEN-MINUS as a minus sign, UI frameworks, e.g., would have an easier time implementing text boxes that display the typographically more pleasing real minus sign, simply converting the text content to the corresponding number.
  • App implementations not further checking user input, but directly trying to parse it as a number, and banking on the returned Result, wouldn't confuse end users anymore when they pasted a number with Unicode minus into the app and the app showed an error.
  • Also, since it's a dedicated minus sign, and not just a common replacement character, it logically follows in my opinion that it should be supported.

I'm not familiar with this, but I want to point out that the Wikipedia article "Plus and minus signs" also talks about ⁒ as a minus sign (U+2052 COMMERCIAL MINUS SIGN). Perhaps, this should also be supported. But I don't know whether it's regularly set off from the number with some space character.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Sep 13, 2024
@workingjubilee
Copy link
Member

In comparison, U+2212 is a dedicated minus sign:

  • GitHub, e.g., uses it to display the red "lines deleted" values. Wikipedia and LaTeX equation renderings also use it.

I don't think this would weigh in as a positive factor for us adding it to FromStr for the signed numeric types. The opposite, actually.

@Enyium
Copy link
Author

Enyium commented Sep 14, 2024

It means you see this character in the wild. In #49746, someone said:

I accidentally used a unicode minus sign (−) instead of a dash (-). This happened to me when I pasted a constant from Wikipedia.

This could also happen with input users paste into Rust apps.

Why would it even be an argument for the opposite?!

@workingjubilee
Copy link
Member

Because if you copy it from a diff, then for a diff that looks like

- 20
+ 45

these should not be parsed as the integers [-20, 45].

@Enyium
Copy link
Author

Enyium commented Sep 14, 2024

You have spaces there between sign and number. Even with a hyphen as a minus sign, this currently gives Err(ParseIntError { kind: InvalidDigit }), and I wasn't proposing to change that.

In any case, the end user would need to have a basic sense of what they're copying and where they're pasting it.

@workingjubilee
Copy link
Member

@Enyium That was only for the sake of readability, a diff can also include

-20
+45

@workingjubilee
Copy link
Member

I am only making this observation because I do think your request is reasonable, and I am slightly perplexed why you included extraneous data that seems like it could undermine the strength of your proposal.

I'm not familiar with this, but I want to point out that the Wikipedia article "Plus and minus signs" also talks about ⁒ as a minus sign (U+2052 COMMERCIAL MINUS SIGN). Perhaps, this should also be supported. But I don't know whether it's regularly set off from the number with some space character.

There are many alternative numeric notations. There are many alternative "commercial minus signs", not just that one. Almost invariably, such graphemes tend to have many subtle variations or reuses. Extending FromStr beyond the set of actual "this is a minus sign that looks like the minus sign that Rust already recognizes" would allow people to FromStr something that looks like %20 and get -20 instead of, say, 0.2, which could be what they expect, incorrectly or no. And if they come from a context that is not European, they may not expect that glyph but another glyph to be be interpreted as a "minus sign", and then we have to be locale-aware, and... well...

I think it would be inappropriate for Rust to attempt to guess what exact cultural context that the FromStr impls must live in. In general Rust has strived to be Unicode-aware but locale-agnostic, deferring locale-sensitive tasks to libraries like icu4x. It seems that, in the spirit of this, it would be in our interest to recognize a set of alternative minuses that represent effectively the same symbol, i.e. a different code point but semantically identical and often rendered the same. And yes, many fonts render hyphen-minus identically to U+2212, it's not like there's a law against doing so.

@Enyium
Copy link
Author

Enyium commented Sep 14, 2024

That was only for the sake of readability, a diff can also include

-20
+45

+45 is also already parsed as the number 45, right? Why would that be an argument against supporting something being by definition the minus sign? (Also, your -20 contains a hyphen, which would already be parsed as the number −20, if someone would be to paste it somewhere where Rust's FromStr::from_str() would be caused to run.)

I have no problem with U+2052 COMMERCIAL MINUS SIGN (⁒) not gaining support. I just saw it in the Wikipedia article. If it was warranted to support this, which I don't know, adding support for U+2212 MINUS SIGN would be a good time to add support for this also.

would allow people to FromStr something that looks like %20 and get -20

I can't follow you there. Nobody talked about the percent sign. You'd only see U+2052 COMMERCIAL MINUS SIGN (⁒) when it was intended to be used in the minus role (or when having it to do with gibberish).

In general Rust has strived to be Unicode-aware but locale-agnostic

At least supporting U+2212 MINUS SIGN should harmonize with that.

it would be in our interest to recognize a set of alternative minuses that represent effectively the same symbol, i.e. a different code point but semantically identical

That's what my issue is about.

I don't know whether ⁒ is used as a sign for negative numbers or only an operator between operands. In the first case, and if it's never set off from the number with a space, maybe support would be warranted.

And yes, many fonts render hyphen-minus identically to U+2212, it's not like there's a law against doing so.

This stood out to me on SoundCloud. The font that they use for remaining play time has a relatively long dash; but it's just a hyphen code-point-wise. But in my perception, my statement holds true for the majority of fonts.

@workingjubilee
Copy link
Member

Also, your -20 contains a hyphen, which would already be parsed as the number −20, if someone would be to paste it somewhere where Rust's FromStr::from_str() would be caused to run.)

Does it? I copied it out of the GitHub UI.

@Enyium
Copy link
Author

Enyium commented Sep 14, 2024

In diffs, GitHub uses the hyphen (like this code point is also used in code instead of fancy characters); but this text is also in a monospace font. On a page like this, I was referring to the red number on the top right (not in a monospace font).

@workingjubilee
Copy link
Member

workingjubilee commented Sep 14, 2024

Ah, I see. I suppose I misunderstood, then.

Anyway, the problem with the "commercial minus sign" is that the glyphs that semantically mean commercial minus sign include e.g. △ and ▲ if Wikipedia is to be believed. But I know that above and beyond such a meanings, those glyphs definitely have a wide variety of other meanings attributed to them, including in the language which supposedly uses them as commercial minus signs (Japanese).

And Wikipedia goes on to state this about the obelus-like symbol in question:

The symbol is also used in the margins of letters to indicate an enclosure, where the upper point is sometimes replaced with the corresponding number.[1]

The Uralic Phonetic Alphabet uses commercial minus signs to denote borrowed forms of a sound.[1]

In Finland, it is used as a symbol for a correct response (the check mark indicates an incorrect response).[1][5]

So regarding this:

I can't follow you there. Nobody talked about the percent sign. You'd only see U+2052 COMMERCIAL MINUS SIGN (⁒) when it was intended to be used in the minus role (or when having it to do with gibberish).

I, personally, would hesitate to suggest that the Finnish deal in gibberish.

@Enyium
Copy link
Author

Enyium commented Sep 14, 2024

Okay, it's rather strange that something defined as COMMERCIAL MINUS SIGN is also used in these other manners. So, in the spirit of not supporting in a narrow use case something with such a variety of uses, this code point can be ruled out for support, it seems.

But could I win you over regarding the support for U+2212 MINUS SIGN?

@CAD97
Copy link
Contributor

CAD97 commented Sep 14, 2024

it would be in our interest to recognize a set of alternative minuses that represent effectively the same symbol, i.e. a different code point but semantically identical and often rendered the same

If that's the goal, it might make sense to use the Unicode compatibility equivalence relation between characters. UAX #15 §1.1:

Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.

That sounds like the property you're describing, and means we don't have to determine our own set. Instead, we would essentially parse from the NFKC normalization of the input string.

I didn't check with any implementation, but visually searching UnicodeData.txt (I did not look at the context-sensitive mappings in SpecialCasing.txt) I believe:

  • U+2212 MINUS SIGN is not compatible with U+002D HYPHEN-MINUS
  • U+FE63 SMALL HYPHEN-MINUS and U+FF0D FULLWIDTH HYPHEN-MINUS are compatible with U+002D HYPHEN-MINUS
  • U+207B SUPERSCRIPT MINUS1 and U+208B SUBSCRIPT MINUS2 are compatible with U+2212 MINUS SIGN

Although further inspection shows that “smart quotes” aren't considered compatible with "straight quotes" either, so despite my first thought maybe NFKC isn't the correct data to be considering for this purpose after all.


As a minimal bar, ICU does permit changing the character used as the negative affix for their number formatter; asking to recognize alternate negative signs is within the reality of what Unicode recognizes (but the default is still U+002D).

I was, however, unable to locate information on what alternate minus sign affixes are actually in use by locale data, or Unicode information on parsing numbers from text instead of formatting numbers to text. The information probably exists and should be referenced here, but I ran out of time to continue looking for it.

Footnotes

  1. Unicode 1.0 called it SUPERSCRIPT HYPHEN-MINUS

  2. Unicode 1.0 called it SUBSCRIPT HYPHEN-MINUS

@workingjubilee
Copy link
Member

Okay, it's rather strange that something defined as COMMERCIAL MINUS SIGN is also used in these other manners.

In that regard I cannot help but agree. Human behavior is very strange.

@lolbinarycat lolbinarycat added A-unicode Area: Unicode T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-discussion Category: Discussion or questions that doesn't represent real issues. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-unicode Area: Unicode C-discussion Category: Discussion or questions that doesn't represent real issues. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants