Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to_ascii_upper, to_ascii_lower and eq_ignore_ascii_case in std::ascii #8231

Merged
merged 1 commit into from
Aug 7, 2013

Conversation

SimonSapin
Copy link
Contributor

Original pull request: Add str.to_ascii_lower() and str.to_ascii_upper() methods in std::str.

@brson
Copy link
Contributor

brson commented Aug 2, 2013

On first glance this seems fairly specialized to Servo's use cases - CSS and HTML. I'd be curious to know if there are others. We need to be cautious about adding features to both vec and str as they are already vast modules.

@Kimundi
Copy link
Member

Kimundi commented Aug 2, 2013

Also, we already have std::ascii (Which I added because my attempt to add to_ascii_{lower, upper} got rejected because str functions should work for all unicode strings.)

See also #5822

@SimonSapin
Copy link
Contributor Author

I use to_ascii_lower() for ASCII case-insensitive matching, which needs to work on any Unicode strings, not just ASCII:

http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#ascii-case-insensitive
http://dev.w3.org/csswg/css-syntax/#ascii-case-insensitive
http://encoding.spec.whatwg.org/#ascii-case-insensitive

It is gonna be needed in a bunch of different libraries. The options are:

  • Duplicate the code
  • Have it in one of these libraries, with every other (otherwise unrelated) library depend on the first
  • Make a tiny library that everything else depend on
  • Have it in libstd or libextra

I’d really prefer to avoid all but the last of these options.

Would it be better to have these in the std::ascii module rather than std::str? Both functions still have Unicode strings that may contain non-ASCII characters as argument and return value, but they’re kinda ASCII-related.

@Kimundi
Copy link
Member

Kimundi commented Aug 2, 2013

If they go in std, than placing them into std::ascii would probably the best.

@thestinger
Copy link
Contributor

I'm fine with it being implemented in ascii, just not as a string method imported in the prelude.

@SimonSapin
Copy link
Contributor Author

Updated pull request: have functions in std::ascii instead of &str methods in std::str. Also add eq_ascii_lower.

eq_ascii_lower(a, b) is the same as to_ascii_lower(a) == b, but without allocating a new string and copying a. It only makes sense when b is already ASCII lower-case itself. It’s typically used with a literal string for b. I’d like it to also check that b is indeed lower case, but only when testing/debugging. I previously had something guarded by #[cfg(test)] but this indicates whether libstd is being tested, while I want to check if the user’s program is being tested. Suggestions?

Maybe I should not worry about it and replace it with ascii_ieq(a, b) that is the same as ``to_ascii_lower(a) == to_ascii_lower(b)`.

@Kimundi
Copy link
Member

Kimundi commented Aug 3, 2013

Could you unify the UPPER_MAP and LOWER_MAP case folding with the case folding for Ascii?

Both seem to do the same thing, so having two wildly different implementations in the same module seems kinda unnecessary. (As an aside: is there a specific reason why you need a lookup table for ascii here? Isn't case in the ascii range just defined by a single bit?)

Same thing with function naming: Consistency would be nice here too.

Lastly, it would be nice if you could methodify the new functions where applicable (add a StrAsciiExt trait or so).

@SimonSapin
Copy link
Contributor Author

About naming/unification: The Ascii and AsciiStr traits are supposed to contain ASCII data only, as indicated by assert!(self.is_ascii()); thorough the code. This is different: it needs to operate on full Unicode &str strings that can contain non-ASCII characters (which are just unchanged.)

Still, please indicate the naming / kind of interface that you prefer and I can update this pull request.

About implementation: Yes, case in the ASCII range is a single bit. These function can be implemented in at least two ways: with a lookup in a 256-byte vector, or with a bit flip conditioned by two comparisons. I don’t really know which is faster or otherwise better, but I went with the former as it seems to be what CPython and my system’s libc are doing. I could change Ascii.to_upper and Ascii.to_lower to match.

@Kimundi
Copy link
Member

Kimundi commented Aug 3, 2013

Sure, not talking about converting it to Ascii, just using a common naming scheme.
For example eq_ignore_case() for Ascii and ascii_eq_ignore_case() as extension method on &str, rather than ascii_ieq() for the latter. (On a shorter name if you can find one.)

All the Ascii type does is encapsulate the invariant that it is a u8 with the highest bit always unset (== 7 bit data), you can convert between that and u8 with casting where appropiate.

No idea which one is faster, so maybe just use the bitflip one because it is shorter?

@SimonSapin
Copy link
Contributor Author

Another way to implement this is with a match on a range pattern (leaving to the compiler how to implement that.) A micro-benchmark show that the lookup table is 1.2x ~ 2.2x faster than either other solution: https://gist.github.com/SimonSapin/6156068#file-summary

Upadated PR: change Ascii.to_lower and Ascii.to_upper to use the lookup tables, change eq_ascii_lower (equivalent to to_ascii_lower(a) == b) to eq_ascii_ignore_case (equivalent to to_ascii_lower(a) == to_ascii_lower(b).)

@SimonSapin
Copy link
Contributor Author

On noes! The build failed because of foreach. Updated PR for language changes, rebased on master.

bors added a commit that referenced this pull request Aug 6, 2013
Original pull request: Add str.to_ascii_lower() and str.to_ascii_upper() methods in std::str.
@bors bors closed this Aug 7, 2013
@bors bors merged commit 88b21f5 into rust-lang:master Aug 7, 2013
@SimonSapin SimonSapin deleted the ascii-upper-lower-case branch August 7, 2013 10:08
@Kimundi Kimundi mentioned this pull request Sep 9, 2013
flip1995 pushed a commit to flip1995/rust that referenced this pull request Jan 17, 2022
Fix `implicit_clone` for `&&T`

fixes rust-lang#8227

changelog: Don't lint `implicit_clone` on `&&T`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants