Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Stabilize std::{c_str, c_vec} #494

Merged
merged 4 commits into from
Jan 2, 2015
Merged

Conversation

alexcrichton
Copy link
Member

Stabilize the std::{c_str, c_vec} modules by re-working their interfaces and
refocusing each primitive for one particular task. The three broad categories of
interoperating with C will work via:

  1. If you have a Rust string/byte slice which needs to be given to C, then the
    CString type will be used to statically guarantee that a terminating nul
    character and no interior nuls exist.
  2. If C hands you a string which you want to inspect, but not own, then a helper
    function will assist in converting the C string to a byte slice.
  3. If C hands you a string which you want to inspect and own, then a helper type
    will consume ownership and will act as a Box<[u8]> in essence.

Stabilize the `std::{c_str, c_vec}` modules by re-working their interfaces and
refocusing each primitive for one particular task. The three broad categories of
interoperating with C will work via:

1. If you have a Rust string/byte slice which needs to be given to C, then the
   `CString` type will be used to statically guarantee that a terminating nul
   character and no interior nuls exist.

2. If C hands you a string which you want to inspect, but not own, then a helper
   function will assist in converting the C string to a byte slice.

3. If C hands you a string which you want to inspect and own, then a helper type
   will consume ownership and will act as a `Box<[u8]>` in essence.
@alexcrichton
Copy link
Member Author

Note that this is largely an alternative to #435


impl CString {
pub fn from_slice(s: &[u8]) -> CString { /* ... */ }
pub fn from_vec(s: Vec<u8>) -> CString { /* ... */ }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be Option<CString>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose for this to take the route of Path where it panics by default as we expect that to be the overwhelming default and the invariant can always be checked beforehand to prevent a panic.

@petrochenkov
Copy link
Contributor

A couple of comments:

Regarding CString constructors:
If Rust String already meet the requirement "string contains only one NUL and it is in the final position" then there should be no (re)allocations, I think it's an important requirement and it is missed.
from_vec and from_vec_unchecked can be easily corrected to meet this requirement - they just shouldn't panic or push 0 if the original string contains a single zero in the end.
from_slice in its turn always allocate and can't be corrected. Maybe owning CString should go with its non-owning companion CStr - zero terminated slice? It will not be a significant burden since the implementation of wrappers in question is pretty minimal.
Edit: On second thought, if you want to start keeping terminating NUL in your String you'd better turn it into CString immediately. But then you lose UTF-8-ness guarantee and all the convenience methods..

Also, if we are working with CString there is a chance that our function is called from C code, then panics are unacceptable and CString should probably grow some non-panicing versions of its methods in the future.

It would be nice to have a convenience constructor, taking &str instead of &[u8], it seems to be the most common use. As an example:

extern {
    fn c_function(s: *const c_char);
}

fn rust_wrapper(s: &str) {
    // Too verbose, at least as_ptr should work on CString due to Deref
    unsafe { c_function(CString::from_slice(s.as_bytes()).as_ptr()) }
    // A bit nicer
    unsafe { c_function(CString::from_str(s).as_ptr()) }
}

The scenario 3 - "A C type was handed to Rust, and Rust owns it" and LibcDtor in particular are dangerous. You can't reliably free memory allocated in another library - what if used versions of libc differ? Should CVec really encourage this dubious practice?

Also, could you explicitly mention why CString doesn't maintain UTF-8 invariant, unlike String, despite the similar name?
And a nitpick: c_char is always one byte by definition, but byte can be larger than 8 bit.

Everything else seems okay :)

@thestinger
Copy link

@alexcrichton: Calling free on a pointer allocated by a library is not allowed on Windows. I don't think APIs that are fundamentally broken on Windows in most cases and that encourage writing code that's broken there belong in the standard library.

@thestinger
Copy link

I think stabilizing these would be premature. The niche for interacting with a C string API that's not a path is so small that it's nearly non-existent. It's rarely correct because C strings are incompatible with both binary data and modern text. UTF-8 allows inner NUL so it's not possible to convert Rust strings to C strings in general. Features should not be in the standard libraries if there are widespread, correct use cases for them. I think this belongs in a repository outside of the standard library, at least until there is proof that it is widely applicable.

@alexcrichton
Copy link
Member Author

@petrochenkov the major reason that String does not have a terminating 0 is that the views into it, &str, are not guaranteed to have terminating zeroes. As a result we moved away from this (we used to do this!) to not having a terminating nul character.

In terms of panics this is somewhat of an orthogonal issue to CString as it must be handled at the boundary regardless (if there's mixed C/Rust on the stack), so I don't think it's too too relevant to the design of an API such as this.

We generally have been moving away from providing various convenience functions for types like &[u8] and &str in favor of funneling everything into one method, but I imagine that we could do something like implement FromStr or add the method later on if it looks like it's really needed.

I'm also not sure I see what you mean about case 3. Despite it being rare, and despite it having dangers, this case does happen and it would be nice to have facilities for it somehow. Do you think it's rare enough that it doesn't warrant inclusion in the standard library itself?


@thestinger to be clear do you think that the standard library shouldn't provide CVec at all? As in, the third use case I outlined above you consider to be so rare that it's not worth catering to it at this time? It seems that @petrochenkov thinks the same, so it may be good to take the more conservative route for now.

Do you also think that we shouldn't stabilizing anything related to this area of programming? I'm not sure I agree that this is necessarily a niche use case in that one of the major targets for rust is the "embedded programming" case where Rust is talking to other languages via a C-like API. In cases like this transferring strings across boundaries is quite a common task, and just within the standard distribution itself the fallout is quite nontrivial (as you can see in the branch above). Do you have reservations, other than panicking constructors, which lead you to believe these are too unstable for inclusion at this time?

@ghost
Copy link

ghost commented Dec 4, 2014

@alexcrichton Much of CString's use cases could be trivially merged into String with just an ensure_null_terminated(). I'd be surprised if CVec had any users anywhere, it's easy enough to slap a destructor on a tuple struct.

There's definitely need for a typesafe libc/ffi helper/half-unsafe-but-compiles-away wrapper library, but it wouldn't need libstd at all.

@erickt
Copy link

erickt commented Dec 4, 2014

@Jurily: I don't think we could use String::ensure_null_terminated as there are C libraries that take non- utf8 strings, like on windows. Plus it'd be kind of weird to pass a "foo\0bar" rust string to C, where rust sees a string of 7 characters, and C sees a string of 3 characters.

@alexcrichton: A couple things:

  • It's a little odd that c_str::from_raw_buf is a function, but CVec::from_c_str is a method. Can we just use one style?
  • Should we also have c_str::from_raw_buf_len<'a>(raw: &'a *const libc::c_char, len: libc::size_t)?
  • If I'm reading this RFC right, CString is supposed to support loaning a pointer from Rust->C, and CVec is C->Vec. Neither appear to support allocating some memory in Rust, filling it with something, then transferring ownership of that memory to C. Can we support that? We could do it by renaming the current CVec::new to CVec::from_buf_len, adding a LibcCtor trait and a CVec::new(uint) constructor that mallocs the pointer for us, and adding an unsafe CVec::unwrap method that pulls out the pointer and size for us to pass along to C.
  • Should we be asserting that there are no nulls in a string, or returning an Option<CString>?

@petrochenkov
Copy link
Contributor

@alexcrichton

the major reason that String does not have a terminating 0 is that the views into it, &str, are not guaranteed to have terminating zeroes. As a result we moved away from this (we used to do this!) to not having a terminating nul character.

I'm not talking about always keeping terminating 0 in String, like Rust did and C++ does, but about manually pushing 0 to string when it is convenient an later turning String into CString (without allocation) when it is necessary. Personally, I didn't used CString in FFI code, but always used the combination "String push(0) + as_ptr()". "ensure_null_terminated() + as_ptr()", as @Jurily suggests, may be even better.

I suggest to answer the question: How CString will be used in practice?
I see only one common case here - safe Rust wrapper around C function and I can think of two kinds of wrapper interfaces:

fn c_function(s: *const c_char) {}

// Wrapper of 1st kind, does all the NUL-checking/pushing job for you, *may* reallocate, but not necessarily
fn rust_wrapper1(s: Vec<u8>) {
    ensure_null_terminated(&mut s); // This helper function can be standardized
    c_function(s.as_ptr());
}

// Wrapper of 2nd kind, never reallocates, you do all the NUL-checking/pushing yourself
// Null-terminatedness of the argument is stated explicitly in the interface, CStr - null terminated slice
fn rust_wrapper2(s: &CStr) {
    c_function(s.as_ptr());
}

// Obtains CStr from [u8] for passing it to wrapper 2
// These helper functions can be standardized too (under saner names, of course)
fn check_null_terminatedness_with_panic(&[u8]) -> &CStr
fn check_null_terminatedness_with_option(&[u8]) -> Optional<&CStr>
fn check_null_terminatedness_with_unchecked(&[u8]) -> &CStr

Suddenly, there's no CString in the picture.

I'm also not sure I see what you mean about case 3. Despite it being rare, and despite it having dangers, this case does happen and it would be nice to have facilities for it somehow. Do you think it's rare enough that it doesn't warrant inclusion in the standard library itself?

If the main use of CVec is to call libc::free on memory allocated in another library, then it is hardly usable on Windows and shouldn't live in the standart library. (More general RAII helper with custom "deleter" in the vein of unique_resource would be nice, but it's a separate story.)

@thestinger
Copy link

@thestinger to be clear do you think that the standard library shouldn't provide CVec at all? As in, the third use case I outlined above you consider to be so rare that it's not worth catering to it at this time? It seems that @petrochenkov thinks the same, so it may be good to take the more conservative route for now.

I don't think it's worth providing any form of this with memory management. It doesn't have common use cases that are correct and it's not portable. Carrying around a function pointer is much uglier than simply using a new type for the specific use case.

Do you also think that we shouldn't stabilizing anything related to this area of programming? I'm not sure I agree that this is necessarily a niche use case in that one of the major targets for rust is the "embedded programming" case where Rust is talking to other languages via a C-like API.

The most common use case for C strings is for file paths, because \0 inside of them is forbidden. Rust already has types for dealing with these, but that doesn't mean they're compatible with the allocator used by that foreign library. Calling free on data allocated via malloc in another library is not correct in every environment. The simplest example is on Windows where it's absolutely forbidden as there is more than one C runtime library / allocator and statically linking it into dynamic libraries is the norm.

C strings are incompatible with binary data and only support a subset of encodings like UTF-8. They're not correct in general, and it's advisable to do cross-language interactions with a pointer and length anyway.

In cases like this transferring strings across boundaries is quite a common task, and just within the standard distribution itself the fallout is quite nontrivial (as you can see in the branch above). Do you have reservations, other than panicking constructors, which lead you to believe these are too unstable for inclusion at this time?

I've explained why I have reservations about it. It's not correct in general and not portable.

@thestinger
Copy link

Carrying around a function pointer is much uglier than simply using a new type for the specific use case.

Using an unboxed closure would make you end up with a different type for each usage. It's not comparable to custom types.

@alexcrichton
Copy link
Member Author

@Jurily I believe that @erickt pointed out the reason why which is to give a guarantee that there are no interior nul bytes as well as precisely one trailing nul byte, which largely rules out using String as the compatibility layer here.


@erickt we talked in person but I'll respond inline as well:

It's a little odd that c_str::from_raw_buf is a function, but CVec::from_c_str is a method. Can we just use one style?

This is because from_raw_buf does not return a CString and from_c_str returns a CVec. This is mostly following constructor conventions that we have today. I'm leaning towards removing CVec entirely tough so it may become a moot point.

Should we also have c_str::from_raw_buf_len<'a>(raw: &'a *const libc::c_char, len: libc::size_t)?

We should! That's what slice::from_raw_buf is for though :)

If I'm reading this RFC right, CString is supposed to support loaning a pointer from Rust->C, and CVec is C->Vec. Neither appear to support allocating some memory in Rust, filling it with something, then transferring ownership of that memory to C. Can we support that?

Due to the comments on this thread, it sounds like while this is a possible pattern that it's not one the standard library should support at this time. I'm thinking of removing CVec entirely from this proposal (or perhaps leaving the entire module as #[experimental].

Should we be asserting that there are no nulls in a string, or returning an Option<CString>?

I've chosen to assert instead of check as it follows what Path does, it's an easy-to-check invariant, it's clearly documented, and there's a function to bypass all checks entirely.


@petrochenkov I don't quite understand your code snippet you've got there. If you'd like an answer to your question of how CString is used in practice I'd encourage you to read over the patch I prepared (linked in the RFC). I do not believe that we can have a null-terminated slice type and most of your functions are just methods on CString already, so it's unclear to me why you're advocating removal of CString entirely.


@thestinger I think I'm ok removing CVec entirely, but I'd like to clarify your position on CString. From your comments it sounds like you're assuming CString is calling free and malloc which is very specifically removed as part of this RFC. When I mentioned "transferring" earlier I more meant "passing" in that we're passing a Rust string into C (no transference of ownership)

I'm not really sure I understand your stance on "C strings are not correct in general" because they're a fact that we have to live with. Many C libraries do not pass around a pointer length pair but instead consume and pass C strings. I'm also focusing on the "borrowing" aspect where ownership of strings is not crossing boundaries, just the contents for inspection. It is true that a C strings are not UTF-8 nor arbitrary data, and it is also true that you may not wish to write new C apis with a C string-like interface, but it is a fact that many existing libraries do. These libraries need to be able to interoperate with Rust, and the purpose of CString is to provide a way of passing strings from Rust to C (not transferring ownership) and to provide a way of interpreting strings passed to Rust from C (again, not transferring ownership).

@alexcrichton
Copy link
Member Author

I've updated with a revision to remove CVec entirely.

@alexcrichton
Copy link
Member Author

The conversation seems to be peaking here, so I'm going to be looking to merge this soon pending further comments.

cc @wycats, @erickt, @petrochenkov, @thestinger, @mzabaluev

@ghost
Copy link

ghost commented Dec 11, 2014

String cannot contain nulls, and we should revise the docs to reflect that. Do we seriously expect any non-Rust code to produce or consume nulls correctly when the only interoperable language in existence explicitly relies on their absence? UTF-8 gained popularity precisely because it Just Works(tm) with the infinite amount of code already using C strings. Isn't it a good enough indicator that not even rustc can deal with nulls in the source?

It doesn't matter that libstd does what the standard says, a full strlen-allocate-copy-free cycle is unacceptable. People will assume non-null and they will null-terminate String to avoid the overhead. We have inline assembly, you can't stop us.

Rust happily assumes the system locale is UTF-8 anyway, why is this such a big issue?

@alexcrichton
Copy link
Member Author

@Jurily I'm not sure I understand your concerns. The String and str types can indeed contain nulls today as they're considered valid UTF-8, so are you saying that we should revise the implementations to reject this form of interior nul bytes?

Also, I'm not sure what you mean by strlen-allocate-copy-free because this RFC currently removes CVec, and CString has nothing to do with strlen and provides constructors which don't involve any form of allocation.

I do believe the assertion that most String types will not have an interior nul byte (which is why I chose the CString constructors to panic in this case), but can you elaborate on why you think users will null-terminate the String type? This will cause the string to be interpreted differently in Rust as it is in C, which I imagine could be highly confusing.

Can you also explain where locales come into this? The purpose of the CString type is to pass a string into C from Rust. Many C libraries require this, and I'm unaware of how locales come into play here.

@thestinger
Copy link

@Jurily: UTF-8 includes inner NUL and C strings only support a subset of it. Using C strings is not a common use case, because it's rarely correct. It does not work for either binary or text. It only works for domain specific encodings without inner NUL like file paths. Rust is doing the same thing as other languages with real strings like Java, C#, Go, Python, etc.

@thestinger
Copy link

when the only interoperable language in existence explicitly relies on their absence?

It does not rely on their absence. Buffers represented as a pointer and length are very common in C. Even legacy software has support for this because it's a necessity for dealing with binary data. C strings are not interoperable because they can't be mixed with real UTF-8 string implementations. I don't know of a string implementation in a modern language without support for Unicode's inner NUL.

@ghost
Copy link

ghost commented Dec 11, 2014

@alexcrichton Locales are interesting in this case because C strings are not UTF-8 in any standard. Their encoding is defined by the system locale, which is often UTF-8 nowadays, but not required. If we can ignore that complexity in favor of the common case, why not interior nulls?

A String with no interior nulls but null-terminated in len + 1 looks exactly the same to Rust and C, except that C substrings necessarily run to null. We could wrap that invariant with a CSlice<'a>(*const c_char) that would also solve the lifetime and safety issues of raw c_char pointers.

@thestinger I'm not sure why you dismiss the filesystem API as not common. It's not even just paths that are null-terminated strings, so are the command line, envvars and half a century's worth of C. We can't get rid of them without writing our own OS and porting everything. Whether or not it's good design, it exists.

Existing code is already aware of the distinction between text and binary , I don't see how we could unify that even with interior nulls. The "UTF-8 goes here" and "pointer + size" camps don't seem to overlap much. Text takes no size, binary doesn't require valid UTF-8.

The gtk family deals only with null-terminated UTF-8, glib is used by half of the Linux desktop. The C++ std::string::c_str() guarantees null-termination and is UTF-8 depending on system locale. Qt is the only major C++ library I'm aware of that doesn't use std::string, they have QByteArray::constData() which serves the same purpose. QString is UTF-16 internally. BeOS/Haiku uses UTF-8 exclusively, also null-terminated.

Is there a major code base that produces or expects interior nulls in UTF-8 that doesn't also take random binary data?

@thestinger
Copy link

I'm not sure why you dismiss the filesystem API as not common. It's not even just paths that are null-terminated strings, so are the command line, envvars and half a century's worth of C. We can't get rid of them without writing our own OS and porting everything. Whether or not it's good design, it exists.

There is no UTF-8 guarantee for any of those so it has no relevance to Rust's String type. It already has a Path type to cope with paths, and there aren't other common cases. Rust can treat command-line arguments as read-only so using &[u8] works fine. Environment variables need to be dealt with inside of a global lock scope (where &[u8] works) or via copying (Vec<u8> or Box<[u8]>) so those don't present any issue.

Existing code is already aware of the distinction between text and binary , I don't see how we could unify that even with interior nulls. The "UTF-8 goes here" and "pointer + size" camps don't seem to overlap much. Text takes no size, binary doesn't require valid UTF-8.

C string APIs rarely have a UTF-8 guarantee / requirement. They are only able to handle a subset of UTF-8 anyway, so they couldn't be directly mixed with real Unicode strings even if they did.

The gtk family deals only with null-terminated UTF-8

In general it doesn't enforce or require a specific string encoding. It has Unicode manipulation functions but strings throughout GTK/glib aren't guaranteed to be UTF-8.

glib is used by half of the Linux desktop.

This doesn't have much to do with whether text is assumed / guaranteed to be UTF-8.

The C++ std::string::c_str() guarantees null-termination and is UTF-8 depending on system locale.

The std::string type can be used for binary data. It has no encoding guarantees and allows inner NUL. In fact, it's intended to be used that way. It has support for taking a pointer / length instead of a C string in every place where that conversion is possible. I don't think it directly uses the locale for anything.

You have to explicitly use the facet stuff for anything locale dependent, and it's almost entirely useless because it operates on wchar_t which is UCS2 on Windows, not UCS4. The C++11 char16_t / char32_t support did not include this kind of library support.

Qt is the only major C++ library I'm aware of that doesn't use std::string, they have QByteArray::constData() which serves the same purpose

QByteArray allows allows inner NUL and makes no guarantee about encoding.

QString is UTF-16 internally

QString is done with pointer/length just like std::string, QByteArray and other real Unicode string types. UTF-16 isn't compatible with C strings either.

@thestinger
Copy link

Is there a major code base that produces or expects interior nulls in UTF-8 that doesn't also take random binary data?

Yes, there's lot of C code handling UTF-8 text correctly. There's also a huge amount of code treating everything as a binary blob. It's not possible to take views into C strings without dynamic memory allocation so they're unusable in many niches. There are hacks like PATH_MAX but they waste a lot of space - and in the case of PATH_MAX, it's not actually an upper bound! It's possible to create paths larger than PATH_MAX.

I don't really understand what you're trying to argue. Rust is not going to regress from existing modern languages by supporting only a subset of UTF-8. There used to be a NUL at the end of String but it was dropped because it was a major complexity / performance problem for all code. The fact that it can speed up interoperability with legacy code in some cases is nice, but it doesn't make up for the fact that it's adding a byte to all strings and forcing lots of complexity / branching everywhere. In general you can't convert C strings to String anyway. C strings rarely have a UTF-8 guarantee. They can contain everything short of \0 or in the case of paths, /.

@thestinger
Copy link

@alexcrichton: The on-stack conversion optimization will be entirely redundant once there's a small vector type. I don't think an API solely for that purpose should be stabilized.

@thestinger
Copy link

I really don't think this is a problem that the standard library should take on. Calling into C code is fundamentally unsafe, especially if a raw pointer is being passed as a parameter. The most efficient way of doing interoperability with C strings is including it at the end of the string literals or pushing it to the end of Vec<u8> / String as-needed. Copying is always required for taking views, because that's just how C strings are.

Converting a string / vector with an inner NUL is a logic error (unintentional truncation) but I don't think paying the cost of an O(n) search at all of the FFI boundaries is going to be acceptable in general, especially for large texts. It's far better to do the work to maintain that guarantee up-front and avoid paying the cost everywhere.

@mzabaluev
Copy link
Contributor

Some concerns raised in #435 that I think should be noted here as well:

  • libc::c_char is not guaranteed to be one byte wide, furthermore, it's not necessarily signed in C (it's rather accidental that the Rust library defines it as i8 on ARM, for example). If the C string support module is retained in std, dependence on the architecture-dependent type alias may be considered a portability issue. This also effectively perpetuates libc as a public dependency of the standard Rust library - we may not expect that Rust displaces C in our lifetimes, but hey, what is a horizon for future-proof...
  • Lack of a zero-copy path for passing strings into C that were earlier obtained from C. It's not certainly a bad problem in practice, but I expect it might be for performance-sensitive code closely working with C APIs. To be polymorphic and accept Rust strings as well, such code would reinvent a lot of wheel that is currently covered by ToCStr.
  • Need to copy strings obtained (with ownership transfer) from C before they can be stored as keys for hash tables, sorted collections, etc. Note that for these use cases there is no clear advantage in having a pre-computed length, so the original C string buffers can be efficient in implementing Ord, Hash etc. with the current design of CString.

@thestinger
Copy link

@mzabaluev: C does require that sizeof(char) == 1. It doesn't require that a sizeof unit ("byte") is 8 bits, so CHAR_BITS exists. However, POSIX requires that it is 8 bits. Rust already makes a pervasive assumption that it is 8 bits. It's not something that Rust is designed to support, just as it's not capable of dealing with a system where size_t and uintptr_t are different sizes. It's not realistic to support this kind of thing without a drastic increase in complexity because Rust can't just use undefined behaviour as a solution to the problems it creates.

It has nothing to do with future proofing the language. It has to do with legacy systems that C had to support but that Rust is never going to support.

@thestinger
Copy link

Lack of a zero-copy path for passing strings into C that were earlier obtained from C. It's not certainly a bad problem in practice, but I expect it might be for performance-sensitive code closely working with C APIs. To be polymorphic and accept Rust strings as well, such code would reinvent a lot of wheel that is currently covered by ToCStr.

Code using C strings isn't very performance aware anyway. See below.

I do think that an O(n) check for interior nulls on every conversion is way too expensive. I don't think there's a use case for the API proposed here. It forces high performance costs (copies, scans) and doesn't really provide better ergonomics than adding the NUL by hand.

I don't think it belongs in the standard library and all of the disagreeing voices about the design back up that opinion. If there isn't a clear way to do it properly, then it should be left for third party libraries until a time when there's strong consensus on a specific solution.

Need to copy strings obtained (with ownership transfer) from C before they can be stored as keys for hash tables, sorted collections, etc. Note that for these use cases there is no clear advantage in having a pre-computed length, so the original C string buffers can be efficient in implementing Ord, Hash etc. with the current design of CString.

That's not true. A pre-computed length nearly always improves performance relative to a sentinel at the end. It's much friendlier to compiler auto-vectorization and CPU pipelining. C strings are less efficient nearly across the board before you even get to the major issues like inability to do slicing without copies. C strings mandate pervasive dynamic allocation or wasteful over-allocation via oversized static buffers.

@mzabaluev
Copy link
Contributor

I do think that an O(n) check for interior nulls on every conversion is way too expensive.

Isn't it the whole reason for CString in this draft, so that one can assert the C string-ness once during construction and then reuse the value without checking?

I don't think it belongs in the standard library and all of the disagreeing voices about the design back up that opinion.

I totally agree with that.

A pre-computed length nearly always improves performance relative to a sentinel at the end. It's much friendlier to compiler auto-vectorization and CPU pipelining.

Let's specifically consider the two traits I mentioned. Note that one reason for implementing them on CString (as in the current c_str) is to avoid a string copy in cases of owned strings used as collection keys and the like, so working with copied String values already carries the handicap.

  • Ord: a lexicographical byte-by-byte comparison is needed in both cases.
  • Hash: needs an extra strlen pass for a C string (let's not go into feeding the bytes piecemeal into the hasher), but the hash pass will then work over the data primed in cache. When length is pre-computed up front, the two passes may be spread far, reducing the caching benefit.

All things considered, I would not declare conversion to slices, much less to owned copies, a clearly better alternative here without some testing.

And those are basically the only operations I'd like to be directly available on CString. Everything involving the string length should be relegated to converted slices, or at least put behind some interface kink making the computation cost more noticeable (so, no .as_bytes() directly on CString).

C strings mandate pervasive dynamic allocation or wasteful over-allocation via oversized static buffers.

That's one irreducible issue with the design proposed here. The current design of ToCStr allows cutting some corners in the Rust-to-C direction, hence my arguing for keeping it.

@mzabaluev
Copy link
Contributor

@thestinger If the bit width of libc::c_char is fixed at 8, perhaps it should be decreed that it is always defined as i8 in Rust regardless of the signedness of char in the C ABI. Then std can use libc::c_char in public interfaces without a portability hazard.

@thestinger
Copy link

Ord: a lexicographical byte-by-byte comparison is needed in both cases.

It doesn't matter that semantically a byte-by-byte comparison is required. It can and will be auto-vectorized and pipelined. Using a length instead of a sentinel makes the code run faster. It's exactly what I said above:

That's not true. A pre-computed length nearly always improves performance relative to a sentinel at the end. It's much friendlier to compiler auto-vectorization and CPU pipelining. C strings are less efficient nearly across the board before you even get to the major issues like inability to do slicing without copies. C strings mandate pervasive dynamic allocation or wasteful over-allocation via oversized static buffers.

@thestinger
Copy link

Isn't it the whole reason for CString in this draft, so that one can assert the C string-ness once during construction and then reuse the value without checking?

That's not what I'm talking about. The CString type here provides an O(n) conversion from the normal string / vector types to a C string. It's the wrong place to do the check. As I said above:

The most efficient way of doing interoperability with C strings is including it at the end of the string literals or pushing it to the end of Vec / String as-needed. Copying is always required for taking views, because that's just how C strings are.

It doesn't make sense to do an O(n) scan rather than just upholding the guarantee. At most it should check for a trailing \0, and even that is more than you would do if you wrote better code without this API. It's like verifying that the slice is sorted in a binary search function.

@thestinger
Copy link

All things considered, I would not declare conversion to slices, much less to owned copies, a clearly better alternative here without some testing.

I didn't say anything like that.

@thestinger
Copy link

That's one irreducible issue with the design proposed here. The current design of ToCStr allows cutting some corners in the Rust-to-C direction, hence my arguing for keeping it.

Doing it as I suggested fully eliminates the performance cost and eliminates the API surface of the standard library. It doesn't make sense to define reusable interfaces when they're slow and don't provide an improvement over doing it by hand.

@thestinger
Copy link

If the bit width of libc::c_char is fixed at 8, perhaps it should be decreed that it is always defined as i8 in Rust regardless of the signedness of char in the C ABI. Then std can use libc::c_char in public interfaces without a portability hazard.

I don't think requiring that it is 8 bits (like POSIX and Windows) is related to whether or not it should match the platform definition. Rust just won't work on platforms without 8 bit bytes or without either little or big endian byte ordering (PDP endian).

I think it makes sense for the definitions to match how they are defined by the platform. It could be a distinct type rather than an alias but it would be more complicated... and I don't think it's important.

@alexcrichton
Copy link
Member Author

@thestinger

I really don't think this is a problem that the standard library should take on.

While in theory C strings would be a pointer/length pair going across the boundary, in reality there are many many C libraries which use a null-terminated string to interoperate with foreign code. The standard library needs to support this fact of life both for its own purposes as well as for the benefit of users of Rust. One of the main targets for Rust is being embedded into other applications or languages where having strings cross the boundary is largely just a requirement.

You seem to have concerns about the unsafety of this type and operation, but this is all pushed on to the user if necessary (e.g. calling the ffi function at the boundary). The CString type itself is not unsafe, and it's supporting a widespread idiom in C. You seem to also have performance reservations, but this is why there is an unchecked constructor which allows you to bypass the assumptions if you already understand them. The CString type also does not impose any allocations or extra overhead by allowing it to consume ownership of the underlying Vec<u8>.

Converting a string / vector with an inner NUL is a logic error error (unintentional truncation) but I don't think paying the cost of an O(n) search at all of the FFI boundaries is going to be acceptable in general

We may have to disagree on the point about allowing truncation, but this is why there is an unchecked constructor.


@mzabaluev

Lack of a zero-copy path for passing strings into C that were earlier obtained from C.

This is true! I expect this idiom to be fairly rare, however. We could grow support for this over time, but for now I think it's ok to focus solely on the lending C -> Rust and Rust -> C cases, not the more flavorful C -> Rust -> C cases or transferring ownership cases.

Also note that the CString in and functions in the standard library are not the end-all-be-all of APIs. Crates can always evolve their own types to handle these sorts of perhaps more niche use cases.

Need to copy strings obtained (with ownership transfer) from C before they can be stored as keys for hash tables, sorted collections, etc

Note that this RFC is now explicitly not handling transferring ownership across language boundaries, and bytes can always be copied with .to_vec().

@mzabaluev
Copy link
Contributor

The most efficient way of doing interoperability with C strings is including it at the end of the string literals or pushing it to the end of Vec / String as-needed.

To think of it, there could be functions to promote a Vec / String into CString (as in, the Rust-to-C string transfer type) reusing the buffer:

pub fn take_bytes(src: Vec<u8>) -> Option<CString> { /* ... */ }
pub unsafe fn take_bytes_unchecked(src: Vec<u8>) -> CString { /* ... */ }
pub fn take_string(src: String) -> Option<CString> { /* ... */ }
pub unsafe fn take_string_unchecked(src: String) -> CString { /* ... */ }

@mzabaluev
Copy link
Contributor

... or better, generically, a MoveToCStr trait that would be implemented for owned strings, slice references, Path, MaybeOwned, and any other type that could be converted to a C string with some optimization.


* `std::str::from_c_str` - this function should be replaced with
`c_str::from_raw_buf` plus one of `str::from_utf8` or
`str::from_utf8_unchecked`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_raw_buf takes *c_char but from_utf8 produces Vec<u8>. Seems like there is another step here. What is it?

@mzabaluev
Copy link
Contributor

FYI, I have now incorporated many ideas from this RFC and the discussion into #435.

@alexcrichton
Copy link
Member Author

I've also added a commit now indicating that the new functionality will instead be provided under std::ffi, not std::c_string.

@aturon
Copy link
Member

aturon commented Jan 2, 2015

After consideration of this RFC and #435, and the discussions on both, the core team has decided to go with this more conservative design for now. Note that the RFC places the module under a new std::ffi module, where we expect to grow a set of FFI-related tools over time (including the new os_str module). Furthermore, if it turns out that other modes of ownership and destructors are needed, we can gradually migrate toward a richer design like #435.

Thanks everyone for pitching in to the design here, and @mzabaluev for your work on the counter-proposal!

Tracking issue.

@dgrunwald
Copy link
Contributor

I like that the general idea of this design (CString = Vec<u8> + 0-invariant) makes CString very similar to the normal String type (String = Vec<u8> + UTF-8-invariant).

However, the design feels incomplete, and has inconsistent naming in some places (e.g. from_slice should be renamed to from_bytes as it takes &[u8]).

The most problematic use case is a safe FFI wrapper that passes on a string to a C function.
I can do:

fn f(s: &CString) {
   ffi::f(s.as_ptr());
}

but this forces callers to provide a heap-allocated string.

If I instead make my function accept &[u8] or &[c_char], I no longer have the guarantee about 0-termination, so my function becomes unsafe or needs to validate the string.

I feel that, just as String has a corresponding borrowed type &str (= &[u8] + UTF-8-invariant), this new CString needs a corresponding slice type &CStr (=&[u8] + 0-invariant).

I'm currently writing an implementation of CStr as a library type, but I think this type belongs in std::ffi; and that <CString as Deref>::Target should be changed to CStr. Shall I write a new RFC for my proposed changes?

@mzabaluev
Copy link
Contributor

@dgrunwald, have you looked at my project for the rejected RFC #435?

My CStrArg is a type meant for function parameters enforcing a null-termination invariant on string data coming from Rust. It does not borrow, but it has a 'static variant for \0-terminated literals (I'll add macros for constructing the values out of any string or bytestring literals). The main reason for not having a borrowed variant is that a NUL in a Rust slice, aside from the literal case, should result in a conversion error, so any borrowed slice needs to be copied into a Vec for appending. Conversely, you can consume a String or a Vec<u8> to get a CStrArg just at the cost of appending the NUL.

I plan to release the crate as c_str after I have gotten through some weird type system errors with the current nightly.

@dgrunwald
Copy link
Contributor

@mzabaluev: I looked at that, but I don't quite like that approach. You usually shouldn't pass ownership of strings from rust to C or back -- if the C library is statically linked, it might use a different libc::free() than Rust does. On Windows, it's quite normal to have multiple C runtimes in the same process.

Here's my approach for a DST CStr.
I think this is a simple solution, and quite elegant due to the CString/String CStr/str parallels. The only problem is that from_bytes_with_nul_unchecked isn't guaranteed to work due to the undefined rust struct layout. Well, that and the DST ICEs.

@mzabaluev
Copy link
Contributor

@dgrunwald: Sometimes you don't have a choice: a library function hands you an allocated string and tells you to free it with another function, typically provided by the same library. This is what GLib does a lot, for example. To address this, there are generic destructors, and no default destructor on the CString so the developer has to be aware about the deallocation method.

Regarding the DST borrow: as I said, you can't borrow your way out of the fact that Rust strings are not NUL-terminated, and an inner NUL in a dynamically created Rust string cannot be passed to C without causing interpretation problems and potential security exploits. You can append every owned string you are given with a NUL, but then the string becomes useless on the Rust side unless further sliced or truncated back, so you might as well consume it towards the invariant-enforcing value.

@mzabaluev
Copy link
Contributor

@dgrunwald, thanks for prodding me in the right direction. I have now implemented deref/borrow introducing my innovative irrelevantly sized type.

@mzabaluev
Copy link
Contributor

Filed a follow-on RFC PR: #592

@Centril Centril added the A-ffi FFI related proposals. label Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ffi FFI related proposals.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants