How can I convert OsStr to &[u8]/Vec<u8> on Windows?

There is no defined interface for getting the bytes of an OsStr on Windows in Rust 1.16. The actual implementation of OsStr delegates to system-specific code. On *nix, this is a wrapper around a Vec<u8>; on Windows, this is a wrapper around a Wtf8Buf. While Wtf8Buf is implemented with a Vec<u8>, that implementation detail is not exposed. More detail about WTF-8 is available on its website, which includes this quote, emphasis mine:

On Windows (which uses potentially ill-formed UTF-16 in its APIs), the Rust standard library uses WTF-8 internally for OS strings, but does not expose the WTF-8 byte sequences.

The "problem" is that on different platforms, there's no unified concept of a "string" when it comes to passing it to an operating system interface. On *nix, usually interfaces accept something almost like UTF-8, except they don't handle embedded NUL values. On Windows, it depends on if you are calling the W or A variant of the API, although the W variant is strongly preferred.

This is made more difficult because libraries may also use different encodings from the OS. This is especially true if you use a C library created on *nix on Windows — it's almost guaranteed to take in a pseudo-UTF-8 string and then some sort of lossy transformation occurs to call the right underlying API.

Rust avoids all that by providing the opaque types OsStr and OsString.

If you need to pass an OsStr to a function that accepts UTF-8 data, you need to convert it to a String or &str, then you can get the bytes of that. If you need to pass it to a function that accepts a LPCWSTR, you first need to convert to a Vec<u16> and then pass the pointer to that buffer to the Windows API. You can see an example of how Rust itself does this.

The point of OsStr is that its very representation is OS-specific. The implementation is somewhat convoluted for technical reasons (@Shepmaster's answer provides more details), but you can think of it like this:

on POSIX systems, OsStr boils down to &[u8], because POSIX functions accept and return byte strings;
on Windows, OsStr can be thought of as an &[u16], because Win32 Unicode functions accept and return strings as arrays of 16-bit units.

Since native Windows APIs accept sequences of 16-bit "wide characters"¹, that is what OsStr is designed to store. While an OsStr could be converted to bytes inasmuch anything can be converted to bytes, such representation is not useful because those bytes would be neither meaningful to the user nor to the system. This is why OsStr does not provide a method to retrieve the contents as bytes on Windows. However, it does provide OsStr::encode_wide() that iterates over the underlying u16 values which is useful in Win32. In the other direction, OsString::from_wide() can be used to create an OsString from a slice of u16 values.

It is up to you to decide how your persistence layer will deal with this difference between platforms. What Rust's OsStr provides are the necessary tools to implement the round-trip, but the code will necessarily differ between platforms. For example, serde resolves the difference by effectively treating as enum OsString { Unix(Vec<u8>), Windows(Vec<u16>) }.

¹ Windows wide character strings are sometimes described as UTF-16 because that is how they are interpreted at a higher level, but this is not correct for all OS strings. A Windows file name can contain pairs of u16 values that are not valid UTF-16, and still be usable. This is why it's not possible to represent Windows strings as bytes by e.g. converting them to UTF-8.

How can I convert OsStr to &[u8]/Vec<u8> on Windows?

Tags:

String

Rust

Related

Recent Posts