Rust Strings - rFronteddu/general_wiki GitHub Wiki
We discuss strings in the context of collections because strings are implemented as a collection of bytes, plus some methods to provide useful functionality when those bytes are interpreted as text.
What is a String
Rust has only one string type in the core language, which is the string slice str that is usually seen in its borrowed form &str. String slices are references to some UTF-8 encoded string data stored elsewhere. String literals, for example, are stored in the program’s binary and are therefore string slices.
The String type, which is provided by Rust’s standard library rather than coded into the core language, is a growable, mutable, owned, UTF-8 encoded string type. Both String and string slices are UTF-8 encoded.
Creating a New String
Many of the same operations available with Vec are available with String as well because String is actually implemented as a wrapper around a vector of bytes with some extra guarantees, restrictions, and capabilities.
let mut s = String::new();
This line creates a new, empty string called s, into which we can then load data. Often, we’ll have some initial data with which we want to start the string. For that, we use the to_string method, which is available on any type that implements the Display trait, as string literals do.
let data = "initial contents";
let s = data.to_string();
// the method also works on a literal directly:
let s = "initial contents".to_string();
We can also use the function String::from to create a String from a string literal.
let s = String::from("initial contents");
Updating a String
A String can grow in size and its contents can change, just like the contents of a Vec, if you push more data into it. In addition, you can conveniently use the + operator or the format! macro to concatenate String values.
We can grow a String by using the push_str method to append a string slice. The push_str method takes a string slice because we don’t necessarily want to take ownership of the parameter.
let mut s = String::from("foo");
s.push_str("bar");
If the push_str method took ownership of s2, we wouldn’t be able to print its value on the last line. However, this code works as we’d expect!
let mut s1 = String::from("foo");
let s2 = "bar";
s1.push_str(s2);
println!("s2 is {s2}");
The push method takes a single character as a parameter and adds it to the String.
let mut s = String::from("lo");
s.push('l');
Concatenation with the + Operator or the format! Macro
Often, you’ll want to combine two existing strings. One way to do so is to use the + operator
let s1 = String::from("Hello, ");
let s2 = String::from("world!");
let s3 = s1 + &s2; // note s1 has been moved here and can no longer be used
The reason s1 is no longer valid after the addition, and the reason we used a reference to s2, has to do with the signature of the method that’s called when we use the + operator. The + operator uses the add method, whose signature looks something like this:
fn add(self, s: &str) -> String {
In the standard library, you’ll see add defined using generics and associated types. Here, we’ve substituted in concrete types, which is what happens when we call this method with String values. First, s2 has an &, meaning that we’re adding a reference of the second string to the first string. This is because of the s parameter in the add function: we can only add a &str to a String; we can’t add two String values together. But wait—the type of &s2 is &String, not &str, as specified in the second parameter to add. So why does the code above compile? The reason we’re able to use &s2 in the call to add is that the compiler can coerce the &String argument into a &str. When we call the add method, Rust uses a deref coercion, which here turns &s2 into &s2[..]. Because add does not take ownership of the s parameter, s2 will still be a valid String after this operation.
Second, we can see in the signature that add takes ownership of self, because self does not have an &. This means s1 in the code above will be moved into the add call and will no longer be valid after that. So, although let s3 = s1 + &s2; looks like it will copy both strings and create a new one, this statement instead does the following:
- add takes ownership of s1,
- it appends a copy of the contents of s2 to s1,
- and then it returns back ownership of s1.
If s1 has enough capacity for s2, then no memory allocations occur. However, if s1 does not have enough capacity for s2, then s1 will internally make a larger memory allocation to fit both strings.
If we need to concatenate multiple strings, the behavior of the + operator gets unwieldy:
let s1 = String::from("tic");
let s2 = String::from("tac");
let s3 = String::from("toe");
let s = s1 + "-" + &s2 + "-" + &s3;
For combining strings in more complicated ways, we can instead use the format! macro:
let s1 = String::from("tic");
let s2 = String::from("tac");
let s3 = String::from("toe");
let s = format!("{s1}-{s2}-{s3}");
The format! macro works like println!, but instead of printing the output to the screen, it returns a String with the contents.
Indexing into Strings
In many other programming languages, accessing individual characters in a string by referencing them by index is a valid and common operation. However, if you try to access parts of a String using indexing syntax in Rust, you’ll get an error.
3 | let h = s1[0];
| ^ string indices are ranges of `usize`
The error and the note tell the story: Rust strings don’t support indexing.
Internal Representation
A String is a wrapper over a Vec. An index into the string’s bytes will not always correlate to a valid Unicode scalar value.
To demonstrate, consider this invalid Rust code:
let hello = "Здравствуйте";
let answer = &hello[0];
When encoded in UTF-8, the first byte of З is 208 and the second is 151, so it would seem that answer should in fact be 208, but 208 is not a valid character on its own. Returning 208 is likely not what a user would want if they asked for the first letter of this string; however, that’s the only data that Rust has at byte index 0. Users generally don’t want the byte value returned, even if the string contains only Latin letters: if &"hello"[0] were valid code that returned the byte value, it would return 104, not h.
The answer, then, is that to avoid returning an unexpected value and causing bugs that might not be discovered immediately, Rust doesn’t compile this code at all and prevents misunderstandings early in the development process.
Bytes and Scalar Values and Grapheme Clusters
Another point about UTF-8 is that there are actually three relevant ways to look at strings from Rust’s perspective: as bytes, scalar values, and grapheme clusters (the closest thing to what we would call letters).
If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is stored as a vector of u8 values that looks like this:
[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164,
224, 165, 135]
That’s 18 bytes and is how computers ultimately store this data. If we look at them as Unicode scalar values, which are what Rust’s char type is, those bytes look like this:
['न', 'म', 'स', '्', 'त', 'े']
There are six char values here, but the fourth and sixth are not letters: they’re diacritics that don’t make sense on their own. Finally, if we look at them as grapheme clusters, we’d get what a person would call the four letters that make up the Hindi word:
["न", "म", "स्", "ते"]
Rust provides different ways of interpreting the raw string data that computers store so that each program can choose the interpretation it needs, no matter what human language the data is in.
A final reason Rust doesn’t allow us to index into a String to get a character is that indexing operations are expected to always take constant time (O(1)). But it isn’t possible to guarantee that performance with a String, because Rust would have to walk through the contents from the beginning to the index to determine how many valid characters there were.
Slicing Strings
Indexing into a string is often a bad idea because it’s not clear what the return type of the string-indexing operation should be: a byte value, a character, a grapheme cluster, or a string slice. If you really need to use indices to create string slices, therefore, Rust asks you to be more specific.
Rather than indexing using [] with a single number, you can use [] with a range to create a string slice containing particular bytes:
let hello = "Здравствуйте";
let s = &hello[0..4];
Here, s will be a &str that contains the first four bytes of the string. Earlier, we mentioned that each of these characters was two bytes, which means s will be Зд.
If we were to try to slice only part of a character’s bytes with something like &hello[0..1], Rust would panic at runtime in the same way as if an invalid index were accessed in a vector. You should use caution when creating string slices with ranges, because doing so can crash your program.
Methods for iterating over Strings
The best way to operate on pieces of strings is to be explicit about whether you want characters or bytes. For individual Unicode scalar values, use the chars method. Calling chars on “Зд” separates out and returns two values of type char, and you can iterate over the result to access each element:
for c in "Зд".chars() {
println!("{c}");
}
This code will print the following:
З
д
Alternatively, the bytes method returns each raw byte, which might be appropriate for your domain:
for b in "Зд".bytes() {
println!("{b}");
}
This code will print the four bytes that make up this string:
208
151
208
180