When working with C++ strings, it’s essential to understand the difference between std::string
and std::wstring
, and the character types char
and wchar_t
on which they are based.
std::string
vs. std::wstring
std::string
is a template instantiation of basic_string
with char
, while std::wstring
uses wchar_t
. The difference between these two types lies in the size and encoding of the characters they hold.
char
vs. wchar_t
The char
type typically holds an 8-bit character, sufficient for ASCII characters. On the other hand, wchar_t
is intended for wide characters. Its size varies by platform: 4 bytes on Linux and 2 bytes on Windows.
Unicode and Character Encoding
Neither char
nor wchar_t
are directly tied to Unicode, which adds complexity. For instance, on Linux systems like Ubuntu, char
strings are natively encoded in UTF-8, allowing them to handle Unicode characters seamlessly. This means a std::string
on Linux can hold Unicode strings, as illustrated in the following code:
#include <cstring>
#include <iostream>
int main() {
const char text[] = "olé";
std::cout << "sizeof(char) : " << sizeof(char) << "\n";
std::cout << "text : " << text << "\n";
std::cout << "sizeof(text) : " << sizeof(text) << "\n";
std::cout << "strlen(text) : " << strlen(text) << "\n";
std::cout << "text(ordinals) :";
for(size_t i = 0, iMax = strlen(text); i < iMax; ++i) {
unsigned char c = static_cast<unsigned char>(text[i]);
std::cout << " " << static_cast<unsigned int>(c);
}
std::cout << "\n\n";
const wchar_t wtext[] = L"olé";
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "\n";
std::cout << "wtext : UNABLE TO CONVERT NATIVELY.\n";
std::wcout << L"wtext : " << wtext << "\n";
std::cout << "sizeof(wtext) : " << sizeof(wtext) << "\n";
std::cout << "wcslen(wtext) : " << wcslen(wtext) << "\n";
std::cout << "wtext(ordinals) :";
for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i) {
unsigned short wc = static_cast<unsigned short>(wtext[i]);
std::cout << " " << static_cast<unsigned int>(wc);
}
std::cout << "\n\n";
}
The output demonstrates that std::string
in Linux handles UTF-8 encoded Unicode strings, though the character count might differ due to multi-byte characters.
Windows Encoding
Windows handles encoding differently. Historical applications use char
with various code pages, not necessarily UTF-8. Unicode applications use wchar_t
encoded in UTF-16. Therefore, using std::wstring
on Windows is more appropriate for Unicode, though conversions between char
and wchar_t
strings are often necessary.
Memory Considerations
UTF-32 always uses 4 bytes per character, while UTF-8 and UTF-16 are more memory-efficient for most languages. UTF-8 usually uses less memory than UTF-16 for Western languages but can be more for others, such as Chinese or Japanese.
Conclusion
Choosing between std::string
and std::wstring
depends on the platform:
- On Linux, prefer
std::string
due to native UTF-8 support. - On Windows, prefer
std::wstring
for Unicode applications.
For cross-platform code, the choice depends on the toolkit or framework used. Understanding these differences ensures efficient and correct handling of text in C++ applications.