diff --git a/language/types/string.xml b/language/types/string.xml index f2745ab1ae..76d853ed36 100644 --- a/language/types/string.xml +++ b/language/types/string.xml @@ -8,8 +8,8 @@ A string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support. See - utf8_encode and utf8_decode for some - basic Unicode functionality. + details of the string + type. @@ -989,6 +989,112 @@ echo "\$foo==$foo; type is " . gettype ($foo) . "
\n"; + + + + Details of the String Type + + + The string in PHP is implemented as an array of bytes and an + integer indicating the length of the buffer. It has no information about how + those bytes translate to characters, leaving that task to the programmer. + There are no limitations on the values the string can be composed of; in + particular, bytes with value 0 (“NUL bytes”) are allowed + anywhere in the string (however, a few functions, said in this manual not to + be “binary safe”, may hand off the strings to libraries that ignore data + after a NUL byte.) + + + This nature of the string type explains why there is no separate “byte” type + in PHP – strings take this role. Functions that return no textual data – for + instance, arbitrary data read from a network socket – will still return + strings. + + + Given that PHP does not dictate a specific encoding for strings, one might + wonder how string literals are encoded. For instance, is the string + "á" equivalent to "\xE1" (ISO-8859-1), + "\xC3\xA1" (UTF-8, C form), + "\x61\xCC\x81" (UTF-8, D form) or any other possible + representation? The answer is that string will be encoded in whatever fashion + it is encoded in the script file. Thus, if the script is written in + ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However, + this does not apply if Zend Multibyte is enabled; in that case, the script + may be written in an arbitrary encoding (which is explicity declared or is + detected) and then converted to a certain internal encoding, which is then + the encoding that will be used for the string literals. + Note that there are some constraints on the encoding of the script (or on the + internal encoding, should Zend Multibyte be enabled) – this almost always + means that this encoding should be a compatible superset of ASCII, such as + UTF-8 or ISO-8859-1. Note, however, that state-dependent encodings where + the same byte values can be used in initial and non-initial shift states + may be problematic. + + + Of course, in order to useful, functions that operate on text may have to + make some assumptions about how the string is encoded. Unfortunately, there + is much variation on this matter throughout PHP’s functions: + + + + + Some functions assume that the string is encoded in some (any) single-byte + encoding, but they do not need to interpret those bytes as specific + characters. This is case of, for instance, substr, + strpos, strlen or + strcmp. Another way to think of these functions is + that operate on memory buffers, i.e., they work with bytes and byte + offsets. + + + + + Other functions are passed the encoding of the string, possibly they also + assume a default if no such information is given. This is the case of + htmlentities and the majority of the + functions in the mbstring extension. + + + + + Others use the current locale (see setlocale), but + operate byte-by-byte. This is the case of strcasecmp, + strtoupper and ucfirst. + This means they can be used only with single-byte encodings, as long as + the encoding is matched by the locale. For instance + strtoupper("á") may return "Á" if the + locale is correctly set and á is encoded with a single + byte. If it is encoded in UTF-8, the correct result will not be returned + and the resulting string may or may not be returned corrupted, depending + on the current locale. + + + + + Finally, they may just assume the string is using a specific encoding, + usually UTF-8. This is the case of most functions in the + intl extension and in the + PCRE extension + (in the last case, only when the u modifier is used). + Although this is due to their special purpose, the function + utf8_decode assumes a UTF-8 encoding and the + function utf8_encode assumes an ISO-8859-1 encoding. + + + + + + Ultimately, this means writing correct programs using Unicode depends on + carefully avoiding functions that will not work and that most likely will + corrupt the data and using instead the functions that do behave correctly, + generally from the intl and + mbstring extensions. + However, using functions that can handle Unicode encodings is just the + beginning. No matter the functions the language provides, it is essential to + know the Unicode specification. For instance, a program that assumes there is + only uppercase and lowercase is making a wrong assumption. + +