diff --git a/language/types/string.xml b/language/types/string.xml
index f2745ab1ae..76d853ed36 100644
--- a/language/types/string.xml
+++ b/language/types/string.xml
@@ -8,8 +8,8 @@
A string is series of characters, where a character is
the same as a byte. This means that PHP only supports a 256-character set,
and hence does not offer native Unicode support. See
- utf8_encode and utf8_decode for some
- basic Unicode functionality.
+ details of the string
+ type.
@@ -989,6 +989,112 @@ echo "\$foo==$foo; type is " . gettype ($foo) . " \n";
+
+
+
+ Details of the String Type
+
+
+ The string in PHP is implemented as an array of bytes and an
+ integer indicating the length of the buffer. It has no information about how
+ those bytes translate to characters, leaving that task to the programmer.
+ There are no limitations on the values the string can be composed of; in
+ particular, bytes with value 0 (“NUL bytes”) are allowed
+ anywhere in the string (however, a few functions, said in this manual not to
+ be “binary safe”, may hand off the strings to libraries that ignore data
+ after a NUL byte.)
+
+
+ This nature of the string type explains why there is no separate “byte” type
+ in PHP – strings take this role. Functions that return no textual data – for
+ instance, arbitrary data read from a network socket – will still return
+ strings.
+
+
+ Given that PHP does not dictate a specific encoding for strings, one might
+ wonder how string literals are encoded. For instance, is the string
+ "á" equivalent to "\xE1" (ISO-8859-1),
+ "\xC3\xA1" (UTF-8, C form),
+ "\x61\xCC\x81" (UTF-8, D form) or any other possible
+ representation? The answer is that string will be encoded in whatever fashion
+ it is encoded in the script file. Thus, if the script is written in
+ ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However,
+ this does not apply if Zend Multibyte is enabled; in that case, the script
+ may be written in an arbitrary encoding (which is explicity declared or is
+ detected) and then converted to a certain internal encoding, which is then
+ the encoding that will be used for the string literals.
+ Note that there are some constraints on the encoding of the script (or on the
+ internal encoding, should Zend Multibyte be enabled) – this almost always
+ means that this encoding should be a compatible superset of ASCII, such as
+ UTF-8 or ISO-8859-1. Note, however, that state-dependent encodings where
+ the same byte values can be used in initial and non-initial shift states
+ may be problematic.
+
+
+ Of course, in order to useful, functions that operate on text may have to
+ make some assumptions about how the string is encoded. Unfortunately, there
+ is much variation on this matter throughout PHP’s functions:
+
+
+
+
+ Some functions assume that the string is encoded in some (any) single-byte
+ encoding, but they do not need to interpret those bytes as specific
+ characters. This is case of, for instance, substr,
+ strpos, strlen or
+ strcmp. Another way to think of these functions is
+ that operate on memory buffers, i.e., they work with bytes and byte
+ offsets.
+
+
+
+
+ Other functions are passed the encoding of the string, possibly they also
+ assume a default if no such information is given. This is the case of
+ htmlentities and the majority of the
+ functions in the mbstring extension.
+
+
+
+
+ Others use the current locale (see setlocale), but
+ operate byte-by-byte. This is the case of strcasecmp,
+ strtoupper and ucfirst.
+ This means they can be used only with single-byte encodings, as long as
+ the encoding is matched by the locale. For instance
+ strtoupper("á") may return "Á" if the
+ locale is correctly set and á is encoded with a single
+ byte. If it is encoded in UTF-8, the correct result will not be returned
+ and the resulting string may or may not be returned corrupted, depending
+ on the current locale.
+
+
+
+
+ Finally, they may just assume the string is using a specific encoding,
+ usually UTF-8. This is the case of most functions in the
+ intl extension and in the
+ PCRE extension
+ (in the last case, only when the u modifier is used).
+ Although this is due to their special purpose, the function
+ utf8_decode assumes a UTF-8 encoding and the
+ function utf8_encode assumes an ISO-8859-1 encoding.
+
+
+
+
+
+ Ultimately, this means writing correct programs using Unicode depends on
+ carefully avoiding functions that will not work and that most likely will
+ corrupt the data and using instead the functions that do behave correctly,
+ generally from the intl and
+ mbstring extensions.
+ However, using functions that can handle Unicode encodings is just the
+ beginning. No matter the functions the language provides, it is essential to
+ know the Unicode specification. For instance, a program that assumes there is
+ only uppercase and lowercase is making a wrong assumption.
+
+