Improve utf8_decode() and utf8_encode() documentation

This rewrites the descriptions of both to clarify that they convert 
specifically between ISO-8859-1 and UTF-8, adds a warning about
confusion with Windows-1252, and adds helpful "See also" links to
other character set conversion functions. Additionally, the
behaviour for invalid characters in utf8_decode() was clarified,
and the description of the UTF-8 binary encoding was removed.

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@340506 c90b9560-bf6c-de11-be94-00142212c4b1
This commit is contained in:
Andrea Faulds 2016-10-17 14:31:39 +00:00
parent a9d0248cd3
commit 838941f6cc
2 changed files with 45 additions and 51 deletions

View file

@ -16,9 +16,27 @@
<methodparam><type>string</type><parameter>data</parameter></methodparam>
</methodsynopsis>
<para>
This function decodes <parameter>data</parameter>, assumed to be
<literal>UTF-8</literal> encoded, to <literal>ISO-8859-1</literal>.
This function converts the string <parameter>data</parameter> from the
<literal>UTF-8</literal> encoding to <literal>ISO-8859-1</literal>. Bytes
in the string which are not valid <literal>UTF-8</literal>, and
<literal>UTF-8</literal> characters which do not exist in
<literal>ISO-8859-1</literal> (that is, characters above
<literal>U+00FF</literal>) are replaced with <literal>?</literal>.
</para>
<note>
<para>
Many web pages marked as using the <literal>ISO-8859-1</literal> character
encoding actually use the similar <literal>Windows-1252</literal> encoding,
and web browsers will interpret <literal>ISO-8859-1</literal> web pages as
<literal>Windows-1252</literal>. <literal>Windows-1252</literal> features
additional printable characters, such as the Euro sign
(<literal></literal>) and curly quotes (<literal></literal>
<literal></literal>), instead of certain <literal>ISO-8859-1</literal>
control characters. This function will not convert such
<literal>Windows-1252</literal> characters correctly. Use a different
function if <literal>Windows-1252</literal> conversion is required.
</para>
</note>
</refsect1>
<refsect1 role="parameters">
@ -29,7 +47,7 @@
<term><parameter>data</parameter></term>
<listitem>
<para>
An UTF-8 encoded string.
A UTF-8 encoded string.
</para>
</listitem>
</varlistentry>
@ -48,7 +66,10 @@
&reftitle.seealso;
<para>
<simplelist>
<member><function>utf8_encode</function> (contains an explanation of UTF-8 encoding)</member>
<member><function>utf8_encode</function> - Performs the reverse conversion</member>
<member><function>mb_convert_encoding</function> - Converts between various character encodings, including UTF-8, ISO-8859-1 and Windows-1252</member>
<member><function>iconv</function> - Converts between various character encodings</member>
<member><function>recode_string</function> - Converts between various character encodings</member>
</simplelist>
</para>
</refsect1>

View file

@ -13,53 +13,23 @@
<methodparam><type>string</type><parameter>data</parameter></methodparam>
</methodsynopsis>
<para>
This function encodes the string <parameter>data</parameter> to
<literal>UTF-8</literal>, and returns the encoded version.
<literal>UTF-8</literal> is a standard mechanism used by
<acronym>Unicode</acronym> for encoding <glossterm>wide
character</glossterm> values into a byte stream.
<literal>UTF-8</literal> is transparent to plain <abbrev>ASCII</abbrev>
characters, is self-synchronized (meaning it is possible for a program to
figure out where in the bytestream characters start) and can be used with
normal string comparison functions for sorting and such. PHP encodes
<literal>UTF-8</literal> characters in up to four bytes, like this:
<table>
<title>UTF-8 encoding</title>
<tgroup cols="3">
<thead>
<row>
<entry>bytes</entry>
<entry>bits</entry>
<entry>representation</entry>
</row>
</thead>
<tbody>
<row>
<entry>1</entry>
<entry>7</entry>
<entry>0bbbbbbb</entry>
</row>
<row>
<entry>2</entry>
<entry>11</entry>
<entry>110bbbbb 10bbbbbb</entry>
</row>
<row>
<entry>3</entry>
<entry>16</entry>
<entry>1110bbbb 10bbbbbb 10bbbbbb</entry>
</row>
<row>
<entry>4</entry>
<entry>21</entry>
<entry>11110bbb 10bbbbbb 10bbbbbb 10bbbbbb</entry>
</row>
</tbody>
</tgroup>
</table>
Each <replaceable>b</replaceable> represents a bit that can be
used to store character data.
This function converts the string <parameter>data</parameter> from the
<literal>ISO-8859-1</literal> encoding to <literal>UTF-8</literal>.
</para>
<note>
<para>
Many web pages marked as using the <literal>ISO-8859-1</literal> character
encoding actually use the similar <literal>Windows-1252</literal> encoding,
and web browsers will interpret <literal>ISO-8859-1</literal> web pages as
<literal>Windows-1252</literal>. <literal>Windows-1252</literal> features
additional printable characters, such as the Euro sign
(<literal></literal>) and curly quotes (<literal></literal>
<literal></literal>), instead of certain <literal>ISO-8859-1</literal>
control characters. This function will not convert such
<literal>Windows-1252</literal> characters correctly. Use a different
function if <literal>Windows-1252</literal> conversion is required.
</para>
</note>
</refsect1>
<refsect1 role="parameters">
@ -89,7 +59,10 @@
&reftitle.seealso;
<para>
<simplelist>
<member><function>utf8_decode</function></member>
<member><function>utf8_encode</function> - Performs the reverse conversion</member>
<member><function>mb_convert_encoding</function> - Converts between various character encodings, including UTF-8, ISO-8859-1 and Windows-1252</member>
<member><function>iconv</function> - Converts between various character encodings</member>
<member><function>recode_string</function> - Converts between various character encodings</member>
</simplelist>
</para>
</refsect1>