Improve documentation of string encoding conversion functions

- Move utf8_encode and utf8_decode into the strings chapter, since
  they were moved out of the XML extension in 7.2
- Recommend mb_convert_encoding, iconv, and UConverter::transcode
  when mentioning encoding in passing
- Document UConverter::transcode, based on examination of source
  and upstream ICU docs
- Make the language used more consistent, e.g. "convert" rather
  than "encode"/"decode", "encoding" rather than "charset"

Closes GH-1418.
This commit is contained in:
Rowan Tommins 2022-04-04 11:24:24 +01:00 committed by GitHub
parent 8b0e03372d
commit 99d758bd25
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
11 changed files with 259 additions and 60 deletions

View file

@ -1401,7 +1401,7 @@ it is inserted with (e.g.) <function xmlns="http://docbook.org/ns/docbook">DOMNo
<emphasis>could</emphasis> be called statically, but would issue an <constant>E_DEPRECATED</constant> error.
As of PHP 8.0.0 calling this method statically throws an <classname>Error</classname> exception</para>'>
<!ENTITY dom.malformederror '<para xmlns="http://docbook.org/ns/docbook">While malformed HTML should load successfully, this function may generate <constant>E_WARNING</constant> errors when it encounters bad markup. <link linkend="function.libxml-use-internal-errors">libxml&apos;s error handling functions</link> may be used to handle these errors.</para>'>
<!ENTITY dom.note.utf8 '<note xmlns="http://docbook.org/ns/docbook"><para>The DOM extension uses UTF-8 encoding. Use <function>utf8_encode</function> and <function>utf8_decode</function> to work with texts in ISO-8859-1 encoding or <link linkend="ref.iconv">iconv</link> for other encodings.</para></note>'>
<!ENTITY dom.note.utf8 '<note xmlns="http://docbook.org/ns/docbook"><para>The DOM extension uses UTF-8 encoding. Use <function>mb_convert_encoding</function>, <methodname>UConverter::transcode</methodname>, or <function>iconv</function> to handle other encodings.</para></note>'>
<!ENTITY dom.note.json '<note xmlns="http://docbook.org/ns/docbook"><para>When using <function>json_encode</function> on a <classname>DOMDocument</classname> object the result will be that of encoding an empty object.</para></note>'>

View file

@ -3,7 +3,7 @@
<refentry xml:id="function.iconv" xmlns="http://docbook.org/ns/docbook">
<refnamediv>
<refname>iconv</refname>
<refpurpose>Convert string to requested character encoding</refpurpose>
<refpurpose>Convert a string from one character encoding to another</refpurpose>
</refnamediv>
<refsect1 role="description">
@ -15,8 +15,7 @@
<methodparam><type>string</type><parameter>string</parameter></methodparam>
</methodsynopsis>
<para>
Performs a character set conversion on the string
<parameter>string</parameter> from <parameter>from_encoding</parameter>
Converts <parameter>string</parameter> from <parameter>from_encoding</parameter>
to <parameter>to_encoding</parameter>.
</para>
</refsect1>
@ -29,7 +28,7 @@
<term><parameter>from_encoding</parameter></term>
<listitem>
<para>
The input charset.
The current encoding used to interpret <parameter>string</parameter>.
</para>
</listitem>
</varlistentry>
@ -37,14 +36,14 @@
<term><parameter>to_encoding</parameter></term>
<listitem>
<para>
The output charset.
The desired encoding of the result.
</para>
<para>
If you append the string <literal>//TRANSLIT</literal> to
<parameter>to_encoding</parameter> transliteration is activated. This
If the string <literal>//TRANSLIT</literal> is appended to
<parameter>to_encoding</parameter>, then transliteration is activated. This
means that when a character can't be represented in the target charset,
it can be approximated through one or several similarly looking
characters. If you append the string <literal>//IGNORE</literal>,
it may be approximated through one or several similarly looking
characters. If the string <literal>//IGNORE</literal> is appended,
characters that cannot be represented in the target charset are silently
discarded. Otherwise, <constant>E_NOTICE</constant> is generated and the function
will return &false;.
@ -64,7 +63,7 @@
<term><parameter>string</parameter></term>
<listitem>
<para>
The string to be converted.
The &string; to be converted.
</para>
</listitem>
</varlistentry>
@ -75,10 +74,22 @@
<refsect1 role="returnvalues">
&reftitle.returnvalues;
<para>
Returns the converted string&return.falseforfailure;.
Returns the converted string,&return.falseforfailure;.
</para>
</refsect1>
<refsect1 role="notes">
&reftitle.notes;
<note>
<para>
The character encodings and options available depend on the installed implementation
of iconv. If the argument to <parameter>from_encoding</parameter>
or <parameter>to_encoding</parameter> is not supported on the current system, &false;
will be returned.
</para>
</note>
</refsect1>
<refsect1 role="examples">
&reftitle.examples;
<para>
@ -111,7 +122,15 @@ Notice: iconv(): Detected an illegal character in input string in .\iconv-exampl
</para>
</refsect1>
<refsect1 role="seealso">
&reftitle.seealso;
<para>
<simplelist>
<member><function>mb_convert_encoding</function></member>
<member><methodname>UConverter::transcode</methodname></member>
</simplelist>
</para>
</refsect1>
</refentry>
<!-- Keep this comment at the end of the file

View file

@ -86,8 +86,6 @@ Array
[19] => xml_parser_free
[20] => xml_parser_set_option
[21] => xml_parser_get_option
[22] => utf8_encode
[23] => utf8_decode
)
]]>
</screen>

View file

@ -3,7 +3,7 @@
<refentry xml:id="uconverter.transcode" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">
<refnamediv>
<refname>UConverter::transcode</refname>
<refpurpose>Convert string from one charset to another</refpurpose>
<refpurpose>Convert a string from one character encoding to another</refpurpose>
</refnamediv>
<refsect1 role="description">
@ -16,11 +16,8 @@
<methodparam choice="opt"><type class="union"><type>array</type><type>null</type></type><parameter>options</parameter><initializer>&null;</initializer></methodparam>
</methodsynopsis>
<para>
Converts <parameter>str</parameter> from <parameter>fromEncoding</parameter> to <parameter>toEncoding</parameter>.
</para>
&warn.undocumented.func;
</refsect1>
<refsect1 role="parameters">
@ -30,7 +27,7 @@
<term><parameter>str</parameter></term>
<listitem>
<para>
The &string; to be converted.
</para>
</listitem>
</varlistentry>
@ -38,7 +35,7 @@
<term><parameter>toEncoding</parameter></term>
<listitem>
<para>
The desired encoding of the result.
</para>
</listitem>
</varlistentry>
@ -46,7 +43,7 @@
<term><parameter>fromEncoding</parameter></term>
<listitem>
<para>
The current encoding used to interpret <parameter>str</parameter>.
</para>
</listitem>
</varlistentry>
@ -54,7 +51,15 @@
<term><parameter>options</parameter></term>
<listitem>
<para>
An optional &array;, which may contain the following keys:
<simplelist>
<member>
<literal>'to_subst'</literal> - the substitution character to use
in place of any character of <parameter>str</parameter> which cannot
be encoded in <parameter>toEncoding</parameter>. If specified, it must
represent a single character in the target encoding.
</member>
</simplelist>
</para>
</listitem>
</varlistentry>
@ -64,10 +69,110 @@
<refsect1 role="returnvalues">
&reftitle.returnvalues;
<para>
Returns the converted string&return.falseforfailure;.
</para>
</refsect1>
<refsect1 role="examples">
&reftitle.examples;
<example>
<title>Converting from UTF-8 to UTF-16 and back</title>
<programlisting role="php">
<![CDATA[
<?php
$utf8_string = "\x5A\x6F\xC3\xAB"; // 'Zoë' in UTF-8
$utf16_string = UConverter::transcode($utf8_string, 'UTF-16BE', 'UTF-8');
echo bin2hex($utf16_string), "\n";
$new_utf8_string = UConverter::transcode($utf16_string, 'UTF-8', 'UTF-16BE');
echo bin2hex($new_utf8_string), "\n";
?>
]]>
</programlisting>
&example.outputs;
<screen>
<![CDATA[
005a006f00eb
5a6fc3ab
]]>
</screen>
</example>
<example>
<title>Invalid characters in input</title>
<para>
If the input string contains a sequence of bytes which is not valid in
the encoding specified by <parameter>fromEncoding</parameter>, they are
replaced by Unicode code point U+FFFD (Replacement Character) before
converting to <parameter>toEncoding</parameter>.
</para>
<programlisting role="php">
<![CDATA[
<?php
$invalid_utf8_string = "\xC3"; // incomplete multi-byte UTF-8 sequence
$utf16_string = UConverter::transcode($invalid_utf8_string, 'UTF-16BE', 'UTF-8');
echo bin2hex($utf16_string), "\n";
?>
]]>
</programlisting>
&example.outputs;
<screen>
<![CDATA[
fffd
]]>
</screen>
</example>
<example>
<title>Characters which cannot be encoded</title>
<para>
If the input string contains characters which cannot be represented
in <parameter>toEncoding</parameter>, they are replaced with a single
character. The default character to use depends on the encoding, and
can be controlled using the <literal>'to_subst'</literal> option.
</para>
<programlisting role="php">
<![CDATA[
<?php
$utf8_string = "\xE2\x82\xAC"; // € (Euro Sign) does not exist in ISO 8859-1
// Default replacement in ISO 8859-1 is "\x1A" (Substitute)
$iso8859_1_string = UConverter::transcode($utf8_string, 'ISO-8859-1', 'UTF-8');
echo bin2hex($iso8859_1_string), "\n";
// Specify a replacement of '?' ("\x3F") instead
$iso8859_1_string = UConverter::transcode(
$utf8_string, 'ISO-8859-1', 'UTF-8', ['to_subst' => '?']
);
echo bin2hex($iso8859_1_string), "\n";
// Since ISO 8859-1 cannot map U+FFFD, invalid input is also replaced by to_subst
$invalid_utf8_string = "\xC3"; // incomplete multi-byte UTF-8 sequence
$iso8859_1_string = UConverter::transcode(
$invalid_utf8_string, 'ISO-8859-1', 'UTF-8', ['to_subst' => '?']
);
echo bin2hex($iso8859_1_string), "\n";
?>
]]>
</programlisting>
&example.outputs;
<screen>
<![CDATA[
1a
3f
3f
]]>
</screen>
</example>
</refsect1>
<refsect1 role="seealso">
&reftitle.seealso;
<para>
<simplelist>
<member><function>mb_convert_encoding</function></member>
<member><function>iconv</function></member>
</simplelist>
</para>
</refsect1>
</refentry>
<!-- Keep this comment at the end of the file

View file

@ -3,7 +3,7 @@
<refentry xml:id="function.mb-convert-encoding" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">
<refnamediv>
<refname>mb_convert_encoding</refname>
<refpurpose>Convert character encoding</refpurpose>
<refpurpose>Convert a string from one character encoding to another</refpurpose>
</refnamediv>
<refsect1 role="description">
@ -15,9 +15,8 @@
<methodparam choice="opt"><type class="union"><type>array</type><type>string</type><type>null</type></type><parameter>from_encoding</parameter><initializer>&null;</initializer></methodparam>
</methodsynopsis>
<para>
Converts the character encoding of <parameter>string</parameter>
to <parameter>to_encoding</parameter>
from optionally <parameter>from_encoding</parameter>.
Converts <parameter>string</parameter> from <parameter>from_encoding</parameter>,
or the current internal encoding, to <parameter>to_encoding</parameter>.
If <parameter>string</parameter> is an &array;, all its &string; values will be
converted recursively.
</para>
@ -31,7 +30,7 @@
<term><parameter>string</parameter></term>
<listitem>
<para>
The &string; or &array; being encoded.
The &string; or &array; to be converted.
</para>
</listitem>
</varlistentry>
@ -39,7 +38,7 @@
<term><parameter>to_encoding</parameter></term>
<listitem>
<para>
The type of encoding that <parameter>string</parameter> is being converted to.
The desired encoding of the result.
</para>
</listitem>
</varlistentry>
@ -47,15 +46,20 @@
<term><parameter>from_encoding</parameter></term>
<listitem>
<para>
Is specified by character code names before conversion. It is either
an <type>array</type>, or a comma separated enumerated list.
If <parameter>from_encoding</parameter> is not specified, the internal
encoding will be used.
<!-- link to internal encoding info -->
The current encoding used to interpret <parameter>string</parameter>.
Multiple encodings may be specified as an &array; or comma separated
list, in which case the correct encoding will be guessed using the
same algorithm as <function>mb_detect_encoding</function>.
</para>
<para>
See <link linkend="mbstring.supported-encodings">supported
encodings</link>.
If <parameter>from_encoding</parameter> is &null; or not specified, the
<link linkend="ini.mbstring.internal-encoding">mbstring.internal_encoding setting</link>
will be used if set, otherwise the <link linkend="ini.default-charset">default_charset setting</link>.
</para>
<para>
See <link linkend="mbstring.supported-encodings">supported encodings</link>
for valid values of <parameter>to_encoding</parameter>
and <parameter>from_encoding</parameter>.
</para>
</listitem>
</varlistentry>
@ -142,7 +146,7 @@ $str = mb_convert_encoding($str, "UTF-7", "EUC-JP");
/* Auto detect encoding from JIS, eucjp-win, sjis-win, then convert str to UCS-2LE */
$str = mb_convert_encoding($str, "UCS-2LE", "JIS, eucjp-win, sjis-win");
/* "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
/* If mbstring.language is "Japanese", "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
$str = mb_convert_encoding($str, "EUC-JP", "auto");
?>
]]>
@ -156,6 +160,8 @@ $str = mb_convert_encoding($str, "EUC-JP", "auto");
<para>
<simplelist>
<member><function>mb_detect_order</function></member>
<member><methodname>UConverter::transcode</methodname></member>
<member><function>iconv</function></member>
</simplelist>
</para>
</refsect1>

View file

@ -83,6 +83,9 @@ echo recode_string("us..flat", "The following character has a diacritical mark:
The GNU Recode documentation of your installation for detailed
instructions about recode requests.
</member>
<member><function>mb_convert_encoding</function></member>
<member><methodname>UConverter::transcode</methodname></member>
<member><function>iconv</function></member>
</simplelist>
</para>
</refsect1>

View file

@ -4,8 +4,8 @@
<refnamediv>
<refname>utf8_decode</refname>
<refpurpose>
Converts a string with ISO-8859-1 characters encoded with UTF-8
to single-byte ISO-8859-1
Converts a string from UTF-8 to ISO-8859-1, replacing invalid or unrepresentable
characters
</refpurpose>
</refnamediv>
@ -20,9 +20,10 @@
<literal>UTF-8</literal> encoding to <literal>ISO-8859-1</literal>. Bytes
in the string which are not valid <literal>UTF-8</literal>, and
<literal>UTF-8</literal> characters which do not exist in
<literal>ISO-8859-1</literal> (that is, characters above
<literal>ISO-8859-1</literal> (that is, code points above
<literal>U+00FF</literal>) are replaced with <literal>?</literal>.
</para>
<note>
<para>
Many web pages marked as using the <literal>ISO-8859-1</literal> character
@ -62,6 +63,42 @@
</para>
</refsect1>
<refsect1 role="examples">
&reftitle.examples;
<example>
<title>Basic examples</title>
<programlisting role="php">
<![CDATA[
<?php
// Convert the string 'Zoë' from UTF-8 to ISO 8859-1
$utf8_string = "\x5A\x6F\xC3\xAB";
$iso8859_1_string = utf8_decode($utf8_string);
echo bin2hex($iso8859_1_string), "\n";
// Invalid UTF-8 sequences are replaced with '?'
$invalid_utf8_string = "\xC3";
$iso8859_1_string = utf8_decode($invalid_utf8_string);
var_dump($iso8859_1_string);
// Characters which don't exist in ISO 8859-1, such as
// '€' (Euro Sign) are also replaced with '?'
$utf8_string = "\xE2\x82\xAC";
$iso8859_1_string = utf8_decode($utf8_string);
var_dump($iso8859_1_string);
?>
]]>
</programlisting>
&example.outputs;
<screen>
<![CDATA[
5a6feb
string(1) "?"
string(1) "?"
]]>
</screen>
</example>
</refsect1>
<refsect1 role="changelog">
&reftitle.changelog;
<para>
@ -77,8 +114,8 @@
<row>
<entry>7.2.0</entry>
<entry>
This function has been moved to the core of PHP, and therefore lifting the requirement
on the XML extension for this function to be available.
This function has been moved from the XML extension to the core of PHP.
In previous versions, it was only available if the XML extension was installed.
</entry>
</row>
</tbody>
@ -91,10 +128,10 @@
&reftitle.seealso;
<para>
<simplelist>
<member><function>utf8_encode</function> - Performs the reverse conversion</member>
<member><function>mb_convert_encoding</function> - Converts between various character encodings, including UTF-8, ISO-8859-1 and Windows-1252</member>
<member><function>iconv</function> - Converts between various character encodings</member>
<member><function>recode_string</function> - Converts between various character encodings</member>
<member><function>utf8_encode</function></member>
<member><function>mb_convert_encoding</function></member>
<member><methodname>UConverter::transcode</methodname></member>
<member><function>iconv</function></member>
</simplelist>
</para>
</refsect1>

View file

@ -3,7 +3,7 @@
<refentry xmlns="http://docbook.org/ns/docbook" xml:id="function.utf8-encode">
<refnamediv>
<refname>utf8_encode</refname>
<refpurpose>Encodes an ISO-8859-1 string to UTF-8</refpurpose>
<refpurpose>Converts a string from ISO-8859-1 to UTF-8</refpurpose>
</refnamediv>
<refsect1 role="description">
@ -16,7 +16,15 @@
This function converts the string <parameter>string</parameter> from the
<literal>ISO-8859-1</literal> encoding to <literal>UTF-8</literal>.
</para>
<note>
<para>
This function does not attempt to guess the current encoding of the provided
string, it assumes it is encoded as ISO-8859-1 (also known as "Latin 1")
and converts to UTF-8. Since every sequence of bytes is a valid ISO-8859-1
string, this never results in an error, but will not result in a useful string
if a different encoding was intended.
</para>
<para>
Many web pages marked as using the <literal>ISO-8859-1</literal> character
encoding actually use the similar <literal>Windows-1252</literal> encoding,
@ -55,6 +63,29 @@
</para>
</refsect1>
<refsect1 role="examples">
&reftitle.examples;
<example>
<title>Basic example</title>
<programlisting role="php">
<![CDATA[
<?php
// Convert the string 'Zoë' from ISO 8859-1 to UTF-8
$iso8859_1_string = "\x5A\x6F\xEB";
$utf8_string = utf8_encode($iso8859_1_string);
echo bin2hex($utf8_string), "\n";
?>
]]>
</programlisting>
&example.outputs;
<screen>
<![CDATA[
5a6fc3ab
]]>
</screen>
</example>
</refsect1>
<refsect1 role="changelog">
&reftitle.changelog;
<para>
@ -70,8 +101,8 @@
<row>
<entry>7.2.0</entry>
<entry>
This function has been moved to the core of PHP, and therefore lifting the requirement
on the XML extension for this function to be available.
This function has been moved from the XML extension to the core of PHP.
In previous versions, it was only available if the XML extension was installed.
</entry>
</row>
</tbody>
@ -84,10 +115,10 @@
&reftitle.seealso;
<para>
<simplelist>
<member><function>utf8_decode</function> - Performs the reverse conversion</member>
<member><function>mb_convert_encoding</function> - Converts between various character encodings, including UTF-8, ISO-8859-1 and Windows-1252</member>
<member><function>iconv</function> - Converts between various character encodings</member>
<member><function>recode_string</function> - Converts between various character encodings</member>
<member><function>utf8_decode</function></member>
<member><function>mb_convert_encoding</function></member>
<member><methodname>UConverter::transcode</methodname></member>
<member><function>iconv</function></member>
</simplelist>
</para>
</refsect1>

View file

@ -101,6 +101,8 @@
<function name="trim" from="PHP 4, PHP 5, PHP 7, PHP 8"/>
<function name="ucfirst" from="PHP 4, PHP 5, PHP 7, PHP 8"/>
<function name="ucwords" from="PHP 4, PHP 5, PHP 7, PHP 8"/>
<function name="utf8_decode" from="PHP 4, PHP 5, PHP 7, PHP 8"/>
<function name="utf8_encode" from="PHP 4, PHP 5, PHP 7, PHP 8"/>
<function name="vfprintf" from="PHP 5, PHP 7, PHP 8"/>
<function name="vprintf" from="PHP 4 &gt;= 4.1.0, PHP 5, PHP 7, PHP 8"/>
<function name="vsprintf" from="PHP 4 &gt;= 4.1.0, PHP 5, PHP 7, PHP 8"/>

View file

@ -66,9 +66,9 @@ echo $packet;
<note>
<para>
If you want to serialize non-ASCII characters you have to convert
your data to UTF-8 first (see <function>utf8_encode</function> and
<function>iconv</function>).
Strings should be encoded in UTF-8; to handle other encodings, convert
the string first using <function>mb_convert_encoding</function>,
<methodname>UConverter::transcode</methodname>, or <function>iconv</function>.
</para>
</note>
</section>

View file

@ -4,8 +4,6 @@
Do NOT translate this file
-->
<versions>
<function name="utf8_decode" from="PHP 4, PHP 5, PHP 7, PHP 8"/>
<function name="utf8_encode" from="PHP 4, PHP 5, PHP 7, PHP 8"/>
<function name="xml_error_string" from="PHP 4, PHP 5, PHP 7, PHP 8"/>
<function name="xml_get_current_byte_index" from="PHP 4, PHP 5, PHP 7, PHP 8"/>
<function name="xml_get_current_column_number" from="PHP 4, PHP 5, PHP 7, PHP 8"/>