unicode results from the CPAN

Catalyst-Runtime


=head1 Name

Catalyst::UTF8 - All About UTF8 and Catalyst Encoding

=head1 Description

Starting in 5.90080 L<Catalyst> will enable UTF8 encoding by default for
text like body responses.  In addition we've made a ton of fixes around encoding
and utf8 scattered throughout the codebase.  This document attempts to give
an overview of the assumptions and practices that  L<Catalyst> uses when
dealing with UTF8 and encoding issues.  You should also review the
Changes file, L<Catalyst::Delta> and L<Catalyst::Upgrading> for more.

We attempt to describe all relevant processes, try to give some advice
and explain where we may have been exceptional to respect our commitment
to backwards compatibility.

=head1 UTF8 in Controller Actions

Using UTF8 characters in your Controller classes and actions.

=head2 Summary

In this section we will review changes to how UTF8 characters can be used in
controller actions, how it looks in the debugging screens (and your logs)
as well as how you construct L<URL> objects to actions with UTF8 paths
(or using UTF8 args or captures).

=head2 Unicode in Controllers and URLs

    package MyApp::Controller::Root;

    use utf8;
    use base 'Catalyst::Controller';

    sub heart_with_arg :Path('â™¥') Args(1)  {
      my ($self, $c, $arg) = @_;
    }

    sub base :Chained('/') CaptureArgs(0) {
      my ($self, $c) = @_;
    }

      sub capture :Chained('base') PathPart('â™¥') CaptureArgs(1) {
        my ($self, $c, $capture) = @_;
      }

        sub arg :Chained('capture') PathPart('â™¥') Args(1) {
          my ($self, $c, $arg) = @_;
        }

=head2 Discussion

In the example controller above we have constructed two matchable URL routes:

    http://localhost/root/â™¥/{arg}
    http://localhost/base/â™¥/{capture}/â™¥/{arg}

The first one is a classic Path type action and the second uses Chaining, and
spans three actions in total.  As you can see, you can use unicode characters
in your Path and PathPart attributes (remember to use the C<utf8> pragma to allow
these multibyte characters in your source).  The two constructed matchable routes
would match the following incoming URLs:

    (heart_with_arg) -> http://localhost/root/%E2%99%A5/{arg}
    (base/capture/arg) -> http://localhost/base/%E2%99%A5/{capture}/%E2%99%A5/{arg}

That path path C<%E2%99%A5> is url encoded unicode (assuming you are hitting this with
a reasonably modern browser).  Its basically what goes over HTTP when your type a
browser location that has the unicode 'heart' in it.  However we will use the unicode
symbol in your debugging messages:

    [debug] Loaded Path actions:
    .-------------------------------------+--------------------------------------.
    | Path                                | Private                              |
    +-------------------------------------+--------------------------------------+
    | /root/â™¥/*                          | /root/heart_with_arg                  |
    '-------------------------------------+--------------------------------------'

    [debug] Loaded Chained actions:
    .-------------------------------------+--------------------------------------.
    | Path Spec                           | Private                              |
    +-------------------------------------+--------------------------------------+
    | /base/â™¥/*/â™¥/*                       | /root/base (0)                       |
    |                                     | -> /root/capture (1)                 |
    |                                     | => /root/arg                         |
    '-------------------------------------+--------------------------------------'

And if the requested URL uses unicode characters in your captures or args (such as
C<http://localhost:/base/â™¥/â™¥/â™¥/â™¥>) you should see the arguments and captures as their
unicode characters as well:

    [debug] Arguments are "â™¥"
    [debug] "GET" request for "base/â™¥/â™¥/â™¥/â™¥" from "127.0.0.1"
    .------------------------------------------------------------+-----------.
    | Action                                                     | Time      |
    +------------------------------------------------------------+-----------+
    | /root/base                                                 | 0.000080s |
    | /root/capture                                              | 0.000075s |
    | /root/arg                                                  | 0.000755s |
    '------------------------------------------------------------+-----------'

Again, remember that we are display the unicode character and using it to match actions
containing such multibyte characters BUT over HTTP you are getting these as URL encoded
bytes.  For example if you looked at the L<PSGI> C<$env> value for C<REQUEST_URI> you
would see (for the above request)

    REQUEST_URI => "/base/%E2%99%A5/%E2%99%A5/%E2%99%A5/%E2%99%A5"

So on the incoming request we decode so that we can match and display unicode characters
(after decoding the URL encoding).  This makes it straightforward to use these types of
multibyte characters in your actions and see them incoming in captures and arguments.  Please
keep this in might if you are doing for example regular expression matching, length determination
or other string comparisons, you will need to try these incoming variables as though UTF8
strings.  For example in the following action:

        sub arg :Chained('capture') PathPart('â™¥') Args(1) {
          my ($self, $c, $arg) = @_;
        }

when $arg is "â™¥" you should expect C<length($arg)> to be C<1> since it is indeed one character
although it will take more than one byte to store.

=head2 UTF8 in constructing URLs via $c->uri_for

For the reverse (constructing meaningful URLs to actions that contain multibyte characters
in their paths or path parts, or when you want to include such characters in your captures
or arguments) L<Catalyst> will do the right thing (again just remember to use the C<utf8>
pragma).

    use utf8;
    my $url = $c->uri_for( $c->controller('Root')->action_for('arg'), ['â™¥','â™¥']);

When you stringify this object (for use in a template, for example) it will automatically
do the right thing regarding utf8 encoding and url encoding.

    http://localhost/base/%E2%99%A5/%E2%99%A5/%E2%99%A5/%E2%99%A5

Since again what you want is a properly url encoded version of this.  In this case your string
length will reflect URL encoded bytes, not the character length.  Ultimately what you want
to send over the wire via HTTP needs to be bytes.

=head1 UTF8 in GET Query and Form POST

What Catalyst does with UTF8 in your GET and classic HTML Form POST

=head2 UTF8 in URL query and keywords

The same rules that we find in URL paths also cover URL query parts.  That is
if one types a URL like this into the browser

    http://localhost/example?â™¥=â™¥â™¥

When this goes 'over the wire' to your application server its going to be as
percent encoded bytes:


    http://localhost/example?%E2%99%A5=%E2%99%A5%E2%99%A5

When L<Catalyst> encounters this we decode the percent encoding and the utf8
so that we can properly display this information (such as in the debugging
logs or in a response.)

    [debug] Query Parameters are:
    .-------------------------------------+--------------------------------------.
    | Parameter                           | Value                                |
    +-------------------------------------+--------------------------------------+
    | â™¥                                   | â™¥â™¥                                   |
    '-------------------------------------+--------------------------------------'

All the values and keys that are part of $c->req->query_parameters will be
utf8 decoded.  So you should not need to do anything special to take those
values/keys and send them to the body response (since as we will see later
L<Catalyst> will do all the necessary encoding for you).

Again, remember that values of your parameters are now decode into Unicode strings.  so
for example you'd expect the result of length to reflect the character length not
the byte length.

Just like with arguments and captures, you can use utf8 literals (or utf8
strings) in $c->uri_for:

    use utf8;
    my $url = $c->uri_for( $c->controller('Root')->action_for('example'), {'â™¥' => 'â™¥â™¥'});

When you stringify this object (for use in a template, for example) it will automatically
do the right thing regarding utf8 encoding and url encoding.

    http://localhost/example?%E2%99%A5=%E2%99%A5%E2%99%A5

Since again what you want is a properly url encoded version of this.  Ultimately what you want
to send over the wire via HTTP needs to be bytes (not unicode characters).

Remember if you use any utf8 literals in your source code, you should use the
C<use utf8> pragma.

B<NOTE:> Assuming UTF-8 in your query parameters and keywords may be an issue if you have
legacy code where you created URL in templates manually and used an encoding other than UTF-8.
In these cases you may find versions of Catalyst after 5.90080+ will incorrectly decode.  For
backwards compatibility we offer three configurations settings, here described in order of
precedence:

C<do_not_decode_query>

If true, then do not try to character decode any wide characters in your
request URL query or keywords.  You will need to handle this manually in your action code
(although if you choose this setting, chances are you already do this).

C<default_query_encoding>

This setting allows one to specify a fixed value for how to decode your query, instead of using
the default, UTF-8.

C<decode_query_using_global_encoding>

If this is true we decode using whatever you set C<encoding> to.

=head2 UTF8 in Form POST

In general most modern browsers will follow the specification, which says that POSTed
form fields should be encoded in the same way that the document was served with.  That means
that if you are using modern Catalyst and serving UTF8 encoded responses, a browser is
supposed to notice that and encode the form POSTs accordingly.

As a result since L<Catalyst> now serves UTF8 encoded responses by default, this means that
you can mostly rely on incoming form POSTs to be so encoded.  L<Catalyst> will make this
assumption and decode accordingly (unless you explicitly turn off encoding...)  If you are
running Catalyst in developer debug, then you will see the correct unicode characters in
the debug output.  For example if you generate a POST request:

    use Catalyst::Test 'MyApp';
    use utf8;

    my $res = request POST "/example/posted", ['â™¥'=>'â™¥', 'â™¥â™¥'=>'â™¥'];

Running in CATALYST_DEBUG=1 mode you should see output like this:

    [debug] Body Parameters are:
    .-------------------------------------+--------------------------------------.
    | Parameter                           | Value                                |
    +-------------------------------------+--------------------------------------+
    | â™¥                                   | â™¥                                    |
    | â™¥â™¥                                  | â™¥                                    |
    '-------------------------------------+--------------------------------------'

And if you had a controller like this:

    package MyApp::Controller::Example;

    use base 'Catalyst::Controller';

    sub posted :POST Local {
        my ($self, $c) = @_;
        $c->res->content_type('text/plain');
        $c->res->body("hearts => ${\$c->req->post_parameters->{â™¥}}");
    }

The following test case would be true:

    use Encode 2.21 'decode_utf8';
    is decode_utf8($req->content), 'hearts => â™¥';

In this case we decode so that we can print and compare strings with multibyte characters.

B<NOTE>  In some cases some browsers may not follow the specification and set the form POST
encoding based on the server response.  Catalyst itself doesn't attempt any workarounds, but one
common approach is to use a hidden form field with a UTF8 value (You might be familiar with
this from how Ruby on Rails has HTML form helpers that do that automatically).  In that case
some browsers will send UTF8 encoded if it notices the hidden input field contains such a
character.  Also, you can add an HTML attribute to your form tag which many modern browsers
will respect to set the encoding (accept-charset="utf-8").  And lastly there are some javascript
based tricks and workarounds for even more odd cases (just search the web for this will return
a number of approaches.  Hopefully as more compliant browsers become popular these edge cases
will fade.

B<NOTE>  It is possible for a form POST multipart response (normally a file upload) to contain
inline content with mixed content character sets and encoding.  For example one might create
a POST like this:

    use utf8;
    use HTTP::Request::Common;

    my $utf8 = 'test â™¥';
    my $shiftjs = 'test ãƒ†ã‚¹ãƒˆ';
    my $req = POST '/root/echo_arg',
        Content_Type => 'form-data',
          Content =>  [
            arg0 => 'helloworld',

lib/Catalyst/UTF8.pod view on Meta::CPAN

encoding for L<Catalyst>:

L<Catalyst::View::TT>, L<Catalyst::View::Mason>, L<Catalyst::View::HTML::Mason>,
L<Catalyst::View::Xslate>

See L<Catalyst::Upgrading> for additional information on L<Catalyst> extensions that require
upgrades.

In generally for the common views you should not need to do anything special.  If your actual
template files contain UTF8 literals you should set configuration on your View to enable that.
For example in TT, if your template has actual UTF8 character in it you should do the following:

    MyApp::View::TT->config(ENCODING => 'utf-8');

However L<Catalyst::View::Xslate> wants to do the UTF8 encoding for you (We assume that the
authors of that view did this as a workaround to the fact that until now encoding was not core
to L<Catalyst>.  So if you use that view, you either need to tell it to not encode, or you need
to turn off encoding for Catalyst.

    MyApp::View::Xslate->config(encode_body => 0);

or

    MyApp->config(encoding=>undef);

Preference is to disable it in the View.

Other views may be similar.  You should review View documentation and test during upgrading.
We tried to make sure most common views worked properly and noted all workaround but if we
missed something please alert the development team (instead of introducing a local hack into
your application that will mean nobody will ever upgrade it...).

=head2 Setting the response from an external PSGI application.

L<Catalyst::Response> allows one to set the response from an external L<PSGI> application.
If you do this, and that external application sets a character set on the content-type, we
C<clear_encoding> for the rest of the response.  This is done to prevent double encoding.

B<NOTE> Even if the character set of the content type is the same as the encoding set in
$c->encoding, we still skip encoding.  This is a regrettable difference from the general rule
outlined above, where if the current character set is the same as the current encoding, we
encode anyway.  Nevertheless I think this is the correct behavior since the earlier rule exists
only to support backward compatibility with L<Catalyst::View::TT>.

In general if you want L<Catalyst> to handle encoding, you should avoid setting the content
type character set since Catalyst will do so automatically based on the requested response
encoding.  Its best to request alternative encodings by setting $c->encoding and if you  really
want manual control of encoding you should always $c->clear_encoding so that programmers that
come after you are very clear as to your intentions.

=head2 Disabling default UTF8 encoding

You may encounter issues with your legacy code running under default UTF8 body encoding.  If
so you can disable this with the following configurations setting:

    MyApp->config(encoding=>undef);

Where C<MyApp> is your L<Catalyst> subclass.

If you do not wish to disable all the Catalyst encoding features, you may disable specific
features via two additional configuration options:  'skip_body_param_unicode_decoding'
and 'skip_complex_post_part_handling'.  The first will skip any attempt to decode POST
parameters in the creating of body parameters and the second will skip creation of instances
of L<Catalyst::Request::PartData> in the case that the multipart form upload contains parts
with a mix of content character sets.

If you believe you have discovered a bug in UTF8 body encoding, I strongly encourage you to
report it (and not try to hack a workaround in your local code).  We also recommend that you
regard such a workaround as a temporary solution.  It is ideal if L<Catalyst> extension
authors can start to count on L<Catalyst> doing the right thing for encoding.

=head1 Conclusion

This document has attempted to be a complete review of how UTF8 and encoding works in the
current version of L<Catalyst> and also to document known issues, gotchas and backward
compatible hacks.  Please report issues to the development team.

=head1 Author

John Napiorkowski L<jjnapiork@cpan.org|mailto:jjnapiork@cpan.org>

=cut

( run in 0.951 second using v1.01-cache-2.11-cpan-39bf76dae61 )