undefined | Better HN

0 pointsgyulai3y ago0 comments

Getting strings to have the right encodings should be easy. On the last Perl codebase I touched it's proven impossible for all practical intents and purposes.

0 comments

wazoox3y ago

It's markedly easier than with Python, though. Here's a short script that will recode a file with mixed iso-8859-1 and utf8 data into proper utf8:

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use Encode qw( decode FB_QUIET );
    
    binmode STDIN, ':bytes';
    binmode STDOUT, ':encoding(UTF-8)';
    
    my $out;
    
    while ( <> ) {
        $out = '';
        while ( length ) {
            $out .= decode( "utf-8", $_, FB_QUIET );
            $out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length;
        }
        print $out;
    }

gyulaiOP3y ago

Thanks for posting the happily ignorant code snippet that I have been waiting for.

The problem is that Perl internally encodes strings as sequences of numbers. Not even sequences of bytes, but sequences of numbers that could either be codepoints or bytes resulting from the encoding of such a sequence of codepoints. ...as a developer you are perfectly free to make this assumption any way you please at any given point in your codebase. It's not even clear that any one of those two is particularly "preferred" at large or a best practice or anything like that.

To make things worse, there is no way to know which is which, i.e. a string itself is happily ignorant about the assumptions that people will/should make about it. And Perl will happily concatenate strings making different kinds of assumptions, or double- or triple-encode them as you please, or decode something that hasn't been encoded in the first place.

This leads to jumbles of numbers that aren't anything in particular. They simply work well enough for sloppy programmers to not realize when they are making mistakes, but badly enough to almost guarantee that encoding errors will crop up on users' screens regularly.

Now, given that this is how the language works, be my guest jumping into a 100k loc Perl codebase that dozens of programmers have touched over a decade, passing around and munging together strings not just within their own codebase, but also using strings stored to and retrieved from elsewhere, in some case places where no one knows anymore where they initially came from or where they will ultimately go to.

wazoox3y ago

> Thanks for posting the happily ignorant code snippet that I have been waiting for.

Thank you from being so civil. IMO displaying a badly encoded string beats crashing on a runtime error most of the time. I'd rather see "hÃ´pital" than "Error 500", if you will. Maybe don't think your personal assumptions carry any validity out of your own choices, preferences, or uses.

I imagine the difficulty working with a huge codebase lacking refactoring and maybe even predating utf-8, but where would you be if it was written in Python 2.5 originally?

gyulaiOP3y ago

But that's precisely the point: Python 2.5 realized that something was fundamentally broken and the community went through a painful transition process. Transitioning to Python 3 meant getting your house in order where string encodings were concerned.

Any python programmer would tell you: Starting a new project in 2022 in Python 2.5 is professional malpractice.

But that's what the original post seems to be saying: That Perl 5 has somehow managed to fix any of what was fundamentally wrong with it. ...and that couldn't be further from the truth. And people in this thread are saying that maybe they should have another look into Perl 5 as a serious option for starting out a new codebase in 2022. ...and that's a very bad idea.

Sure: If you started out a new codebase in Perl 5 in 2022, there are coding standards you could adopt to avoid getting yourself into a pickle where string encodings are concerned. But without the interpreter helping you out on that front, it'll produce ugly code, and take mental discipline and disciplined code reviewing practices on a team. It's solving a problem that Python solves for you so much more easily and effectively. You could go with Perl 6 / Raku, but why would you? What does it have to recommend it over Python or Ruby, other than a Perl programmer's nostalgia for being a little Perl-like?

You could say the transition from Perl 5 to Perl 6 is just like the transition from Python 2 to Python 3. The difference is: Perl is simply late by at least a decade.

The point that the article is trying to refute, namely that Perl is for dinosaurs, in my mind just absolutely stands.

3 more replies

teddyh3y ago

> I'd rather see "hÃ´pital" than "Error 500"

The debate between weak typing and strong typing is as old as the hills. But in much of the modern era, strong typing, of which Python is an example, seems to have decidedly prevailed.

j / k navigate · click thread line to collapse

0 comments

wazoox3y ago

It's markedly easier than with Python, though. Here's a short script that will recode a file with mixed iso-8859-1 and utf8 data into proper utf8:

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use Encode qw( decode FB_QUIET );
    
    binmode STDIN, ':bytes';
    binmode STDOUT, ':encoding(UTF-8)';
    
    my $out;
    
    while ( <> ) {
        $out = '';
        while ( length ) {
            $out .= decode( "utf-8", $_, FB_QUIET );
            $out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length;
        }
        print $out;
    }

gyulaiOP3y ago

Thanks for posting the happily ignorant code snippet that I have been waiting for.

wazoox3y ago

> Thanks for posting the happily ignorant code snippet that I have been waiting for.

I imagine the difficulty working with a huge codebase lacking refactoring and maybe even predating utf-8, but where would you be if it was written in Python 2.5 originally?

gyulaiOP3y ago

Any python programmer would tell you: Starting a new project in 2022 in Python 2.5 is professional malpractice.

You could say the transition from Perl 5 to Perl 6 is just like the transition from Python 2 to Python 3. The difference is: Perl is simply late by at least a decade.

The point that the article is trying to refute, namely that Perl is for dinosaurs, in my mind just absolutely stands.

3 more replies

teddyh3y ago

> I'd rather see "hÃ´pital" than "Error 500"

The debate between weak typing and strong typing is as old as the hills. But in much of the modern era, strong typing, of which Python is an example, seems to have decidedly prevailed.

j / k navigate · click thread line to collapse