Ruby 1.9 made me remember how I hate the concept of encodings

Update: You can follow the discussion on Hacker news

I guess I won’t make a lot of friends by saying this but my first impression with Ruby 1.9 was awful. Since we had to configure a new server, we thought it was the perfect occasion to install ruby 1.9 and see how our current web applications behave. I can’t comment on any new features about 1.9 because we got bored after a while and decided to switch back to 1.8.
There is one huge problem with 1.9 and it is how it manages encodings. Sure every ruby fanatics will tell you that it is “cleaner”, “more robust”, “safer”, “clever” or whatever… but it breaks working applications! So in my book this is a problem. I know it is just normal to refactor some aspects of your code when a new version of a programming language comes out but… this? All this pain for what, encodings? What’s the benefit already? No seriously tell me because I wake up every morning having to remember what they are, what purpose they serve and why application developers still have to worry about them after all those years. What I know however is that with 1.8 you could mix different encodings in the same string instance and the worst that could happen was some weird looking characters in the resulting web page. But ruby 1.9 makes things different, it throws an exception in your face. Here is a great article that explains what is happening with string encodings in 1.9. Beware though, although this article is saying that there is a solution, it is an hypothetical solution. Here is the excerpt :

Even better, Ruby already has a mechanism that is mostly designed for this purpose. In Ruby 1.9, setting Encoding.default_internal tells Ruby to encode all Strings crossing the barrier via its IO system into that preferred encoding. All we’d need, then, is for maintainers of database drivers to honor this convention as well.

So, unless I’m just not getting something (which is highly possible because encodings always confuse the heck out of me), there is no real solution other than to wait for database driver developers to honor the Encoding.default_internal setting of ruby 1.9.
A small rant about encodings
I dream of a world without war or hunger and where I don’t have to care about character encodings when I’m programming! Why on earth do we have all those different encodings in 2010? Why not making a huge encoding table UTF-16384 containing every single character in the universe so we can forget about this crazy concept of different encodings and pretend that it never existed? Would a big fat and unique encoding table cause huge performance issues everywhere? I might be mistaken but I really doubt it would.

18 thoughts on “Ruby 1.9 made me remember how I hate the concept of encodings

  1. Believe it or not, I had a problem with encoding yesterday in .NET with javascript. It took me a long time to find how to change encoding of a file with .NET (I had to google it) and then, I had to choose among a hundred encodings in a list. WHY????

  2. Failing silently on bad data is one sure way to end up with rubbish data. UTF8 and 16 already can represent all human character sets. The choice between the two typically depends on where you are from. Asian scripts are more efficiently represented using UTF16, and western scripts more efficiently with UTF8. Ruby suffers in that it’s author is biased towards home grown encodings like ShiftJIS for efficiency reasons, standards and interoperability be damned.

  3. You start to care when your name contains characters outside ascii. Using just ascii is not an option when you can’t write your name with ascii in a way that would be pronounced properly. With UTF-8 vs. UTF-16 you care when all the characters you usually use are outside a single byte. If you don’t care about the rest of the world that has to deal with these issues then don’t use a product that’s not targeted just for English speaking countries. You can’t say ruby 1.9 sucks if the database drivers that don’t play ball yet.

  4. The following drivers have already been updated with the fix:
    do_sqlite3
    do_mysql
    do_postgres
    sqlite3
    mongodb
    Fixes have also been applies to 1.9’s YAML parser (psych) and Nokogiri. We’re waiting on fixes in mysql and pg, which should happen in the near future.
    People sometimes have a tendency to assume that hypothetical solutions will take ages to implement. In this case, I expect to have this wrapped up within a month, and before Ruby 1.9.2.final is released.

  5. > Why on earth do we have all those different encodings in 2010?
    > Why not making a huge encoding table UTF-16384 containing
    > every single character in the universe so we can forget about this
    > crazy concept of different encodings and pretend that it never existed?
    We do have that. It’s called Unicode. Unicode has a list of every character the unicode committee has gotten to with a number. UTF-8, UTF-16, and UTF-32 are simply different ways to encode those numbers in a way that can be read.
    Unfortunately, what you propose is an inherently human activity, and a hard one at that. They’ve been at it for quite some time, and there are still encodings that cannot be losslessly converted to Unicode and back. This is the same for UTF-8, UTF-16, and UTF-32, because UTF-8 can already encode the entire Unicode space.
    One issue is that in multiple encodings, there might be characters that represent the same unicode character. For instance, we don’t need a separate “A” for the UTF-8 variant and the Latin-1 variant (Latin-1 is a common encoding that has the ASCII characters plus a bunch of stuff for latin countries). They attempted to do this with the myriad encodings in Japan, in a process called “Han Unification”. Unfortunately, the Japanese weren’t always happy with the results.
    This process could sometimes lead to it being impossible to represent your name in Unicode. Imagine if your name was José, and the Unicode committee decided that é and e were the same character, so a program which read in José spit back out Jose. You would not be happy with it.
    In short, it’s possible that this problem will eventually be solved, but it’s complicated and messy, and probably has many years to go before it’s completed.

  6. Every string has an implicit encoding, even if it’s just ASCII (be happy it’s not EBCDIC). Ruby 1.9 just makes this explicit, and allows you to convert strings to specific encodings, or not (remember, some encoding conversions are lossy).

  7. > but it breaks working applications
    Actually, it doesn’t. It makes *already broken* applications raise an exception instead of silently corrupting data.

  8. Jorg, that’s the most exaggerated comment I saw so far. With your standards every rails applications backed by mysql databases encoded in Latin-1 (often the default encoding with mysql) are broken. That’s a whole lot of broken apps!

  9. @Frank: Well, if the Rails app is presenting a UTF-8 web page and using a Latin-1 table for storage, it is a broken app – conceptually if not in practice. It’s a disaster that was just waiting to happen. Well, it happened – not Ruby’s fault people don’t configure their DBs properly.

  10. @Amadan, but what kind of disaster could happen if the website in question doesn’t need internationalization? These “local” websites/applications are not waiting for a disaster to happen… they would have continue to work for eternity. Now they don’t and I think it is stretching a bit too far to put all the blame on the developer.
    Call me crazy but I prefer the silent “failing” of 1.8 over the unforgiving exception that is triggered with 1.9.1

  11. @Yehuda,
    Thanks for the nice explanation. I’m not throwing the blame at anyone in fact. I know Unicode is supposed to be the “ultimate encoding table” that I am talking about but it just isn’t the case yet.
    In a perfect world we would not have to worry about character encoding at all. Machines should do what has to be done behind the scene to make it all work without our intervention. I’m sure one day we’ll get there.

  12. You are absolutely right Frank.
    Ruby 1.9.x forces everyone to deal with it.
    The Encoding shit is still the number one reason why I can’t upgrade. I solved all the other problems with syntax change.
    Ruby now forces me to deal with Encodings. I was able to avoid that.
    Now I have to read up shit about Encoding and I KNOW THAT I WILL NEVER NEED THEM BECAUSE IN THE TEN YEARS SO FAR I NEVER HAD TO CATER TOWARDS ENCODINGS I DID NOT CARE ABOUT.
    Now this choice has been removed from Ruby users.

  13. Think of Ruby 1.9 as a different language that happens to be based on 1.8. I have numerous command-line programs that assume that a string is simply a sequence of bytes, but to make them compatible with 1.9 I have to go around and change them to explicitly say so by e.g. opening a file in “r:BINARY” mode (which won’t work in 1.8). Also, that “abc”[0] no longer returns the first byte in a string breaks a lot of existing programs, and it’s difficult to rewrite code that relies on the old behavior a way that’s compatible with both 1.8 and 1.9.
    I imagine a lot of people will simply stick with 1.8 for the foreseeable future.

  14. Usually I don’t read post on blogs, however I wish
    to say that this write-up very compelled me to take
    a look at and do so! Your writing taste has been amazed me.
    Thank you, quite nice post.

  15. I was recommended this website by my cousin. I’m not sure
    whether this post is written by him as no one else know such
    detailed about my problem. You are incredible! Thanks!

  16. Youu could certainly ssee your enthusiasm in the article you write.
    The sector hopes for more passionate writers such as you who aren’t afraid to say how they believe.
    At all times follow your heart.

Leave a Reply

Your email address will not be published. Required fields are marked *