Re: recycling internationalized garbage
Available news archives: comp.lang.tcl - comp.lang.python - comp.security.firewalls - sci.crypt - comp.lang.php - comp.lang.javascript
Google
 
Web news.hping.org


comp.lang.python archive

Re: recycling internationalized garbage

From: <aaronwmail-usenet@yahoo.com>
Date: Tue Mar 14 2006 - 16:18:06 CET

Regarding cleaning of mixed string encodings in
the discography search engine

http://www.xfeedme.com/discs/discography.html

Following </F>'s suggestion I came up with this:

utf8enc = codecs.getencoder("utf8")
utf8dec = codecs.getdecoder("utf8")
iso88591dec = codecs.getdecoder("iso-8859-1")

def checkEncoding(s):
    try:
        (uni, dummy) = utf8dec(s)
    except:
        (uni, dummy) = iso88591dec(s, 'ignore')
    (out, dummy) = utf8enc(uni)
    return out

This works nicely for Nordic stuff like
"björgvin halldórsson - gunnar Þórðarson",
but russian seems to turn into garbage
and I have no idea about chinese.

Unless someone has any other ideas I'm
giving up now.
   -- Aaron Watters

===

In theory, theory is the same as practice.
In practice it's more complicated than that.
  -- folklore
Received on Sun Apr 30 11:49:07 2006