Quantcast

Reliable character encodings conversion

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Reliable character encodings conversion

Hubert Łępicki
Hi,

I am looking for reliable and error-resistant way to convert character
encodings to UTF8. Input encodings vary, and I have quite good input
encodings detection in place.

I am using Iconv library wrapper to convert texts to UTF8, but it's
throwing "Iconv::IllegalSequence" exception. The problem is that input
texts are user-generated and have sometimes mixed characters
encodings.

Does anyone have any experience with these kind of situations, or can
suggest alternative libraries?

Thanks,
Hubert

--
Pozdrawiam,
Hubert Łępicki
 -----------------------------------------------
[ http://hubertlepicki.com ]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reliable character encodings conversion

James Gray-7
On Sep 30, 2008, at 7:30 AM, Hubert Łępicki wrote:

> I am using Iconv library wrapper to convert texts to UTF8, but it's
> throwing "Iconv::IllegalSequence" exception.

You can add a //TRANSLIT to the end of the "to" encoding to have Iconv  
attempt to convert characters to reasonable equivalents in that  
encoding. This is usually more helpful when your input is all one  
encoding and just has some characters that won't translate well (like  
a UTF-8 … going to ISO-8859-1).

Your case of mixed encodings is probably best handled with //IGNORE  
instead, which asks Iconv to skip over any characters that cannot be  
converted.  You will loose some data with this, but it will convert  
what it can.

You can also use //TRANSLIT//IGNORE to convert what can be converted  
and skip the rest.

Hope that helps.

James Edward Gray II
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reliable character encodings conversion

Hubert Łępicki
2008/9/30 James Gray <[hidden email]>:

> On Sep 30, 2008, at 7:30 AM, Hubert Łępicki wrote:
>
>> I am using Iconv library wrapper to convert texts to UTF8, but it's
>> throwing "Iconv::IllegalSequence" exception.
>
> You can add a //TRANSLIT to the end of the "to" encoding to have Iconv
> attempt to convert characters to reasonable equivalents in that encoding.
> This is usually more helpful when your input is all one encoding and just
> has some characters that won't translate well (like a UTF-8 … going to
> ISO-8859-1).
>
> Your case of mixed encodings is probably best handled with //IGNORE instead,
> which asks Iconv to skip over any characters that cannot be converted.  You
> will loose some data with this, but it will convert what it can.
>
> You can also use //TRANSLIT//IGNORE to convert what can be converted and
> skip the rest.
>

Thanks, //IGNORE//TRANSLIT seems to help a bit - but it's not perfect.
I am loosing characters like British pound that were placed in
us-ascii encoding for example. Is there some smart library out there
that can help with common problems like this one?

I have noticed that there is ICU (http://www.icu-project.org/) library
for C++ that I could use if it's any smarter - anyone had any
experience with it?

Best,
H.

> Hope that helps.
>
> James Edward Gray II
>



--
Pozdrawiam,
Hubert Łępicki
 -----------------------------------------------
[ http://hubertlepicki.com ]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reliable character encodings conversion

James Gray-7
On Sep 30, 2008, at 8:20 AM, Hubert Łępicki wrote:

> 2008/9/30 James Gray <[hidden email]>:
>> On Sep 30, 2008, at 7:30 AM, Hubert Łępicki wrote:
>>
>>> I am using Iconv library wrapper to convert texts to UTF8, but it's
>>> throwing "Iconv::IllegalSequence" exception.
>>
>> You can add a //TRANSLIT to the end of the "to" encoding to have  
>> Iconv
>> attempt to convert characters to reasonable equivalents in that  
>> encoding.
>> This is usually more helpful when your input is all one encoding  
>> and just
>> has some characters that won't translate well (like a UTF-8 …  
>> going to
>> ISO-8859-1).
>>
>> Your case of mixed encodings is probably best handled with //IGNORE  
>> instead,
>> which asks Iconv to skip over any characters that cannot be  
>> converted.  You
>> will loose some data with this, but it will convert what it can.
>>
>> You can also use //TRANSLIT//IGNORE to convert what can be  
>> converted and
>> skip the rest.
>>
>
> Thanks, //IGNORE//TRANSLIT seems to help a bit - but it's not perfect.

You listed those backwards.  Is that really what you tried?  Does  
reversing them make any difference?

James Edward Gray II
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reliable character encodings conversion

Bugzilla from mailing.mr@gmail.com
you can use RChardet library,

her'es what i use:

require 'rchardet'

class String
   def encoding
     @encoding ||= guess_encoding
   end

   def encoding=(new)
     @encoding = new
   end

   def convert_to(new)
     self.replace(Iconv.iconv(new, encoding, self)[0])
     @encoding = new
   end

   def guess_encoding
     @encoding = CharDet.guess(self)["encoding"]
   end

   # this enables "foo".convert :us-ascii => :utf8
   def convert(hash)
     from = hash.keys[0]
     to = hash[from]
     self.replace(Iconv.iconv(to, from, self)[0])
   end
end

it handles translating preatty well :)

Loading...