elegantgasra.blogg.se - Ruby change text encoding

The implemetnation in String#encode, but applied without transcoding, is exactly what would be appropriate for consistency. Step through all the bytes, do exactly whatever String#encode is doing to identify illegal bytes in source encoding, and just omit the subsequent transcode step. But if it's possible for String encode to identify invalid bytes in the source encoding when transcoding, and do it sufficiently correctly, it seems like it should be possible to do it without transcoding too. Likewise, I am not familiar with the internal implementation of String#encode with :invalid => :replace. If you might want to do this when transcoding (and indeed I think people often do), then why wouldn't you sometimes want to do it without transcoding too? (I think it is quite common). This is why String#encode includes the ":replace=>:invaid" option, right? Note that option in String#encode is for invalid bytes in the source, not for inability to transcode (:undef => :replace) is for that. I could explain the particular context I am personally wanting to use replacement chars - but I think it is a general thing people will want to do in a variety of contexts. But sometimes instead, I'll want to replace invalid bytes with a replacement char - it depends on the context of my application. Sometimes I'll want an exception to be raised right away, true. Instead, I want to take action immediately upon the string entering my program. But at some indeterminate point in the future, an exception will be raised, virtually at any point in my program's execution. Right now, I take in this data, and call force_encoding on it.

It may have been mis-entered at some point in history, or have had it's encoding mis-represented. It may have errors in it, or corrupt bytes. It is advertised as UTF8, as far as I know it is.īut, like all input, it can not be guaranteed reliable. I am taking in input from an external source which I believe is UTF-8 (for example, could be any encoding). I think the use case is very common - it is for me anyway, I think I"m not unique! Here's a pure-ruby partial implementation showing what I need, but it's not as full-featured as the relevant functions in #encode for trans-coding, and it's probably much much slower too. Or if it needs to be a new method name, say #validate_encoding. I don't know if this functionality should be provided by String#encode as above, even when the target encoding is the same as the destination encoding.

It is actually a pretty common thing to want to do, sometimes strings come from external sources that are not want they claim they are it's very useful to be able to check/validate them, and possibly repair them, right away, rather than waiting for an "invalid byte sequence" error to crop up at some indeterminate point in the future. So this is a feature request for a built-in way to do this. String.encode("UTF-8", :invalid => :replace) # Does NOT replace bad bytes String.encode("UTF-8") # Does NOT raise even if there are bad bytes This does not work, it's a no-op even when there are invalid bytes: I'd like to check it right away, sometimes raising right away, sometimes using :invalid/:replace functionality similar to String#encode.Īs far as I can tell, ruby gives me no way to do it. However, like all input from an external source that I don't have complete control over, it's possible that it contains invalid bytes. I have a string which, ought to be, say, UTF-8 string = something.force_encoding("UTF-8")

Sometimes I do not want to transcode to a new encoding. If I use the String#encode feature to transcode from one encoding to another, then bad (invalid) bytes in the source encoding will raise, or else I can pass in :invalid and :replace options to tell it to do something different with bad bytes in the source encoding.