Monday, November 2, 2015

MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence

This error suggests that a generated file may have multiple encodings. It turns out the file was generated with aggregated content from various sources, each of which were using different encodings. Ideally, it would be better if way the file was generated could be corrected and re-encoded, but the file is provided by a third-party, so we're not a position to make such changes.

So I'll demonstrate how to correct these errors by using some linux commands

Identifying the error

To identify the offending character, we can use the iconv command as follows:

[root@webserver failed]# iconv -f utf-8 bad_file.xml  -o /dev/null
iconv: illegal input sequence at position 25691275

Here we can see that the offending characters is at position 25691275

To view the character at this position use the following command:

head -c 25691310 bad_file.xml

Removing the bad characters

If you can get away with removing the bad characters without having to replace them, then you can use this command:


iconv -f utf8 -t utf8 -c bad_file.xml > fixed.xml

If you want to fix a batch of files then you can use this command:


 find . -type f -exec bash -c 'iconv -f utf8 -t utf8 -c "{}" > ../fixed/"{}"' \;

Reference:

http://www.martinaulbach.net/linux/command-line-magic/41-dealing-with-inconsistent-or-corrupt-character-encodings

No comments:

Post a Comment