This error suggests that a generated file may have
multiple encodings. It turns out the file was generated with aggregated content from various sources, each of which were using different encodings. Ideally, it would be better if way the file was generated could be corrected and re-encoded, but the file is provided by a third-party, so we're not a position to make such changes.
So I'll demonstrate how to correct these errors by using some linux commands
Identifying the error
To identify the offending character, we can use the
iconv command as follows:
[root@webserver failed]# iconv -f utf-8 bad_file.xml -o /dev/null
iconv: illegal input sequence at position 25691275
Here we can see that the offending characters is at position 25691275
To view the character at this position use the following command:
head -c 25691310 bad_file.xml
Removing the bad characters
If you can get away with removing the bad characters without having to replace them, then you can use this command:
iconv -f utf8 -t utf8 -c bad_file.xml > fixed.xml
If you want to fix a batch of files then you can use this command:
find . -type f -exec bash -c 'iconv -f utf8 -t utf8 -c "{}" > ../fixed/"{}"' \;
Reference:
http://www.martinaulbach.net/linux/command-line-magic/41-dealing-with-inconsistent-or-corrupt-character-encodings