Python - обработка файлов с смешанным кодированием

У меня есть файл, который в основном UTF-8, но некоторые символы Windows-1252 также нашли там путь.

Я создал таблицу для сопоставления символов Windows-1252 (cp1252) с их юникодными аналогами и хотел бы использовать ее для исправления некодированных символов, например.

cp1252_to_unicode = { "\x85": u'\u2026', # … "\x91": u'\u2018', # ‘ "\x92": u'\u2019', # ’ "\x93": u'\u201c', # " "\x94": u'\u201d', # " "\x97": u'\u2014' # — } for l in open('file.txt'): for c, u in cp1252_to_unicode.items(): l = l.replace(c, u)

Но попытка выполнить эту замену приводит к тому, что создается UnicodeDecodeError, например:

"\x85".replace("\x85", u'\u2026') UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

Любые идеи о том, как справиться с этим?

Ответы

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252") >>> print a maçã ma�� >>> a.decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data

import codecs last_position = -1 def mixed_decoder(unicode_error): global last_position string = unicode_error[1] position = unicode_error.start if position <= last_position: position = last_position + 1 last_position = position new_char = string[position].decode("cp1252") #new_char = u"_" return new_char, position + 1 codecs.register_error("mixed", mixed_decoder)

Ответ 2

Благодаря jsbueno и удару других поисковых запросов Google и других шагов я решил это так.

#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")

Эта версия позволяет ограничить возможность исправления недопустимых символов. Неизвестные символы заменяются безопасным значением.

import codecs    
replacement = {
   '85' : '...',           # u'\u2026' ... character.
   '96' : '-',             # u'\u2013' en-dash
   '97' : '-',             # u'\u2014' em-dash
   '91' : "'",             # u'\u2018' left single quote
   '92' : "'",             # u'\u2019' right single quote
   '93' : '"',             # u'\u201C' left double quote
   '94' : '"',             # u'\u201D' right double quote
   '95' : "*"              # u'\u2022' bullet
}

#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
    errStr = unicodeError[1]
    errLen = unicodeError.end - unicodeError.start
    nextPosition = unicodeError.start + errLen
    errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
    if errHex in replacement:
        return u'%s' % replacement[errHex], nextPosition
    return u'%s' % errHex, nextPosition   # Comment this line out to get a question mark
    return u'?', nextPosition

codecs.register_error("mixed", mixed_decoder)

xmlText = xmlText.decode("utf-8", "mixed")

В основном я пытаюсь превратить его в utf8. Для любых символов, которые выходят из строя, я просто конвертирую его в HEX, чтобы я мог отображать или искать его в собственной таблице.

Это не очень, но это позволяет мне разобраться в испорченных данных.

Python - обработка файлов с смешанным кодированием

Ответы

Ответ 1

Ответ 2