Самый быстрый способ вычеркивания пунктуации из строки юникода в Python

Я пытаюсь эффективно стирать пунктуацию из строки юникода. С обычной строкой, используя mystring.translate(None, string.punctuation), очевидно, самый быстрый подход. Однако этот код разбивается на строку юникода в Python 2.7. Как поясняют комментарии к этому , метод перевода может быть реализован, но он должен быть реализован со словарем. Когда я использую эту реализацию, я считаю, что производительность перевода значительно снижается. Вот мой код времени (скопированный в основном из этого answer):

Как показывают мои результаты, реализация перевода в Юникоде выполняется ужасно:

Мой вопрос заключается в том, существует ли более быстрый способ реализовать перевод для unicode (или любого другого метода), который превосходит регулярное выражение.

Ответы

Ответ 1

Текущий тест script ошибочен, потому что он не сравнивается с подобным.

Для более справедливого сравнения все функции должны выполняться с одним и тем же набором символов пунктуации (то есть либо для всех ascii, либо для всего юникода).

Когда это будет сделано, методы регулярного выражения и замены значительно ухудшатся с полным набором символов пунктуации unicode.

Для полного юникода, похоже, что метод "set" является лучшим. Однако, если вы хотите удалить символы прецизионности ascii из строк unicode, лучше всего кодировать, транслировать и декодировать (в зависимости от длины входной строки).

Метод "replace" также может быть существенно улучшен путем выполнения теста на герметичность перед попыткой замены (в зависимости от точного состава строки).

Здесь приведены некоторые примеры результатов повторного хэша теста script:

$ python2 test.py
running ascii punctuation test...
using byte strings...

set: 0.862006902695
re: 0.17484498024
trans: 0.0207080841064
enc_trans: 0.0206489562988
repl: 0.157525062561
in_repl: 0.213351011276

$ python2 test.py a
running ascii punctuation test...
using unicode strings...

set: 0.927773952484
re: 0.18892288208
trans: 1.58275294304
enc_trans: 0.0794939994812
repl: 0.413739919662
in_repl: 0.249747991562

python2 test.py u
running unicode punctuation test...
using unicode strings...

set: 0.978360176086
re: 7.97941994667
trans: 1.72471117973
enc_trans: 0.0784001350403
repl: 7.05612301826
in_repl: 3.66821289062

И здесь re-hashed script:

# -*- coding: utf-8 -*-

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = """For me, Reddit brings to mind Obi Wan’s enduring description of the Mos
Eisley cantina: a wretched hive of scum and villainy. But, you know, one you
still kinda want to hang out in occasionally. The thing is, though, Reddit
isn’t some obscure dive bar in a remote corner of the universe—it’s a huge
watering hole at the very center of it. The site had some 400 million unique
visitors in 2012. They can’t all be Greedos. So maybe my problem is just that
I’ve never been able to find the places where the decent people hang out."""

su = u"""For me, Reddit brings to mind Obi Wan’s enduring description of the
Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one
you still kinda want to hang out in occasionally. The thing is, though,
Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a
huge watering hole at the very center of it. The site had some 400 million
unique visitors in 2012. They can’t all be Greedos. So maybe my problem is
just that I’ve never been able to find the places where the decent people
hang out."""

def test_trans(s):
    return s.translate(tbl)

def test_enc_trans(s):
    s = s.encode('utf-8').translate(None, string.punctuation)
    return s.decode('utf-8')

def test_set(s): # with list comprehension fix
    return ''.join([ch for ch in s if ch not in exclude])

def test_re(s):  # From Vinko solution, with fix.
    return regex.sub('', s)

def test_repl(s):  # From S.Lott solution
    for c in punc:
        s = s.replace(c, "")
    return s

def test_in_repl(s):  # From S.Lott solution, with fix
    for c in punc:
        if c in s:
            s = s.replace(c, "")
    return s

txt = 'su'
ptn = u'[%s]'

if 'u' in sys.argv[1:]:
    print 'running unicode punctuation test...'
    print 'using unicode strings...'
    punc = u''
    tbl = {}
    for i in xrange(sys.maxunicode):
        char = unichr(i)
        if unicodedata.category(char).startswith('P'):
            tbl[i] = None
            punc += char
else:
    print 'running ascii punctuation test...'
    punc = string.punctuation
    if 'a' in sys.argv[1:]:
        print 'using unicode strings...'
        punc = punc.decode()
        tbl = {ord(ch):None for ch in punc}
    else:
        print 'using byte strings...'
        txt = 's'
        ptn = '[%s]'
        def test_trans(s):
            return s.translate(None, punc)
        test_enc_trans = test_trans

exclude = set(punc)
regex = re.compile(ptn % re.escape(punc))

def time_func(func, n=10000):
    timer = timeit.Timer(
        'func(%s)' % txt,
        'from __main__ import %s, test_%s  as func' % (txt, func))
    print '%s: %s' % (func, timer.timeit(n))

print
time_func('set')
time_func('re')
time_func('trans')
time_func('enc_trans')
time_func('repl')
time_func('in_repl')