Поиск комбинаций стеблей и концов

У меня есть отображения "стеблей" и "окончаний" (могут быть не правильные слова), которые выглядят так:

all_endings = {
 'birth': set(['place', 'day', 'mark']), 
 'snow': set(['plow', 'storm', 'flake', 'man']),
 'shoe': set(['lace', 'string', 'maker']),
 'lock': set(['down', 'up', 'smith']),
 'crack': set(['down', 'up',]),
 'arm': set(['chair']),
 'high': set(['chair']),
 'over': set(['charge']),
 'under': set(['charge']),
}

Но гораздо дольше, конечно. Я также сделал соответствующий словарь по-другому:

all_stems = {
 'chair': set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'up': set(['lock', 'crack', 'vote']),
 'down': set(['lock', 'crack', 'fall']),
 'smith': set(['lock']),
 'place': set(['birth']),
 'day': set(['birth']),
 'mark': set(['birth']),
 'plow': set(['snow']),
 'storm': set(['snow']),
 'flake': set(['snow']),
 'man': set(['snow']),
 'lace': set(['shoe']),
 'string': set(['shoe']),
 'maker': set(['shoe']),
}

Теперь я попытался найти алгоритм, чтобы найти совпадение двух или более "стеблей", которые соответствуют двум или более "концам". Выше, например, он будет соответствовать вниз и вверх с блокировкой и трещиной, в результате чего

lockdown
lockup
crackdown
crackup

Но не включая 'upvote', 'downfall' or 'locksmith' (и это то, что вызывает у меня самые большие проблемы). Я получаю ложные срабатывания вроде:

pancake
cupcake
cupboard

Но я просто обойдусь "петлями". (Пун предназначался), и я, кажется, никуда не денутся. Я был бы признателен за любой удар в правильном направлении.

До сих пор запутанный и бесполезный код, который вы, вероятно, должны просто игнорировать:

findings = defaultdict(set)
for stem, endings in all_endings.items():
    # What stems have matching endings:
    for ending in endings:
        otherstems = all_stems[ending]
        if not otherstems:
            continue
        for otherstem in otherstems:
            # Find endings that also exist for other stems
            otherendings = all_endings[otherstem].intersection(endings)
            if otherendings:
                # Some kind of match
                findings[stem].add(otherstem)

# Go through this in order of what is the most stems that match:

MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
    for this_stem in match:
        other_stems = set() # Stems that have endings in common with this_stem
        other_endings = set() # Endings this stem have in common with other stems
        this_endings = all_endings[this_stem]
        for this_ending in this_endings:
            for other_stem in all_stems[this_ending] - set([this_stem]):
                matching_endings = this_endings.intersection(all_endings[other_stem])
                if matching_endings:
                    other_endings.add(this_ending)
                    other_stems.add(other_stem)

        stem_matches = all_stems[other_endings.pop()]
        for other in other_endings:
            stem_matches = stem_matches.intersection(all_stems[other])

        if len(stem_matches) >= MINMATCH:
            for m in stem_matches:
                for e in all_endings[m]:
                    print(m+e)

Ответы

Ответ 1

Это не особенно красиво, но это довольно просто, если вы разложите словарь на два списка и используете явные индексы:

all_stems = {
 'chair' : set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'fall'  : set(['down', 'water', 'night']),
 'up'    : set(['lock', 'crack', 'vote']),
 'down'  : set(['lock', 'crack', 'fall']),
}

endings     = all_stems.keys()
stem_sets   = all_stems.values()

i = 0
for target_stem_set in stem_sets:
    i += 1
    j  = 0

    remaining_stems = stem_sets[i:]
    for remaining_stem_set in remaining_stems:
        j += 1
        union = target_stem_set & remaining_stem_set
        if len(union) > 1:
            print "%d matches found" % len(union)
            for stem in union:
                print "%s%s" % (stem, endings[i-1])
                print "%s%s" % (stem, endings[j+i-1])

Вывод:

$ python stems_and_endings.py 
2 matches found
lockdown
lockup
crackdown
crackup

В основном все, что мы делаем, это итерация через каждый набор по очереди и сравнение его с каждым оставшимся набором, чтобы увидеть, есть ли более двух совпадений. Нам никогда не приходится пытаться выполнять наборы, которые падают раньше текущего набора, потому что они уже были сопоставлены в предыдущей итерации. Остальное (индексирование и т.д.) - это просто бухгалтерский учет.

Ответ 2

Я думаю, что способ избежать этих ложных срабатываний - удалить кандидатов без слов в пересечении стеблей. Если это имеет смысл:(

Пожалуйста, посмотрите и, пожалуйста, дайте мне знать, если я что-то упустил.

#using all_stems and all_endings from the question

#this function is declared at the end of this answer
two_or_more_stem_combinations = get_stem_combinations(all_stems)
print "two_or_more_stem_combinations", two_or_more_stem_combinations
#this print shows ... [set(['lock', 'crack'])] 

for request in two_or_more_stem_combinations:
    #we filter the initial index to only look for sets or words in the request
    candidates = filter(lambda x: x[0] in request, all_endings.items())

    #intersection of the words for the request
    words = candidates[0][1]
    for c in  candidates[1:]:
        words=words.intersection(c[1])

    #it handy to have it in a dict
    candidates = dict(candidates)

    #we need to remove those that do not contain 
    #any words after the intersection of stems of all the candidates
    candidates_to_remove = set()
    for c in candidates.items():
        if len(c[1].intersection(words)) == 0:
        candidates_to_remove.add(c[0])

    for key in candidates_to_remove:
        del candidates[key]

    #now we know what to combine
    for c in candidates.keys():
       print "combine", c , "with", words

Выход:

объединить блокировку с множеством (['down', 'up'])

объединить трещины с множеством (['down', 'up'])

Как вы можете видеть, это решение не содержит эти ложные срабатывания.

Изменить: сложность

И сложность этого решения не хуже, чем O (3n) в худшем сценарии - без учета доступа к словарям. А также для большинства исполнений первый фильтр сужает довольно много пространства решений.

Изменить: получение стеблей

Эта функция в основном исследует рекурсивно словарь all_stems и находит комбинации двух или более окончаний, для которых два или более стебля совпадают.

def get_stems_recursive(stems,partial,result,at_least=2):
    if len(partial) >= at_least:
        stem_intersect=all_stems[partial[0]]
        for x in partial[1:]:
           stem_intersect = stem_intersect.intersection(all_stems[x])
           if len(stem_intersect) < 2:
               return
        result.append(stem_intersect)

    for i in range(len(stems)):
        remaining = stems[i+1:]
        get_stems_recursive(remaining,partial + [stems[i][0]],result)


def get_stem_combinations(all_stems,at_least=2):
    result = []
    get_stems_recursive(all_stems.items(),list(),result)
    return result

two_or_more_stem_combinations = get_stem_combinations(all_stems)

Ответ 3

== Отредактированный ответ: ==

Ну, вот еще одна итерация для вашего рассмотрения с ошибками, которые я сделал в первый раз. На самом деле результатом является код, который еще короче и проще. В документе combinations говорится, что "если входные элементы уникальны, в каждой комбинации не будет повторяющихся значений", поэтому он должен только формировать и тестировать минимальное количество пересечений. Также представляется, что определение endings_by_stems не требуется.

from itertools import combinations

MINMATCH = 2
print 'all words with at least', MINMATCH, 'endings in common:'
for (word0,word1) in combinations(stems_by_endings, 2):
    ending_words0 = stems_by_endings[word0]
    ending_words1 = stems_by_endings[word1]
    common_endings = ending_words0 & ending_words1
    if len(common_endings) >= MINMATCH:
        for stem in common_endings:
            print ' ', stem+word0
            print ' ', stem+word1

# all words with at least 2 endings in common:
#   lockdown
#   lockup
#   falldown
#   fallup
#   crackdown
#   crackup

== Предыдущий ответ ==

Я не пытаюсь много оптимизировать, но здесь несколько грубой силы, но короткого подхода, который сначала вычисляет "end_sets" для каждого слова стебля, а затем находит все слова стебля, которые имеют общие конечные_секунды, по крайней мере, указанное минимальное количество общих окончаний.

В заключительной фазе он печатает все возможные комбинации этих слов стебля + окончания, которые он обнаружил, которые соответствуют критериям. Я попытался сделать все имена переменных максимально описательными, чтобы было легко следовать.;-) Я также не учитывал определения all_endings' and 'all+stems.

from collections import defaultdict
from itertools import combinations

ending_sets = defaultdict(set)
for stem in all_stems:
    # create a set of all endings that have this as stem
    for ending in all_endings:
        if stem in all_endings[ending]:
            ending_sets[stem].add(ending)


MINMATCH = 2
print 'all words with at least', MINMATCH, 'endings in common:'
for (word0,word1) in combinations(ending_sets, 2):
    ending_words0 = ending_sets[word0]
    ending_words1 = ending_sets[word1]
    if len(ending_words0) >= MINMATCH and ending_words0 == ending_words1:
        for stem in ending_words0:
            print ' ', stem+word0
            print ' ', stem+word1

# output
# all words with at least 2 endings in common:
#   lockup
#   lockdown
#   crackup
#   crackdown

Ответ 4

Если вы представляете свои связи в квадратных двоичных массивах (где 1 означает, что "x может следовать за y", например, и где другие элементы установлены в 0), то, что вы пытаетесь сделать, эквивалентно поиску "сломанные прямоугольники", заполненные единицами:

       ... lock  **0  crack  **1  ...

...    ...
down   ...  1     0     1     1
up     ...  1     1     1     1
...    ...

Здесь lock, crack и **1 (примерное слово) можно сопоставить с down и up (но не с словом **0). Сопряженные отношения рисуют прямоугольник 2x3, заполненный с помощью.

Надеюсь, это поможет!