Zip() альтернатива для итерации через два итератора

У меня есть два больших (~ 100 ГБ) текстовых файлов, которые нужно повторять одновременно.

Zip хорошо работает для небольших файлов, но я узнал, что он фактически создает список строк из моих двух файлов. Это означает, что каждая строка сохраняется в памяти. Мне не нужно ничего делать с линиями более одного раза.

handle1 = open('filea', 'r'); handle2 = open('fileb', 'r')

for i, j in zip(handle1, handle2):
    do something with i and j.
    write to an output file.
    no need to do anything with i and j after this.

Есть ли альтернатива zip(), которая действует как генератор, который позволит мне перебирать эти два файла без использования > 200 ГБ RAM?

Ответы

Ответ 1

itertools имеет функцию izip, который делает это

from itertools import izip
for i, j in izip(handle1, handle2):
    ...

Если файлы имеют разные размеры, вы можете использовать izip_longest, поскольку izip остановится в меньшем файле.

Ответ 2

Вы можете использовать izip_longest, как это, чтобы вставить более короткий файл с пустыми строками

в python 2.6

from itertools import izip_longest
with handle1 as open('filea', 'r'):
    with handle2 as open('fileb', 'r'): 
        for i, j in izip_longest(handle1, handle2, fillvalue=""):
            ...

или python3.1

from itertools import izip_longest
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'): 
    for i, j in izip_longest(handle1, handle2, fillvalue=""):
        ...

Ответ 3

Если вы хотите усечь до кратчайшего файла:

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

try:
    while 1:
        i = handle1.next()
        j = handle2.next()

        do something with i and j.
        write to an output file.

except StopIteration:
    pass

finally:
    handle1.close()
    handle2.close()

Else

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

i_ended = False
j_ended = False
while 1:
    try:
        i = handle1.next()
    except StopIteration:
        i_ended = True
    try:
        j = handle2.next()
    except StopIteration:
        j_ended = True

        do something with i and j.
        write to an output file.
    if i_ended and j_ended:
        break

handle1.close()
handle2.close()

или

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

while 1:
    i = handle1.readline()
    j = handle2.readline()

    do something with i and j.
    write to an output file.

    if not i and not j:
        break
handle1.close()
handle2.close()

Ответ 4

Для python3 izip_longest на самом деле zip_longest.

from itertools import zip_longest

for i, j in izip(handle1, handle2):
    ...

Ответ 5

Что-то вроде этого? Wordy, но, похоже, это то, о чем вы просите.

Он может быть настроен так, чтобы делать что-то вроде правильного слияния, чтобы сопоставлять ключи между двумя файлами, что часто более важно, чем упрощенная функция zip. Кроме того, это не усекает, что и делает алгоритм SQL OUTER JOIN, опять же, отличный от того, что делает zip и более типичный для файлов.

with open("file1","r") as file1:
    with open( "file2", "r" as file2:
        for line1, line2 in parallel( file1, file2 ):
            process lines

def parallel( file1, file2 ):
    if1_more, if2_more = True, True
    while if1_more or if2_more:
        line1, line2 = None, None # Assume simplistic zip-style matching
        # If you're going to compare keys, then you'd do that before
        # deciding what to read.
        if if1_more:
            try:
                line1= file1.next()
            except StopIteration:
                if1_more= False
        if if2_more:
            try:
                line2= file2.next()
            except StopIteration:
                if2_more= False
        yield line1, line2