Ad

Eliminate Overlap Between Two Text Blocks Using Python

- 1 answer

I have two text files, which slightly overlap, i.e. :

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""

text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

As you can see the last sentence of text1 and the first sentence of text2 slightly overlap. Now, I would like to get rid of this overlap, essentially deleting the strings in text2 that are also in the last sentence of text1.

To do so, I can extract the last sentence of text1:

text1_last_sentence = list(filter(None,text1.split(".")))[-1]

And the first sentence of text2:

text2_first_sentence = text2.split(".")[0]

... but now the question is:

How do I find the part of the first sentence of text2 that should stay in text2 and put everything back toghether?

EDIT 1:

The expected output:

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""

text2 = """greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

EDIT 2:

Here is the complete code:

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy.""" 

text1_last_sentence = list(filter(None,text1.split(".")))[-1]
text2_first_sentence = text2.split(".")[0]

print(text1_last_sentence, "\n")
print(text2_first_sentence, "\n")

The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in

theory or investigate a phenomenon in greater detail

Ad

Answer

Here is a way to do it, that will find the largest possible overlap:

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

def remove_overlap(text1, text2):
    """Returns the part of text2 that doesn't overlap with text1"""

    words1 = text1.split()
    words2 = text2.split()

    # all apperances of the last word of text1 in text2
    last_word_appearances = [index for index, word in enumerate(words2) if word == words1[-1]]
    # we look for the largest possible overlap
    for n in reversed(last_word_appearances):
        # are the first n+1 words of text2 the same as the (n+1) last from text1? 
        if words2[:n+1] == words1[-(n+1):]:
            return ' '.join(words2[n+1:])
    else:
        # no overlap found
        return text2


remove_overlap(text1, text2)
# 'greater detail.There are still some deficiencies in [...]
Ad
source: stackoverflow.com
Ad