Ad

Is There A Pythonic Way Of Filtering Substrings Of Strings In A List?

- 1 answer

I have a list with strings as below.

candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]

And I want the list to be filtered as ["HelloWorld", "Foo", "Bar"], because others are substrings. I can do it like this, but don't think it's fast or elegant.

def filter_not_substring(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a == b:
                continue
            if a in b:
                break
        else:
            survive.append(a)
    return survive

Is there any fast way to do it?

Ad

Answer

How about:

candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
result = [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
print(result)

Counter to what was suggested in the comments:

from timeit import timeit


def filter_not_substring(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a == b:
                continue
            if a in b:
                break
        else:
            survive.append(a)
    return survive


def filter_not_substring2a(candidates):
    return [c for c in candidates if not any(len(o) > len(c) and c in o for o in candidates)]


def filter_not_substring2b(candidates):
    return [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]


xs = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar", "bar"]
print(filter_not_substring(xs), filter_not_substring2a(xs), filter_not_substring2b(xs))
print(timeit(lambda: filter_not_substring(xs)))
print(timeit(lambda: filter_not_substring2a(xs)))
print(timeit(lambda: filter_not_substring2b(xs)))

Result:

['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar']
1.5163685
4.6516653
3.8334089999999996

So, OP's solution is substantially faster, but filter_not_substring2b is still about 20% faster than 2a. So, putting the len comparison first doesn't save time.

For any production scenario, OP's function is probably optimal - a way to speed it up might be to bring the whole problem into C, but I doubt that would show great gains, since the logic is pretty straightforward already and I'd expect Python to do a fairly good job of it as well.

User @ming noted that OP's solution can be improved a bit:

def filter_not_substring_b(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a in b and a != b:
                break
        else:
            survive.append(a)
    return survive

This version of the function is somewhat faster, for me about 10-15%

Finally, note that this is only just faster than 2b, even though it is very similar to the optimised solution by @ming, but almost 3x slower than their solution. It's unclear to me why that would be - if anyone has fairly certain thoughts on that, please share in the comments:

def filter_not_substring_c(candidates):
    return [a for a in candidates if all(a not in b or a == b for b in candidates)]
Ad
source: stackoverflow.com
Ad