Fastest Way To Find Objects From A Collection Matched By Condition On String Member

- 1 answer

Suppose I have a collection (be it an array, generic List, or whatever is the fastest solution to this problem) of a certain class, let's call it ClassFoo:

class ClassFoo
    public string word;
    public float score;
    //... etc ...

Assume there's going to be like 50.000 items in the collection, all in memory. Now I want to obtain as fast as possible all the instances in the collection that obey a condition on its bar member, for example like this:

List<ClassFoo> result = new List<ClassFoo>();
foreach (ClassFoo cf in collection)
    if (cf.word.StartsWith(query) || cf.word.EndsWith(query))

How do I get the results as fast as possible? Should I consider some advanced indexing techniques and datastructures?

The application domain for this problem is an autocompleter, that gets a query and gives a collection of suggestions as a result. Assume that the condition doesn't get any more complex than this. Assume also that there's going to be a lot of searches.



With the constraint that the condition clause can be "anything", then you're limited to scanning the entire list and applying the condition.

If there are limitations on the condition clause, then you can look at organizing the data to more efficiently handle the queries.

For example, the code sample with the "byFirstLetter" dictionary doesn't help at all with an "endsWith" query.

So, it really comes down to what queries you want to do against that data.

In Databases, this problem is the burden of the "query optimizer". In a typical database, if you have a database with no indexes, obviously every query is going to be a table scan. As you add indexes to the table, the optimizer can use that data to make more sophisticated query plans to better get to the data. That's essentially the problem you're describing.

Once you have a more concrete subset of the types of queries then you can make a better decision as to what structure is best. Also, you need to consider the amount of data. If you have a list of 10 elements each less than 100 byte, a scan of everything may well be the fastest thing you can do since you have such a small amount of data. Obviously that doesn't scale to a 1M elements, but even clever access techniques carry a cost in setup, maintenance (like index maintenance), and memory.

EDIT, based on the comment

If it's an auto completer, if the data is static, then sort it and use a binary search. You're really not going to get faster than that.

If the data is dynamic, then store it in a balanced tree, and search that. That's effectively a binary search, and it lets you keep add the data randomly.

Anything else is some specialization on these concepts.