Which collection interface to use?

Let’s talk about when to use which collection type and why.

1. The guideline, again

Last time, we discussed the following guideline:

  • Use the most generic types possible for arguments,

  • Use the most specific types possible for return values.

Or, to be more precise:

  • Use the most generic types for arguments, while keeping the method’s signature honest,

  • Use the most specific types for return values, while keeping the method’s signature honest.

Here’s this guideline in one picture:

The guideline in one picture
The guideline in one picture

To summarize the reasoning behind it:

  • On the one hand, argument types provide the most utility when they are as generic as possible, because the method starts to apply to a wider range of input values.

  • On the other hand, return types provide the most utility when they are as specific as possible, because a specific return value exposes more functionality than a generic one.

2. Which collection type to return?

The use of collection types (and their interfaces) should also follow this guideline. You should also return the most specific and accept the most generic collection type.

Here’s the collection interface hierarchy in .NET:

Collection interface hierarchy in .NET
Collection interface hierarchy in .NET

Note that it’s quite simplified, but good enough for our purposes. You can find the complete hierarchy diagram in this article: C# Read-Only Collections and LSP.

Given this hierarchy, which collection type should you use as a return value?

Well, following the guideline, it should be the most specific type, but which one exactly? There are multiple leaves in this hierarchy tree:

  • IQueryable

  • IList

  • Array (such as int[])

  • IReadOnlyList

Let’s discuss them one by one.

3. IQueryable is a leaky abstraction

The first option is IQueryable. This interface allows you to query the database using LINQ expressions the same way you would query an in-memory collection:

Convenience of IQueryable interface
Convenience of IQueryable interface

The difference here is that LINQ expressions that work with IQueryable are executed in the database, not in-memory. ORMs such as EF Core and NHibernate implement an IQueryable provider, which translates LINQ expressions into SQL queries.

You can think of IQueryable as an abstraction that allows you to work with different data sources the same way.

IQueryable is particularly popular in repositories, where you can introduce a GetAll method, like this:

// Student repository
public IQueryable<Student> GetAll()
{
    return _context.Set<Student>();
}

And then use it to apply required filters in the controller:

// Student controller
public IEnumerable<StudentDto> GetAll()
{
    var students = _repository.GetAll()
        .Where(x => x.Email.EndsWith(".edu")); // Executed in the database

    /* ... */
}

This approach looks logical at first. The client code (in our case, the controller) can build filters on top of IQueryable to its own liking, depending on the scenario. In the above example, for instance, we are selecting all students with .edu emails. This filter gets translated into SQL and is executed in the database, so there’s no excess data transfer between the database and our API.

But the use of IQueryable as a return value has a significant drawback: this interface is a leaky abstraction. As we discussed last time, a leaky abstraction is an abstraction that requires you to know implementation details it ought to abstract.

When using IQueryable, you have to know which LINQ expressions can be converted into SQL and which can’t. Look at this code:

IQueryable is a leaky abstraction
IQueryable is a leaky abstraction

Here, the EndsWith method can be translated into SQL, but MapToDto can’t. That’s why we first call ToList. This method prompts EF Core to execute the query we’ve built up so far and retrieve the resulting data into memory. All the subsequent LINQ expressions are executed against that in-memory data.

And that is exactly an example of a leaky abstraction. When using IQueryable, we have to know which LINQ expressions are supported by EF Core; we can’t simply execute any LINQ expression of our choice. In other words, we have to know how EF Core implements its IQueryable provider.

IQueryable interface is too broad
IQueryable interface is too broad

And by the way, this is not the fault of EF Core (or NHibernate for that matter). ORMs do a great job supporting as many LINQ expressions as humanly possible. The problem is that the IQueryable interface itself is too broad. Its intention is to allow any LINQ expression to be translated into SQL, which is an impossible task to begin with. There’s an infinite number of LINQ expressions, and only so many of them can be converted into SQL.

Therefore, it is misleading to use IQueryable as part of the public API. It makes it seem as though you can apply any expression on top of this return value, but in reality, you have to keep in mind which LINQ expressions the ORM can understand.

The same is true when you accept an expression as an argument in a repository, like this:

// Student repository
public IQueryable<Student> GetAll(
    Expression<Func<Student, bool>> predicate)
{
    return _context.Set<Student>().Where(predicate);
}

Expressions, when used as part of the public API, are also leaky abstractions because they make it seem as though you can supply any LINQ expression to the method, whereas, once again, only a limited number of them can be understood by the ORM.

So, what’s the solution here?

The solution is to not use IQueryable (or expressions) as part of the repository’s public API. You can still use it internally, because the repository is aware of which ORM it works with and its limitations. But that knowledge shouldn’t cross the repository’s boundary. In other words, this knowledge shouldn’t leak to the clients of your repository; those clients shouldn’t be aware of the repository’s implementation details.

Here’s how we can do this:

// Student repository
public IEnumerable<Student> GetAll(string emailDomain)
{
    IQueryable<Student> queryable = _context.Set<Student>();

    if (string.IsNullOrWhiteSpace(emailDomain) == false)
    {
        queryable = queryable.Where(x => x.Email.EndsWith(emailDomain));
    }

    return queryable.AsEnumerable();
}

Notice that the filter is now represented as a plain emailDomain argument. There’s now no way to apply a filter that isn’t supported by the ORM. Also notice that we still use IQueryable, but only internally. We don’t leak it anywhere outside the repository.

The client code doesn’t have to know the internal implementation details of this method anymore. This method has become a good abstraction (as opposed to a leaky one).

4. IList and Array

Alright, so IQueryable is out of the question.

You shouldn't use IQueryable as a return type
You shouldn’t use IQueryable as a return type

What about the other 3 leaves in the collection hierarchy tree?

  • IList

  • Array (such as int[])

  • IReadOnlyList

List<T> vs IList<T>

Notice that I used the interface IList throughout this article and not the concrete type List.

The two are almost identical functionality-wise, and so even though technically, List is more concrete than IList, in practice, it doesn’t really matter which one you use.

Can we use IList or an array as the return value?

// Option 1
public IList<Student> GetAll(string emailDomain)

// Option 2
public Student[] GetAll(string emailDomain)

When deciding which type to use as a return value, we not only need to be cautious of leaking abstractions, but also need to consider whether this type correctly represents what the client can and can’t do with the return data.

For example, the client code can change a list of an array of students after it gets them from the repository:

IList<Student> students = _repository.GetAll(".edu");

students.Add(new Student());   // We can extend the collection
students[0] = null;            // Or modify it

But what does it mean to change that collection?

You may do that for two reasons:

  • You are modifying the collection in order to add or remove students from the database.

  • You are reusing the collection for some other purpose. For example, you need to filter these students using some criteria and instead of creating a new collection, you are removing students from the collection you already have.

Both of these reasons are invalid.

Returning a mutable collection makes it seem as though the students it represents can be modified by simply adding or removing an element from that collection, which is not the case.

The moment the information about students leaves the database (as soon as SELECT SQL statement is finished), that information becomes detached from the database and modifying it doesn’t do anything.

Note

If you are using an ORM (NHibernate or EF Core), then individual students in the collection are attached to the original data (rows in the database), and modifying them does affect that original data.

But that’s not the case for the collection of students. Modifying that collection doesn’t add or delete data from the database.

Because of that detachment, we shouldn’t use a mutable collection, since it doesn’t properly represent what the client can and can’t do with the data in the database.

And you also shouldn’t modify the collection in order to reuse it for some other purpose, like filtration. Unless you have severe performance limitations (which is almost never the case), you should just create a new collection with the filtered data instead. Unnecessary data modification often leads to a whole set of hard-to-debug issues, which you can avoid simply by make that data immutable.

And so we should only use immutable collection types for both arguments and return values. There are 3 such types in this hierarchy:

There are only 3 immutable collection types in .NET BCL
There are only 3 immutable collection types in .NET BCL

Following the guideline, we should:

  • Use IEnumerable for arguments as the most generic type possible,

  • Use IReadOnlyList for return values as the most specific type possible.

Here’s our GetAll method, refactored toward the use of IReadOnlyList:

// Student repository
public IReadOnlyList<Student> GetAll(string emailDomain)
{
    IQueryable<Student> queryable = _context.Set<Student>();

    if (string.IsNullOrWhiteSpace(emailDomain) == false)
    {
        queryable = queryable.Where(x => x.Email.EndsWith(emailDomain));
    }

    return queryable.ToList();
}

We use IReadOnlyList here instead of IEnumerable because IReadOnlyList provides more functionality out of the box. For example, it has the Count property and it also allows you to access the collection through an index. IEnumerable doesn’t have any of that, and you have to resort to LINQ extension methods to compensate for the lack of this functionality.

Note

IReadOnlyList isn’t necessarily immutable behind the scenes. The concrete class List also implements IReadOnlyList and in theory, you could cast IReadOnlyList back to List and mutate it.

But doing so ties your code to implementation details of the method returning IReadOnlyList and makes that code fragile. The client code shouldn’t rely on any information other than provided by the method signature and shouldn’t assume anything about the object other than what its type says. In other words, no downcast.

If you must change the returned collection, then just return IList instead of IReadOnlyList, or create a new collection based on the IReadOnlyList at hand.

5. IEnumerable is also a leaky abstraction

So, we should use IReadOnlyList as the collection return type. But why do so many well-known libraries use IEnumerable instead? Even though it provides less functionality compared to IReadOnlyList?

One reason is lazy evaluation. In .NET BCL, most LINQ statements are evaluated lazily when working with IEnumerable.

Look at this example:

int[] numbers = new int[] { 1, 2, 3 };

IEnumerable<int> r1 = numbers.Where(x => x > 1);  '1

IEnumerable<int> r2 = r1.Select(x => x + 1);      '2

int result = r2.Sum();                            '3

Here, the execution of lines '1 and '2 is postponed up to line '3. These LINQ statements are only evaluated when the whole thing needs to "collapse" in order to produce the end result. This happens when calling ToList(), Sum(), or similar.

If you need to maintain lazy evaluation, then use IEnumerable as the collection return type. Otherwise stick to IReadOnlyList.

5.1. IEnumerable with an ORM

Note that lazy evaluation also leads to IEnumerable being a leaky abstraction in some scenarios. Look at this code for example:

Lazy evaluation of IEnumerable with an ORM
Lazy evaluation of IEnumerable with an ORM

The code in the first line (and even in the second one) is only executed when we call ToList. If by that time the DbContext/ISession is already disposed of, we get an exception:

The DbContext gets disposed on at the end of the method
The DbContext gets disposed of at the end of the method

This is another example of a leaky abstraction. It’s a leaky abstraction because EF Core requires you to know whether the underlying database connection is still open at the time of IEnumerable evaluation.

In other words, EF Core imposes an additional, hidden precondition. This precondition is hidden because nowhere in the IEnumerable interface’s signature does it tell you to keep an eye on the database connection, and yet, EF Core does impose that requirement.

Note that the leakiness of the IEnumerable interface is a much smaller issue compared to IQueryable because the presence of a non-disposed DbContext (or ISession in case of NHibernate) is a reasonable assumption in most web applications. You will almost always have an active DbContext instance for the whole duration of the business operation; the DbContext is usually disposed of only at the end of that operation.

Still, the fact remains: lazy evaluation of database operations violates the LSP principle for the IEnumerable interface and makes the implementation of that interface a leaky abstraction. If you aren’t planning on using lazy evaluation, it’s best to stick to IReadOnlyList instead.

5.2. IEnumerable without an ORM

Note that implementations of IEnumerable often act as leaky abstractions even in scenarios without an ORM.

For example, BlockingCollection implements IEnumerable in such a way that it blocks the calling thread in method MoveNext() until another thread adds an element to that collection:

public void Test()
{
    BlockingCollection<int> collection = new BlockingCollection<int>();

    IEnumerator<int> enumerator = collection.GetConsumingEnumerable().GetEnumerator();

    bool moveNext = enumerator.MoveNext(); // The calling thread is blocked
}

And there are countless implementations of IEnumerable that exhibit similar behavior. For example, you could write an infinite random number generator that implements IEnumerable<int>, like this:

private IEnumerable<int> Test2()
{
    Random random = new Random();
    while (true)
    {
        yield return random.Next();
    }
}

If you call ToList on such an implementation, the program will hang in an infinite loop.

Note that technically, these implementations aren’t leaky abstractions, because the IEnumerable interface doesn’t promise collection finality (i.e. it doesn’t expose a Count property).

But still, in practice, most programmers expect a certain behavior from IEnumerable. Collection finality and non-blocking calls to MoveNext are part of that expectation. We can say that IEnumerable has an implicit contract that most programmers came to rely on, and some implementations of IEnumerable violate that contract and become leaky abstractions.

6. IEnumerable in library development

The guideline of returning the most specific type (such as IReadOnlyList) is relevant for enterprise application development, but the situation with library development is different.

If you are developing a library for a public use, you want to restrict its public API as much as possible to give yourself leeway to update that API. This includes both the number of available public methods and the types those methods return. In short, you want to promise as little as possible.

Returning IEnumerable over IReadOnlyList is preferable in such environment. It allows you to change the internal implementation from e.g List<T> to a LinkedList<T> without breaking backward compatibility.

But this doesn’t apply to regular enterprise application development. Since your team is the only user of your code, you don’t need to maintain backward compatibility and may refactor your code as needed if the API changes.

This is called the Open-Closed Principle. You can read more about it in this article: OCP vs YAGNI

7. Summary

  • Adhere to the following guideline.

    • Use IEnumerable for arguments as the most generic type possible.

    • Use IReadOnlyList for return values as the most specific type possible.

  • IQueryable is a leaky abstraction because it requires you to know which LINQ expressions the ORM can understand.

  • IList and Array aren’t suitable because they are mutable.

  • Keep in mind that some implementations of IEnumerable are also leaky abstractions.

Subscribe


I don't post everything on my blog. Don't miss smaller tips and updates. Sign up to my mailing list below.

Comments


comments powered by Disqus