When To Use Set vs. List In SOQL

(tl;dr at bottom)

Filter Speed

If you can use Map.keySet(), you'll get a Set back in about 1/10th of the time versus a loop over those same records and adding all the values. Of course, the Map<Id, SObject>(List<SObject>) constructor only works on Id, not on related ID values, which means you'd have to fall back to a loop. However, in cases where you just want to use the ID of a list of records for the purpose of a query, you can just use the list directly:

Account[] accounts = [SELECT Id FROM Account];
Contact[] contacts = [SELECT AccountId FROM Contact WHERE AccountId IN :accounts];

This takes linear time that's far smaller than constructing a Map/Set or a List of IDs. Alternatively, you can often use a semi join sub-selects to the same effect:

Contact[] contacts = [SELECT AccountId FROM Contact];
Account[] accounts = [SELECT Id FROM Account WHERE Id IN (SELECT AccountId FROM Contact WHERE Id IN :contacts)];

Assuming the semi join is not prohibited (e.g. you cannot use it with an OR filter), this is the fastest method available. If not, then binding to list directly is faster, followed by a Map Set, followed by a plain Set or List.

In other words, while generateParentIds() seems to be handy, it's actually slower than simply baking the logic directly into the query, if possible. If you have to choose a List or a Set, there's no clear benefit to either, as they'll take the exact same amount of time to filter records.

Size

Given a List and a Set that each contain the same, unique values, both collections will be the same size, byte for byte. Once you start having duplicate values, then the set becomes more useful. However, for the volume of values you'd expect from a typical collection, there's little meaningful difference between a list and a set. For example, in a trigger of 200 records, the maximum size of a list of ID values should be about 3,000 bytes. As you can see, a handful of ID values don't use much storage. You will not experience any benefits at all until you have non-unique values, and even then, the savings don't add up unless there's just a few values repeated many times.

Storage Speed

Here's where things get tricky. At very small sizes object sizes, and very small data sizes, a Set and a List are about equal in terms of speed, assuming you're using native objects. For example, storing values in to a Set of ID values is about the same speed as using a List, assuming there's not many values.

Consider the following code:

Integer counter = 0;
Set<Id> values = new Set<Id>();
// Id[] values = new Id[0];
while(counter++<100000) {
    String k = String.valueOf(counter), v = '005000000000000'.left(15-k.length()); 
    values.add(v+k); 
}

Building a list of 100,000 unique values in a list or a set shows a very similar time stamp (~4.8 seconds). It's not until you get to absurd values (~1,000,000 values) that you start noticing a considerable penalty, and that's usually moot because anything even as complex as the above code will CPU time out far before processing 1,000,000 values.

Using this very simple loop:

while(counter++<1000000) values.add(counter);

I get about 6 seconds for a List, and 7 seconds for a Set. And that's over 1,000,000 values. The deviation in terms of time penalty is about 0.001 ms per item in the list. I promise, your users won't notice your choice of using a Set over a List.

Retrieval Speed

Here's where a Set really shines. Given these two blocks of code:

for(Case record:Trigger.new) {
    if(record.AccountId != null) {
        accountIds.add(record.AccountId);
    }
}

Or

for(Case record:Trigger.new) {
    accountIds.add(record.AccountId);
}
accountIds.remove(null);

The clear winner is the Set. While it's harder to benchmark this sort of code, I do know that repeatedly checking a condition each iteration is linearly more expensive than having to check just once. The results can be pretty impressive on large lists (~100 ms or more in some extreme cases).

tl;dr

At normal volumes that you'd find within a trigger context and most types of page contexts, there's no clear benefit to using either a List or a Set in terms of memory or speed, with the sole exception that a Set can identify and remove bad values in near-constant time (far better than a List can do). So, if you need to avoid a specific value, use a Set, if you need a specific order or possibly duplicate values, use a List, and if neither apply, just use whatever you feel like.

A query that uses IN, with the same value more than once, does not cause an error, so there's no point in worrying about uniqueness in normal filters. However, if you're truly concerned about query performance, consider using a semi join sub-select, which is far superior in terms of speed than either the List or the Set. Obviously, this doesn't work on records not committed in the database, so often you will need to fall back to using either type of collection. I'm just advocating that you not worry about which you use as long as it gets the correct results.


Update: I missed what is probably the single most important part of the question in answering this - the "In SOQL" part. Perhaps consider this answer in terms of the other processing you are going to do with the collection before or after you pass it off to a SOQL query.

Summary:

Use a List when:

  • Ordering is important
  • You want to access elements by index (zero-based)
  • You want to be able to sort the elements
  • You need multiple dimensions

Use a Set when:

  • Ordering is not important
  • You want the elements to be unique (as defined by field set for sObjects or equals and hashCode methods)
  • You only need to iterate over all the records or test if it is present in the collection. See note* on ordering.
  • You want to do Set wise operations, such as union (addAll), intersection (retainAll), and relative complement (removeAll)
  • You already have a Map where the keys are the values you want, allowing you to use Map.keySet() rather than iterating and adding. This can be useful when combined with a SOQL query that directly populates the Map.

One important distinction is that a Set is an "unordered collection of elements" where as a List maintains the ordering that elements were added in and can be sorted. This can be important in a number of scenarios where you want to work on records in a particular sequence. It also allows the List to be accessed by index, whereas the Set can only be iterated over. You can use a List interchangeably with Array notations.

A set is an unordered collection—you can’t access a set element at a specific index. You can only iterate over set elements. Source

This partially comes down to the Set using the equals and hashCode methods to determine uniqueness. I've been caught out a few times with Sets and Map keys where a field change on an sObject changes its hashCode. As a result it can be added to the Set or Map Keys again. So relying on a Set to remove duplicate values can be difficult if there are intermediate modifications to field values on sObjects. The same is true if you are implementing hashCode yourself on a custom object.

Unlike most of Apex, the casing of a string in a Set<string> is considered. From the docs:

If the set contains String elements, the elements are case-sensitive. Two set elements that differ only by case are considered distinct.


As for memory usage, the definitive answer would be to monitor the heap size using Set and List alternatively. That would give the best indication for your specific scenario.

It's been awhile since I've done a data structures course. It looks like the Apex Set is based off Javas HashSet -

Unlike Java, Apex developers do not need to reference the algorithm that is used to implement a set in their declarations (for example, HashSet or TreeSet). Apex uses a hash structure for all sets. Source

Quick empirical test:

List<String> ids = new List<String>{'foo','bar'};
System.debug('Heap: ' + Limits.getHeapSize() + '/' + Limits.getLimitHeapSize());

Heap: 1065/6000000

Set<String> ids = new Set<String>{'foo','bar'};
System.debug('Heap: ' + Limits.getHeapSize() + '/' + Limits.getLimitHeapSize());

Heap: 1065/6000000

This would suggest that there is no difference in terms of heap usage between the two. I suspect this my be due to the negligible size of my collections. There could be some form of optimization going on for small sets.


* Up until Summer `15 the ordering of items in Sets was non-deterministic. See Predictable Iteration Order for Unordered Collections