Collection classification - dev-toolkt/DevToolkt GitHub Wiki
This is an opinionated classification of collections, focusing on the concept of order.
Used terms
Tangible types
I consider an object (or its type) "tangible" if it has a transparent structure allowing defining a hash operator or a deterministic comparator (not necessarily related to any "natural" order). This reasoning is not specific to any particular programming language, but rather to the current wave of general-purpose programming languages (such as C++, Java, C#, and Kotlin) as a whole.
For example, Int
is a tangible type. It also has a "natural" order (the order of integers). Calculating its hash and comparing it is trivial.
String
is also a tangible type. Calculating its hash and comparing it lexicographically is trivial. It can be argued that the lexicographic order is the "natural" order of strings, but it's not as clear as in the case of integers.
UserInfo {firstName: String, lastName: String}
(immutable) is also a tangible type. We can easily calculate its hash "for technical purposes", by combining the hashes of firstName
and lastName
with a symmetric hash operator (without assuming the order of fields). To compare such objects, we have to assume a fixed order once and for all, but it doesn't have to be considered the "natural" order.
Most primitive types are tangible, and immutable complex structural types (based on other tangible types) are also tangible. In a well-designed programming language, functions shouldn't be considered tangible. It's not the case for most mainstream programming languages.
Mutable objects with an immutable identity (sometimes called "an address") are also tangible.
Typically, hashability and comparability are separated into two separate traits or interfaces. I am not aware of any practical type that can be hashable but cannot be compared in an arbitrary yet deterministic way (or vice versa), so I prefer to think in terms of "tangibility."
Handle
A handle is something that can be used to distinguish a given element from other elements of the collection. It's a new identity in its own right. The integer 0
can be added twice to the same collection (if it allows duplicates), but both "copies" can be reachable by their respective handles. Handles can improve access performance.
Implementing handles is easier in the case of node-based linked data structures (linked lists, trees, graphs) than in the case of arrays. Providing a stable handle for an array-based structure requires some level of indirection.
Handles allow storing non-tangible types in collections, but they can also be used in the lookup-supporting collections of tangible types, potentially improving performance.
Lookup
"Lookup" is a way of finding a tangible element in a collection efficiently, i.e., that doesn't require iterating through the entire collection.
Collections
Unordered Bag
There's no inherent order of elements. There is no requirement for the elements to be tangible. Duplicates are allowed (this can be considered a consequence of allowing non-tangible types). Handles are provided.
Unordered Bag is similar to a multiset (e.g., MultiSet (Apache Commons), std::multiset
), but it doesn't require tangibility (in consequence, it doesn't provide operations like "get the number of occurenses of x in the bag").
Provided functionality:
- Adding an element (in exchange for a handle)
- Getting an element value (given a handle)
- Removing an element (given a handle)
- Iteration (without any guarantees regarding the order, even its stability)
- Although this sounds like not much, iteration without any order assumptions can be implemented to find the sum of all elements, the min/max of all elements, and many more
Implementation techniques:
- A linked list (naive but simple)
- An array with some helper bookkeeping structure
- Optional: an allocation scheme for re-using free slots in the array, which might provide great performance benefits for big bags with frequent content changes
Known implementations:
std:list
- It provides stable handles
- LinkedList (Java) is similar, but doesn't provide stable handles
Manually Ordered Bag
There's no requirement for element uniqueness. There is no requirement for the elements to be tangible. Handles are provided. The elements are given an order. That order can be changed.
Provided functionality:
- Adding an element (in exchange for a handle)
- Getting an element value (given a handle)
- Removing an element (given a handle)
- Swapping two neighbours in the order (given their handles)
- Iteration (in the current order)
- E.g., any commutative or non-commutative operation on the whole bag (sum, difference, concatenation...)
Implementation techniques:
- A linked list
- An array with some helper handle mapping structure
Known implementations:
std:list
- It provides stable handles
- LinkedList (Java) is similar, but doesn't provide stable handles
Manually Ordered Bag (with order statistic)
The semantics are the same, but in addition, these operations are provided:
- Lookup for a handle to the i-th element in the order
- Lookup for the element's index in the order (given a handle)
Implementation techniques:
- Order statistic tree (but in variant that's not a binary search tree, using the given order)
Known implementations: ?
Unordered Set
There's no inherent order of elements. There is a requirement for the elements to be tangible. Duplicates are not allowed. Handles are provided optionally.
Provided functionality:
- Adding an element
- Removing an element given a value
- Checking if a value is the collection's element
- Iteration (without any guarantees regarding the order, even its stability)
Provided optional functionality:
- Getting an element value (in exchange for a handle)
- Removing an element (given a handle)
Implementation techniques:
- A balanced ordered binary tree (using an arbitrary yet deterministic order)
- A hash table
- An array (for small sets)
Known implementations:
- TreeSet (Java) and similar classes in other programming languages
- Although it also "gives" you an order, you can ignore it
- HashSet (Java) and similar classes in other programming languages
Naturally Ordered Set
There is an inherent order to the elements. There is a requirement for the elements to be tangible. Duplicates are not allowed. Handles are provided optionally.
Provided functionality:
- Adding an element
- Removing an element given a value
- Checking if a value is the collection's element
- Iteration (in the natural order)
Provided optional functionality:
- Getting an element value (in exchange for a handle)
- Removing an element (given a handle)
Implementation techniques:
- A balanced ordered binary tree (using the natural order)
Known implementations:
- TreeSet (Java) and similar classes in other programming languages
Naturally Ordered Set (with order statistic)
The semantics are the same, but in addition, these operations are provided:
- Lookup for a handle to the i-th element in the order
- Lookup for the element's index in the order (given a handle)
Implementation techniques:
Known implementations: ?
Manually Ordered Set
There is a requirement for the elements to be tangible. Duplicates are not allowed. Handles are provided optionally. The elements are given an order. That order can be changed.
Provided functionality:
- Adding an element
- Removing an element given a value
- Checking if a value is the collection's element
- Swapping two neighbours in the order (given their values or handles)
- Iteration (in the current order)
Provided optional functionality:
- Getting an element value (in exchange for a handle)
- Removing an element (given a handle)
Implementation techniques:
- A manually ordered bag (e.g. a linked list) combined with a map in the form [element -> bag handle] to guarantee uniqueness
- Possibly there's no more efficient solution
Known implementations: ?
Manually Ordered Set (with order statistic)
The semantics are the same, but in addition, these operations are provided:
- Lookup for a handle to the i-th element in the order
- Lookup for the element's index in the order (given a handle)
Implementation techniques:
- A manually ordered bag with order statistic (order statistic tree) combined with a map in the form [element -> bag handle] to guarantee uniqueness
- Possibly there's no more efficient solution
Known implementations: ?