.. _adr-003: [ADR-003] ``PathSequence`` inherits from ``Set`` ================================================ :bdg-danger:`Rejected` Context and Problem Statement ----------------------------- This proposal is to change the class signature of ``PurePathSequence`` to change from ``PurePathSequence(Sequence[PathT_co])`` to ``PurePathSequence(Sequence[PathT_co], Set[PathT_co])``. In this proposal we consider the following use cases: * 10,000 files in a sequence is considered normal, if rare. 5,000 files is a more common use case. * Most of the time, a sequence will have been output by a DCC representing files in order, due to those files getting rendered in order. The sequence string is likely then stored somewhere as-is, and passed to pathseq in the same order. In other words, it's considered uncommon for the ranges in a range string to be given out of order. * Path sequences are most commonly used for looping over all paths in the sequence, or establishing which files in the sequence exist. Neither requires set operations, although we already deduplicate and sort the file numbers when establishing which exist on disk. We have no use case where the order of file numbers matters. Currently, equality checks benefit from a very basic form of normalisation: .. code-block:: pycon >>> from pathseq import PathSequence >>> seq1 = PathSequence("/path/to/images.1-2####.exr") >>> seq2 = PathSequence("/path/to/images.1,2####.exr") >>> seq1 == seq2 True By turning path sequences into a set, more complex forms of normalisation will allow two sequences with the same file number sets to also compare as equal. For example: .. code-block:: pycon >>> from pathseq import PathSequence >>> seq1 = PathSequence("/path/to/images.1-2,2-4####.exr") >>> seq2 = PathSequence("/path/to/images.1-4####.exr") >>> seq1 == seq2 True Considered Options ------------------ Normalisation ~~~~~~~~~~~~~ Give that we have no use case where the order of file numbers matters, the complex deduplication of file numbers could be implemented in one of two ways: with sorting or without sorting. With sorting would require the arithmetic sequences given to a file number set to be sorted, before being split such that they don't overlap. Given the use cases, most of the time the sorting will take :math:`n` time if we're using a :math:`\Omega(n)` time sort algorithm. Detecting whether two sequences overlap is simple — though does require in depth mathemtical concepts — for integers, but is more involved for decimals (see Appendix A). If maintaining this complex code ever becomes a problem, we would always resort to looping over every item in the sequences. Without sorting a naive implementation would iterate over every item in the given sequences, and place them into an ordered set to deduplicate them in :math:`O(n)` time with :math:`O(n)` memory. This favours the unusual case, where file numbers are given unordered because it takes :math:`O(n)` time instead of :math:`O(n\log n)` time, but at the cost of much more computation in the more common case where file numers are already ordered. Sorting would violate the concept of a sequence being ordered, and therefore normalisation via not sorting is considered superior. Decision Outcome ---------------- We will not turn PathSequences into sets. The performance cost and maintenance complexity introduced does not give rise to enough justified functionality. Consequences ------------ * The option to do more complex normalisation remains open. A subclass of ``PurePathSequence`` or a method on a ``PurePathSequence`` could perform more complex normalisation as something that the user opts into. * If use cases for set operations come to light, it would still possible to provide set operations without introducing breaking changes, by providing ``PurePathSet`` and/or ``OrderedPurePathSet`` classes, plus impure and loose equivalents. Appendix A ---------- Code to check if two arithmetic sequences are disjoint, implemented as methods on ``ArithmeticSequence``. .. code-block:: python def isdisjoint(self, other: Iterable[Any]) -> bool: if isinstance(other, ArithmeticSequence): return self._quick_isdisjoint(other) # Resort to a slow check return super().isdisjoint(other) def _quick_isdisjoint(self, other: Self) -> bool: # Shortcut if self.start > other.end or other.start < self.end: return True # For # a_n = a_1 + d_1(n-1) # b_m = b_1 + d_2(m-1) a = self b = other a1 = a.start b1 = b.start d1 = a.step d2 = b.step # the following condition must be met: # a_1 + d_1(n-1) = b_1 + d_2(m-1) # d_1 * n - d_2 * m = b_1 - a_1 # Calculate the difference between the first terms difference = b1 - a1 # Convert differences to integers for GCD calculation d1_int = int(d1) if d1 % 1 == 0 else None d2_int = int(d2) if d2 % 1 == 0 else None # Check if the GCD of the common differences divides the difference if d1_int is not None and d2_int is not None: gcd = math.gcd(d1_int, d2_int) if gcd == 0: return True if difference % gcd != 0: return True else: # Handle the case where differences are not integers # Convert the equation to integer coefficients assert isinstance(d1, decimal.Decimal) assert isinstance(d2, decimal.Decimal) a_exp = remove_exponent(d1).as_tuple().exponent assert isinstance(a_exp, int) b_exp = remove_exponent(d2).as_tuple().exponent assert isinstance(b_exp, int) scale_factor = 10 ** max(a_exp, b_exp) d1_scaled = int(d1 * scale_factor) d2_scaled = int(d2 * scale_factor) difference_scaled = int(difference * scale_factor) # Calculate the GCD of the scaled differences gcd_scaled = math.gcd(d1_scaled, d2_scaled) if gcd_scaled == 0: return True # Check if the scaled difference is divisible by the scaled GCD if difference_scaled % gcd_scaled != 0: return True # Find the smallest common term common_term_scaled = ( a1 * scale_factor + (difference_scaled // gcd_scaled) * d1_scaled ) common_term = decimal.Decimal(common_term_scaled) / decimal.Decimal( scale_factor ) # Check if the common term is within the bounds of both sequences if common_term <= a.end and common_term <= b.end: return False return True