Find subsequences of consecutive integers inside a list
You can use Split
in this simple case
list = {3, 4, 5, 6, 7, 10, 11, 12, 15, 16, 17, 19, 20, 21, 22, 23, 24, 42, 43, 44, 45, 46};
{Min[#], Max[#]} & /@ Split[list, #2 - #1 == 1 &]
What it does is that the last argument to split gives True
only when neighboring elements have a difference of 1. If not, the list is split there. Then you can use the Min
/Max
approach to find the ends. First
and Last
will work too.
Update:
Since the attention to this question/answer is rather surprising, let me point out one important thing: It is the crucial difference between Split
and SplitBy
. Both functions take a second argument to supply a testing function to specify the point to split but the behavior is completely different. Btw, the same is true for Gather
and GatherBy
.
While the second argument to Split
makes that it
treats pairs of adjacent elements as identical whenever applying the function test to them yields True,
SplitBy
does a completely different thing. It
splits list a into sublists consisting of runs of successive elements that give the same value when f is applied.
If you weren't aware of this, a closer look is surely advisable.
UPDATE: After reading Ajasja's answer I realized that I was making this way more complicated than it needed to be. My new code is easily an order of magnitude faster than my prior code or two orders faster than Split
.
Split
is a wonderfully clean method but it is not the fastest. Without resorting to C code one can get more than two orders of magnitude improvement on long lists with this:
intervals[a_List] :=
{a[[Prepend[# + 1, 1]]], a[[Append[#, -1]]]}\[Transpose] & @
SparseArray[Differences @ a, Automatic, 1]["AdjacencyLists"]
Compared to Split
:
a = Delete[#, List /@ RandomSample[#, 15000]] & @ Range@1*^7;
(r1 = intervals[a]) // Timing // First
(r2 = {Min[#], Max[#]} & /@ Split[a, #2 - #1 == 1 &]) // Timing // First
r1 === r2
0.0624
7.005
True
A short description of how the method works:
Differences
is used to find the steps between each element and the next for the entire list.
I have used SparseArray Properties many times on this site: (1), (2), (3), (4), (5), (6), (7), (8), (9)
Here it is used as a well-optimized method to find the positions of all non-background elements in the differences list. I specify a background of 1
to find the positions of all other elements, representing a change of other than +1. (In later versions Pick
is also well-optimized so that becomes an option(10) but here we need to manipulate the position list itself so it may be the best method even in later versions.)
Padding (Append
, Prepend
) the position list with 1
and -1
is used catch the first and last elements, respectively, of the original list. Adding (not appending) one to the list is used to offset the positions to get the elements on both sides of each jump. Finally, Transpose
is used to pair off these values into the interval lists.
Recently, I had to solve exactly the same problem. But my data consisted of several hundred lists of $10^6$ elements. The profiler showed this was becoming a significant overhead for my applications, so I invested an hour into a faster implementation. Anyway, here is a bit more than another order of magnitude improvement over Mr. W's answer (and more than 300× faster than the naive Mathematica implementation):
The algorithm is quite simple: Iterate through the list and every time a difference different from 1 appears (curr - prev) != 1
push prev
as the closing part of the interval and curr
as the opening part of the next interval.
Internal`Bag
is used for O(1)
insertion.
compiledGetContigIntervals =
Compile[{{ind, _Integer, 1}},
Block[{i, openInterval = 0, result = Internal`Bag[Most@{0}]},
openInterval = ind[[1]];(* the first opening interval *)
(* loop through all the indices and check for differences <> 1
If that is the case stuff the interval *)
Do[With[{curr = ind[[i]], prev = ind[[i - 1]]},
If[(curr - prev) != 1,
Internal`StuffBag[result, openInterval];
Internal`StuffBag[result, prev];
openInterval = curr;]]
, {i, 2, Length@ind}];
Internal`StuffBag[result, openInterval];
Internal`StuffBag[result, ind[[-1]]];
(* return the intervals *)
Partition[Internal`BagPart[result, All], 2]],
"CompilationTarget" -> C, "RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True,
"InlineCompiledFunctions" -> True,
"InlineExternalDefinitions" -> True}];
(If you don't have a C compiler just leave out the options at the end of Compile
)
And now the timings:
a = Delete[#, List /@ RandomSample[#, 500]] &@Range@1*^7;
intervals[a_List] := {a[[Prepend[# + 1, 1]]], a[[Append[#, -1]]]}\[Transpose] &@
SparseArray[Differences@a, Automatic, 1]["AdjacencyLists"]
(r1 = compiledGetContigIntervals[a]) // AbsoluteTiming // First
(r2 = intervals[a]) // AbsoluteTiming // First
(r3 = {Min[#], Max[#]} & /@ Split[a, #2 - #1 == 1 &]) // AbsoluteTiming // First
r1 === r3
r2 === r3
(*0.040002*)
(*0.191011*)
(*14.636837*)
(*True*)
(*True*)
(If the code is not compiled to C
, but to the WVM
the timing is 0.74 s
)