Optimizing (minimizing) the number of lines in file: an optimization problem in line with permutations and agenda scheduling

I study algorithms as a hobby and I agree with Caduchon on this one, that greedy is the way to go. Though I believe this is actually the clique cover problem, to be more accurate: https://en.wikipedia.org/wiki/Clique_cover

Some ideas on how to approach building cliques can be found here: https://en.wikipedia.org/wiki/Clique_problem

Clique problems are related to independence set problems.

Considering the constraints, and that I'm not familiar with matlab or R, I'd suggest this:

Step 1. Build the independence set time slot data. For each time slot that is a 1, create a mapping (for fast lookup) of all records that also have a one. None of these can be merged into the same row (they all need to be merged into different rows). IE: For the first column (slot), the subset of the data looks like this:

id1 :1,1,1,0,0,0,0,0,0,0
id4 :1,1,1,1,1,0,0,0,0,0
id8 :1,1,1,1,0,0,0,0,0,0
id9 :1,1,0,0,0,0,0,0,0,0
id16:1,1,1,1,1,1,1,1,0,0

The data would be stored as something like 0: Set(id1,id4,id8,id9,id16) (zero indexed rows, we start at row 0 instead of row 1 though probably doesn't matter here). Idea here is to have O(1) lookup. You may need to quickly tell that id2 is not in that set. You can also use nested hash tables for that. IE: 0: { id1: true, id2: true }`. Sets also allow for usage of set operations which may help quite a bit when determining unassigned columns/ids.

In any case, none of these 5 can be grouped together. That means at best (given that row) you must have at least 5 rows (if the other rows can be merged into those 5 without conflict).

Performance of this step is O(NT), where N is the number of individuals and T is the number of time slots.

Step 2. Now we have options of how to attack things. To start, we pick the time slot with the most individuals and use that as our starting point. That gives us the min possible number of rows. In this case, we actually have a tie, where the 2nd and 5th rows both have 7. I'm going with the 2nd, which looks like:

id1 :1,1,1,0,0,0,0,0,0,0
id4 :1,1,1,1,1,0,0,0,0,0
id5 :0,1,1,1,0,0,0,0,0,0
id8 :1,1,1,1,0,0,0,0,0,0
id9 :1,1,0,0,0,0,0,0,0,0
id12:0,1,1,1,0,0,0,0,0,0
id16:1,1,1,1,1,1,1,1,0,0

Step 3. Now that we have our starting groups we need to add to them while trying to avoid conflicts between new members and old group members (which would require an additional row). This is where we get into NP-complete territory as there are a ton (roughly 2^N to be more accurately) to assign things.

I think the best approach might be a random approach as you can theoretically run it as many times as you have time for to get results. So here is the randomized algorithm:

  1. Given the starting column and ids (1,4,5,8,9,12,16 above). Mark this column and ids as assigned.
  2. Randomly pick an unassigned column (time slot). If you want a perhaps "better" result. Pick the one with the least (or most) number of unassigned ids. For faster implementation, just loop over the columns.
  3. Randomly pick an unassigned id. For a better result, perhaps the one with the most/least groups that could be assigned that ID. For faster implementation, just pick the first unassigned id.
  4. Find all groups that unassigned ID could be assigned to without creating conflict.
  5. Randomly assign it to one of them. For faster implementation, just pick the first one that doesn't cause a conflict. If there are no groups without conflict, create a new group and assign the id to that group as the first id. The optimal result is that no new groups have to be created.
  6. Update the data for that row (make 0s into 1s as needed).
  7. Repeat steps 3-5 until no unassigned ids for that column remain.
  8. Repeat steps 2-6 until no unassigned columns remain.

Example with the faster implementation approach, which is an optimal result (there cannot be less than 7 rows and there are only 7 rows in the result).

First 3 columns: No unassigned ids (all have 0). Skipped.

4th Column: Assigned id13 to id9 group (13=>9). id9 Looks like this now, showing that the group that started with id9 now also includes id13:

id9 :1,1,0,1,1,1,0,0,0,0 (+id13)

5th Column: 3=>1, 7=>5, 11=>8, 15=>12

Now it looks like:

id1 :1,1,1,0,1,0,0,0,0,0 (+id3)
id4 :1,1,1,1,1,0,0,0,0,0
id5 :0,1,1,1,1,1,1,0,0,0 (+id7)
id8 :1,1,1,1,1,0,0,0,0,0 (+id11)
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
id12:0,1,1,1,1,1,1,1,1,1 (+id15)
id16:1,1,1,1,1,1,1,1,0,0

We'll just quickly look the next columns and see the final result:

7th Column: 2=>1, 10=>4
8th column: 6=>5
Last column: 14=>4

So the final result is:

id1 :1,1,1,0,1,0,1,1,1,1 (+id3,id2)
id4 :1,1,1,1,1,0,1,1,0,1 (+id10,id14)
id5 :0,1,1,1,1,1,1,1,1,1 (+id7,id6)
id8 :1,1,1,1,1,0,0,0,0,0 (+id11)
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
id12:0,1,1,1,1,1,1,1,1,1 (+id15)
id16:1,1,1,1,1,1,1,1,0,0

Conveniently, even this "simple" approach allowed for us to assign the remaining ids to the original 7 groups without conflict. This is unlikely to happen in practice with as you say "500-1000" ids and up to 30 columns, but far from impossible. Roughly speaking 500 / 30 is roughly 17, and 1000 / 30 is roughly 34. So I would expect you to be able to get down to roughly 10-60 rows with about 15-45 being likely, but it's highly dependent on the data and a bit of luck.


In theory, the performance of this method is O(NT) where N is the number of individuals (ids) and T is the number of time slots (columns). It takes O(NT) to build the data structure (basically converting the table into a graph). After that for each column it requires checking and assigning at most O(N) individual ids, they might be checked multiple times. In practice since O(T) is roughly O(sqrt(N)) and performance increases as you go through the algorithm (similar to quick sort), it is likely O(N log N) or O(N sqrt(N)) on average, though really it's probably more accurate to use O(E) where E is the number of 1s (edges) in the table. Each each likely gets checked and iterated over a fixed number of times. So that is probably a better indicator.

The NP hard part comes into play in working out which ids to assign to which groups such that no new groups (rows) are created or a lowest possible number of new groups are created. I would run the "fast implementation" and the "random" approaches a few times and see how many extra rows (beyond the known minimum) you have, if it's a small amount.


This problem, contrary to some comments, is not NP-complete due to the restriction that "There cannot be two separated sequences in a line". This restriction implies that each line can be considered to be representing a single interval. In this case, the problem reduces to a minimum coloring of an interval graph, which is known to be optimally solved via a greedy approach. Namely, sort the intervals in descending order according to their ending times, then process the intervals one at a time in that order always assigning each interval to the first color (i.e.: consolidated line) that it doesn't conflict with or assigning it to a new color if it conflicts with all previously assigned colors.


Consider a constraint programming approach. Here is a question very similar to yours: Constraint Programming: Scheduling with multiple workers.

A very simple MiniZinc-model could also look like (sorry no Matlab or R):

include "globals.mzn";

%int: jobs = 4;
int: jobs = 16;
set of int: JOB = 1..jobs;

%array[JOB] of var int: start = [0, 6, 4, 0];
%array[JOB] of var int: duration = [3, 4, 1, 4]; 
array[JOB] of var int: start = [0, 6, 4, 0, 1, 8, 4, 0, 0, 6, 4, 1, 3, 9, 4, 1];
array[JOB] of var int: duration = [3, 4, 1, 5, 3, 2, 3, 4, 2, 2, 1, 3, 3, 1, 6, 8]; 

var int: machines;

constraint cumulative(start, duration, [1 | j in JOB], machines);

solve minimize machines;

This model does not, however, tell which jobs are scheduled on which machines.

Edit:

Another option would be to transform the problem into a graph coloring problem. Let each line be a vertex in a graph. Create edges for all overlapping lines (the 1-segments overlap). Find the chromatic number of the graph. The vertices of each color then represent a combined line in the original problem.

Graph coloring is a well-studied problem, for larger instances consider a local search approach, using tabu search or simulated annealing.