Finding all combinations based on multiple conditions for a large list

It is a classical operations research problem.

There are tons of algorithms that permit to find an optimal (or just a very good depending on the algorithm) solution :

  • Mixed-Integer Programming
  • Metaheuristics
  • Constraint Programming
  • ...

Here is a code that will find the optimal solution using MIP, ortools library and default solver COIN-OR :

from ortools.linear_solver import pywraplp
import pandas as pd


solver = pywraplp.Solver('cyclist', pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)    
cyclist_df = pd.read_csv('cyclists.csv')

# Variables

variables_name = {}
variables_team = {}

for _, row in cyclist_df.iterrows():
    variables_name[row['Naam']] = solver.IntVar(0, 1, 'x_{}'.format(row['Naam']))
    if row['Ploeg'] not in variables_team:
        variables_team[row['Ploeg']] = solver.IntVar(0, solver.infinity(), 'y_{}'.format(row['Ploeg']))

# Constraints

# Link cyclist <-> team
for team, var in variables_team.items():
    constraint = solver.Constraint(0, solver.infinity())
    constraint.SetCoefficient(var, 1)
    for cyclist in cyclist_df[cyclist_df.Ploeg == team]['Naam']:
        constraint.SetCoefficient(variables_name[cyclist], -1)

# Max 4 cyclist per team
for team, var in variables_team.items():
    constraint = solver.Constraint(0, 4)
    constraint.SetCoefficient(var, 1)

# Max cyclists
constraint_max_cyclists = solver.Constraint(16, 16)
for cyclist in variables_name.values():
    constraint_max_cyclists.SetCoefficient(cyclist, 1)

# Max cost
constraint_max_cost = solver.Constraint(0, 100)
for _, row in cyclist_df.iterrows():
    constraint_max_cost.SetCoefficient(variables_name[row['Naam']], row['Waarde'])    

# Objective 
objective = solver.Objective()
objective.SetMaximization()

for _, row in cyclist_df.iterrows():
    objective.SetCoefficient(variables_name[row['Naam']], row['Punten totaal:'])

# Solve and retrieve solution     
solver.Solve()

chosen_cyclists = [key for key, variable in variables_name.items() if variable.solution_value() > 0.5]

print(cyclist_df[cyclist_df.Naam.isin(chosen_cyclists)])

Prints :

    Naam                Ploeg                       Punten totaal:  Waarde
1   SAGAN Peter         BORA - hansgrohe            522             11.5
2   GROENEWEGEN         Dylan   Team Jumbo-Visma    205             11.0
8   VIVIANI Elia        Deceuninck - Quick Step     273             9.5
11  ALAPHILIPPE Julian  Deceuninck - Quick Step     399             9.0
14  PINOT Thibaut       Groupama - FDJ              155             8.5
15  MATTHEWS Michael    Team Sunweb                 323             8.5
22  TRENTIN Matteo      Mitchelton-Scott            218             7.5
24  COLBRELLI Sonny     Bahrain Merida              238             6.5
25  VAN AVERMAET Greg   CCC Team                    192             6.5
44  STUYVEN Jasper      Trek - Segafredo            201             4.5
51  CICCONE Giulio      Trek - Segafredo            153             4.0
82  TEUNISSEN Mike      Team Jumbo-Visma            255             3.0
83  HERRADA Jesús       Cofidis, Solutions Crédits  255             3.0
104 NIZZOLO Giacomo     Dimension Data              121             2.5
123 MEURISSE Xandro     Wanty - Groupe Gobert       141             2.0
151 TRATNIK Jan Bahrain Merida                      87              1.0

How does this code solve the problem ? As @KyleParsons said, it looks like the knapsack problem and can be modelized using Integer Programming.

Let's define variables Xi (0 <= i <= nb_cyclists) and Yj (0 <= j <= nb_teams).

Xi = 1 if cyclist n°i is chosen, =0 otherwise

Yj = n where n is the number of cyclists chosen within team j

To define the relation between those variable, you can model these constraints :

# Link cyclist <-> team
For all j, Yj >= sum(Xi, for all i where Xi is part of team j)

To select only 4 cyclists per team max, you create these constraints :

# Max 4 cyclist per team
For all j, Yj <= 4

To select 16 cyclists, here are the associated constraints :

# Min 16 cyclists 
sum(Xi, 1<=i<=nb_cyclists) >= 16
# Max 16 cyclists 
sum(Xi, 1<=i<=nb_cyclists) <= 16

The cost constraint :

# Max cost 
sum(ci * Xi, 1<=i<=n_cyclists) <= 100 
# where ci = cost of cyclist i

Then you can maximize

# Objective
max sum(pi * Xi, 1<=i<=n_cyclists)
# where pi = nb_points of cyclist i

Notice that we model the problem using linear objective and linear inequation constraints. If Xi and Yj would be continous variables, this problem would be polynomial (Linear programming) and could be solved using :

  • Interior point methodes (polynomial solution)
  • Simplex (non polynomial but more effective in practice)

Because these variables are integers (Integer Programming or Mixed Integer Programming), the problem is known as be part of NP_complete class (cannot be solved using polynomial solutions unless you are a genious). Solvers like COIN-OR use complex Branch & Bound or Branch & Cut methods to solve them efficiently. ortools provides a nice wrapper to use COIN with python. These tools are free & open source.

All these methods have the advantage of finding an optimal solution without iterating on all the possible solutions (and considerably reduce the combinatorics).


I add an other answer for your question :

The CSV I posted was actually modified, my original one also contains a list for each rider with their score for each stage. This list looks like this [0, 40, 13, 0, 2, 55, 1, 17, 0, 14]. I am trying to find the team that performs the best overall. So I have a pool of 16 cyclists, from which the score of 10 cyclists counts towards the score of each day. The scores for each day are then summed to get a total score. The purpose is to get this final total score as high as possible.

If you think I should edit my first post please let me know, I think that it is more clear like this because my first post is quite dense and answers the initial question.

Let's introduce a new variable :

Zik = 1 if cyclist i is selected and is one of the top 10 in your team on day k

You need to add these constraints to link variables Zik and Xi (variable Zik cannot be = 1 if cyclist i is not selected i.e if Xi = 0)

For all i, sum(Zik, 1<=k<=n_days) <= n_days * Xi

And these constraints to select 10 cyclists per day :

For all k, sum(Zik, 1<=i<=n_cyclists) <= 10

Finally, your objective could be written like this :

Maximize sum(pik * Xi * Zik, 1<=i<=n_cyclists, 1 <= k <= n_days)
# where pik = nb_points of cyclist i at day k

And here is the thinking part. An objective written like this is not linear (notice the multiplication between the two variables X and Z). Fortunately, there are both binaries and there is a trick to transform this formula to its linear form.

enter image description here

Let's introduce again new variables Lik (Lik = Xi * Zik) to linearize the objective.

The objective can now be written like this and be linear :

Maximize sum(pik * Lik, 1<=i<=n_cyclists, 1 <= k <= n_days)
# where pik = nb_points of cyclist i at day k

And we need now to add these constraints to make Lik equal to Xi * Zik :

For all i,k : Xi + Zik - 1 <= Lik
For all i,k : Lik <= 1/2 * (Xi + Zik)

And voilà. This is the beauty of mathematics, you can model a lot of things with linear equations. I presented advanced notions and it is normal if you don't understand them at first glance.


I simulated the score per day column on this file.

Here is the Python code to solve the new problem :

import ast
from ortools.linear_solver import pywraplp
import pandas as pd


solver = pywraplp.Solver('cyclist', pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
cyclist_df = pd.read_csv('cyclists_day.csv')
cyclist_df['Punten_day'] = cyclist_df['Punten_day'].apply(ast.literal_eval)

# Variables
variables_name = {}
variables_team = {}
variables_name_per_day = {}
variables_linear = {}

for _, row in cyclist_df.iterrows():
    variables_name[row['Naam']] = solver.IntVar(0, 1, 'x_{}'.format(row['Naam']))
    if row['Ploeg'] not in variables_team:
        variables_team[row['Ploeg']] = solver.IntVar(0, solver.infinity(), 'y_{}'.format(row['Ploeg']))

    for k in range(10):
        variables_name_per_day[(row['Naam'], k)] = solver.IntVar(0, 1, 'z_{}_{}'.format(row['Naam'], k))
        variables_linear[(row['Naam'], k)] = solver.IntVar(0, 1, 'l_{}_{}'.format(row['Naam'], k))

# Link cyclist <-> team
for team, var in variables_team.items():
    constraint = solver.Constraint(0, solver.infinity())
    constraint.SetCoefficient(var, 1)
    for cyclist in cyclist_df[cyclist_df.Ploeg == team]['Naam']:
        constraint.SetCoefficient(variables_name[cyclist], -1)

# Max 4 cyclist per team
for team, var in variables_team.items():
    constraint = solver.Constraint(0, 4)
    constraint.SetCoefficient(var, 1)

# Max cyclists
constraint_max_cyclists = solver.Constraint(16, 16)
for cyclist in variables_name.values():
    constraint_max_cyclists.SetCoefficient(cyclist, 1)

# Max cost
constraint_max_cost = solver.Constraint(0, 100)
for _, row in cyclist_df.iterrows():
    constraint_max_cost.SetCoefficient(variables_name[row['Naam']], row['Waarde'])

# Link Zik and Xi
for name, cyclist in variables_name.items():
    constraint_link_cyclist_day = solver.Constraint(-solver.infinity(), 0)
    constraint_link_cyclist_day.SetCoefficient(cyclist, - 10)
    for k in range(10):
        constraint_link_cyclist_day.SetCoefficient(variables_name_per_day[name, k], 1)

# Min/Max 10 cyclists per day
for k in range(10):
    constraint_cyclist_per_day = solver.Constraint(10, 10)
    for name in cyclist_df.Naam:
        constraint_cyclist_per_day.SetCoefficient(variables_name_per_day[name, k], 1)

# Linearization constraints 
for name, cyclist in variables_name.items():
    for k in range(10):
        constraint_linearization1 = solver.Constraint(-solver.infinity(), 1)
        constraint_linearization2 = solver.Constraint(-solver.infinity(), 0)

        constraint_linearization1.SetCoefficient(cyclist, 1)
        constraint_linearization1.SetCoefficient(variables_name_per_day[name, k], 1)
        constraint_linearization1.SetCoefficient(variables_linear[name, k], -1)

        constraint_linearization2.SetCoefficient(cyclist, -1/2)
        constraint_linearization2.SetCoefficient(variables_name_per_day[name, k], -1/2)
        constraint_linearization2.SetCoefficient(variables_linear[name, k], 1)

# Objective 
objective = solver.Objective()
objective.SetMaximization()

for _, row in cyclist_df.iterrows():
    for k in range(10):
        objective.SetCoefficient(variables_linear[row['Naam'], k], row['Punten_day'][k])

solver.Solve()

chosen_cyclists = [key for key, variable in variables_name.items() if variable.solution_value() > 0.5]

print('\n'.join(chosen_cyclists))

for k in range(10):
    print('\nDay {} :'.format(k + 1))
    chosen_cyclists_day = [name for (name, day), variable in variables_name_per_day.items() 
                       if (day == k and variable.solution_value() > 0.5)]
    assert len(chosen_cyclists_day) == 10
    assert all(chosen_cyclists_day[i] in chosen_cyclists for i in range(10))
    print('\n'.join(chosen_cyclists_day))

Here are the results :

Your team :

SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
BENOOT Tiesj
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
MEURISSE Xandro
GRELLIER Fabien

Selected cyclists per day

Day 1 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús

Day 2 :
SAGAN Peter
ALAPHILIPPE Julian
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
NIZZOLO Giacomo
MEURISSE Xandro

Day 3 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
MATTHEWS Michael
TRENTIN Matteo
VAN AVERMAET Greg
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús

Day 4 :
SAGAN Peter
VIVIANI Elia
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús

Day 5 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
CICCONE Giulio
HERRADA Jesús

Day 6 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike

Day 7 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús
MEURISSE Xandro

Day 8 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús
NIZZOLO Giacomo
MEURISSE Xandro

Day 9 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
TEUNISSEN Mike
HERRADA Jesús

Day 10 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
PINOT Thibaut
COLBRELLI Sonny
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
NIZZOLO Giacomo

Let's compare the results of answer 1 and answer 2 print(solver.Objective().Value()):

You get 3738.0 with the first model, 3129.087388325567 with the second one. The value is lower because you select only 10 cyclists per stage instead of 16.

Now if keep the first solution and use the new scoring method, we get 3122.9477585307413

We could consider that the first model is good enough : we didn't have to introduce new variables/constraints, model stays simple and we got a solution almost as good as the complex model. Sometimes it is not necessary to be 100% accurate and a model can be solved more easily and quickly with some approximations.