Configuration management: push versus pull based topology

Solution 1:

In case it is of interest to anyone, I guess at minimum I can give a user experience report having put made my first use of Ansible's out of the box push capability in the context of patch management of multi-host setups of mission-critical systems in the Amazon cloud. To understand my preconceptions or biases, I should explain that I have a preference for Ruby at the automation scripting level and have set up projects to use master-agent puppet configuration per-project-Vpc in the past. So my experience belies past prejudices, if there were any.

My recent experience was very favourable to dynamic push onto a changing estate of from dozens to many hundreds of servers which can scale up or down, be terminated and refreshed. In my situation a simple Ansible 1.7 ad hoc command was all that I needed to make the patch. However in view of the effectiveness of setting up an AnsibleController (on a t2.micro) per Vpc for the purpose, in future I am intending to expand the technique for more complex requirements.

So let me return to the question asked in this thread: pros and cons of push in a dynamically changing estate.

The assumptions of the kind of server estate I targeted was:

  • No assumption that IP addresses or Amazon-generated local hostnames would be long lasting - they can both come and go
  • All instances were created from machine images which already had the ability to make ssh access possible from a single privileged administrative user
  • To individuate servers, and potentially partition them into groups, according to function or according to the stage of development (e.g. test or prod) this would be done through launch specific Amazon tags of agreed conventional Names
  • That I would patch administer Linux and Windows servers separately, with different ad hoc commands, therefore simply allowing Linux specific logins to fail when contacting a Windows server was perfectly acceptable

With these conditions in mind, creating a machine image of an AnsibleController to drop into numerous Vpcs and configure (with credentials) in situ within the existing server accounts is very simple. Automated within each instance created from the image is

  1. A cron job to push the patch to running servers at regular intervals so that the required estate is accessed continually at intervals
  2. A way of computing the Ansible inventory at every such interval.

The second item can be made relatively sophisticated if needed (via the Info structure of the Ansible inventory). But if sophistication is not needed, here is a very straightforward example of a script to compute all Amazon EC2 instances at each cron interval and direct the results into an appropriate inventory file (e.g. /etc/ansible/hosts) …

#!/bin/bash
# Assumes aws-cli/1.3.4 Python/2.6.9 Linux/3.4.73-64.112.amzn1.x86_64 or greater
# http://aws.amazon.com/releasenotes/8906204440930658
# To check yum list aws-cli
# Assumes that server is equipped with AWS keys and is able to access some or all
# instances in the account within it is running.
# Provide a list of host IPs each on a separate line
# If an argument is passed then treat it as the filename, whether local or absolute 
# path, to which the list is written

function list-of-ips {
    /usr/bin/aws ec2 describe-instances --filters '[ {"Name": "instance-state-code", "Values": [ "16" ] } ]' | grep -w PrivateIpAddress | awk  '{x=$2; gsub("\"","", x); gsub(",","", x); if(x && FNR!=1){print x;}}' | uniq
 }

if [ -n "$1" ]; then
   list-of-ips > "$1"
else
   list-of-ips
fi

The only caveat for the use case is that the patch command should be idempotent. It is desirable to pre-test to make perfectly sure that this is satisfied, as part of making sure that the patch does exactly what is intended.

So to sum up, I have illustrated a use case where dynamic push is effective against the goals I set. It is a repeatable solution (in the sense of being encapsulated in an image which can be rolled out in multiple accounts and regions). In my experience to date the dynamic push technique is much easier to provide --- and get into action --- than the alternatives available from the toolsets available to us at the moment.

Solution 2:

The problem with push based systems is that you have to have a complete model of the entire architecture on the central push node. You can't push to a machine that you don't know about.

It can obviously work, but it takes a lot of work to keep it in sync.

Using things like Mcollective, you can convert Puppet and other CM's into a push based system. Generally, it's trivial to convert a pull system to a push based one, but not always simple to go the other way.

There is also the question of organizational politics. A push based system puts all the control hands of the central admins. It can be very hard to manage complexity that way. I think the scaling issue is a red herring, either approach scales if you just look at the number of clients. In many ways push is easier to scale. However, dynamic configuration does more or less imply that you have at least a pull version of client registration.

Ultimately, it's about which system matches the workflow and ownership in your organization. As a general rule, pull systems are more flexible.


Solution 3:

This is an old post, but interestingly enough history repeats itself.

Now embedded IoT devices need configuration management and the infrastructure / network topology seems to be even more complex with both firewalls, NATs and even mobile networks in the mix.
The push or pull based decision is again just as important but the number of devices is even higher. When we developed our IoT embedded device configuration management tool qbee.io we selected a pull based approach with an agent having its foundation in promise theory. That means the agent pulls configuration and converges autonomously to the desired state. The advantage is that configuration is actively maintained even if the master server is down and the system does not need to track which device has received what configuration change. In addition it is often difficult to know how the local network conditions for the device are. So we do not care until the device pings the server. An additional example and argument for a pull based solution in case of an embedded use case is the long lifecycle of these devices. If a device fails and is replaced by a spare device (e.g. on an oil rig) the device will immediately receive the configuration for its specific group and converges towards that. If for example ssh keys are rotated for security reasons every 6 months then the last valid key for the spare device group will automatically be applied.

It will be interesting to follow how this discussion continues over the years. Also with containers and disposable infrastructure as an alternative to systems that maintain configuration over a longer period of time.