Thoughts on setting up Orchestrator in a Production environment

orchestrator logo


There have been several posts on setting up and using orchestrator.  Most of these are quite simple and do not discuss in detail some of the different choices you may want to consider when setting up orchestrator in a real production environment. As I have been using orchestrator for some time I thought it would be good to discuss how a real production orchestrator setup might be achieved. So here are some thoughts which I would like to share.


The basics for setting up orchestrator are to setup the orchestrator app and configure it to be able to write to a backend MySQL database.

Configuration requires telling orchestrator how to find the MySQL instances you want to monitor and perhaps to forget old servers that are no longer being used. For a small setup you may be happy to do this by hand but adding automation hooks to the provisioning or decommissioning process of MySQL hosts can come in handy. You have the choice of using the command line:

or using the http interface and URLs like

according to which method is easiest to setup. Note: discovery of new servers in an existing replication chain should not be necessary as orchestrator will normally be able to figure this out on its own.

Handling Master or Intermediate master failover

Failover behaviour also needs to be configured. While orchestrator is able to adjust the replication topology if a master or intermediate master fails sometimes, and this is more so for a primary master, additional external tasks may be needed to ensure the completion of the failover process. This may also good for notifying appropriate people or systems prior to and after dealing with the failover.

This is handled in orchestrator.conf.json with the following settings:

which will run scripts on the active orchestrator node to achieve the desired configuration changes. You can use these hooks to do tasks such as:

  • notify people or systems of the issue that’s been seen
  • change the configuration of external systems which need to be aware of a master or intermediate master failure
  • tell the applications of the failure and where to find the new master

All of these tasks will be specific to your environment but there’s plenty of freedom here to hook orchestrator in even if it is not directly aware of “the outside world”.

Selection of servers to be eligible [intermediate] masters

You may have special servers, such as those used for testing, or located in a different part of your network, which you do not want to promote to be a master or intermediate master. Orchestrator is able to allow you to indicate this with settings such as

This works pretty well and covers almost all cases where you need to handle special cases for one or more reasons.

Failover PromotionRules

For larger setups where there are more servers in the cluster you may prefer orchestrator to failover to one or more specific servers and there there are some promotion rules you can apply to adjust the priority of which servers are preferred as a candidate when a failure occurs.

Currently this is configured on a per MySQL instance basis giving it one of the types Prefer, Neutral (default value) or Must Not.  (The code does have two other options Must and Prefer Not but these are not implemented.)

Configuration can be done via the command line via:

though here the configured default promotion rule is used (Prefer), but you can also use the http interface where you can explicitly state the required promotion rule using:

It is also possible to pull out the promotion rules as a bulk operation using the url:

This is convenient if you want to configure this centrally rather than individually on each MySQL instance.

High Availability Setup

If you really care about your MySQL servers not failing you probably also care about orchestrator itself not failing, so what can be done to make this service more reliable?

Orchestrator itself comprises two parts: the orchestrator application and the MySQL backend it writes to.

As far as the orchestrator app is concerned it is easy to configure more than one server. All apps use the same configuration and talk to the same MySQL backend database. They co-operate by writing to a common table in the backend database and electing a leader (or active node) which actively polls all known MySQL instances. The other nodes are working but doing nothing. Should the elected leader stop working another app will be chosen and takeover the process
of checking all MySQL instances. So setting up more than one app is very straightforward and usually it is good to setup orchestrator app servers in the same locations or datacentres where your MySQL servers are running.

Once you have more than one orchestrator app running it is convenient to use some sort of load balancing technology to make orchestrator visible via a single URL. This process works quite nicely as normal usage of the GUI can work on any of the orchestrator nodes, even if the active monitoring only takes place on one of them. This is where it may be convenient to add an authentication and https layer, neither of which is handled directly by orchestrator but which can easily be added using something like nginx.


is very convenient as it shows you the apps which are running, their version which node is the active node. You can see an example below on some testing servers I use:

orchestrator web status
orchestrator web status

As far as Orchestrator’s handling of the backend MySQL server going away this is something which perhaps deserves a comment. Orchestrator has a backend database and expects it to be there.  So configuring a single MySQL server as orchestrator’s backend is probably not ideal. Standard MySQL replication will give you a spare and I think that for most cases this is in practice good enough.

If the “orchestrator db” master fails it is unlikely that orchestrator will be able to fix this. The paranoid may like to consider using something like Galera, MySQL Cluster or even the new MySQL Group Replication (and InnoDB Cluster when it is released), but all that orchestrator really cares about is being able to write to a backend database so it can store state and use that state later.  Additional auditing, logging, and history information is kept but none of this is critical and write rates on the backend are generally low unless the number of instances you monitor is very high. So adjusting the orchestrator configuration to talk to a different MySQL host, or alternatively to make the configuration use a virtual IP or DNS CNAME gives you the flexibility to be able to make quick changes without needing to adjust the orchestrator configuration itself.

While I use standard MySQL replication to provide a spare backend I also keep a record of the MySQL instances ( host:port ) so even under some completely strange broken setup I can feed this information into orchestrator via the discovery interface into an empty configuration and have orchestrator working again in a few seconds. A convenient URL is designed to simplify this task.

So all in all the HA setup is quite easy to get going and the good thing about that is then it is easy to upgrade any of the nodes just by stopping it, adjusting binaries and restarting, without having to worry about the “MySQL Failover Service” not being available.

People may wonder why this matters so much. If you setup is small then the chances of the master or intermediate master failing are also quite low. As your environment grows so does the chance of a failure occurring. I see failures, sometimes more than once a day, and prefer orchestrator to be running so I do not need to have to deal with these failures manually.

Monitoring Orchestrator

What’s required to monitor orchestrator? Basically you want to monitor the orchestrator process is working and the http web interface (especially if you run several app servers) on each of the boxes individually.

Orchestrator itself also supports graphite and can provide you information on internal activity such as the number of successful or failed discovery processes (polling MySQL servers) and also read and write activity to the backend MySQL store. However if you’re not using graphite this is more tricky.

I have made some code changes to provide further more detailed metrics on the time taken to poll and check each of the monitored MySQL servers as I had experienced some load issues due to the number of servers being monitored and these timing metrics helped identify where to focus to fix this. These metrics are available via a raw http api call and for simplicity aggregate values can be retrieved for the last few seconds. This makes tying into any external monitoring system much easier.

Some of these patches have been passed back upstream to github and further patches should arrive shortly. However, adding these metrics allowed me to identify bottlenecks in orchestrator when monitoring a large number of systems and together with colleagues performance enhancements for this sort of situation have been fed back upstream.


I hope that this article helps provide a bit more insight into what might be worth thinking about when setting up orchestrator for the first time in a production environment. Feel free to contact me if more detail is needed or something is not clear enough.