Monitoring Inmanta Service Orchestrator

July 8, 2022 by
Monitoring Inmanta Service Orchestrator
Bart Vanbrabant

Inmanta Service Orchestrator takes up the responsibility of making sure the state of your infrastructure and services is always what you want it to be. We call it the desired state or intent. This goal is achieved through different methods and there are knobs that you can turn and tune.

But how do you make sure your orchestrator is tuned?

Of course, we can manually look for these knobs and turn them if necessary, which is no fun. Or, as strong proponents of automation, we automate as much as possible to make our lives easier and to catch mistakes dead in their tracks.

Inmanta provides a set of Nagios checks for monitoring Inmanta Service Orchestrator. These checks can be used with various monitoring stacks like NagiosIcinga, and Telegraf (TICK stack).

In this post, I will explain what to monitor and how.

Prerequisites

In order to demonstrate how to use the scripts for monitoring Inmanta Service Orchestrator, I’ll be using Telegraf as my monitoring agent. Of course, you can proceed with your existing monitoring solution or choose a stack that you are most comfortable with to achieve the same goal.

Before continuing, you have to set up Telegraf and Influxdb using their official documentation:

In the end, make sure Telegraf is able to communicate with Influxdb successfully.

  • Next, clone the repository that contains the check scripts on your Telegraf machine:
git clone https://github.com/inmanta/orchestrator_monitoring.git
  • Change directory to:
cd orchestrator_monitoring/
  • Install the Python requirements:
pip3 install -r requirements.txt
  • Make the scripts executable:
chmod +x check_*.py

Environment settings

Each environment on the orchestrator has some settings. There are a few settings that are convenient for testing but not suited for production use (such as not enabling protected_environment).

These specific settings should be monitored:

  • auto_deploy
  • server_compile
  • purge_on_delete
  • push_on_auto_deploy
  • protected_environment
  • agent_trigger_method_on_auto_deploy

The meaning of all these settings and their intention can be found in our documentation.

The check_env.py script is written to check the environment settings against the recommended values. It requires two parameters values

  1. base_url: The URL of your orchestrator, e.g. http://172.30.0.3:8888/
  2. env: The environment name you want to monitor

In order to run this script and retrieve the information, add the following lines to Influxdb’s Load Data > Telegraf > system configurator file:

[[inputs.exec]]
  commands = ["/home/orchestrator_monitoring/check_env.py --base_url http://172.30.0.3:8888/ --env dev"]
  timeout = "5s"
  name_suffix = "_inmanta_env"
  data_format = "nagios"

This will pull in the result and you can make a query inside the Data Explorer tab, like the image below:

Ultimately you can save this as a dashboard cell, task or variable and create an alert, which I will demonstrate at the end of this post.

You can repeat this procedure for your other environments or orchestrators.

Agents status

Agents carry an important responsibility. When a deployment agent is down or paused for maintenance, it can not deploy resources. This can cause the service provisioning to get stuck.

The check_agents.py script checks the status of all agents inside an environment. It takes two parameter values:

  1. base_url: The URL of your orchestrator, e.g. http://172.30.0.3:8888/
  2. env_id: The environment ID you want to monitor

In order to run this script and retrieve the information, add the following lines to Influxdb’s Load Data > Telegraf > system configurator file:

[[inputs.exec]]
  commands = ["/home/orchestrator_monitoring/check_env.py --base_url http://172.30.0.3:8888/ --env dev"]
  timeout = "5s"
  name_suffix = "_inmanta_env"
  data_format = "nagios"

In case an agent is down or paused, we will get a result similar to the image below:

Failed compiles

If the compiler is unable to compile, new services cannot be created, updated, or deleted. This is a severe condition that should not occur in production and requires immediate investigation.

Let’s also check for failed compiles in the past 24 hours using check_compiles.py by adding another entry in the system configurator file.

This check takes two parameter values:

  1. base_url: The URL of your orchestrator, e.g. http://172.30.0.3:8888/
  2. env_id: The environment ID you want to monitor
[[inputs.exec]]
  commands = ["/home/orchestrator_monitoring/check_compiles.py --base_url http://172.30.0.3:8888/ --env_id 977f6148-86af-4975-9fb2-fb684aefffc3"]
  timeout = "5s"
  name_suffix = "_inmanta_agents"
  data_format = "nagios"

In case there are any compilation failures, we will see an error like in the previous steps, otherwise the status will be OK:

Failed resources

Sometimes, individual resources cannot be deployed. e.g. because the credentials of the orchestrator have expired. When the orchestrator runs unsupervised, it is important to have automated alerting when this happens.

Next, let’s gather all the failed resources using check_resources.py by adding another entry in the system configurator file.

This check takes two values:

  1. base_url: The URL of your orchestrator, e.g. http://172.30.0.3:8888/
  2. env_id: The environment ID you want to monitor
[[inputs.exec]]
  commands = ["/home/orchestrator_monitoring/check_resources.py --base_url http://172.30.0.3:8888/ --env_id 977f6148-86af-4975-9fb2-fb684aefffc3"]
  timeout = "5s"
  name_suffix = "_inmanta_agents"
  data_format = "nagios"

This check will return the URL of any failed resources for further analysis as well.

Failed services

Service instances can also go into a failure mode. Usually, this is picked up by the OSS/BSS system as well, but it is good to also have an alert for this.

Using check_services.py we can add another entry in the system configurator file to that end.

This check takes two values:

  1. base_url: The URL of your orchestrator, e.g. http://172.30.0.3:8888/
  2. env_id: The environment ID you want to monitor
[[inputs.exec]]
  commands = ["/home/orchestrator_monitoring/check_services.py --base_url http://172.30.0.3:8888/ --env_id 977f6148-86af-4975-9fb2-fb684aefffc3"]
  timeout = "5s"
  name_suffix = "_inmanta_agents"
  data_format = "nagios"

This check retrieves problematic lifecycle states that are labeled as danger. If it finds anything, we get the service name, its value, and the diagnostics URL.

Putting it all together

Now that we have all the required checks, let’s define an alarm for agent check:

In Influxdb’s WebUI, head to Alerts tab and click on Alerts page, then Create, and finally, Threshold Check:

Define your query similar to the image below:

Next, click on Configure check to define the thresholds:

Upon saving the configuration by clicking on the blue checkmark in the top right corner, head to the Alert History tab and from there, you can view all the present alerts:

From here, you can take any approach you like to consolidate these alerts and have the team in charge respond to them.

Conclusion

Observing your infrastructure and especially the system/orchestrator that manages it, is of utmost importance and should be taken into account under any circumstances to make sure a smooth and interrupt-free operation.

The examples that we showcased in this post are just to demonstrate the toolset available to our clients for monitoring Inmanta Service Orchestrator and could be tailored to your specific environment and use cases.

There is also another script here that runs all the checks discussed in this post against your orchestrator and outputs the results in the terminal. However, it is not compliant to the Nagios format and cannot be used in Influxdb. It is just provided as a reference.

Monitoring Inmanta Service Orchestrator
Bart Vanbrabant July 8, 2022
Share this post
Tags