Throughout the Distribt course, we have seen that we are going to create microservices to perform different actions within a larger system.
In the post about the SAGA pattern, we saw how a simple transaction in the monolithic world became very complex within microservices. And I mentioned that it is very important to have a clear idea of everything that has happened, and when it happened in our system.
Table of Contents
1 - Observability in our distributed system
As I just mentioned, it is important to know when and how an event has occurred in our system. There are three main ways to do this.
- Logs: These tell us about the specific error or the specific event that we are indicating. We covered a post about them a while ago.
- Metrics: These tell us the raw statistics about the system. For example, a metric could be the load time of a particular page on our site or how many times it has loaded.
- Traces: The context for why things happen. The journey of an action throughout our system, from start to finish. For example, consider the saga use case for creating an order; each event would have a property called “traceId”, and if an error occurs, we only need to search for that ID in our monitoring system to find all related events.
2 - What is OpenTelemetry?
When we have a distributed system, we have multiple applications, and these applications may all be written in the same programming language or in different languages.
When building systems, collecting metrics is crucial so we can detect when something is slow or failing before it impacts the user.
Before OpenTelemetry, each language generated metrics differently, and every service you used wanted to receive them in a different way. For example, New Relic wanted to receive data in format A and Prometheus wanted them in format B, etc.
This caused several problems: having to implement the same thing multiple times, or being unable to switch providers without rewriting everything from scratch.
This is where OpenTelemetry comes in. It standardizes the way applications send metrics, so that regardless of the programming language, the information is generated in the same way.
As you can see in the image, there is a component called the OpenTelemetry collector, which is basically the service that collects, processes, and displays these data, since OpenTelemetry is not a backend application but rather a standard that provides APIs and SDKs for different languages, as well as a way to export the information.
By the way, it’s not in the image, but a virtual machine, CI/CD pipeline, or other services in your infrastructure can also generate metrics.
2.1 - OpenTelemetry Collector
The so-called Opentelemetry Collector is an application or service that we will deploy alongside our applications so that those applications can send their metric data to this service.
There are two ways to deploy the collector:
- Agent: When you deploy the application, you should run another service alongside it, on the same host. This is usually done with a sidecar if you use k8s or a daemonset, etc.
- Gateway: The collector is deployed as a regular service.
You can find more information on the official website.
- Note: In our example, we will see it deployed in Docker as a service.
3 - Visualizing metrics
So far, we've only been exporting the information, but we aren’t doing anything with it. We need to process and store it so we can later visualize it—this is where Prometheus, Grafana, and Zipkin come in. That said, you can use whichever services you want or those provided by your cloud provider; the idea remains the same.
3.1 - What is Prometheus?
The first thing we need is a backend to store the information, process the data, and act as a database.
This is where Prometheus comes in—a database that acts as the storage for our monitoring systems and metrics.
And its interface is nothing special, in fact it's a bit basic:
Note: We can send data directly from the application to Prometheus without passing through OpenTelemetry; as always, it depends on the infrastructure you want to implement.
3.2 - What is Grafana?
Once we have the information stored and processed, we’ll want to see it—but not just as plain text, but with cool graphs and tables. This is where Grafana comes in: it’s the user interface that lets us see the metrics.
As you can see, I’m not great at making graphs, but the data is there to create some cool charts.
3.3 - What is Zipkin?
Much like Grafana gives us charts and diagrams about our system’s or application's metrics, Zipkin lets us see the traceability of the calls, where each call goes, and where it takes more or less time.
NOTE: In this post I won’t show how to add custom traces or custom metrics, since that’s out of scope and would be too long. I will show these actions in future standalone posts.
4 - Do we need OpenTelemetry?
Now that I’ve explained that OpenTelemetry is a standard and that the real work of storing, processing, and displaying the data is done by Prometheus and Grafana (or other services), the question is obvious.
Should we implement OpenTelemetry in our services?
Well, it depends a bit—like everything, there are always pros and cons.
A pro is that if you switch providers, for example migrating from Prometheus and Grafana to New Relic, you won’t need to change the code.
But at the same time, New Relic understands Prometheus output. This means that if you export to Prometheus and switch to New Relic, you don't have to change anything either.
- Note: I don’t know what happens with other services, these are the ones I have experience with.
A con is that, for OpenTelemetry to work, we need to have our service/sidecar configured, which requires additional work both in maintaining the service and in the system resources we need to assign.
So we have two options: go the "quick" route by sending logs from the application straight to the backend system (Prometheus), or use an intermediary to normalize the data (OpenTelemetry)?
As always, it depends on the resources you want to invest.
5 - Configure OpenTelemetry in .NET
For this example, I'm going to use OpenTelemetry. This means our applications will use the OpenTelemetry collector, and Prometheus will connect to it to read and process the information.
5.1 - Creating the infrastructure for observability
The first thing we’ll do is set up all the configuration needed for our system. We'll go to our docker compose file and add the containers for OpenTelemetry, Prometheus, Grafana, and Zipkin
opentelemetry-collector: image: otel/opentelemetry-collector:latest container_name: open_telemetry_collector command: [ "--config=/etc/otel-collector-config.yaml" ] volumes: - ./tools/telemetry/otel-collector-config.yaml:/etc/otel-collector-config.yaml - ./tools/telemetry/logs:/etc/output:rw # Store the logs (not commited in git) ports: - "8888:8888" # Prometheus metrics exposed by the collector - "8889:8889" # Prometheus exporter metrics - "4317:4317" # OTLP gRPC receiverprometheus: image: bitnami/prometheus container_name: prometheus volumes: - ./tools/telemetry/prometheus.yaml:/etc/prometheus/prometheus.yml ports: - 9090:9090grafana: image: grafana/grafana container_name: grafana environment: - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=admin - GF_USERS_ALLOW_SIGN_UP=false volumes: - ./tools/telemetry/grafana_datasources.yaml:/etc/grafana/provisioning/datasources/all.yaml ports: - 3000:3000zipkin: container_name: zipkin-traces image: openzipkin/zipkin:latest ports: - "9411:9411"
As you can see, inside the OpenTelemetry container, we are using a configuration file called otel-collector-config.yaml
, which contains configuration such as available protocols and where the collected information will be exported:
# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiverreceivers: otlp: protocols: grpc:# Configure exportersexporters: # Export prometheus endpoint prometheus: endpoint: "0.0.0.0:8889" # log to the console logging: # Export to zipkin zipkin: endpoint: "http://zipkin:9411/api/v2/spans" format: proto # Export to a file file: path: /etc/output/logs.json# https://opentelemetry.io/docs/collector/configuration/#processorsprocessors: batch:# https://opentelemetry.io/docs/collector/configuration/#service# https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/design.md#pipelinesservice: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [logging, zipkin] metrics: receivers: [otlp] processors: [batch] exporters: [logging, prometheus] logs: receivers: [otlp] processors: [] exporters: [logging, file]
This is the Prometheus config file (prometheus.yaml
). As we can see, we are telling it to look at the opentelemetry container to receive data
scrape_configs: - job_name: 'collect-metrics' scrape_interval: 10s static_configs: - targets: ['opentelemetry-collector:8889'] - targets: ['opentelemetry-collector:8888']
And finally, the Grafana config file, where we specify Prometheus as a data source and its location:
datasources: - name: 'prometheus' type: 'prometheus' access: 'proxy' url: 'http://prometheus:9090'
- Note: The username and password to connect to Grafana is
admin:admin
Now you can run docker-compose up
to start our infrastructure ..
5.2 - Implementing Opentelemetry in C# code
The first thing we need to do is add the following packages:
- OpenTelemetry.Exporter.OpenTelemetryProtocol
- OpenTelemetry.Extensions.Hosting (prerelease)
- OpenTelemetry.Instrumentation.AspNetCore (prerelease)
If instead of the Exporter.OpenTelemetryProtocol
package, we install the Exporter.Prometheus
package, we can send the information directly to Prometheus without passing through the collector.
- Note: If you’re in the Distribt project, this package is added inside the
Distribt.Shared.Setup
project, so all our applications will have OpenTelemetry implemented by default.
With these packages, we can include tracing, metrics, and logs.
5.2.1 - Add Tracing to a .NET application with OpenTelemetry
We only need to use the AddOpenTelemetryTracing()
method and provide the necessary configuration in the builder.
public static void AddTracing(this IServiceCollection serviceCollection, IConfiguration configuration){ serviceCollection.AddOpenTelemetryTracing(builder => builder .SetResourceBuilder(ResourceBuilder.CreateDefault().AddService(configuration["AppName"])) .AddAspNetCoreInstrumentation() .AddOtlpExporter(exporter => { //TODO: call the discovery service to retrieve the correctUrl dinamically exporter.Endpoint = new Uri("http://localhost:4317"); }) );; }
As you can see, we are providing a name for the service, which we will have configured in the appsettings.json
files for each of our applications.
5.2.2 - Add Metrics to a .NET application with OpenTelemetry
Similar to the previous case, we need to use the .AddOpenTelemetryMetrics()
method:
public static void AddMetrics(this IServiceCollection serviceCollection, IConfiguration configuration){ serviceCollection.AddOpenTelemetryMetrics(builder => builder // Configure the resource attribute `service.name` to MyServiceName .SetResourceBuilder(ResourceBuilder.CreateDefault().AddService("MyServiceName")) // Add metrics from the AspNetCore instrumentation library .AddAspNetCoreInstrumentation() .AddOtlpExporter(exporter => { //TODO: call the discovery service to retrieve the correctUrl dinamically exporter.Endpoint = new Uri("http://localhost:4317"); }));}
5.2.3 - Add Logs to a .NET application with OpenTelemetry
Similar to the previous case, we need to use the .ConfigureLogging()
method:
public static void AddLogging(this IHostBuilder builder, IConfiguration configuration){ builder.ConfigureLogging(logging => logging //Next line optional to remove other providers .ClearProviders() .AddOpenTelemetry(options => { options.IncludeFormattedMessage = true; options.SetResourceBuilder(ResourceBuilder.CreateDefault().AddService(configuration["AppName"])); options.AddConsoleExporter(); }));}
In the Distribt library, this is more optional since we saw how to configure the app to use graylog.
And that’s it! If we run the applications, we can see the result. The images you saw in this post are from the actual output.
6 - Adding observability to other parts of our infrastructure
We can add observability to many or even all parts of the infrastructure. If you use cloud services, they already come prepared for observability, traces, etc., out of the box—you don’t need to configure anything, it’s all available by default.
6.1 - Connecting RabbitMQ with Prometheus and Grafana
In another post about distributed systems, we covered what a service bus is—specifically, RabbitMQ. Now, we'll see how to add information to Prometheus/Grafana.
The first thing we need to do in our infrastructure is modify the service in the docker-compose
file to specify that we’ll pass a file called enabled_plugins
through volumes:
rabbitmq: image: rabbitmq:3.8.34-management-alpine #management version needed to be able to have a User interface container_name: rabbitmq ports: - 5672:5672 - 15672:15672 volumes: - ./tools/rabbitmq/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf - ./tools/rabbitmq/definitions.json:/etc/rabbitmq/definitions.json - ./tools/rabbitmq/enabled_plugins:/etc/rabbitmq/enabled_plugins
This file contains a list of plugins we will enable in RabbitMQ. In this case, the one we care about is rabbitmq_prometheus
, but I've enabled a few more here.
[rabbitmq_prometheus, rabbitmq_amqp1_0, rabbitmq_management, rabbitmq_web_dispatch, rabbitmq_management_agent, rabbitmq_stomp].
Finally, update our prometheus.yaml file to add the new target to collect information.
scrape_configs: - job_name: 'collect-metrics' scrape_interval: 10s static_configs: - targets: ['opentelemetry-collector:8889'] - targets: ['opentelemetry-collector:8888'] - targets: [ 'rabbitmq:15692' ]
Now we can run docker-compose up -d
.
Once it's running, there is a manual step to complete. Remember, you'll only do this once in production.
We need to import the dashboard from the official page into Grafana. This is because Grafana has a large community that shares information.
When you import it, make sure to change the datasource to Prometheus, since that's the one with the relevant information.
And here is the final result; as we can see in the upper corner, it shows us the number of queues available in RabbitMQ.
If we run the applications and generate some events, we can see how the rest of the charts also change:
Conclusion
In this post, we've seen what OpenTelemetry is and how it relates to observability.
How to use OpenTelemetry with .NET and Prometheus.
How to use Prometheus with Grafana.
How to use Prometheus with Zipkin.
How to add observability to other parts of our infrastructure.
If there is any problem you can add a comment bellow or contact me in the website's contact form