19. Telemetry - Metrics with Grafana

So we're coming to the end.

Now on this section where we've been focusing on telemetry features, we started with kiali promise, you really don't need to do anything special to make kiali work, as long as you have Istio running, and kiali will work, you don't need to do anything special to your code that gives a good high level view of exactly how your microservices or containers or pods are connected together.

And the traffic lines between them are incredibly valuable, we will be going into some more detail on kiali.

When we come to traffic management later in the course, in fact, I'll constantly be using lists to show the results of any changes that we make very, very powerful.

And we've seen Yeager distributed tracing incredibly useful and powerful with the potentially massive problem that unfortunately, do have to make changes to your code for this to work, or at least for it to work in a meaningful way.

So all of that is great.

But once you've identified that you have a problem with a particular service, let's say this staff service pod is performing really badly, then you can in kiali, go into the details of that service.

And you can look at the inbound metrics, which will give you graphs showing the request volume request durations, request and response sizes.

And you get the same data for the outbound metrics as well, it looks like on this particular service, the staff service is the end of a chain.

And it's not making any outbound calls by the look of it.

But there is another user interface that you can use.

And that's Grafana.

And you will find this by visiting your minikube IP address, colon 31 002.

Once again, that's just a node port that I've set up for you on the chapter towards the end of the course on installing Istio.

I'll show you how I did that.

But it's really not very complicated at all.

Now, I'm not going to talk very long at all about Grafana.

Because I'm sure many of you will have already used grafana.

Grafana is a very popular open source graphing framework that can show things such as charts, and graphs.

And it presents everything in the form of lovely dashboards, you can integrate grafana with a million different applications.

It's not just for Kubernetes.

But many of you will be familiar with the fact that often when we want to monitor a Kubernetes cluster, we want to monitor the health of the nodes, for example, in the cluster, then it's very, very common to use grafana connected to Prometheus, Prometheus, if you don't know it, is deployed into a cluster.

And it's responsible for scraping metrics.

So it will find out things like for example, the CPU usage of all of the physical nodes in your cluster, Prometheus gathers that together, and you connect grafana to it to give you some lovely charts.

So if you're not familiar with that, then I recommend you check out any good Kubernetes course, which might well have a section on Prometheus and grafana.

If you have used it, then you might be thinking, Well, why is this do also ship with refiner? The reason is that this is just a different set of monitoring tools, you're probably have been using it for monitoring, as I say, the physical hardware infrastructure that you've deployed, your Kubernetes cluster to Istio is concerned with much lower level details of the traffic moving between your pods.

And so they've produced a set of sort of standalone Grafana charts that giving you information about the traffic between the pots.

And that's what we have here.

So even if you've used it before, you will find this a different set of charts.

It's fairly self explanatory how to use this.

So I do invite you to have a play around for yourself with this and you'll soon get to grips with it.

But I will just in case you've not used it before, go through the basics of profiler.

And the main thing is you'll find a link top left here, the home link which if you drop that down, you will find a folder for Istio.

Now I've just clicked on there.

Don't be surprised if your minkube is struggling.

And you should kind of shorter resources you might get some slow responses here I've noticed in rehearsing that it can be a bit sluggish.

But assuming it's working, okay, you have what looks like a large collection of dashboards here at the time of recording.

There are eight of them.

Actually, from your applications point of view, really only two of them are useful.

That is the Istio service dashboard, and the Istio workload dashboard.

And it's a bit like kiali and that they've got this distinction between services and workloads.

Let's start with the work.

load.

So Istio workload dashboard.

And first thing you want to do is to set the correct namespace for as its default.

And you will find then a list of all of your workloads.

Now remember, that doesn't quite mean pod, it means collection of pods that are grouped together for a particular service.

So for example, I could look at the staff service right here.

One minor change that they've made in the recent version of these dashboards, is they've compacted all of the information that appears on this chart into three sub headings.

And by default, these three sub headings have not been expanded.

So to get all the information, you'll need to click on each of these headers, to get information on the outbound services, the inbound workloads, and the general information.

And really, there's a lot of duplication across these user interfaces, you could probably have got this information from kiali.

But it's down to your own taste, really many people prefer grafana.

For this level of detail, the charts are nicer, they refresh better, and they and it can show a wider range of data.

But it's showing things such as the incoming request volumes and durations, the success rates definitely useful.

How many returned non 500 responses.

At present, this is healthy, with 100%.

So that's good.

I'll let you explore all of the little graphs in detail.

But some very important points about refiner.

The top right is showing the period that we're monitoring here, by default, it's just the last five minutes.

If you want the last hour, for example, then you need to drop this down and select the last hour.

And the refresh period is set by default to 10 seconds, I did find in rehearsing again, I was running out of resources on minkube and everything started slowing down.

And I thought it was wise to extend that to maybe refresh every 30 seconds or every minute.

And you do that using this drop down here.

So I'll set that to 30 seconds.

Now if you have a particular graph that you're interested in, such as the request duration, you can hover here, drop down, and select the View button.

And that will zoom into that particular graph.

And if you repeat that, then it will minimise it back onto the dashboard.

So this could be very useful and helpful.

You can see a big gap here, for example, I think that was just when I redeployed all of my pods to remove that automatic header propagation.

And then when the system came back, it looked like there was a kind of a period of thrashing as the new pods came in.

Let's zoom into a smaller region.

And you can do this on any of the graphs by clicking and dragging a particular area of interest.

And that's zoomed me in and it will actually change the range up here.

It's going a little further down more detail basically further down.

So I'd like to go into detail about this graph, incoming request duration by source, this will actually split the requests into separate graphs, depending on where the requests came from.

Now I've picked a bad one actually thinking about it for the staff service.

Because going back to kiali.

Actually, it looks like all the requests staff service come from a single source this position tracker.

But the position tracker looks more interesting because it's getting requests from the position simulator workload, and the API, gateway workload, whatever they are, you don't need to worry about what they are.

But we can see there's two routes traffic coming into position tracker.

So we can verify that if we switch to position tracker.

And we've got very similar data and very similar information here.

But just to show you this incoming request, duration by source, zoom in on that.

This is a potentially a bit more interesting, but it's also quite cluttered.

Just to get rid of this peak, I think I'll zoom into this middle section here.

Yeah, it's definitely cluttered this but the reason it's cluttered is we've got 12345678 different lines.

And the reason there's so many, is it showing me the fifth 95th 95th and 99th percentile.

I'm not entirely sure why they do the 95th here, but they do so we've got four different percentiles, but it's also split up by where the requests came from.

So we've got the 95th percentile, for example of the traffic coming in from From API gateway, and if I click on this key here, it will just show me that.

But I've also got the 95th percentile for the position simulator, which is right here.

So two different incoming routes.

If I wanted to compare these 290 fifth percentile, so I could hold down, I'm on Windows, so I'm holding down the Ctrl key here, I guess it would be the Command key on Mac.

If I click that, then it's allowing me to compare those two metrics, which could be quite useful if you know there's something in particular not performing.

So that's pretty much it really, there's a corresponding, again, going to the Istio folder here, there's a corresponding service dashboard, and you need to select the service you're interested in, we'll just look at one of our services, let's go for, I don't know the Fleetman staff service, it's pretty much duplicated information, really, you could get all of this from the workloads view as well.

So that's really it for those two dashboards.

And I would suggest you probably going to be using these dashboards, if you want to drill into a particular area of concern.

Maybe after you've used kiali.

Maybe after you've used Jaeger, the remaining dashboards are more concerned with monitoring Istio itself.

For example, there is at the time of recording a dashboard for the control plane.

Now on earlier versions of this course, actually, this particular dashboard was spread across I don't know four or five different dashboards, which reflected at the time the architecture of Istio.

But now that the architecture of the control plane is simpler than Soto is this dashboard, this isn't really a dashboard that I really need to look at too often, it's pretty low level, to be honest.

However, having said that, if you have any kind of performance problems with Istio, let's say it's taking up too much memory, or CPU on your cluster, I think you might find the istio performance dashboard to be useful.

Now this has given the kind of information that you actually get from a standard monitoring package.

So it's kind of things you're going to be familiar with memory usage, CPU usage, and disk usage, and so on.

So you could get that from a standard monitoring package.

But if you do have problems with stos performance, you might find this useful.

Again, it's not something that I've particularly used extensively.

Certainly in the more recent versions, Istio just performs pretty well.

However, there is one new dashboard, I don't know at what point this was introduced.

But I do find this one particularly useful.

It's called the Istio mesh dashboard.

And if we look at that one, it's a very simple dashboard.

But it gives you a very good top level view of what's happening across your entire architecture.

Just at a glance, I can see that in this system right now, the global success rates.

In other words, all of the non 500 responses from all of the requests that are pinging around my system is currently running at 100%.

So that is a very useful metric.

It's also going to give you information about the virtual services and destination roles that you have set up.

We don't have any yet purely because we haven't got there yet, that's coming up in the next section of the course on traffic management.

In fact, most of these knots applicables on row two, and three are the kinds of things we're going to be adding as we go through the course.

I think this is a very useful top level dashboard to give you a very quick and very rough overview of the health of your entire system.

And that really completes all of the dashboards, there is one further one called the w a s m.

Extension dashboard.

Now this is a method of extending Istio, which is a very advanced feature.

It's not something I'm currently planning to cover on this course, it might be something I add in the future, or would probably more likely be something that would fit well on an advanced this do course, for me the day to day dashboards of the service dashboard, and the workload dashboard.

And as I say, you might also find the very high level is the mesh dashboard useful as well.

That's it for this section of the course I hope you'll find these telemetry features to be really useful for any project.

But in the next section of the course, we're going to start looking at the features of Istio that allow us to influence the behaviour of our system.

We're going to start by looking at how to manage the traffic flow through your system in the section called traffic routing.

I'll see you for that

Page tree

19. Telemetry - Metrics with Grafana