15. Telemetry - Distributed Tracing Overview

And now for the second of the major telemetry tools in Istio kiali is great, but it gives a very high level picture, we can't actually look at the details of individual requests.

So now we're going to bring in Yeager, which is a distributed tracing framework that is integrated into Istio using its telemetry data, it allows us to track a single request that we send into the system.

And we can get a full graph of the chain of events that the request caused, we can get detailed timing data, and get detailed information about exactly where that request went.

So this course isn't specifically about distributed tracing, I could definitely do a full course on tracing.

There's quite a lot of movement on this in the industry.

At the moment, there's been quite a lot of competition between different tracing frameworks, there is an effort in the industry to put together a single unified API for tracing called Open tracing.

And you can find that website here.

And the goal of that is to build a vendor neutral API for different tracing frameworks.

And you can just see this scrolling area here where they're showing different tracing frameworks.

I haven't heard of most of these sky walking and inspect it I've certainly never seen before, but one of them is Yeager.

And Jaeger was a framework that was originally built by Uber.

When just wait long enough, there's another one here called zipkin, which is a bit older than Jaeger, and was originally developed by Twitter, I think.

So the reason I've picked those two out in particular is they are supported by Istio, and support is already built in.

I've only used Yeager.

So that's the one I'm going to lead with on this course, you'll find the details of zipkin.

A pretty straightforward.

So what's distributed tracing all about? Well, the idea is, if you have a system with multiple components collaborating together, as of course we do on a Kubernetes system, then it's very beneficial to be able to take a particular request that might be handled by several of those software components all being called in a big, long chain.

And it will be very beneficial to see exactly which components took part in that chain, and how long each of the steps took.

So that's distributed tracing in general.

And you don't have to have Kubernetes, or Istio to do this, you can imagine that you have a system where you have, say, a JavaScript front end calling a Java back end, for example, well, you could go ahead and download one of these tracing frameworks.

And what you typically do is you get a client library for your particular language.

So we could get a client library for the JavaScript and a client library for the Java.

And you would then build those libraries into your applications, which is absolutely fine.

But what we're looking for with a service mesh is so that all of this happens down at the proxy level.

And you're probably not surprised to find out that that's exactly what we're going to get with Istio we don't need to use these client libraries at all.

Because istio has already built a distributed tracing framework into the underlying infrastructure.

There is one catch with this.

And this is one area in the course where we are potentially going to have to make changes to our application code.

But I don't want to get bogged down in that right now.

So I'm going to hide that detail.

And we'll cover that on the last video of this section of the course.

And the video will be called something like why you need to propagate headers.

So there is that catch, but but the advantage of using Istio is we're not going to need to use these clients libraries.

So we have distributed tracing built into Istio and there is support for both zipkin and Jaeger.

And both of them come with front ends.

So I'm here on the Yeager front end and you can do the same.

Whatever your mini cube IP addresses, I have set up Port 31,001 3001.

Just visit that IP address and you should land on the Yeager front end.

Now of course, I'm assuming that you have the sample system up and running, if not go back to the previous section where we were looking at kiali.

And there are other steps there on how to get this system up and running.

I've left my system running for a while now.

I think we'll find if we go into Yeager, what you can do is do a search for so called traces.

Now I need to explain what all of that means.

But let's just click around first of all You can start with on the service drop down here, just really select any of your services, I suggest possibly the web app service.

And then you can click find traces.

So we have effectively here a search engine.

And you should be getting back some kind of results, I happen to have 13 traces.

So what we need to do then is understand what is a trace.

And there's also some corresponding jargon called a span.

And to demonstrate all of that, I'm going to send a request into this system.

Now, if I click on any of these vehicles on the left hand side that is going to send a HTTP request to the server side of this system.

And I would like to know exactly what were the results of that request? What hops did it make, what micro services did it called, and how did those micro services perform? So I'm working with some insider knowledge here and that I implemented this system myself.

So I know that that particular use case of getting the information for a vehicle is implemented via three micro services.

And the initial request goes into a micro service called the web app pod.

It doesn't do all of the work by itself, I don't know, I can't remember exactly what it does.

But it will eventually at some point, be making a call to this other pod called the API gateway pod.

And in the terminology, we call that an upstream pod.

So as these arrows going in this direction, we're talking about upstream requests.

Again, we don't need to care about what the API gateways actually doing.

But again, in order to fulfil the request, it's going to send a further request, again, upstream to another part called the staff service, there'll be some logic in there, and then these arrows going in the opposite direction in the downstream showing the responses, or you can think of it as the data flow being returned.

So that's going to return I assume some data back to here, this pod might then do some further processing, it then returns back to the original pod, which again, itself might do some further processing before the results are returned back to the front end, I'm sure all of that is very natural and normal to you.

What I'm leading towards though, is showing you how Yeager will represent this in the user interface.

Now wait to imagine that we've put a stopwatch on the entire process.

So the original request came in, we started a stopwatch and then this entire flow has happened with logic in three parts.

And we've had two requests and two responses.

And let's say the overall timing for that was, I don't know, 10 seconds.

So I'm showing here on the bottom half of the caption, a timing diagram.

So we want to know really, how was that 10 seconds split up between the three microservices, if we just forget about all of this for the moment, and I'll, I'll just grey that side of the picture out.

After all, the client, the JavaScript front end can only see this web app pods.

So you know, they don't really care about the back end stuff to them.

They made a request, and they got a response back in 10 seconds.

But we want to break that down.

So when the request was made, presumably the web app probably did some processing probably did some thinking.

And let's say I mean, it wouldn't be the source of figures.

But let's say that thinking time was a second.

And then after that thinking time, it sent the request out to the API gateway, then all of this stuff happened.

But then when control returned back to the web app, maybe the web app did some further processing before finally delivering the results.

Let's say that was a second as well.

So I can represent that now on my timing diagram.

And this is how things kind of looked in the web app.

If we ignore everything else, there was one second of thinking time, and then it made a request and then had to wait eight seconds, then it got the response.

And then it did one seconds further processing time.

It's not obvious from this timeline, exactly which bits was the web apps processing, and which was the upstream services processing.

So I'll break this down further into lanes.

Now, what I'm heading towards here is the ultimate structure that you're going to see in the Yeager UI.

This is saying then, that the timing was the web app, start There's one second of processing and then pass control up to the upstream pods.

That's why this request here, the upstream pods, did eight seconds of work.

We don't know yet how that breaks down.

And then control returned back to the web app, which did 31 seconds worth of processing.

If you understand that, then you've really got enough to understand Yeager, you can probably guess that we would also break this eight seconds down as well.

If we look at the API gateway pod, when it receives its request, again, it might do some processing, I'll keep things simple, and say that's going to be a seconds worth of processing, it then sends out the request, waits for the response, and then maybe does another seconds worth of processing before returning back.

So now that we've got each microservice appearing in its own line, and we can now see at a glance how much work each microservice was doing at each step of the process.

And it looks like then assuming stuff service is the last in the chain, that it was taking six seconds to implement its logic.

So probably something quite questionable going on there.

That's it really, if you understand that description, and this is pretty much how the charts are going to be shown in Yeager, there is just one slight difference.

And that is, if we were looking at this chart in Jaeger, we would actually see the results like this, the initial service would have a timing bar, which encompasses the entire process.

And then what you're doing with each further lane downwards is you're drilling down into the detail.

So don't think that this means that the overall process took 10 plus eight plus 624 seconds.

It's not this is saying the overall process was 10 seconds, of which eight seconds came from the API gateway and six seconds came from the staff service.

Obviously, this is quite simplified.

It might have been after receiving the response from staff service, which is sent back to the web app, maybe the web app then sent a further request outwards.

And you just see that as a continuation along the timeline.

This entire graph, which is tracking the progress have an entire request as it goes through several micro services in the system is called a trace.

Each individual timing element here is called a span.

So in this example, we have a single trace, and it can be broken down into three spans.

Now, that is pretty much what you're going to see over in Diego when we try this for real, there is a slight complexity in that, of course, all of this is happening via proxies.

And every single one of our pots has a proxy inside it.

So I've redrawn the diagram with the proxies in place.

And this time, the arrows are just representing the requests, I'm not showing the responses.

Although in theory, we're going to have three spans because there were three route requests in the chain, we're actually going to end up with more than that on the final results.

And that's because it will track the requests from the containers to the side cars.

We'll be going over to Jaeger in a few moments.

And I'm just saying this so you're not surprised that you often see doubling up of spans.

And that's because when you make a request from one container to another, it is going through the proxy.

So usually you see this doubling up of spans except for the very last pod in a chain.

I'm just saying this so that when we look at a real Jaeger trace in a few moments, you'll understand why there's quite a lot of extra spans that you might not have been expecting.

이제 이스티오키알리의 두 번째 주요 원격 측정 도구는 훌륭하지만, 매우 높은 수준의 그림을 제공합니다. 우리는 개별 요청의 세부 사항을 볼 수 없습니다.

그래서 이제 우리는 Yeager를 데려올 것입니다. 이것은 원격측정 데이터를 사용하여 Istio에 통합된 분산 추적 프레임워크입니다. 이것은 우리가 시스템에 보내는 단일 요청을 추적할 수 있게 해줍니다.

그리고 우리는 요청으로 야기된 일련의 사건들에 대한 전체 그래프를 얻을 수 있고, 상세한 타이밍 데이터를 얻을 수 있으며, 요청이 정확히 어디로 갔는지에 대한 자세한 정보를 얻을 수 있습니다.

따라서 이 과정은 분산 추적에 대한 것이 아니라 추적에 대한 전체 과정을 수행할 수 있습니다.

산업계에서는 이에 대한 움직임이 상당히 많습니다.

현재, 서로 다른 추적 프레임워크 간에 상당한 경쟁이 있어, 업계에서는 Open tracing이라고 불리는 추적을 위한 단일 통합 API를 결합하려는 노력이 있다.

그리고 당신은 이 웹사이트를 찾을 수 있습니다.

그 목표는 다양한 추적 프레임워크를 위한 벤더 중립 API를 구축하는 것입니다.

여기 다른 추적 프레임워크를 보여주는 스크롤 영역이 있습니다.

이런 스카이워크는 들어본 적이 없지만, 그 중 하나가 예거입니다.

Jaeger는 원래 Uber에 의해 만들어진 프레임워크입니다.

오래 기다리면, 여기에 집킨이라는 것이 하나 더 있는데, 원래 트위터가 개발한 재거보다 조금 오래된 것입니다.

제가 이 둘을 특별히 고른 이유는 이스티오의 지원을 받고 있고 이미 지원이 내장되어 있기 때문입니다.

예거만 써봤어요.

그래서 제가 이 코스에 대해 설명하고자 하는 것이 바로 이겁니다. 짚킨에 대한 자세한 정보를 보실 수 있을 겁니다.

꽤 직설적이죠.

그렇다면 분산 추적은 무엇에 관한 것일까요? 음, 제 아이디어는, 쿠베르네테스 시스템에서처럼 여러 구성 요소가 함께 협력하는 시스템을 가지고 있다면, 여러 소프트웨어 구성 요소들이 모두 크고 긴 체인으로 불려질 수 있는 특정한 요청을 받아들이는 것이 매우 유익하다는 것입니다.

그리고 정확히 어떤 구성 요소가 이 체인에 참여했는지, 그리고 각 단계가 얼마나 걸렸는지 보는 것이 매우 유익할 것입니다.

그래서 그것은 일반적으로 분산 추적입니다.

이것을 하기 위해 쿠베르네츠나 이스티오가 없어도, 자바 백엔드를 호출하는 자바스크립트 프런트 엔드를 가지고 있는 시스템을 상상할 수 있습니다. 예를 들어, 이러한 추적 프레임워크 중 하나를 다운로드 할 수 있습니다.

일반적으로 특정 언어를 위한 클라이언트 라이브러리를 얻을 수 있습니다.

그래서 우리는 자바스크립트용 클라이언트 라이브러리와 자바용 클라이언트 라이브러리를 얻을 수 있었다.

그런 다음 해당 라이브러리를 애플리케이션에 구축하면 됩니다. 전혀 문제 없습니다.

하지만 우리가 서비스 메쉬를 통해 찾고 있는 것은 이 모든 것이 프록시 수준에서 발생하도록 하는 것입니다.

그리고 여러분은 아마 이 클라이언트 라이브러리를 전혀 사용할 필요가 없는 Istio를 통해 얻을 수 있는 것이 바로 그것이라는 것을 알게 되어도 놀라지 않을 것입니다.

isistio는 이미 분산 추적 프레임워크를 기반 인프라에 구축했습니다.

이것에는 한 가지 함정이 있다.

그리고 이것은 우리가 잠재적으로 우리의 응용 프로그램 코드를 변경해야 할 한 분야입니다.

하지만 지금 당장 그 일에 휘말리고 싶지는 않아요.

그래서 저는 그 세부사항을 숨기려고 합니다.

그리고 이 과의 마지막 비디오에서 그 내용을 다루겠습니다.

그리고 이 비디오는 헤더를 전파해야 하는 이유와 같은 것으로 불립니다.

그래서 이런 함정이 있습니다. 하지만 Istio를 사용하면 이러한 클라이언트 라이브러리를 사용할 필요가 없다는 장점이 있습니다.

그래서 우리는 이스티오에 내장된 트레이싱을 배포했고 지핀과 재거에 대한 지원이 있습니다.

그리고 둘 다 앞부분이 있습니다.

예거 프런트 엔드로 왔으니 당신도 그렇게 할 수 있어요.

미니 큐브 IP 주소가 무엇이든 간에 포트 31,001 3001을 설정했습니다.

그 IP 주소로 방문하시면 Yeager 프런트 엔드에 착륙하실 수 있습니다.

물론, 저는 당신이 샘플 시스템을 가동하고 있다고 가정하고 있습니다. 만약 우리가 키알리를 관찰했던 이전 섹션으로 돌아가지 않는다면 말이죠.

그리고 이 시스템을 가동시키는 다른 단계들도 있습니다.

나는 지금 한동안 시스템을 가동하지 않았다.

예거에 들어가면 찾을 수 있을 것 같아. 소위 '추적'을 찾는 거야.

이제 저는 그 모든 것이 무엇을 의미하는지 설명할 필요가 있습니다.

먼저 클릭해 보겠습니다. 서비스 드롭다운에서 시작할 수 있는 서비스 중 하나를 선택하여 웹 앱 서비스를 추천합니다.

그런 다음 추적 찾기를 클릭할 수 있습니다.

그래서 우리는 효과적으로 검색엔진을 가지고 있습니다.

그리고 당신은 어떤 결과를 얻어야 합니다. 저는 13개의 흔적을 가지고 있습니다.

그래서 우리가 해야 할 일은 무엇이 흔적인지 이해하는 것입니다.

그리고 스판이라고 불리는 상응하는 전문 용어들도 있습니다.

이 모든 것을 증명하기 위해, 저는 이 시스템에 요청을 보내려고 합니다.

왼쪽에 있는 이 차량 중 아무거나 클릭하면 이 시스템의 서버에 HTTP 요청을 보낼 수 있습니다.

그리고 나는 그 요청의 결과가 정확히 무엇이었는지 알고 싶다. 어떤 홉을 만들고, 어떤 마이크로 서비스를 불렀으며, 그 마이크로 서비스는 어떻게 작동했을까요? 그래서 저는 이 시스템을 직접 구현한 내부 지식을 가지고 일하고 있습니다.

그래서 저는 차량에 대한 정보를 얻는 특정한 사용 사례가 세 개의 마이크로 서비스를 통해 구현된다는 것을 알고 있습니다.

그리고 초기 요청은 웹 앱 포드라고 불리는 마이크로 서비스에 들어갑니다.

모든 일을 혼자서 하는 것은 아닙니다. 모르겠어요. 정확히 무슨 일을 하는지 기억이 안 나요.

하지만 언젠가는 API 게이트웨이 포드라고 불리는 다른 포드에 전화를 걸게 될 것입니다.

그리고 용어에서, 우리는 그것을 업스트림 포드라고 부릅니다.

그래서 이 화살표가 이 방향으로 갈 때, 우리는 업스트림 요청에 대해 이야기 하고 있습니다.

다시 말하지만, API 게이트웨이가 실제로 무엇을 하는지 신경 쓸 필요가 없습니다.

하지만 다시, 요청을 이행하기 위해서, 그것은 추가적인 요청을 보낼 것입니다. 다시, 직원 서비스라고 불리는 다른 부분으로 업스트림에 약간의 논리가 있을 것입니다. 그리고 이 화살표들은 반응을 보여주며 다운스트림의 반대 방향으로 갑니다. 아니면 데이터 흐름이라고 생각할 수 있습니다.

그래서 이 팟은 데이터를 여기 다시 가져와서 처리한 다음 원래 팟으로 돌아가게 됩니다. 다시 한번, 결과가 프런트 엔드로 돌아오기 전에, 그 자체로도 처리될 수 있습니다. 저는 이 모든 것이 매우 자연스럽고 정상적인 것이라고 확신합니다.

하지만 제가 앞서고 있는 것은 Yeager가 사용자 인터페이스에서 이를 어떻게 표현할지 보여주는 것입니다.

이제 우리가 전체 공정에 스톱워치를 달았다고 상상해 보세요.

그래서 원래 요청이 들어와서 스톱워치를 작동시켰고 이 모든 흐름은 세 부분의 논리와 함께 일어났습니다.

그리고 우리는 두 가지 요청과 두 가지 응답을 받았습니다.

그리고 그 전체적인 타이밍이 10초였다고 가정해보죠.

자, 여기 캡션의 맨 아래쪽에 시간 도표를 보여드리겠습니다.

그래서 우리는 3개의 마이크로서비스 사이에 10초가 어떻게 분할되었는지 알고 싶습니다. 만약 우리가 이 모든 것을 잠시 잊어버린다면, 저는 사진의 그 부분을 회색으로 표시하겠습니다.

결국, 클라이언트, JavaScript 프런트 엔드는 이 웹 앱 포드만 볼 수 있습니다.

아시다시피, 그들은 그들에게 뒷전에는 별로 신경 쓰지 않습니다.

그들은 요청을 했고, 10초 만에 답변을 받았습니다.

하지만 우리는 그것을 타파하고 싶다.

그래서 요청이 왔을 때, 아마도 웹 앱은 아마 어떤 처리를 했을 것입니다. 아마도 약간의 생각을 했을 것입니다.

제 말은, 그것이 숫자의 원천이 아니라고 합시다.

하지만 생각하는 시간이 초였다고 가정해 보자.

그 사고 시간이 지난 후, API 게이트웨이로 요청을 보냈고, 그 후 이 모든 일들이 일어났습니다.

그러나 제어가 웹 앱으로 되돌아갔을 때, 웹 앱은 결과를 전달하기 전에 몇 가지 추가 처리를 수행했을 수 있습니다.

그것도 초라고 합시다.

그래서 저는 그것을 지금 제 타이밍 다이어그램에 나타낼 수 있습니다.

웹 앱에서는 이렇게 생겼습니다.

만약 우리가 다른 모든 것을 무시한다면, 1초의 사고 시간이 있었고, 그리고 나서 요청을 하고 8초를 기다려야 했고, 그리고 나서 반응을 얻었습니다.

그리고 나서 처리 시간을 1초 더 늘렸습니다.

이 타임라인에서는 정확히 어떤 비트가 웹 앱 처리인지, 어떤 비트가 업스트림 서비스 처리인지 알 수 없습니다.

그래서 이것을 차선으로 더 세분화하겠습니다.

자, 이제 제가 이 쪽으로 가고 있는 것은 여러분이 Yeager UI에서 보게 될 궁극적인 구조입니다.

이 말은, 그 시점이 웹 앱이었다고 말하고 있습니다. 1초 동안 처리를 시작한 다음 제어장치를 업스트림 포드로 전달합니다.

그래서 여기 요청하신 업스트림 포드는 8초 동안 작업을 했습니다.

우리는 그것이 어떻게 고장났는지 아직 모른다.

그리고 제어는 다시 웹 앱으로 돌아왔고, 31초 분량의 처리를 했습니다.

만약 여러분이 그것을 이해한다면, 여러분은 예거를 이해할 수 있을 것입니다. 여러분은 아마 8초도 줄일 수 있을 것입니다.

API 게이트웨이 포드가 요청을 받았을 때 다시 한 번 처리 작업을 수행할 수 있습니다. 몇 초 동안 처리하면 요청을 전송하고 응답을 기다렸다가 다시 돌아가기 전에 몇 초 동안 처리 작업을 수행할 수 있습니다.

이제 각 마이크로서비스가 자체 라인에 나타나면서 각 마이크로서비스가 프로세스의 각 단계에서 얼마나 많은 작업을 수행하는지 한 눈에 볼 수 있습니다.

그런 다음 자료 서비스가 마지막이라고 가정하면, 그 논리를 구현하는 데 6초가 걸린 것 같습니다.

그래서 아마도 꽤 의심스러운 일이 벌어지고 있을 것입니다.

바로 그것입니다. 만약 여러분이 그 설명을 이해한다면, 그리고 이 차트가 Yeager에서 보여질 것입니다. 단지 한 가지 작은 차이점이 있습니다.

즉, Jeger에서 이 차트를 보면 실제로 결과를 확인할 수 있습니다. 초기 서비스에는 전체 프로세스를 포함하는 타이밍 막대가 있습니다.

그리고 나서 각각의 차선을 아래로 내려가면서 할 일은 세부 사항들을 파고드는 것입니다.

따라서 이것이 전체 과정에 10+8+624초가 걸렸다는 것을 의미한다고 생각하지 마십시오.

API 게이트웨이에서 8초, 직원 서비스에서 6초 등 전체 프로세스가 10초였다는 것이 아닙니다.

분명히, 이것은 꽤 단순하다.

웹 앱으로 다시 전송되는 직원 서비스의 응답을 받은 후 웹 앱이 외부로 추가 요청을 보낸 것일 수 있습니다.

시간대를 따라 계속 이어지는 것으로 볼 수 있습니다.

진행 상황을 추적하는 이 전체 그래프는 시스템에서 여러 개의 마이크로 서비스를 거치는 전체 요청을 추적이라고 합니다.

여기서 각 개별 타이밍 요소를 스팬이라고 합니다.

이 예에서, 우리는 하나의 흔적을 가지고 있습니다. 그리고 그것은 세 개의 범위로 나눌 수 있습니다.

이제, 디에고에서 실제로 이것을 시도할 때, 약간의 복잡함이 있습니다. 물론, 이 모든 것이 프록시를 통해 일어나고 있습니다.

그리고 우리의 모든 화분에는 대리인이 들어 있습니다.

그래서 저는 프록시를 배치한 도표를 다시 그렸습니다.

이번에는 화살표가 요청을 나타내는 것일 뿐 반응을 보이는 것은 아닙니다.

이론적으로는 3개의 경로 요청이 있었기 때문에 3개의 스팬을 가지게 될 것이지만, 최종 결과에서는 그 이상의 경로로 마무리될 것입니다.

그리고 그것은 컨테이너에서 사이드카로 가는 요청들을 추적할 것이기 때문입니다.

잠시 후에 재거로 넘어가겠습니다.

저는 단지 여러분이 종종 두배의 간격을 볼 수 있다는 것에 놀라지 않게 하기 위해 이 말을 하는 것입니다.

왜냐하면 한 컨테이너에서 다른 컨테이너로 요청을 하면 프록시를 통해 요청을 하기 때문입니다.

그래서 보통 체인의 마지막 팟을 제외하고, 스팬이 두 배로 늘어나는 것을 볼 수 있습니다.

제 말은 잠시 후에 진짜 재거의 흔적을 볼 때, 왜 여러분이 예상하지 못했던 많은 여분의 시간이 있는지 이해하실 수 있도록 하려는 것입니다.

Page tree

15. Telemetry - Distributed Tracing Overview