08. Hands on Demo - Finding Performance Problems

So welcome back.

And we're in the middle of solving this rather strange problem with the system.

And we're really struggling because we don't know much about this system.

But kiali has given us this visualisation of the system.

And it's using the data that Istio is gathering via the proxies.

So this is really wonderful that already we have a very strong clue as to what the problem is, it's something between this service and what we now know, it's an external service called Fleetman.

Driver monitoring.

So we do some good looking a bit more detail about what's going on in these requests.

And you you've seen that we have a traffic traffic graph here, which is showing us that it looks so over time, like about half of the requests are failing and half of succeeding.

Which is useful, I suppose it's saying there's something wrong, but it's not really explaining why is the system running so slowly? Why is everything seizing up? And I guess I could click on the response codes here.

Yeah.

Okay.

Summer 503, summer five, oh, fours.

I guess in some circumstances that could be useful, but not really here.

What I'm feeling is I'd like to know how well are these requests performing, we know some are working, but they might be running really slowly.

I've got to be careful here, I don't want to show you too much in this, this is just a warm up.

But I didn't notice if you go to the drop down here, I can drop this down.

And it allows me to change what's being displayed on the lines.

I can, for example, select requests per second.

And it shows me how many requests are going along these lines per second.

Okay, that might be useful in some circumstances.

And there's also a requests percentage, which shows really how the traffic is splitting up.

So if the traffic was coming down here, a third of it is going that way.

Two thirds is going this way.

Hmm, maybe not useful in this case.

We'll look at that in detail later.

But there is a risk, there is an option for response time.

Now I'm discovering when I select that I'm not getting any information on the graph.

And to be honest, I'm not sure if this is a just a bug in kiali.

So that would have been useful, but it does give me a chance to show you another UI that we get as part of Istio.

And you will find this UI once again, at the same IP address that you've been using so far.

That's the mini cube IP address, this son colon 31,001 31001.

Now, that's a number that I've made up, don't expect to find that on your projects.

And once again, when we install Istio from scratch, I will show you how I set that up.

But if you go there, it takes us to this what looks like a search page.

This UI is called Jager.

And this is tracing.

This allows us to look at what's happening to requests as they go through the entire system.

We can look at things like timing, how long did a particular request take? Now, this is so powerful, I'm really looking forward to exploring this in detail later in the course.

For now, I suggest we just we know that there's something wrong from the staff service as the staff services making outward requests.

So if you cite the staff service from the drop down list, and then click on find traces, I'm not going to go into any more detail other than showing you that there is up top here a histogram.

And it's showing the response times of the requests over a certain time period.

And the thing I really want to show you is that here on the left hand side, the scale of the graph, is going up to 20 seconds.

So it actually looks like these dots here are showing that some requests.

In fact, actually most requests are taking off the order of 30 seconds to complete, incredibly slow HTTP requests somewhere.

Okay, this one here, took about 10 seconds.

And there are some that are coming back far quicker.

Now we can and we will later in the course looking details, what all these entries are and how we can drill into this and find further information.

But I think for now, that's probably now telling me enough to say that if we go back to the kiali console, it looks like most of these requests that going across here, even if they're successful, are taking so long, that presumably this staff service is seizing up, and that's causing the knock on effect of the entire system running slowly.

I think what we have Here is an example of a cascading failure.

That's where one small problem maybe here is causing a knock on potentially across the entire system.

Now, I do have a chapter coming up on this course where we'll be addressing what to do to prevent cascading failures.

Now, the right solution here is to install something called a circuit breaker.

And it really is recommended that all microservice architecture should have circuit breakers in place on every single service.

So that's coming up.

But for now, it may well be the case that we've discovered this problem late at night, we don't have a coder to hand to fix the problem.

And we need to get things moving as quickly as possible.

Now, it might be that this external call is just some nice to have feature that's not particularly serious.

And it would be absolutely fine for us to take this out of service, we take this out of service, the rest of the system should start performing correctly.

And you know, we might be missing a feature, whatever this driver monitoring is doing will not work anymore.

But then it doesn't look like it's working anyway.

As you probably know, in a microservice architecture, it should be absolutely viable, that any micro service can be taken out of action, and the rest of the system can still continue working, albeit possibly in a degraded form.

So given this as an emergency, I'm suggesting that we're going to somehow take this service out of action.

How can we do that? Remember, this driver monitoring is just an external call, it's not a part that we're in control of, keep CTL get po there is no part called driver monitoring.

So I can't just kind of switch a pod off.

What it would be really good would be if I could kind of intervene here.

And even if this staff service has got a line of code in it that says call driver monitoring.

We it would be good if we could configure things somehow just to say, Okay, well, we're going to drop that request, we're not going to do anything with it.

Well, you can probably guess that we can do exactly that, when we're working with a service mesh.

Let's go back to our original pictures.

And here's a micro service pod.

So you can think of this as the staff service and the staff service is making a call to this external service in the outside world.

It's not calling another pod, it's calling a URL outside the cluster.

Well, that's not a problem, because exactly the same thing applies that I've been talking about up until now, when that container makes the call, even though it's an external call, it's still going to be routed or directed through the proxy.

In other words through the service mesh, and we're in control of this proxy, we can configure it.

So we could provide some yaml configuration to this proxy to say, actually, do you know what when we're calling that external service, let's just drop it, let's just return an error, we could return an HTTP 500 immediately back to the container.

We could even return an http 200.

Possibly.

Or we could say, Okay, we'll let the proxy make the call.

But if we don't get a response within a second, say, which should be the normal response time, then we'll terminate the connection and will immediately return back.

So we've got lots of options.

I think for this demo, it sounds like adding in a timeout might be a good idea.

How do we do that? Well, I'm just going to show you how that would work.

If you open up the yaml file for the Yama file, and before, which is where we've got our full application full stack camel file, if you go down to around line 207.

c, this is the definition of that external service called Fleetman.

Driver monitoring.

This is an Istio specific concepts quite a complicated one that we'll be exploring in detail.

It's really just a yaml block that's describing the fact that we're calling this external service.

And for the first time now we can actually see the name of that service.

It's this public URL here.

Now, of course, for the demo, this is just an external URL endpoint that I've set up that we can all use.

It's supposed to be representing a legacy system.

A virtual services are used for much more than that, but for now that we'll do one of the key things so about a virtual service.

I think for the first time on this course we're on the Istio reference manual here.

I'm only showing This just to say that yeah, that we can do request timeouts.

And if I go further down, I've got to pick through this, it does show us that on this virtual service yaml, we can add in a timeout field, which is in seconds.

So whether to do exactly the same, The difficult thing is getting the indenting.

Right.

So it needs to be in line with this entry here for routes or routes.

So those comparing with what we got, it looks like if we put it in line with routes here.

So I suggest we add a new line on line 280.

And we can configure that we want a timeout.

And I'm going to set that to be one second.

So we're making a major change here without having to touch any of the any of the containers that we've deployed into this service mesh.

And I'm hoping that if we apply that change, we'll see, we should see an improvements of the performance of the system.

So let's do kubectl apply on that yaml file.

And there should be one reference to virtual service has been configured.

And it's a little bit and it is a little bit hard to tell.

But it does seem now that reports are coming in more frequently.

I think we can illustrate this a bit better by doing a full hard refresh.

And if you remember, previously, it was taking a long time for this left hand panel to populate.

And now and now the other panel is populating within about five seconds, which is actually normal.

Now we can't be 100% sure that this is working, okay.

But if I go back into kiali, and maybe if I just restrict this to the last minutes, and see now a difference.

Now this might not look good.

The fact we've got a big red line running across there.

But this is actually showing that our changes works.

Remember previously about half were working half were not working.

But they were all taking too long.

While now 100% of them are failing.

And my suspicion is they're failing because we're hitting that timeouts.

fact we look at the response code HTTP 504 is indeed a gateway timeouts, we can probably get better information if we go back into the Jager UI on port 31, double o one, repeat that search.

And wow, yeah, there is a massive difference.

Now if we look at the histogram here, all of these requests are now around a second.

Some have gone a little bit over a second, that's something like 1.1 seconds or something.

But they're all certainly nothing like the 30 seconds that we were seeing before.

And it does look like now like the vehicles are moving a lot smoother.

And I'm noticing that the speeds are coming through a lot higher as well.

There's a lot more that we could do here other than a timeout.

But I think this demo has gone on long enough now.

But my aim in this demo, and I hope it succeeded, is to give you a flavour for what Istio can give us.

There's lots more to come.

But without Istio this problem would have been very, very difficult to solve by using kiali, which is gathering data that's being provided by the Istio proxies.

Together with a little bit of this Jager tracing, we were able to find an underlying problem that would have been very difficult to find otherwise.

And we were also able to put in, okay, possibly a barge to just time this out.

Clearly there's an underlying problem that should be fixed.

This driver monitoring is clearly in need of replacements.

Given this as an emergency, it has at least allowed us to provide a temporary solution to that emergency.

My intention here was to give you a flavour and a taste for what Istio can do for you.

But there are a lot more features in Istio so for the rest of this course.

We're going to go into a lot more detail on the rest of stos features.

So have a good break and I'll see you in the next video.

다시 오신 것을 환영합니다.

그리고 우리는 시스템에서 다소 이상한 문제를 해결하는 중입니다.

그리고 우리는이 시스템에 대해 많이 알지 못하기 때문에 정말 고군분투하고 있습니다.

그러나 kiali는 우리에게 시스템의 시각화를 제공했습니다.

그리고 Istio가 프록시를 통해 수집하는 데이터를 사용하고 있습니다.

이건 정말 놀랍습니다. 이미 우리는 문제가 무엇인지에 대해 매우 강력한 단서를 가지고 있습니다. 이것은이 서비스와 현재 우리가 알고있는 것 사이에있는 것입니다. 그것은 Fleetman이라는 외부 서비스입니다.

운전자 모니터링.

그래서 우리는 이러한 요청에서 무슨 일이 일어나고 있는지에 대해 좀 더 자세히 살펴 보겠습니다.

여기에 트래픽 트래픽 그래프가 있습니다. 시간이 지남에 따라 요청의 절반은 실패하고 절반은 성공하는 것처럼 보입니다.

어느 것이 유용합니다. 뭔가 잘못된 것이 있다고 생각하지만 시스템이 왜 그렇게 느리게 실행되는지 실제로 설명하지 않습니다. 왜 모든 것이 압류되고 있습니까? 여기에서 응답 코드를 클릭 할 수있을 것 같습니다.

네.

괜찮아.

여름 503, 여름 다섯, 오, 네.

어떤 상황에서는 유용 할 수 있지만 실제로는 그렇지 않습니다.

제가 느끼는 것은 이러한 요청이 얼마나 잘 수행되고 있는지 알고 싶습니다. 일부는 작동하고 있지만 실제로 느리게 실행될 수 있습니다.

여기서 조심해야 해요. 너무 많이 보여주고 싶지 않아요. 이건 그냥 워밍업이에요.

하지만 여기 드롭 다운으로 가면 이걸 내려 놓을 수 있습니다.

그리고 라인에 표시되는 내용을 변경할 수 있습니다.

예를 들어 초당 요청을 선택할 수 있습니다.

그리고 초당 얼마나 많은 요청이 이러한 라인을 따라 진행되는지 보여줍니다.

예, 어떤 상황에서는 유용 할 수 있습니다.

또한 실제로 트래픽이 분할되는 방식을 보여주는 요청 비율도 있습니다.

따라서 교통량이 여기로 내려오고 있다면 1/3이 저쪽으로 가고 있습니다.

2/3가 이쪽으로 가고 있습니다.

이 경우에는 유용하지 않을 수 있습니다.

나중에 자세히 살펴 보겠습니다.

그러나 위험이 있으며 응답 시간에 대한 옵션이 있습니다.

이제 그래프에서 정보를 얻지 못한다는 것을 선택할 때 발견하고 있습니다.

솔직히 말해서 이것이 kiali의 버그인지 확실하지 않습니다.

유용 할 수 있었지만 Istio의 일부로 제공되는 또 다른 UI를 보여줄 수있는 기회를 제공합니다.

그리고 지금까지 사용했던 동일한 IP 주소에서이 UI를 다시 한 번 찾을 수 있습니다.

이것이 바로 미니 큐브 IP 주소,이 아들 콜론 31,001 31001입니다.

자, 그것은 내가 만든 숫자입니다. 당신의 프로젝트에서 그것을 찾을 것이라고 기대하지 마십시오.

다시 한 번 Istio를 처음부터 설치할 때 설정 방법을 보여 드리겠습니다.

그러나 거기에 가면 검색 페이지처럼 보이는이 페이지로 이동합니다.

이 UI를 Jager라고합니다.

그리고 이것은 추적입니다.

이를 통해 요청이 전체 시스템을 통과 할 때 어떤 일이 일어나는지 확인할 수 있습니다.

타이밍, 특정 요청에 소요 된 시간 등을 살펴볼 수 있습니다. 이제 이것은 매우 강력합니다.이 과정의 뒷부분에서 자세히 살펴볼 수 있기를 정말 기대합니다.

지금은 직원 서비스가 외부 요청을하므로 직원 서비스에 문제가 있음을 알고 있습니다.

따라서 드롭 다운 목록에서 직원 서비스를 인용 한 다음 traces 찾기를 클릭하면 여기 위에 히스토그램이 있다는 것을 보여주는 것 외에는 더 자세한 내용을 다루지 않겠습니다.

그리고 특정 기간 동안 요청의 응답 시간을 보여줍니다.

제가 정말로 보여 드리고 싶은 것은 여기 왼쪽 그래프의 눈금이 20 초까지 올라간다는 것입니다.

그래서 실제로 여기이 점들이 일부 요청을 보여주는 것처럼 보입니다.

실제로 대부분의 요청은 완료하는 데 30 초 정도 걸리며 어딘가에서 엄청나게 느린 HTTP 요청입니다.

여기 이건 10 초 정도 걸렸습니다.

그리고 훨씬 더 빨리 돌아 오는 것들이 있습니다.

이제 우리는 세부 사항, 이러한 모든 항목이 무엇인지,이 항목을 자세히 살펴보고 추가 정보를 찾을 수있는 방법을 살펴볼 수 있습니다.

하지만 지금 당장은 아마도 kiali 콘솔로 돌아 가면 여기에서 진행되는 대부분의 요청이 성공하더라도 너무 오래 걸리는 것처럼 보입니다. 이 직원 서비스가 점유되고 있으며, 이로 인해 전체 시스템이 느리게 실행되는 영향을 미치게됩니다.

여기에 계단식 실패의 예가 있다고 생각합니다.

여기서 하나의 작은 문제가 전체 시스템에 잠재적으로 노크를 일으키는 곳입니다.

이제이 과정에서 계단식 오류를 방지하기 위해해야 할 일을 다룰 장이 있습니다.

이제 여기에서 올바른 해결책은 회로 차단기라는 것을 설치하는 것입니다.

그리고 모든 마이크로 서비스 아키텍처에는 모든 단일 서비스에 회로 차단기가 있어야합니다.

그래서 그것이 다가오고 있습니다.

하지만 지금은 늦은 밤에이 문제를 발견 한 경우 일 수 있습니다. 문제를 해결할 코더가 없습니다.

그리고 우리는 가능한 한 빨리 사물을 움직여야합니다.

이제이 외부 호출은 특별히 심각하지 않은 기능을 갖는 것이 좋은 것일 수 있습니다.

그리고 우리가 이것을 서비스에서 제외하는 것은 절대적으로 괜찮을 것입니다. 우리는 이것을 서비스에서 제외하고 나머지 시스템이 올바르게 작동하기 시작해야합니다.

그리고 아시다시피, 기능이 누락되었을 수 있습니다.이 드라이버 모니터링이 수행하는 작업은 더 이상 작동하지 않습니다.

그러나 어쨌든 작동하는 것처럼 보이지 않습니다.

아시다시피 마이크로 서비스 아키텍처에서는 모든 마이크로 서비스가 작동하지 않을 수 있고 시스템의 나머지 부분이 성능이 저하 된 형태 일지라도 계속 작동 할 수 있다는 것은 절대적으로 실행 가능해야합니다.

그래서 이것이 비상 사태로 주어진다면, 저는 우리가 어떻게 든이 서비스를 작동하지 않을 것이라고 제안합니다.

어떻게 할 수 있습니까? 이 드라이버 모니터링은 외부 호출 일 뿐이며 우리가 제어하는 부분이 아닙니다. CTL을 유지하십시오. 드라이버 모니터링이라는 부분이 없습니다.

그래서 저는 포드를 끌 수는 없습니다.

제가 여기서 개입 할 수 있다면 정말 좋을 것입니다.

이 직원 서비스에 전화 운전자 모니터링이라는 코드 줄이 있더라도.

어떤 식 으로든 다음과 같이 구성 할 수 있다면 좋을 것입니다. 좋아요, 우리는 그 요청을 취소 할 것입니다. 우리는 그것에 대해 아무것도하지 않을 것입니다.

글쎄, 우리가 서비스 메시로 작업 할 때 정확히 그렇게 할 수 있다고 추측 할 수 있습니다.

원래 사진으로 돌아 갑시다.

여기에 마이크로 서비스 포드가 있습니다.

따라서 이것을 직원 서비스라고 생각할 수 있으며 직원 서비스는 외부 세계에서이 외부 서비스를 호출하고 있습니다.

다른 포드를 호출하는 것이 아니라 클러스터 외부의 URL을 호출합니다.

글쎄요, 그건 문제가되지 않습니다. 왜냐하면 지금까지 제가 얘기해 왔던 것과 똑같은 것이 적용되기 때문입니다. 그 컨테이너가 호출을 할 때, 그것이 외부 호출이더라도 여전히 프록시를 통해 라우팅되거나 전달 될 것입니다.

즉, 서비스 메시를 통해이 프록시를 제어하고 구성 할 수 있습니다.

따라서이 프록시에 yaml 구성을 제공하여 실제로 외부 서비스를 호출 할 때 무엇을 알고 있는지, 그냥 삭제하고, 오류를 반환하고, HTTP 500을 컨테이너로 즉시 반환 할 수 있습니다. .

http 200을 반환 할 수도 있습니다.

혹시.

또는 "좋아요. 프록시가 전화를 걸도록하겠습니다"라고 말할 수 있습니다.

그러나 1 초 이내에 응답을받지 못하면, 예를 들어 정상적인 응답 시간이어야합니다. 연결을 종료하고 즉시 다시 돌아옵니다.

그래서 우리는 많은 옵션을 가지고 있습니다.

이 데모에서는 타임 아웃을 추가하는 것이 좋은 생각이라고 생각합니다.

어떻게하나요? 글쎄, 나는 그것이 어떻게 작동하는지 보여줄 것입니다.

Yama 파일에 대한 yaml 파일을 열면 이전에 전체 애플리케이션 전체 스택 camel 파일이 있습니다.

c, 이것은 Fleetman이라는 외부 서비스의 정의입니다.

운전자 모니터링.

이것은 우리가 자세히 살펴볼 Istio의 특정 개념이며 상당히 복잡한 개념입니다.

이 외부 서비스를 호출한다는 사실을 설명하는 것은 실제로 yaml 블록입니다.

그리고 처음으로 우리는 그 서비스의 이름을 실제로 볼 수 있습니다.

여기이 공개 URL입니다.

물론 데모의 경우 이것은 우리가 모두 사용할 수 있도록 설정 한 외부 URL 끝 점일뿐입니다.

레거시 시스템을 나타내야합니다.

가상 서비스는 그보다 훨씬 더 많이 사용되지만 지금은 가상 서비스에 대한 중요한 작업 중 하나를 수행 할 것입니다.

이 과정에서 처음으로 우리는 Istio 참조 설명서에 있습니다.

예, 요청 시간 제한을 수행 할 수 있음을 알리기 위해 This 만 표시합니다.

더 아래로 내려 가면이 항목을 선택해야합니다.이 가상 서비스 yaml에서 시간 제한 필드를 추가 할 수 있음을 보여줍니다. 이는 초 단위입니다.

그래서 정확히 똑같이 할 것인지, 어려운 것은 들여 쓰기를 얻는 것입니다.

권리.

따라서 경로 또는 경로에 대해 여기에있는이 항목과 일치해야합니다.

그래서 우리가 얻은 것과 비교하면, 우리가 여기 경로와 일치시키는 것처럼 보입니다.

따라서 280 행에 새 행을 추가하는 것이 좋습니다.

타임 아웃을 원하도록 구성 할 수 있습니다.

1 초로 설정하겠습니다.

따라서 우리는이 서비스 메시에 배포 한 컨테이너를 건드리지 않고 여기서 큰 변화를 만들고 있습니다.

이 변경 사항을 적용하면 시스템 성능이 향상되는 것을 볼 수 있기를 바랍니다.

이제 yaml 파일에 kubectl을 적용 해 보겠습니다.

그리고 가상 서비스에 대한 하나의 참조가 구성되어 있어야합니다.

그리고 그것은 약간이고 말하기가 약간 어렵습니다.

그러나 이제 보고서가 더 자주 들어오는 것 같습니다.

완전히 하드 새로 고침을함으로써 이것을 좀 더 잘 설명 할 수 있다고 생각합니다.

이전에는이 왼쪽 패널을 채우는 데 오랜 시간이 걸렸습니다.

그리고 지금은 다른 패널이 약 5 초 이내에 채워지는데 이는 실제로 정상입니다.

이제 우리는 이것이 작동한다고 100 % 확신 할 수 없습니다.

그러나 내가 키 알리로 돌아가서 아마도 이것을 마지막 몇 분으로 제한하고 지금 차이를 본다면.

이제 이것은 좋지 않을 수 있습니다.

우리가 거기에 큰 빨간 선이 있다는 사실.

그러나 이것은 실제로 우리의 변화가 효과가 있음을 보여줍니다.

이전에 약 절반이 일하고 있었음을 기억하십시오.

그러나 그들은 모두 너무 오래 걸렸습니다.

지금은 100 % 실패하고 있습니다.

그리고 내 의심은 우리가 그 시간 초과에 도달했기 때문에 그들이 실패하고 있다는 것입니다.

응답 코드 HTTP 504를 보면 실제로 게이트웨이 시간 초과가 발생합니다. 포트 31의 Jager UI로 돌아가서 두 번 검색하고 검색을 반복하면 더 나은 정보를 얻을 수 있습니다.

그리고와, 예, 엄청난 차이가 있습니다.

이제 여기 히스토그램을 살펴보면 이러한 모든 요청이 이제 약 1 초입니다.

일부는 1 초가 조금 넘었습니다. 이는 1.1 초 정도의 시간입니다.

그러나 그들은 우리가 전에 보았던 30 초와는 전혀 다릅니다.

이제 차량이 훨씬 더 부드럽게 움직이는 것처럼 보입니다.

그리고 속도도 훨씬 더 빨라지고 있음을 알아 차 렸습니다.

타임 아웃 외에 여기서 할 수있는 일이 훨씬 더 많습니다.

하지만이 데모는 이제 충분히 오래 지속되었다고 생각합니다.

하지만이 데모에서 제 목표는 Istio가 우리에게 줄 수있는 맛을 제공하는 것입니다.

앞으로 더 많이 있습니다.

그러나 Istio가 없었다면이 문제는 Istio 프록시에서 제공하는 데이터를 수집하는 kiali를 사용하여 해결하기가 매우 어려웠을 것입니다.

이 Jager 추적의 약간과 함께, 우리는 다른 방법으로는 찾기가 매우 어려웠을 근본적인 문제를 찾을 수있었습니다.

그리고 우리는 시간을 맞추기 위해 바지선을 넣을 수도있었습니다.

분명히 해결해야 할 근본적인 문제가 있습니다.

이 운전자 모니터링은 분명히 교체가 필요합니다.

이를 비상 사태로 감안할 때, 최소한 해당 비상 사태에 대한 임시 해결책을 제공 할 수있었습니다.

여기서 제 의도는 Istio가 당신을 위해 할 수있는 일에 대한 맛과 맛을 제공하는 것이 었습니다.

그러나 Istio에는 훨씬 더 많은 기능이 있으므로이 과정의 나머지 부분에 대해 설명합니다.

나머지 stos 기능에 대해 훨씬 더 자세히 살펴 보겠습니다.

그러니 휴식을 취하세요. 다음 영상에서 뵙겠습니다.

Page tree

08. Hands on Demo - Finding Performance Problems