EKS Pod 인스턴스 내에서 스케쥴링 - API 통신이 안될 때

개요

개발 단계에서 문제는 없었으나 실제 EKS Pod 인스턴스로 서버를 배포한 후에 스케쥴링이 제대로 돌아가지 않는 것을 확인, Argo CD 내의 인스턴스 로그를 확인해보니 아래와 같이 Fetch Error: TypeError: fetch failed cause: ConnectTimeoutError: Connect Timeout Error 에러가 발생하고 있음

[argocd - Applications - api - 생성되어있는 pod를 클릭하면 로그 확인 가능]

[Nest] 1  - 08/15/2025, 2:15:00 AM     LOG [DistributedLockService] Lock acquired: cron:lock:trackingChannels by fyc-api-77c8c57977-r6g48-1-1755224100002
[trackingChannels] Starting execution with lock: cron:lock:trackingChannels
[updateChannelPool] Lock acquisition failed, skipping execution
Fetch Error: TypeError: fetch failed
    at node:internal/deps/undici/undici:12618:11
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async ChzzkChannelRepository.findById (/app/node_modules/chzzk-z/dist/lib/chzzk/apis/channel.repository.js:15:16)
    at async ChzzkChannel.findById (/app/node_modules/chzzk-z/dist/lib/chzzk/channel.js:15:16)
    at async ChzzkRepository.getChannelById (/app/dist/src/chzzk/chzzk.repository.js:22:25)
    at async BatchService.trackingChannel (/app/dist/src/batch/batch.service.js:54:30)
    at async BatchService.trackingChannels (/app/dist/src/batch/batch.service.js:49:13)
    at async BatchService.value (/app/dist/src/common/decorators/batch-only.decorator.js:21:40)
    at async CronJob.<anonymous> (/app/node_modules/@nestjs/schedule/dist/schedule.explorer.js:96:17) {
  cause: ConnectTimeoutError: Connect Timeout Error
      at onConnectTimeout (node:internal/deps/undici/undici:7760:28)
      at node:internal/deps/undici/undici:7716:50
      at Immediate._onImmediate (node:internal/deps/undici/undici:7748:13)
      at process.processImmediate (node:internal/timers:478:21) {
    code: 'UND_ERR_CONNECT_TIMEOUT'
  }
}
[trackingChannels] Execution failed: TypeError: Cannot read properties of undefined (reading 'channelName')
    at BatchService.isUpdateChannel (/app/dist/src/batch/batch.service.js:74:50)
    at BatchService.trackingChannel (/app/dist/src/batch/batch.service.js:55:31)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async BatchService.trackingChannels (/app/dist/src/batch/batch.service.js:49:13)
    at async BatchService.value (/app/dist/src/common/decorators/batch-only.decorator.js:21:40)
    at async CronJob.<anonymous> (/app/node_modules/@nestjs/schedule/dist/schedule.explorer.js:96:17)
[Nest] 1  - 08/15/2025, 2:15:11 AM   ERROR [Scheduler] TypeError: Cannot read properties of undefined (reading 'channelName')
[Nest] 1  - 08/15/2025, 2:15:11 AM     LOG [DistributedLockService] Lock released: cron:lock:trackingChannels
Bash
복사

네트워크 오류?

이 문제는 API 서버인

https://chzzk.naver.com

https://openapi.chzzk.naver.com

로 연결을 못하고 있음을 명시하고 있어 직접 인스턴스 내에서 통신이 가능한지, 어플리케이션 내에서 불가능한건지 확인해보면 될 것으로 보인다.

베스천 서버 접속

(base) dominic@Dominic-MacBook-Pro .aws % aws ssm start-session --target i-00166642031aa6b50

Starting session with SessionId: dumin-gfh832piuvbxbxdfxazi6ff9ny
sh-5.2$ source ~/.bashrc
ssm-user@ip-10-0-2-173 /usr/bin$ 
Bash
복사

Pod 확인

kubectl get po -n chat-ingestor
---------------------------------------------------------------------------------
NAME                                 READY   STATUS    RESTARTS        AGE
fyc-chat-ingestor-5798879d96-6ctqv   1/1     Running   0               17h
fyc-chat-ingestor-5798879d96-p54qj   1/1     Running   0               17h
fyc-chat-ingestor-5798879d96-vnrcm   1/1     Running   0               17h
fyc-chat-ingestor-5798879d96-zz5ds   1/1     Running   0               17h
test-csi-pod                         1/1     Running   162 (10m ago)   6d18h
Bash
복사

Pod 접속

kubectl -n chat-ingestor exec -it fyc-chat-ingestor-5798879d96-6ctqv -- sh
Bash
복사

Ping 테스트

/app/apps/ws-chat-ingestor # ping chzzk.naver.com
PING chzzk.naver.com (110.93.151.136): 56 data bytes
^C
--- chzzk.naver.com ping statistics ---
8 packets transmitted, 0 packets received, 100% packet loss
/app/apps/ws-chat-ingestor # ping openapi.chzzk.naver.com
PING openapi.chzzk.naver.com (110.93.151.166): 56 data bytes
Bash
복사

원래 네이버 정책 상 PING은 차단해놓은 것으로 보아 PING과는 무관한 것을 확인

여러 인스턴스 중 스케쥴링에 성공한 인스턴스가 있다?

한 번이라도 성공했다면 네트워크 레벨의 문제가 아닐 가능성이 높다.

그러던 중 다른 에러 로그를 확인하게 되었는데

ECONNRESET?

Fetch Error: TypeError: fetch failed
    at node:internal/deps/undici/undici:13510:13
    at ChzzkChannelRepository.findById (/Users/dominic/Playground/fyc-api/node_modules/chzzk-z/lib/chzzk/apis/channel.repository.ts:24:12)
    at ChzzkChannel.findById (/Users/dominic/Playground/fyc-api/node_modules/chzzk-z/lib/chzzk/channel.ts:20:12)
    at ChzzkRepository.getChannelById (/Users/dominic/Playground/fyc-api/src/chzzk/chzzk.repository.ts:16:21)
    at BatchService.trackingChannel (/Users/dominic/Playground/fyc-api/src/batch/batch.service.ts:65:26)
    at BatchService.trackingChannels (/Users/dominic/Playground/fyc-api/src/batch/batch.service.ts:59:7)
    at BatchService.value (/Users/dominic/Playground/fyc-api/src/common/decorators/batch-only.decorator.ts:38:28)
    at CronJob.<anonymous> (/Users/dominic/Playground/fyc-api/node_modules/@nestjs/schedule/dist/schedule.explorer.js:96:17) {
  [cause]: Error: read ECONNRESET
      at TLSWrap.onStreamRead (node:internal/stream_base_commons:216:20) {
    errno: -54,
    code: 'ECONNRESET',
    syscall: 'read'
  }
}
Bash
복사

ECONNRESET 라는 부분이 중요하다.

아마 한번에 약 344개 채널에 대한 정보를 요청하다보니 문제가 생긴 걸로 보인다.

한 클라이언트의 요청 수가 너무 많아지면 네이버의 속도 제한 정책이 존재한다고 가정하면 충분히 가능성이 존재한다.

내부에서 정확한 확인이 어려우므로 이 경우에는 요청수를 줄이거나 요청빈도를 줄이는 형태로 다시 진행을 해야한다.