멀티모달 Agent Harness

왜 멀티모달인가?

텍스트만으로는 표현할 수 없는 정보가 있습니다. UI 버그는 스크린샷으로, 성능 문제는 프로파일링 차트로, 음성 명령은 오디오로 전달될 때 가장 자연스럽습니다.

멀티모달 에이전트는 여러 감각 채널을 통합하여 인간과 더 유사한 방식으로 환경을 인식하고 작업합니다.

단일 모달 에이전트:
  "버튼이 정렬이 안 맞아요" → 에이전트가 코드만 보고 추측

멀티모달 에이전트:
  스크린샷 + "버튼이 정렬이 안 맞아요" → 에이전트가 실제 화면을 보고 정확히 수정

멀티모달 통합 아키텍처

3가지 모달리티

모달리티	입력 형태	활용 사례
Vision (시각)	스크린샷, UI 이미지, 다이어그램	UI 버그 수정, 디자인 구현, 에러 화면 분석
Code (코드)	소스 코드, diff, AST	일반 코드 작성/수정
Audio (음성)	음성 명령, 미팅 녹음	음성으로 작업 지시, 요구사항 추출

모달리티 라우터

interface MultimodalInput {
  text?: string
  images?: ImageData[]
  audio?: AudioData
  code?: string
}

interface ImageData {
  base64: string
  mediaType: 'image/png' | 'image/jpeg' | 'image/webp'
}

async function routeMultimodalTask(
  input: MultimodalInput
): Promise<AgentResponse> {
  const hasImages = (input.images?.length ?? 0) > 0
  const hasAudio = input.audio !== undefined

  // 음성 입력 → 먼저 텍스트로 변환
  if (hasAudio && input.audio) {
    const transcribed = await transcribeAudio(input.audio)
    return routeMultimodalTask({ ...input, text: transcribed, audio: undefined })
  }

  // 이미지 + 텍스트 → VLM 사용
  if (hasImages) {
    return await runVisionAgent(input)
  }

  // 텍스트/코드만 → 표준 텍스트 에이전트
  return await runTextAgent(input)
}

VLM 기반 UI 분석 에이전트

Vision Language Model(VLM)을 활용하면 에이전트가 실제 화면을 보고 작업할 수 있습니다.

스크린샷 분석 에이전트

import Anthropic from '@anthropic-ai/sdk'
import * as fs from 'fs'

const client = new Anthropic()

async function analyzeUIBug(
  screenshotPath: string,
  bugDescription: string
): Promise<{ diagnosis: string; fixSuggestion: string }> {
  const imageData = fs.readFileSync(screenshotPath)
  const base64Image = imageData.toString('base64')

  const response = await client.messages.create({
    model: 'claude-opus-4-5',
    max_tokens: 2048,
    messages: [
      {
        role: 'user',
        content: [
          {
            type: 'image',
            source: {
              type: 'base64',
              media_type: 'image/png',
              data: base64Image
            }
          },
          {
            type: 'text',
            text: `다음 UI 스크린샷을 분석하고 버그를 진단해주세요.
버그 설명: ${bugDescription}

다음 형식으로 응답하세요:
1. 진단: 무엇이 잘못되었는지
2. 수정 제안: 어떤 CSS/코드를 수정해야 하는지`
          }
        ]
      }
    ]
  })

  const text = response.content[0].type === 'text' ? response.content[0].text : ''
  const [diagnosis, fixSuggestion] = text.split('\n2. ')

  return {
    diagnosis: diagnosis.replace('1. 진단: ', ''),
    fixSuggestion: fixSuggestion ?? ''
  }
}

인지 단계별 특화 모델 아키텍처

인간의 인지 과정처럼 단계별로 특화된 모델을 사용하는 아키텍처입니다.

5단계 인지 파이프라인

입력 수신 → 지각 → 이해 → 계획 → 실행 → 검증

단계	역할	적합한 모델
지각 (Perception)	입력 모달리티 파싱, 관련 정보 추출	VLM, ASR 모델
이해 (Comprehension)	의도 파악, 컨텍스트 분석	Sonnet (균형)
계획 (Planning)	작업 분해, 전략 수립	Opus (깊은 추론)
실행 (Execution)	실제 코드 작성, 도구 호출	Sonnet (속도+품질)
검증 (Verification)	출력 검증, 테스트 실행	Haiku (빠름)

파이프라인 구현

interface CognitivePipelineResult {
  perception: PerceptionOutput
  comprehension: ComprehensionOutput
  plan: ExecutionPlan
  execution: ExecutionResult
  verification: VerificationResult
}

async function runCognitivePipeline(
  input: MultimodalInput
): Promise<CognitivePipelineResult> {
  // 1. 지각: 멀티모달 입력 파싱
  const perception = await perceptionModel.process(input)

  // 2. 이해: 의도와 컨텍스트 분석
  const comprehension = await comprehensionModel.analyze({
    raw: perception,
    context: await contextManager.getRelevant(perception)
  })

  // 3. 계획: Opus로 전략 수립
  const plan = await planningModel.createPlan(comprehension)

  // 4. 실행: Sonnet으로 실제 작업
  const execution = await executionModel.execute(plan)

  // 5. 검증: Haiku로 빠른 검증
  const verification = await verificationModel.verify(execution)

  if (!verification.passed) {
    // 검증 실패 시 실행 단계로 피드백
    return runCognitivePipeline({
      ...input,
      text: `이전 시도 실패: ${verification.errors.join(', ')}\n원본 작업: ${input.text}`
    })
  }

  return { perception, comprehension, plan, execution, verification }
}

멀티모달 하네스 설계 고려사항

이미지 처리 비용

이미지는 텍스트보다 훨씬 많은 토큰을 소비합니다.

일반 텍스트 요청: ~500 input tokens
동일 작업 + 스크린샷(1080p): ~2,500 input tokens (5배)

→ 이미지 압축 전처리 필수
→ 불필요한 UI 영역 크롭
→ 낮은 해상도로 다운샘플링 (대부분의 UI 분석에 충분)

이미지 전처리 파이프라인

import sharp from 'sharp'

async function preprocessForAgent(
  imagePath: string,
  options: { maxWidth?: number; quality?: number } = {}
): Promise<string> {
  const { maxWidth = 1280, quality = 80 } = options

  const processed = await sharp(imagePath)
    .resize({ width: maxWidth, withoutEnlargement: true })
    .jpeg({ quality })
    .toBuffer()

  return processed.toString('base64')
}

요약

멀티모달 Agent Harness는 Vision, Code, Audio를 통합하여 에이전트가 더 풍부한 방식으로 환경을 인식할 수 있게 합니다. 인지 단계별 특화 모델 아키텍처로 각 단계에 최적화된 모델을 사용하고, 이미지 전처리로 비용을 통제하세요. 가장 가치 있는 시작점은 VLM 기반 UI 버그 분석 에이전트입니다.