Back to blog

Hive ACP v0.2.0: Notifications that never get lost

by Hugo Hernández Valdez

Hive ACP v0.2.0: Notifications that never get lost

From orchestration to reliability

In part two I built a multi-agent orchestration system where Kiro and OpenCode work together, with a JobManager dispatching tasks in parallel and a knowledge graph persisting facts across sessions. It worked, but under real load subtle problems appeared that undermined trust in the system.

The worst one: you ask the bot to analyze a repository with 3 subagents in parallel. All 3 finish successfully. But you never receive the result. The bot goes silent. No error in the logs. No crash. Just... nothing.

That happened when the timing was exact: a job finished right while the orchestrator was processing another message. The result was queued into a limbo it never escaped. This release focused on solving that class of problems — the ones that don't crash but make you distrust your own system.

The cost of untyped EventEmitter

Node.js EventEmitter is a silent time bomb. Everything compiles, nothing fails at runtime (it's just ignored), and the bug manifests as "sometimes it doesn't work":

// This compiles perfectly. It's a bug.
client.emit("chnuk", text);        // typo → event never arrives
client.on("tool", (name: number) => {}); // wrong type → runtime crash
jobManager.emit("events", evt);    // plural → nobody listens

With 3 classes emitting events, listeners in 4 different files, and frequent refactors, it was only a matter of time before a typo caused an impossible-to-trace bug.

The fix: 40 lines that change everything

export class TypedEmitter<E extends Record<string, (...args: any[]) => void>> {
  private emitter = new EventEmitter();

  on<K extends keyof E & string>(event: K, listener: E[K]): this {
    this.emitter.on(event, listener as any);
    return this;
  }

  emit<K extends keyof E & string>(event: K, ...args: Parameters<E[K]>): boolean {
    return this.emitter.emit(event, ...args);
  }

  off<K extends keyof E & string>(event: K, listener: E[K]): this {
    this.emitter.off(event, listener as any);
    return this;
  }
}

Now each class declares exactly what events it emits and with what types:

export type AcpEvents = {
  chunk: (text: string) => void;
  tool: (name: string, toolCallId: string) => void;
  tool_update: (toolCallId: string, status: string) => void;
  turn_end: (text: string) => void;
  exit: (code: number | null) => void;
};

export class AcpClient extends TypedEmitter<AcpEvents> { ... }

The result: if I write client.emit("chnuk", text), the build fails. Not at runtime, not in production at 3am — it fails in my editor, with a red underline, before I save the file.

❌ Error: Argument of type '"chnuk"' is not assignable to parameter of type '"chunk" | "tool" | "tool_update" | "turn_end" | "exit"'

One detail that cost me 20 minutes: TypeScript rejects interface as a constraint for Record<string, ...> because interfaces don't have implicit index signatures. The fix is using type instead of interface for event maps.

The lost notifications bug

This was the most frustrating bug in the project. The flow:

1. User sends message → orchestrator starts processing (busy=true)
2. Subagent job finishes → results are queued
3. drainToAgent() sees client is busy → re-queues and returns null
4. Adapter receives null → sends generic fallback "Job finished"
5. Results stay in queue... forever

The user saw "📋 Job finished — 3/3 tasks completed" but never received the actual content. Results only appeared if the user sent another message afterwards (because the next prompt consumed them from the queue).

First attempt: timer-based backoff

const BACKOFF = [2_000, 4_000, 8_000, 16_000]; // ~30s total

for (let attempt = 0; attempt <= BACKOFF.length; attempt++) {
  if (!entry.busy) break; // already free, continue
  await new Promise((r) => setTimeout(r, BACKOFF[attempt]));
}

Worked for 5-10 second prompts. But if the user asked "analyze the entire repository" and the orchestrator took 2 minutes, the 30 seconds ran out. Back to the same problem.

The right solution: don't guess, listen

The answer was obvious once I stopped thinking in timers: wait for the event to happen, don't guess when it will.

AcpPool now emits idle when a client transitions from busy to free:

setBusy(chatId: number, busy: boolean): void {
  const entry = this.pool.get(chatId);
  if (!entry) return;
  const wasBusy = entry.busy;
  entry.busy = busy;
  if (wasBusy && !busy) this.emit("idle", chatId);
}

And drainToAgent simply waits for that event:

if (entry.busy) {
  log.acp.debug({ chatId }, "client busy, waiting for idle event");

  const idle = await new Promise<boolean>((resolve) => {
    const timeout = setTimeout(() => { cleanup(); resolve(false); }, 5 * 60_000);
    const onIdle = (id: number) => {
      if (id !== chatId) return;
      cleanup();
      resolve(true);
    };
    const cleanup = () => { clearTimeout(timeout); this.off("idle", onIdle); };
    this.on("idle", onIdle);
  });

  if (!idle) {
    this.inject(chatId, queued); // 5 min with no response = something died
    return null;
  }
}

Doesn't matter if the prompt takes 3 seconds or 3 minutes. The drain executes immediately when the orchestrator becomes free. Zero polling, zero waste, zero lost notifications.

The 5-minute timeout is only a safety net for the extreme case where the process dies without emitting the event.

From 590 lines to reusable modules

The TelegramAdapter was a 590-line file where these coexisted:

  • Markdown→HTML conversion with code block handling
  • Message splitting respecting UTF-16 surrogate pairs
  • Rate limiting with Telegram's RetryAfter
  • All the bot logic

The problem wasn't just length — when the Slack adapter arrives, it'll need throttling and splitting but can't import them without pulling in all of Telegram.

Extraction

src/utils/telegram-html.ts (78 lines):

export function escapeHtml(text: string): string;
export function mdToHtml(text: string): string;    // **bold** → <b>bold</b>
export function splitMessage(text: string, maxLen?: number): string[];

src/utils/throttle.ts (42 lines):

export class OutboundThrottle {
  async wait(): Promise<void>;   // wait until allowed to send
  tryNow(): boolean;             // try without blocking
  defer(ms: number): void;       // backoff for RetryAfter
}
export function getRetryAfter(err: unknown): number | null;

The adapter dropped to ~460 LOC — only Telegram logic. Utilities are importable from any future adapter.

The agent that couldn't save photos

A user reported: "I send an image to the bot and tell it to save it to the project. It says it can't access the file."

They were right. The flow was:

1. User sends photo
2. Adapter downloads as base64, passes to agent in prompt
3. Agent "sees" the image (can describe it, analyze it)
4. Agent tries to save it → has no access to the binary

The agent received the image as visual prompt content, but not as a file it could manipulate. Same limitation LLMs have with images: they "see" them but can't extract the binary back.

New tool: telegram_download_attachment

{
  name: "telegram_download_attachment",
  description: "Download the last attachment the user sent to /tmp and return the path.",
}

Now when the adapter processes a photo or document, it stores the file_id in context:

private setAttachment(chatId: number, fileId: string, fileName: string, mimeType: string): void {
  const ctx = this.activeCtx.get(chatId);
  if (ctx) ctx.attachment = { fileId, fileName, mimeType };
}

When the agent needs the file, it calls the tool. The tool downloads via Telegram API and returns the path:

/tmp/telegram-1715180400000-photo.jpg

From there the agent can copy it to the workspace, process it, or whatever it needs.

Errors swallowed in silence

The promptLock serializes prompts — if you send two messages quickly, the second waits for the first to finish. But the error handler was:

this.promptLock = result.then(() => {}, () => {});
//                                       ^^^^^^^^ error swallowed

If a prompt failed (timeout, dead process, network error), the lock was released correctly but nobody knew. Logs showed nothing happened. The next prompt worked fine, and the previous error vanished into the void.

Now:

this.promptLock = result.then(() => {}, (err) => {
  log.acp.warn({ err: err?.message }, "prompt failed (lock released)");
});

One line. The difference between "the bot sometimes doesn't respond and I don't know why" and "ah, the prompt failed due to timeout at 14:32".

Cleanup with strict TypeScript

I ran tsc --noUnusedLocals --noUnusedParameters and found:

src/acp/pool.ts(56,23): error TS6138: Property 'registry' is declared but its value is never read.
src/adapters/chat/telegram/adapter.ts(252,9): error TS6133: 'acpInstance' is declared but its value is never read.

Two ghost variables that survived previous refactors. Removed. The project now passes the strict check clean — and it should be part of CI.

Measurable impact

Metricv0.1.0v0.2.0
Adapter LOC~590~460 (-22%)
Possible typing bugs in events0 (compile-time)
Lost notifications due to busyFrequentEliminated
Max wait time for drain30s (then lost)Until done (event-driven)
Reusable modules for adapters02
Dead code2 variables0
Silent errors in promptLockAll of them0 (logged)

Lessons

  1. Event-driven > polling: If you're doing setTimeout in a loop waiting for something to change, you should probably emit an event when it changes. The code is simpler, more efficient, and has no timing edge cases.

  2. 40 lines of infrastructure save hours of debugging: TypedEmitter is trivial to implement. The ROI is enormous — every event typo that TypeScript catches is a bug you won't debug in production.

  3. Extract utilities before you need them in two places: If you wait for duplication to extract, you already have two diverging implementations to reconcile. Proactive extraction is cheaper.

  4. Silent errors are worse than crashes: A crash tells you exactly what failed and where. A swallowed error leaves you with "sometimes it doesn't work" and hours of investigation.

What's next

  • Unit tests: TripleStore, splitMessage, mdToHtml and pool logic are perfect candidates
  • Timeout on acp.prompt(): The last blind spot — if the agent hangs, there's no limit
  • Slack adapter: With utilities extracted, it's just implementing ChatAdapter

Code remains at github.com/gouh/hive-acp.

Share

Related posts