I have a complicated relationship with meetings. Some are genuinely useful — the ones where decisions get made and context gets shared. Most aren't. They're status updates that could've been a message, or calls where I'm only needed for five minutes out of forty-five.

So I built a meeting bot. My agent joins Google Meet and Zoom calls, records the audio, transcribes it with Whisper, and hands me a summary. I read a two-minute digest instead of sitting through an hour-long call.

How It Works

The bot is built as an OpenClaw skill — a set of instructions and scripts the agent follows when asked to join a meeting. Here's the pipeline:

1. Join the Call

The agent launches a headless browser (Chromium via Puppeteer), navigates to the meeting URL, and joins as a bot account. It has its own dedicated Google account — not mine — so it shows up with its own name. People in the meeting can see the bot joined, which matters for transparency. No secret recording.

Google Meet and Zoom have different join flows, different button layouts, different permission dialogs. The skill handles both, including dismissing popups, accepting permissions, and waiting for the meeting to actually start.

2. Record Audio

Once in the call, the bot captures system audio using PulseAudio. It records to a WAV file — no compression, no quality loss. The recording runs until the meeting ends (detected by participant count dropping) or until a timeout is hit.

Getting audio capture right on a headless server was one of the trickier parts. There's no physical sound card — we use a virtual audio sink that captures what the browser outputs. It took some PulseAudio plumbing to get clean, consistent audio without echo or feedback loops.

3. Transcribe with Whisper

After the recording stops, the audio file goes through OpenAI's Whisper model running locally. No API calls, no data leaving the server. The transcription includes timestamps, which makes it easy to jump to specific parts of the conversation later.

We use the tiny.en model for speed — a one-hour meeting transcribes in a few minutes on our hardware. For meetings where accuracy matters more than speed, we can bump up to the base or small model.

4. Summarize

The raw transcript goes to the LLM for summarization. The agent extracts:

The summary lands in my workspace as a markdown file, and the agent can also send it to a Discord channel or wherever I need it.

What I Actually Get

A typical output looks like this: a one-paragraph executive summary, a bullet list of decisions, a list of action items with owners, and the full transcript with timestamps for reference. Reading it takes two minutes. The meeting it came from was an hour.

The action items are the real value. Half the time, meetings generate tasks that never get written down and slowly evaporate. Having the agent extract them means nothing falls through the cracks.

The Gotchas

It's not perfect. Some honest limitations:

When to Use It (and When Not To)

It works great for:

It doesn't replace:

Building It as a Skill

The whole thing is packaged as an OpenClaw skill — a SKILL.md file with instructions, plus supporting scripts for browser automation, audio capture, and transcription. When I say "join this meeting," the agent reads the skill and follows the steps.

This is the pattern I keep coming back to: complex automation, packaged as a readable set of instructions that the agent executes. No custom training, no special infrastructure beyond what's already on the server. Just well-written instructions and scripts that work together.

The meeting bot took less than a day to build end-to-end. Most of that time was fighting PulseAudio and Chromium's headless audio quirks. The actual skill definition and Whisper integration were straightforward.

The best meeting is the one you don't attend. The second best is the one your AI attended and gave you the two-minute version.