I have a complicated relationship with meetings. Some are genuinely useful — the ones where decisions get made and context gets shared. Most aren't. They're status updates that could've been a message, or calls where I'm only needed for five minutes out of forty-five.
So I built a meeting bot. My agent joins Google Meet and Zoom calls, records the audio, transcribes it with Whisper, and hands me a summary. I read a two-minute digest instead of sitting through an hour-long call.
How It Works
The bot is built as an OpenClaw skill — a set of instructions and scripts the agent follows when asked to join a meeting. Here's the pipeline:
1. Join the Call
The agent launches a headless browser (Chromium via Puppeteer), navigates to the meeting URL, and joins as a bot account. It has its own dedicated Google account — not mine — so it shows up with its own name. People in the meeting can see the bot joined, which matters for transparency. No secret recording.
Google Meet and Zoom have different join flows, different button layouts, different permission dialogs. The skill handles both, including dismissing popups, accepting permissions, and waiting for the meeting to actually start.
2. Record Audio
Once in the call, the bot captures system audio using PulseAudio. It records to a WAV file — no compression, no quality loss. The recording runs until the meeting ends (detected by participant count dropping) or until a timeout is hit.
Getting audio capture right on a headless server was one of the trickier parts. There's no physical sound card — we use a virtual audio sink that captures what the browser outputs. It took some PulseAudio plumbing to get clean, consistent audio without echo or feedback loops.
3. Transcribe with Whisper
After the recording stops, the audio file goes through OpenAI's Whisper model running locally. No API calls, no data leaving the server. The transcription includes timestamps, which makes it easy to jump to specific parts of the conversation later.
We use the tiny.en model for speed — a one-hour meeting transcribes in a few minutes on our hardware. For meetings where accuracy matters more than speed, we can bump up to the base or small model.
4. Summarize
The raw transcript goes to the LLM for summarization. The agent extracts:
- Key decisions — what was agreed on
- Action items — who's doing what, by when
- Discussion highlights — the important points, minus the filler
- Questions raised — anything left unresolved
The summary lands in my workspace as a markdown file, and the agent can also send it to a Discord channel or wherever I need it.
What I Actually Get
A typical output looks like this: a one-paragraph executive summary, a bullet list of decisions, a list of action items with owners, and the full transcript with timestamps for reference. Reading it takes two minutes. The meeting it came from was an hour.
The action items are the real value. Half the time, meetings generate tasks that never get written down and slowly evaporate. Having the agent extract them means nothing falls through the cracks.
The Gotchas
It's not perfect. Some honest limitations:
- Audio quality matters. If people have bad mics or there's cross-talk, Whisper struggles. The transcription degrades noticeably with low-quality audio.
- Speaker identification is limited. Whisper transcribes words but doesn't always know who said them. For small meetings with distinct voices it's manageable — for large calls it gets murky. This is where our VoiceID work could eventually plug in.
- Join flow breaks. Google and Zoom change their UIs regularly. A button that was labeled "Join now" last week might be "Ask to join" this week. The skill needs periodic maintenance to handle UI changes.
- It's not invisible. The bot shows up as a participant. Some people are uncomfortable being recorded by a bot, even if the meeting was already being recorded. Social norms matter.
When to Use It (and When Not To)
It works great for:
- Status meetings where I need the outcome but not the discussion
- Large calls where I'm a passive listener
- Recording my own brainstorming sessions for later reference
- Meetings in different time zones that I can't attend live
It doesn't replace:
- Meetings where I need to actively participate and make decisions
- Sensitive conversations where recording isn't appropriate
- First meetings with new people — showing up as a bot is a bad first impression
Building It as a Skill
The whole thing is packaged as an OpenClaw skill — a SKILL.md file with instructions, plus supporting scripts for browser automation, audio capture, and transcription. When I say "join this meeting," the agent reads the skill and follows the steps.
This is the pattern I keep coming back to: complex automation, packaged as a readable set of instructions that the agent executes. No custom training, no special infrastructure beyond what's already on the server. Just well-written instructions and scripts that work together.
The meeting bot took less than a day to build end-to-end. Most of that time was fighting PulseAudio and Chromium's headless audio quirks. The actual skill definition and Whisper integration were straightforward.
The best meeting is the one you don't attend. The second best is the one your AI attended and gave you the two-minute version.