Building an Academic Research Skill for OpenClaw

I wanted to build my own tools — not just use other people's software, but create things from scratch with a real understanding of the underlying theory. The problem: jumping into a new domain cold is brutal. You don't know what you don't know. You need to read papers, understand the state of the art, and figure out which approaches actually work before you write a single line of code.

So we built an academic research skill for our OpenClaw agent. Give it a topic, and it searches for papers, reads abstracts, identifies key findings, and compiles a literature review. It became the first step in every new project: before building anything, research what exists, understand the tradeoffs, then design something informed.

What's a Skill?

In OpenClaw, a skill is a markdown file (SKILL.md) that teaches the agent how to do something specific. It includes instructions, tool usage patterns, and scripts the agent can execute. When the agent encounters a task that matches a skill's description, it reads the skill file and follows the instructions.

Think of it as a runbook that the AI actually follows. No fine-tuning, no special training — just well-written instructions and supporting scripts.

The Research Pipeline

The academic research skill works in stages:

1. Query Formulation

The agent takes a natural language topic ("How do transformer attention mechanisms handle long sequences?") and reformulates it into effective search queries for academic databases. This includes generating keyword variations and Boolean search strings.

2. Paper Discovery

Using APIs from sources like Semantic Scholar, arXiv, and Google Scholar, the agent searches for relevant papers. It collects titles, abstracts, citation counts, publication dates, and author information. The skill includes scripts that handle API pagination and rate limiting.

3. Relevance Filtering

Not every result is useful. The agent reads abstracts and filters papers by relevance to the original query, recency, and citation impact. It builds a ranked shortlist of the most pertinent papers.

4. Synthesis

For the top papers, the agent extracts key findings, methodologies, and conclusions. It then synthesizes these into a coherent literature review — identifying common themes, contradictions, and gaps in the research.

5. Output

The final output is a structured markdown document: an executive summary, key findings organized by theme, a methodology comparison table, identified research gaps, and a full bibliography with links to the original papers.

What Made It Hard

The technical implementation was straightforward. The hard part was getting the agent to think like a researcher. Early versions would just list papers and summarize them individually — basically a fancy search engine. The breakthrough came from explicit instructions about synthesis: "Don't just summarize each paper. Identify how they relate to each other. Where do they agree? Where do they disagree? What questions remain unanswered?"

The other challenge was quality control. Academic research requires accuracy — you can't hallucinate citations. The skill explicitly instructs the agent to only reference papers it actually found through the search APIs, with real DOIs and links. If it can't find enough papers, it says so rather than making them up.

Real-World Usage: Building Our Own Tools

The whole point was to stop relying on off-the-shelf solutions and start building custom tools grounded in real research. The research skill became the first step in every build — understand the theory, then write the code.

VoiceID: Custom Speech-to-Text with Speaker Recognition

We wanted a speech-to-text system that could identify who was speaking — not just transcribe words, but lock transcription to a specific voice. Off-the-shelf STT doesn't do this. So the agent researched voice activity detection (VAD), speaker embedding models, and enrollment pipelines.

The research pointed us to Silero VAD for detecting speech segments, 3DSpeaker for generating voice embeddings, and Zipformer (via sherpa-onnx) for the actual transcription. We built VoiceID — a 131MB Android app that enrolls a speaker's voice, then only transcribes audio that matches their voiceprint. Ten iterations in a single day, from v1 to v10, each informed by what the research told us about embedding space alignment and audio pipeline consistency.

The critical breakthrough came from understanding a paper on speaker embeddings: enrollment audio and live audio must go through the exact same processing pipeline, or the embeddings drift apart. That insight — which came directly from the research phase — saved us hours of debugging.

Voice Cloning: Custom Text-to-Speech

On the flip side, we also built a custom TTS pipeline with voice cloning. The agent researched neural voice synthesis, few-shot voice cloning architectures, and the tradeoffs between quality and inference speed. The goal: give the agent its own voice — not a generic robotic one, but something with character.

The research helped us understand which models could clone a voice from just a few seconds of reference audio, how to handle prosody and emotion, and what the latency constraints looked like for real-time conversation. Without that research foundation, we would've been blindly testing models and hoping for the best.

The Pattern

Every tool we built followed the same loop: research → understand → build → iterate. The academic research skill made that first step fast and thorough. Each run produces a 2,000-4,000 word review with 10-20 cited papers — enough to make informed architectural decisions before writing a single line of code.

The research outputs also feed into the agent's self-learning system. Key findings get distilled into the knowledge base, so each tool we build makes the next one better-informed.

The Skill Pattern

What I love about this project is that it demonstrates the power of the skill pattern. We didn't retrain a model or build a complex RAG pipeline. We wrote a markdown file with clear instructions and a few Python scripts for API access. The LLM's existing reasoning ability handles the rest.

If you can write a good runbook for a human research assistant, you can write a skill for an AI agent. The format is the same — the reader is just different.