Ops Command Center v3.2.1
AIA-VT-2026 Ready
Created Jan 21, 2026

Voice-to-Text Pipeline: Spokenly + NVIDIA Parakeet

Build a high-performance dictation system with Spokenly and NVIDIA Parakeet. Speak naturally, get clean text—no editing required.

Tools
General
Joshua Schultz
-
Universal
Tags:
#voice-to-text #dictation #productivity #nvidia #spokenly #ai prompts #workflow
Article Content

I type at 80 WPM. I speak at 150. That’s nearly 2x throughput sitting on the table, unused, because raw transcription is rough—filler words, false starts, run-on sentences. You have to clean it up, which kills the speed advantage.

This pipeline solves that. Speak naturally. Get clean, professional text. One pass. No editing. Paste and send.

The Stack

ComponentRoleWhy This Choice
SpokenlyInput interface + prompt executionCross-platform (Mac, Windows, iOS, Android), works in any app, hotkey activation, runs cleanup prompt natively
NVIDIA ParakeetSpeech-to-text engineFast, accurate, runs locally, handles technical vocabulary, no cloud latency
Custom PromptText cleanup rulesBuilt into Spokenly’s pipeline—transcription and cleanup happen in one step

How It Works

You speak → Parakeet transcribes → Spokenly applies cleanup prompt → Clean text output

All in one pass. No separate tool, no copy-paste to an AI, no extra step. Spokenly handles the prompt execution internally using Parakeet.

  1. Activate Spokenly (hotkey or tap)
  2. Dictate naturally—filler words, restarts, thinking out loud—doesn’t matter
  3. Clean text appears—ready to use wherever your cursor is

Why This Combination

Spokenly gives you universal access plus native prompt support. Not app-specific—works in your IDE, browser, email client, Slack, anywhere you can type. The prompt runs inside Spokenly, so cleanup isn’t a separate workflow.

NVIDIA Parakeet runs locally with high accuracy. No cloud dependency, minimal latency, handles technical terms (code, acronyms, product names) better than most consumer transcription.

The cleanup prompt solves the last-mile problem. Raw transcription is usable but rough. Most cleanup approaches either do too little (leaving filler words) or too much (rewriting your voice into generic corporate speak). This prompt is calibrated to clean without sanitizing—and it runs automatically on every transcription.

The Cleanup Prompt

This goes in Spokenly’s prompt configuration. Two parts: system prompt and user prompt.

System Prompt

You are a transcription editor specializing in cleaning dictated text from busy executives. Your job is to transform raw speech-to-text into polished, readable content while preserving the speaker's authentic voice and intent.

Core principles:
1. PRESERVE VOICE - The speaker's word choices, tone, and style are intentional. Don't sanitize personality or rewrite their language.
2. MINIMAL INTERVENTION - Edit only what's necessary. If it reads fine, leave it alone.
3. MAINTAIN MEANING - Never alter, omit, or soften the substance of what was said.

What to remove:
- Filler words: um, uh, like, you know, basically, actually, kind of, sort of, I mean, right, so yeah
- False starts and self-corrections: "We need to—well actually we should—we need to focus on..."
- Duplicate phrases from thinking out loud: "The key thing is, the key thing is really..."
- Verbal tics and hedging that add no meaning

What to fix:
- Grammar errors introduced by speech-to-text
- Missing or incorrect punctuation
- Run-on sentences (break at natural pause points)
- Obvious misrecognitions from speech-to-text

What to keep:
- Original vocabulary and phrasing
- Technical terms and proper nouns exactly as spoken
- Intentional repetition used for emphasis
- The speaker's natural sentence structure and rhythm
- Strong language if used

Output formatting:
- Clean paragraphs with logical breaks by topic
- Use line breaks between distinct subjects or thoughts
- No headers, bullets, or formatting unless the speaker clearly indicated a list
- Present as continuous prose unless structure is explicitly requested

User Prompt

Clean this dictated text:

{{transcription}}

The {{transcription}} variable gets replaced with Parakeet’s raw output automatically.

Use Cases

Coding Workflows

  • Dictate code comments and docstrings
  • Speak through architecture decisions for documentation
  • Dictate commit messages and PR descriptions
  • Faster rubber duck debugging—explain the problem out loud, paste the explanation into Claude

AI Interaction

  • Speak complex prompts naturally instead of typing
  • Faster iteration on prompt engineering
  • Works especially well for long-context instructions
  • Chain with Claude: dictate → cleanup → paste into Claude → respond

Communication

  • Emails drafted at speaking speed
  • Slack messages without the typing tax
  • Client communications that sound like you, not a template
  • Meeting follow-ups captured while walking to the next meeting

Documentation

  • SOPs dictated while doing the work
  • Strategy docs captured in one pass
  • Meeting notes spoken immediately after
  • Code reviews—speak observations while reading, clean text appears

Performance Characteristics

Speed: Speaking is 3-4x faster than typing for most people. Even with cleanup latency (~1-2 seconds), net throughput is significantly higher for anything longer than a sentence.

Zero friction: No separate cleanup step. Speak and it’s done. This matters more than raw speed—switching context to clean up text kills flow.

Local processing: Parakeet runs on-device. No cloud dependency, no latency spikes, works offline, no privacy concerns about what you’re dictating.

Consistency: Same voice, same style, regardless of device or context. The prompt ensures output quality doesn’t depend on how carefully you speak.

Setup

  1. Install Spokenly on all devices
  2. Configure NVIDIA Parakeet as the transcription engine in Spokenly
  3. Add the system prompt and user prompt to Spokenly’s prompt configuration
  4. Assign activation trigger (see below)

Done. Total setup time: 10 minutes.

Activation Methods

The activation trigger matters more than you’d think. Keyboard shortcuts work, but there are faster options.

Keyboard: Right Command Key

Bind Spokenly to right Command (⌘). You probably never use it. Press to start, press again to stop. Hands stay on keyboard.

Pros: No extra hardware. Works immediately. Cons: Requires hand movement from typing position.

Foot Pedal + Always-On Mic

This is the fastest method. USB foot pedal (like the Olympus RS-28H or cheap Amazon options) triggers recording. Press with foot, speak, release.

Your hands never leave the keyboard. Combined with an always-on desk mic (Blue Yeti, Shure MV7, or even AirPods), the entire dictation happens without any hand movement.

Setup:

  1. Map foot pedal to trigger Spokenly’s hotkey (most pedals let you assign any keystroke)
  2. Position mic within 2 feet of your face
  3. Configure Spokenly for hold-to-record mode

Pros: Zero hand movement. Fastest possible activation. Natural start/stop rhythm. Cons: Extra hardware (~$30-80 for pedal). Desk setup only.

Logitech MX Mouse Custom Button

If you use a Logitech MX Master, assign one of the thumb buttons to Spokenly’s hotkey via Logi Options+.

Middle thumb button works well—easy to reach, distinct tactile feel, unlikely to hit accidentally.

Pros: Mouse is already in your hand. No extra hardware if you already use MX. Cons: Hand leaves keyboard (but not desk).

  1. Start: Keyboard shortcut (right ⌘) to validate the workflow
  2. Upgrade: MX mouse button if you’re already in that ecosystem
  3. Peak: Foot pedal + desk mic for maximum throughput

The foot pedal setup sounds ridiculous until you try it. Speaking without any hand involvement changes the cognitive load. You think, you speak, text appears. The physical act of typing is completely removed from the loop.

Calibration Tips

Speak at normal pace. Parakeet’s accuracy improves with natural speech patterns. Over-enunciating or speaking slowly makes it worse.

Don’t self-edit while speaking. The prompt handles filler words and restarts. Say “um” and “like” all you want—they get stripped.

Speak technical terms confidently. Parakeet handles technical vocabulary well, but mumbling proper nouns causes misrecognition.

Use the output as-is. If you’re editing the output, something’s wrong with the prompt. Tune the system prompt rather than fixing individual outputs.

Limitations

Noisy environments: Parakeet degrades in high-noise situations. Works fine with office background noise, struggles at coffee shops with music.

Multi-speaker: This is dictation, not transcription. It assumes one speaker. Group meeting transcription requires different tooling.

Specialized vocabulary: First few uses with domain-specific terms may need correction. Parakeet learns, but initial setup benefits from speaking technical terms clearly.

The Real Value

This isn’t about saving time on individual messages. It’s about removing the friction between thought and text.

Ideas that would never get written down—too much typing—now get captured. The threshold for “worth documenting” drops dramatically.

  • Quick architectural decision? Dictate it into a file, commit.
  • Insight from a client call? Speak it into Slack before you forget.
  • Complex prompt for Claude? Speak the whole thing, paste it in.

The best dictation system is one you actually use. The cleanup step was the blocker. Remove it, and speaking becomes the default input mode.


Speed: 150 WPM. Friction: Zero. Setup: 10 minutes. Output: Clean, professional text that sounds like you.

Back to AI Articles
Submit Work Order