Voice-to-Text Pipeline: Spokenly + NVIDIA Parakeet | Joshua Schultz Ops Command Center

I type at 80 WPM. I speak at 150. That’s nearly 2x throughput sitting on the table, unused, because raw transcription is rough—filler words, false starts, run-on sentences. You have to clean it up, which kills the speed advantage.

This pipeline solves that. Speak naturally. Get clean, professional text. One pass. No editing. Paste and send.

The Stack

Component	Role	Why This Choice
Spokenly	Input interface + prompt execution	Cross-platform (Mac, Windows, iOS, Android), works in any app, hotkey activation, runs cleanup prompt natively
NVIDIA Parakeet	Speech-to-text engine	Fast, accurate, runs locally, handles technical vocabulary, no cloud latency
Custom Prompt	Text cleanup rules	Built into Spokenly’s pipeline—transcription and cleanup happen in one step

How It Works

You speak → Parakeet transcribes → Spokenly applies cleanup prompt → Clean text output

All in one pass. No separate tool, no copy-paste to an AI, no extra step. Spokenly handles the prompt execution internally using Parakeet.

Activate Spokenly (hotkey or tap)
Dictate naturally—filler words, restarts, thinking out loud—doesn’t matter
Clean text appears—ready to use wherever your cursor is

Why This Combination

Spokenly gives you universal access plus native prompt support. Not app-specific—works in your IDE, browser, email client, Slack, anywhere you can type. The prompt runs inside Spokenly, so cleanup isn’t a separate workflow.

NVIDIA Parakeet runs locally with high accuracy. No cloud dependency, minimal latency, handles technical terms (code, acronyms, product names) better than most consumer transcription.

The cleanup prompt solves the last-mile problem. Raw transcription is usable but rough. Most cleanup approaches either do too little (leaving filler words) or too much (rewriting your voice into generic corporate speak). This prompt is calibrated to clean without sanitizing—and it runs automatically on every transcription.

The Cleanup Prompt

This goes in Spokenly’s prompt configuration. Two parts: system prompt and user prompt.

System Prompt

You are a transcription editor specializing in cleaning dictated text from busy executives. Your job is to transform raw speech-to-text into polished, readable content while preserving the speaker's authentic voice and intent.

Core principles:
1. PRESERVE VOICE - The speaker's word choices, tone, and style are intentional. Don't sanitize personality or rewrite their language.
2. MINIMAL INTERVENTION - Edit only what's necessary. If it reads fine, leave it alone.
3. MAINTAIN MEANING - Never alter, omit, or soften the substance of what was said.

What to remove:
- Filler words: um, uh, like, you know, basically, actually, kind of, sort of, I mean, right, so yeah
- False starts and self-corrections: "We need to—well actually we should—we need to focus on..."
- Duplicate phrases from thinking out loud: "The key thing is, the key thing is really..."
- Verbal tics and hedging that add no meaning

What to fix:
- Grammar errors introduced by speech-to-text
- Missing or incorrect punctuation
- Run-on sentences (break at natural pause points)
- Obvious misrecognitions from speech-to-text

What to keep:
- Original vocabulary and phrasing
- Technical terms and proper nouns exactly as spoken
- Intentional repetition used for emphasis
- The speaker's natural sentence structure and rhythm
- Strong language if used

Output formatting:
- Clean paragraphs with logical breaks by topic
- Use line breaks between distinct subjects or thoughts
- No headers, bullets, or formatting unless the speaker clearly indicated a list
- Present as continuous prose unless structure is explicitly requested

User Prompt

Clean this dictated text:

{{transcription}}

The {{transcription}} variable gets replaced with Parakeet’s raw output automatically.

Use Cases

Coding Workflows

Dictate code comments and docstrings
Speak through architecture decisions for documentation
Dictate commit messages and PR descriptions
Faster rubber duck debugging—explain the problem out loud, paste the explanation into Claude

AI Interaction

Speak complex prompts naturally instead of typing
Faster iteration on prompt engineering
Works especially well for long-context instructions
Chain with Claude: dictate → cleanup → paste into Claude → respond

Communication

Emails drafted at speaking speed
Slack messages without the typing tax
Client communications that sound like you, not a template
Meeting follow-ups captured while walking to the next meeting

Documentation

SOPs dictated while doing the work
Strategy docs captured in one pass
Meeting notes spoken immediately after
Code reviews—speak observations while reading, clean text appears

Performance Characteristics

Speed: Speaking is 3-4x faster than typing for most people. Even with cleanup latency (~1-2 seconds), net throughput is significantly higher for anything longer than a sentence.

Zero friction: No separate cleanup step. Speak and it’s done. This matters more than raw speed—switching context to clean up text kills flow.

Local processing: Parakeet runs on-device. No cloud dependency, no latency spikes, works offline, no privacy concerns about what you’re dictating.

Consistency: Same voice, same style, regardless of device or context. The prompt ensures output quality doesn’t depend on how carefully you speak.

Setup

Install Spokenly on all devices
Configure NVIDIA Parakeet as the transcription engine in Spokenly
Add the system prompt and user prompt to Spokenly’s prompt configuration
Assign activation trigger (see below)

Done. Total setup time: 10 minutes.

Activation Methods

The activation trigger matters more than you’d think. Keyboard shortcuts work, but there are faster options.

Keyboard: Right Command Key

Bind Spokenly to right Command (⌘). You probably never use it. Press to start, press again to stop. Hands stay on keyboard.

Pros: No extra hardware. Works immediately. Cons: Requires hand movement from typing position.

Foot Pedal + Always-On Mic

This is the fastest method. USB foot pedal (like the Olympus RS-28H or cheap Amazon options) triggers recording. Press with foot, speak, release.

Your hands never leave the keyboard. Combined with an always-on desk mic (Blue Yeti, Shure MV7, or even AirPods), the entire dictation happens without any hand movement.

Setup:

Map foot pedal to trigger Spokenly’s hotkey (most pedals let you assign any keystroke)
Position mic within 2 feet of your face
Configure Spokenly for hold-to-record mode

Pros: Zero hand movement. Fastest possible activation. Natural start/stop rhythm. Cons: Extra hardware (~$30-80 for pedal). Desk setup only.

Logitech MX Mouse Custom Button

If you use a Logitech MX Master, assign one of the thumb buttons to Spokenly’s hotkey via Logi Options+.

Middle thumb button works well—easy to reach, distinct tactile feel, unlikely to hit accidentally.

Pros: Mouse is already in your hand. No extra hardware if you already use MX. Cons: Hand leaves keyboard (but not desk).

Recommended Progression

Start: Keyboard shortcut (right ⌘) to validate the workflow
Upgrade: MX mouse button if you’re already in that ecosystem
Peak: Foot pedal + desk mic for maximum throughput

The foot pedal setup sounds ridiculous until you try it. Speaking without any hand involvement changes the cognitive load. You think, you speak, text appears. The physical act of typing is completely removed from the loop.

Calibration Tips

Speak at normal pace. Parakeet’s accuracy improves with natural speech patterns. Over-enunciating or speaking slowly makes it worse.

Don’t self-edit while speaking. The prompt handles filler words and restarts. Say “um” and “like” all you want—they get stripped.

Speak technical terms confidently. Parakeet handles technical vocabulary well, but mumbling proper nouns causes misrecognition.

Use the output as-is. If you’re editing the output, something’s wrong with the prompt. Tune the system prompt rather than fixing individual outputs.

Limitations

Noisy environments: Parakeet degrades in high-noise situations. Works fine with office background noise, struggles at coffee shops with music.

Multi-speaker: This is dictation, not transcription. It assumes one speaker. Group meeting transcription requires different tooling.

Specialized vocabulary: First few uses with domain-specific terms may need correction. Parakeet learns, but initial setup benefits from speaking technical terms clearly.

The Real Value

This isn’t about saving time on individual messages. It’s about removing the friction between thought and text.

Ideas that would never get written down—too much typing—now get captured. The threshold for “worth documenting” drops dramatically.

Quick architectural decision? Dictate it into a file, commit.
Insight from a client call? Speak it into Slack before you forget.
Complex prompt for Claude? Speak the whole thing, paste it in.

The best dictation system is one you actually use. The cleanup step was the blocker. Remove it, and speaking becomes the default input mode.

Speed: 150 WPM. Friction: Zero. Setup: 10 minutes. Output: Clean, professional text that sounds like you.