Speech to Text

Record audio and transcribe with AI-powered speech recognition

Overview

STT (Speech to Text) lets you use your microphone as input for any compatible action. Record your voice, and the audio is sent to an AI speech recognition provider which returns the transcribed text. The result is then handled exactly like any other action output: pasted, saved for later use, or shown in a floating overlay.

Two built-in STT actions are included: Audio to Text (pure transcription) and Audio Instruct (spoken commands processed by an LLM). You can also configure custom STT actions. Any action requires an STT-capable connection to be set up first. See Connections for how to add one.

Recording Overlay

When you start recording — via hotkey, mouse trigger, or the PromptBar microphone button — a small floating overlay window appears near your cursor. It stays on top of all other windows and does not steal focus, so your current application remains active while you record.

Recording overlay with waveform

The overlay displays:

Element Type Description
Waveform canvas Display Real-time audio amplitude visualization. Shows recording is active. Dims when paused or stopping.
Countdown timer display Display Shows remaining time (MM:SS) when under 2 minutes remain. Turns red under 1 minute.
Cancel button (red X) Button Discards the recording entirely during recording, or cancels the active API transcription call during processing.
Pause button (yellow pause icon) Button Pauses the active recording. Audio captured before the pause is preserved. Icon changes to a play triangle when paused.
Resume button (play triangle) Button Shown in place of Pause when recording is paused. Click to continue capturing audio.
Finish button (green checkmark) Button Stops recording and sends all captured audio to the AI for transcription. The overlay switches to the processing spinner.
You can drag the recording overlay to reposition it anywhere on screen. It stays in the new position for the duration of the recording session.

Audio Modes

STT actions operate in one of two modes, configured per connection in Settings > Connections:

Mode Description Use case
Transcribe Speech is converted to text exactly as spoken. The raw transcription is the final output. No LLM processing is applied. Dictating notes, messages, or any text where you want the output to match what you said
Instruct Speech is treated as an instruction for the AI language model. The transcription is sent as input to the LLM, combined with your current text context, and the LLM produces the final response. Dictating commands such as "Write a polite reply declining this meeting" and receiving a fully composed result

The audio mode is a property of the connection, not the action. You can create multiple connections with different modes and assign each to different STT actions.

The Instruct/Transcribe mode toggle is only shown for connections whose model is known to support audio input. It does not appear on standard Whisper-only endpoints.

Per-Action STT Configuration

Every STT action has its own independent configuration section. Open Settings > Actions, find an STT action card, and expand its STT Configuration section to access these settings.

Per-action STT configuration panel
Element Type Description
Audio mode toggle (Instruct / Transcribe) Segmented button Switches the STT connection between Instruct and Transcribe mode. Only visible for connections whose model supports audio input.
Microphone dropdown Select Choose a specific microphone device or leave on Default to use the system default. Only shown when more than one audio input device is detected.
Language chips (input languages) Chip buttons Shows the languages configured for this action. The active input language chip is highlighted in blue. Click a chip to make it the active input language. Click the X on a chip to remove that language.
Add language button Button Opens an inline form to add a language. Choose from a curated list or enter a custom BCP-47 language code and name.
Output language dropdown Select Select a language to auto-translate the transcription into after it is produced. Populated from the input languages added to this action. Leave empty to skip translation.
Sample rate selector Select Audio sample rate for microphone capture. Options: 16,000 Hz (recommended), 24,000 Hz, 32,000 Hz, 48,000 Hz.
Noise suppression toggle Toggle Reduces background noise. Useful in noisy environments.
Echo cancellation toggle Toggle Removes echo from speakers playing back into the microphone. Enable when speakers are active during recording.
Auto gain control toggle Toggle Automatically adjusts microphone volume to maintain consistent levels regardless of mic sensitivity or speaker distance.

Starting a Recording

You can start an STT recording in three ways:

Configure hotkeys and mouse triggers for STT actions in Settings > Actions.

Recording Limits

The following limits apply to all recordings:

Limit Value
Maximum recording duration 12 minutes (720 seconds). Recording stops automatically at this limit.
Countdown timer appears at 2 minutes remaining (120 seconds)
Minimum audio size 1,000 bytes. Recordings below this threshold (empty or silent) are rejected before being sent to the provider.
Maximum audio file size 100 MB. Very long recordings at high sample rates may approach this limit.
The 16,000 Hz sample rate produces the smallest files and is the rate recommended by most STT providers. Use higher rates only if your provider or audio quality requires it.

Supported Providers

Any connection with the STT capability can be used for speech-to-text actions. Which providers offer STT depends on which connections you have configured. Common STT-capable providers include OpenAI (Whisper), Azure OpenAI, and OpenAI-compatible endpoints that expose a transcription API.

See Connections for how to add an STT-capable connection and how to assign it as the default STT connection or to specific actions.

Related Topics