| Title: | Native R 'torch' Implementation of 'OpenAI' 'Whisper' |
|---|---|
| Description: | Speech-to-text transcription using a native R 'torch' implementation of 'OpenAI' 'Whisper' model <https://github.com/openai/whisper>. Supports multiple model sizes from tiny (39M parameters) to large-v3 (1.5B parameters) with integrated download from 'HuggingFace' <https://huggingface.co/> via the 'hfhub' package. Provides automatic speech recognition with optional language detection and translation to English. Audio preprocessing, mel spectrogram computation, and transformer-based encoder-decoder inference are all implemented in R using the 'torch' package. |
| Authors: | Troy Hernandez [aut, cre] (ORCID: <https://orcid.org/0009-0005-4248-604X>), cornball.ai [cph], OpenAI [cph] (Whisper model architecture and mel filterbank data (MIT license)) |
| Maintainer: | Troy Hernandez <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.0 |
| Built: | 2026-05-18 09:44:22 UTC |
| Source: | https://github.com/cornball-ai/whisper |
Apply BPE Merges
apply_bpe(tokens, merge_ranks)apply_bpe(tokens, merge_ranks)
tokens |
Character vector of tokens |
merge_ranks |
Named vector of merge rankings |
Character vector after BPE merges
Enforce Whisper timestamp generation constraints on logits.
apply_timestamp_rules(logits, generated, special, sample_begin)apply_timestamp_rules(logits, generated, special, sample_begin)
logits |
Logit tensor (1, vocab) or (vocab) |
generated |
Integer vector of tokens generated so far |
special |
Special token IDs |
sample_begin |
Index where content tokens start in generated |
Modified logits tensor
Get Audio Duration
audio_duration(file)audio_duration(file)
file |
Path to audio file |
Duration in seconds
Main preprocessing function that converts audio to the mel spectrogram format expected by Whisper.
audio_to_mel(file, n_mels = 80L, device = "auto", dtype = "auto")audio_to_mel(file, n_mels = 80L, device = "auto", dtype = "auto")
file |
Path to audio file, or numeric vector of audio samples |
n_mels |
Number of mel bins (80 for most models, 128 for large-v3) |
device |
torch device for output tensor |
dtype |
torch dtype for output tensor |
torch tensor of shape (1, n_mels, 3000) for 30s audio
# Convert audio file to mel spectrogram audio_file <- system.file("audio", "jfk.mp3", package = "whisper") mel <- audio_to_mel(audio_file) dim(mel)# Convert audio file to mel spectrogram audio_file <- system.file("audio", "jfk.mp3", package = "whisper") mel <- audio_to_mel(audio_file) dim(mel)
Beam search decoding for Whisper. Maintains multiple hypotheses and selects the best one based on length-normalized log probability.
beam_search_decode(model, encoder_output, initial_tokens, tokenizer, beam_size = 5L, max_length = 448L, timestamps = FALSE, word_timestamps = FALSE, length_penalty = 1, patience = Inf, device)beam_search_decode(model, encoder_output, initial_tokens, tokenizer, beam_size = 5L, max_length = 448L, timestamps = FALSE, word_timestamps = FALSE, length_penalty = 1, patience = Inf, device)
model |
WhisperModel |
encoder_output |
Encoder hidden states (batch=1) |
initial_tokens |
Initial token tensor (batch=1) |
tokenizer |
Tokenizer |
beam_size |
Number of beams |
max_length |
Maximum output length |
timestamps |
Whether to allow timestamp tokens |
word_timestamps |
Whether to collect cross-attention weights |
length_penalty |
Length penalty exponent |
patience |
Patience factor (stop after patience*beam_size finished) |
device |
Device |
List with tokens, cross_attn_weights, sum_logprob, n_tokens
Inverts the GPT-2 byte-to-unicode mapping used by byte_to_token(). Cached after first call.
build_byte_decoder()build_byte_decoder()
Named character vector mapping unicode codepoint (as string) to raw byte value
GPT-2/Whisper uses a specific byte-to-unicode mapping.
byte_to_token(byte)byte_to_token(byte)
byte |
Integer byte value (0-255) |
Character token
Clean Transcribed Text
clean_text(text)clean_text(text)
text |
Raw decoded text |
Cleaned text
Ratio of raw to compressed text size. High values indicate repetitive or hallucinated output.
compression_ratio(text)compression_ratio(text)
text |
Character string |
Numeric compression ratio
Compute STFT Magnitude
compute_stft(audio, n_fft = WHISPER_N_FFT, hop_length = WHISPER_HOP_LENGTH)compute_stft(audio, n_fft = WHISPER_N_FFT, hop_length = WHISPER_HOP_LENGTH)
audio |
Numeric vector of audio samples |
n_fft |
FFT window size |
hop_length |
Hop length between frames |
Complex STFT matrix
DTW-based alignment of tokens to audio frames using cross-attention weights. Compute Word-Level Timestamps Use cross-attention weights and DTW alignment to assign timestamps to individual words.
compute_word_timestamps(tokens, cross_attn_weights, tokenizer, config, time_offset = 0, sample_begin = 4L)compute_word_timestamps(tokens, cross_attn_weights, tokenizer, config, time_offset = 0, sample_begin = 4L)
tokens |
Integer vector of generated token IDs |
cross_attn_weights |
List of cross-attention weight tensors per decode step |
tokenizer |
Whisper tokenizer |
config |
Model configuration |
time_offset |
Time offset in seconds (for chunked audio) |
sample_begin |
Index where content tokens start in generated |
Data frame with word, start, end columns
Copy Weight if Exists
copy_if_exists(param, weights, name)copy_if_exists(param, weights, name)
param |
Target parameter |
weights |
Weight dictionary |
name |
Weight name |
Create Decoder from Config
create_decoder(config)create_decoder(config)
config |
Model configuration from whisper_config() |
WhisperDecoder module
Create Encoder from Config
create_encoder(config)create_encoder(config)
config |
Model configuration from whisper_config() |
WhisperEncoder module
Create a mel filterbank matrix for converting STFT to mel spectrogram. Used when pre-computed filterbank is not available.
create_mel_filterbank_fallback(n_fft = WHISPER_N_FFT, n_mels = 80L, sample_rate = WHISPER_SAMPLE_RATE)create_mel_filterbank_fallback(n_fft = WHISPER_N_FFT, n_mels = 80L, sample_rate = WHISPER_SAMPLE_RATE)
n_fft |
FFT size |
n_mels |
Number of mel bins |
sample_rate |
Audio sample rate |
Mel filterbank matrix (n_mels x n_freqs)
Reverses the GPT-2 byte-level encoding, converting unicode tokens back to raw UTF-8 bytes.
decode_bpe_bytes(text)decode_bpe_bytes(text)
text |
Text with BPE byte tokens |
Decoded UTF-8 text
Decode Timestamp Token
decode_timestamp(token_id, model = "tiny")decode_timestamp(token_id, model = "tiny")
token_id |
Token ID |
model |
Model name for correct token IDs |
Time in seconds
Try decoding at progressively higher temperatures until quality thresholds are met. At temperature 0, uses beam search (or greedy if beam_size=1). At temperature > 0, uses sampling with best-of.
decode_with_fallback(model, encoder_output, initial_tokens, tokenizer, temperatures = c(0, 0.2, 0.4, 0.6, 0.8, 1), beam_size = 5L, best_of = 5L, max_length = 448L, timestamps = FALSE, word_timestamps = FALSE, compression_ratio_threshold = 2.4, logprob_threshold = -1, length_penalty = 1, patience = Inf, device)decode_with_fallback(model, encoder_output, initial_tokens, tokenizer, temperatures = c(0, 0.2, 0.4, 0.6, 0.8, 1), beam_size = 5L, best_of = 5L, max_length = 448L, timestamps = FALSE, word_timestamps = FALSE, compression_ratio_threshold = 2.4, logprob_threshold = -1, length_penalty = 1, patience = Inf, device)
model |
WhisperModel |
encoder_output |
Encoder hidden states |
initial_tokens |
Initial token tensor |
tokenizer |
Tokenizer |
temperatures |
Numeric vector of temperatures to try |
beam_size |
Number of beams for temp=0 |
best_of |
Number of samples for temp>0 |
max_length |
Maximum output length |
timestamps |
Whether to allow timestamp tokens |
word_timestamps |
Whether to collect cross-attention weights |
compression_ratio_threshold |
Max compression ratio |
logprob_threshold |
Min average log probability |
length_penalty |
Length penalty for beam search |
patience |
Patience factor for beam search |
device |
Device |
List with tokens, cross_attn_weights, sum_logprob, n_tokens
Detect the spoken language in an audio file using Whisper. Detect Language Identify the spoken language in an audio file. Uses Whisper's decoder to predict the most likely language token from the first 30 seconds of audio.
detect_language(file, model = "tiny", device = "auto", dtype = "auto", top_k = 5L, download = TRUE, verbose = TRUE)detect_language(file, model = "tiny", device = "auto", dtype = "auto", top_k = 5L, download = TRUE, verbose = TRUE)
file |
Path to audio file (WAV, MP3, etc.) |
model |
Model name: "tiny", "base", "small", "medium", "large-v3" |
device |
Device: "auto", "cpu", "cuda" |
dtype |
Data type: "auto", "float16", "float32" |
top_k |
Number of top language probabilities to return (default: 5) |
download |
If TRUE and model not present, prompt to download. |
verbose |
Print loading messages. |
List with language (two-letter code) and
probabilities (named numeric vector of top-k language probs).
if (model_exists("tiny")) { audio_file <- system.file("audio", "jfk.mp3", package = "whisper") result <- detect_language(audio_file) result$language result$probabilities }if (model_exists("tiny")) { audio_file <- system.file("audio", "jfk.mp3", package = "whisper") result <- detect_language(audio_file) result$language result$probabilities }
Core detection logic. Feed SOT token to decoder, read language logits.
detect_language_from_mel(model, mel, config, device, top_k = 5L)detect_language_from_mel(model, mel, config, device, top_k = 5L)
model |
WhisperModel |
mel |
Mel spectrogram tensor |
config |
Model config |
device |
torch device |
top_k |
Number of top probabilities to return |
List with language code and probabilities
Internal function that runs language detection using a pre-loaded pipeline.
detect_language_from_pipeline(pipe, file, top_k = 5L)detect_language_from_pipeline(pipe, file, top_k = 5L)
pipe |
A whisper_pipeline object |
file |
Path to audio file, or numeric vector of audio samples |
top_k |
Number of top probabilities to return |
List with language code and probabilities
Download Tokenizer Files from HuggingFace
download_tokenizer_files(model)download_tokenizer_files(model)
model |
Model name |
Download Whisper model weights and tokenizer files from HuggingFace. In interactive sessions, asks for user consent before downloading.
download_whisper_model(model = "tiny", force = FALSE)download_whisper_model(model = "tiny", force = FALSE)
model |
Model name: "tiny", "base", "small", "medium", "large-v3" |
force |
Re-download even if exists |
Path to model directory (invisibly)
if (interactive()) { # Download tiny model (smallest, ~150MB) download_whisper_model("tiny") # Download larger model for better accuracy download_whisper_model("small") }if (interactive()) { # Download tiny model (smallest, ~150MB) download_whisper_model("tiny") # Download larger model for better accuracy download_whisper_model("small") }
Standard dynamic time warping on a cost matrix.
dtw_align(cost)dtw_align(cost)
cost |
Numeric matrix (n_tokens x n_frames) |
Integer matrix with 2 columns (token_idx, frame_idx), 1-indexed
Ensure Tokenizer Files are Downloaded
ensure_tokenizer_files(model)ensure_tokenizer_files(model)
model |
Model name |
Path to vocab directory (directory containing vocab.json)
Replicate batch=1 KV cache to batch=beam_size.
expand_kv_cache(kv_cache, beam_size)expand_kv_cache(kv_cache, beam_size)
kv_cache |
List of per-layer KV caches (batch=1) |
beam_size |
Number of beams |
Expanded KV cache (batch=beam_size)
Extract Segments with Timestamps
extract_segments(tokens, tokenizer, time_offset = 0)extract_segments(tokens, tokenizer, time_offset = 0)
tokens |
Token IDs |
tokenizer |
Tokenizer |
time_offset |
Offset in seconds for chunk processing |
Data frame with start, end, text
Teacher-forcing decode: feed known token sequence one at a time, collecting cross-attention weights. Used by beam search when word_timestamps is needed.
forced_decode(model, encoder_output, token_ids, device)forced_decode(model, encoder_output, token_ids, device)
model |
WhisperModel |
encoder_output |
Encoder hidden states |
token_ids |
Integer vector of all token IDs (including initial) |
device |
Device |
List of cross-attention weight lists (one per content step)
Build the initial token sequence for decoder input.
get_initial_tokens(language = "en", task = "transcribe", model = "tiny", timestamps = FALSE)get_initial_tokens(language = "en", task = "transcribe", model = "tiny", timestamps = FALSE)
language |
Two-letter language code or NULL for auto |
task |
"transcribe" or "translate" |
model |
Model name for correct special token IDs |
timestamps |
Whether to include timestamps (internal use) |
Integer vector of initial token IDs
Get Model Cache Path
get_model_path(model)get_model_path(model)
model |
Model name |
Path to model directory in hfhub cache
Get Path to Model Weights
get_weights_path(model)get_weights_path(model)
model |
Model name |
Path to safetensors file
Greedy Decoding
greedy_decode(model, encoder_output, initial_tokens, tokenizer, max_length = 448L, timestamps = FALSE, word_timestamps = FALSE, device)greedy_decode(model, encoder_output, initial_tokens, tokenizer, max_length = 448L, timestamps = FALSE, word_timestamps = FALSE, device)
model |
WhisperModel |
encoder_output |
Encoder hidden states |
initial_tokens |
Initial token tensor |
tokenizer |
Tokenizer |
max_length |
Maximum output length |
timestamps |
Whether to allow timestamp tokens |
word_timestamps |
Whether to collect cross-attention weights |
device |
Device |
Integer vector of generated tokens, or list with tokens and cross_attn_weights when word_timestamps is TRUE
Merge BPE subword tokens into whole words with timestamps.
group_into_words(token_ids, starts, ends, tokenizer)group_into_words(token_ids, starts, ends, tokenizer)
token_ids |
Integer vector of text token IDs |
starts |
Numeric vector of token start times |
ends |
Numeric vector of token end times |
tokenizer |
Whisper tokenizer |
Data frame with word, start, end columns
Convert Hz to Mel Scale
hz_to_mel(hz)hz_to_mel(hz)
hz |
Frequency in Hz |
Frequency in mel scale
Check if Token is Timestamp
is_timestamp_token(token_id, model = "tiny")is_timestamp_token(token_id, model = "tiny")
token_id |
Token ID |
model |
Model name for correct token IDs |
TRUE if timestamp token
List Downloaded Models
list_downloaded_models()list_downloaded_models()
Character vector of downloaded model names
list_downloaded_models()list_downloaded_models()
List Available Models
list_whisper_models()list_whisper_models()
Character vector of model names
list_whisper_models()list_whisper_models()
Load audio from file, convert to mono, resample to 16kHz.
load_audio(file)load_audio(file)
file |
Path to audio file (WAV, MP3, etc.) |
Numeric vector of audio samples normalized to -1 to 1 range
# Load included sample audio audio_file <- system.file("audio", "jfk.mp3", package = "whisper") samples <- load_audio(audio_file) length(samples) range(samples)# Load included sample audio audio_file <- system.file("audio", "jfk.mp3", package = "whisper") samples <- load_audio(audio_file) length(samples) range(samples)
Load Decoder Weights
load_decoder_weights(decoder, weights)load_decoder_weights(decoder, weights)
decoder |
WhisperDecoder module |
weights |
Named list of tensors |
Load Encoder Weights
load_encoder_weights(encoder, weights)load_encoder_weights(encoder, weights)
encoder |
WhisperEncoder module |
weights |
Named list of tensors |
Load the official Whisper mel filterbank from bundled CSV file.
load_mel_filterbank(n_mels = 80L)load_mel_filterbank(n_mels = 80L)
n_mels |
Number of mel bins (80 or 128) |
Mel filterbank matrix (n_mels x n_freqs)
Load a Whisper model with weights from HuggingFace.
load_whisper_model(model = "tiny", device = "auto", dtype = "auto", download = FALSE, verbose = TRUE)load_whisper_model(model = "tiny", device = "auto", dtype = "auto", download = FALSE, verbose = TRUE)
model |
Model name: "tiny", "base", "small", "medium", "large-v3" |
device |
Device to load model on ("auto", "cpu", "cuda") |
dtype |
Data type ("auto", "float16", "float32") |
download |
If TRUE and model not present, prompt to download |
verbose |
Print loading messages |
WhisperModel module
# Load tiny model (requires prior download) if (model_exists("tiny")) { model <- load_whisper_model("tiny") }# Load tiny model (requires prior download) if (model_exists("tiny")) { model <- load_whisper_model("tiny") }
Load Weights from Safetensors
load_whisper_weights(model, weights_path, verbose = TRUE)load_whisper_weights(model, weights_path, verbose = TRUE)
model |
WhisperModel module |
weights_path |
Path to safetensors file |
verbose |
Print loading messages |
Apply a sliding median filter to a numeric vector.
medfilt1(x, width = 7L)medfilt1(x, width = 7L)
x |
Numeric vector |
width |
Filter width (must be odd) |
Filtered numeric vector of same length
Convert Mel Scale to Hz
mel_to_hz(mel)mel_to_hz(mel)
mel |
Frequency in mel scale |
Frequency in Hz
Check if Model is Downloaded
model_exists(model)model_exists(model)
model |
Model name |
TRUE if model weights exist locally
model_exists("tiny") model_exists("large-v3")model_exists("tiny") model_exists("large-v3")
Pad or Trim Audio to Fixed Length
pad_or_trim(audio, length = WHISPER_N_SAMPLES)pad_or_trim(audio, length = WHISPER_N_SAMPLES)
audio |
Numeric vector of audio samples |
length |
Target length in samples (default: 30s at 16kHz) |
Numeric vector of specified length
Parse Device Argument
parse_device(device = "auto")parse_device(device = "auto")
device |
Character or torch device. "auto" uses GPU if available. |
torch device object
Parse Dtype Argument
parse_dtype(dtype = "auto", device = whisper_device())parse_dtype(dtype = "auto", device = whisper_device())
dtype |
Character or torch dtype. "auto" uses float16 on GPU, float32 on CPU. |
device |
torch device (used for auto selection) |
torch dtype
Reorder cached key-value tensors to match new beam ordering.
rearrange_kv_cache(kv_cache, beam_indices, device)rearrange_kv_cache(kv_cache, beam_indices, device)
kv_cache |
List of per-layer KV caches |
beam_indices |
Integer tensor of beam indices (1-indexed) |
device |
Device |
Reordered KV cache
Temperature-scaled sampling decode. Fork of greedy_decode that uses categorical sampling instead of argmax.
sample_decode(model, encoder_output, initial_tokens, tokenizer, temperature = 0.6, max_length = 448L, timestamps = FALSE, word_timestamps = FALSE, device)sample_decode(model, encoder_output, initial_tokens, tokenizer, temperature = 0.6, max_length = 448L, timestamps = FALSE, word_timestamps = FALSE, device)
model |
WhisperModel |
encoder_output |
Encoder hidden states |
initial_tokens |
Initial token tensor (batch=1) |
tokenizer |
Tokenizer |
temperature |
Sampling temperature (must be > 0) |
max_length |
Maximum output length |
timestamps |
Whether to allow timestamp tokens |
word_timestamps |
Whether to collect cross-attention weights |
device |
Device |
List with tokens, cross_attn_weights, sum_logprob, n_tokens
Split audio longer than 30 seconds into overlapping chunks.
split_audio(file, chunk_length = 30, overlap = 1)split_audio(file, chunk_length = 30, overlap = 1)
file |
Path to audio file |
chunk_length |
Chunk length in seconds |
overlap |
Overlap between chunks in seconds |
List of audio chunks (numeric vectors)
Decode Token IDs to Text
tokenizer_decode(ids, id_to_token, special_tokens)tokenizer_decode(ids, id_to_token, special_tokens)
ids |
Integer vector of token IDs |
id_to_token |
Mapping from ID to token |
special_tokens |
Special token info |
Character string
Encode Text to Token IDs
tokenizer_encode(text, vocab, merge_ranks)tokenizer_encode(text, vocab, merge_ranks)
text |
Character string to encode |
vocab |
Vocabulary mapping (token -> id) |
merge_ranks |
Merge ranking for BPE |
Integer vector of token IDs
Transcribe speech from an audio file using Whisper.
For repeated transcription, use whisper_pipeline() to
load the model once.
transcribe(file, model = "tiny", language = NULL, task = "transcribe", timestamps = FALSE, word_timestamps = FALSE, beam_size = 1L, temperatures = 0, best_of = 1L, compression_ratio_threshold = 2.4, logprob_threshold = -1, length_penalty = 1, patience = Inf, device = "auto", dtype = "auto", verbose = TRUE)transcribe(file, model = "tiny", language = NULL, task = "transcribe", timestamps = FALSE, word_timestamps = FALSE, beam_size = 1L, temperatures = 0, best_of = 1L, compression_ratio_threshold = 2.4, logprob_threshold = -1, length_penalty = 1, patience = Inf, device = "auto", dtype = "auto", verbose = TRUE)
file |
Path to audio file (WAV, MP3, etc.) |
model |
Model name: "tiny", "base", "small", "medium", "large-v3" |
language |
Language code (e.g., "en", "es"), or NULL (default) for auto-detection from the audio. |
task |
"transcribe" or "translate" (translate to English) |
timestamps |
If TRUE, return segment-level timestamps |
word_timestamps |
If TRUE, return word-level timestamps (implies timestamps) |
beam_size |
Number of beams for beam search (1 = greedy, default) |
temperatures |
Numeric vector of temperatures to try. 0 uses beam search or greedy; values > 0 use sampling. Multiple values enable fallback. |
best_of |
Number of samples per temperature > 0, keeping the best. |
compression_ratio_threshold |
Max compression ratio before fallback. |
logprob_threshold |
Min average log probability before fallback. |
length_penalty |
Length penalty exponent for beam search scoring. |
patience |
Patience factor for beam search (stop after patience*beam_size). |
device |
Device: "auto", "cpu", "cuda" |
dtype |
Data type: "auto", "float16", "float32" |
verbose |
Print progress messages |
List with text, language, and metadata. When timestamps=TRUE,
includes segments data.frame with start, end, text columns. When
word_timestamps=TRUE, includes words data.frame with word,
start, end columns.
if (model_exists("tiny")) { audio_file <- system.file("audio", "jfk.mp3", package = "whisper") # Auto-detect language (default) result <- transcribe(audio_file, model = "tiny") result$language # "en" result$text # Explicit language result <- transcribe(audio_file, model = "tiny", language = "en") # With timestamps result <- transcribe(audio_file, model = "tiny", timestamps = TRUE) result$segments # Translate Spanish audio to English spanish_file <- system.file("audio", "allende.mp3", package = "whisper") result <- transcribe(spanish_file, model = "tiny", language = "es", task = "translate") result$text }if (model_exists("tiny")) { audio_file <- system.file("audio", "jfk.mp3", package = "whisper") # Auto-detect language (default) result <- transcribe(audio_file, model = "tiny") result$language # "en" result$text # Explicit language result <- transcribe(audio_file, model = "tiny", language = "en") # With timestamps result <- transcribe(audio_file, model = "tiny", timestamps = TRUE) result$segments # Translate Spanish audio to English spanish_file <- system.file("audio", "allende.mp3", package = "whisper") result <- transcribe(spanish_file, model = "tiny", language = "es", task = "translate") result$text }
Transcribe Single Chunk
transcribe_chunk(file, model, tokenizer, config, language = NULL, task = "transcribe", timestamps = FALSE, word_timestamps = FALSE, beam_size = 1L, temperatures = 0, best_of = 1L, compression_ratio_threshold = 2.4, logprob_threshold = -1, length_penalty = 1, patience = Inf, time_offset = 0, device, dtype, verbose = TRUE)transcribe_chunk(file, model, tokenizer, config, language = NULL, task = "transcribe", timestamps = FALSE, word_timestamps = FALSE, beam_size = 1L, temperatures = 0, best_of = 1L, compression_ratio_threshold = 2.4, logprob_threshold = -1, length_penalty = 1, patience = Inf, time_offset = 0, device, dtype, verbose = TRUE)
file |
Audio file or mel spectrogram |
model |
WhisperModel |
tokenizer |
Tokenizer |
config |
Model config |
language |
Language code |
task |
Task type |
timestamps |
Return segment-level timestamps. |
word_timestamps |
Return word-level timestamps. |
beam_size |
Number of beams for beam search. |
temperatures |
Numeric vector of temperatures for fallback. |
best_of |
Number of samples per temperature > 0. |
compression_ratio_threshold |
Max compression ratio before fallback. |
logprob_threshold |
Min average log probability before fallback. |
length_penalty |
Length penalty exponent for beam search. |
patience |
Patience factor for beam search. |
time_offset |
Time offset in seconds for chunk processing. |
device |
Device |
dtype |
Dtype |
verbose |
Verbose output |
Transcription result
Process audio longer than 30 seconds in chunks.
transcribe_long(file, model, tokenizer, config, language, task, timestamps = FALSE, word_timestamps = FALSE, beam_size = 1L, temperatures = 0, best_of = 1L, compression_ratio_threshold = 2.4, logprob_threshold = -1, length_penalty = 1, patience = Inf, device, dtype, verbose)transcribe_long(file, model, tokenizer, config, language, task, timestamps = FALSE, word_timestamps = FALSE, beam_size = 1L, temperatures = 0, best_of = 1L, compression_ratio_threshold = 2.4, logprob_threshold = -1, length_penalty = 1, patience = Inf, device, dtype, verbose)
file |
Audio file |
model |
WhisperModel |
tokenizer |
Tokenizer |
config |
Model config |
language |
Language |
task |
Task |
timestamps |
Return segment-level timestamps. |
word_timestamps |
Return word-level timestamps. |
beam_size |
Number of beams for beam search. |
temperatures |
Numeric vector of temperatures for fallback. |
best_of |
Number of samples per temperature > 0. |
compression_ratio_threshold |
Max compression ratio before fallback. |
logprob_threshold |
Min average log probability before fallback. |
length_penalty |
Length penalty exponent for beam search. |
patience |
Patience factor for beam search. |
device |
Device |
dtype |
Dtype |
verbose |
Verbose |
Combined transcription result
Transformer encoder for processing mel spectrograms. Multi-Head Self-Attention
whisper_attention(n_state, n_head)whisper_attention(n_state, n_head)
n_state |
Hidden dimension |
n_head |
Number of attention heads |
Get configuration for a Whisper model variant.
whisper_config(model = "tiny")whisper_config(model = "tiny")
model |
Character. Model name: "tiny", "base", "small", "medium", "large-v3" |
List with model configuration parameters
# Get tiny model configuration cfg <- whisper_config("tiny") cfg$n_mels cfg$n_audio_layer # Compare model sizes whisper_config("tiny")$n_text_layer whisper_config("large-v3")$n_text_layer# Get tiny model configuration cfg <- whisper_config("tiny") cfg$n_mels cfg$n_audio_layer # Compare model sizes whisper_config("tiny")$n_text_layer whisper_config("large-v3")$n_text_layer
Full Whisper decoder: token embedding + positional embedding + transformer layers.
whisper_decoder(n_vocab, n_ctx, n_state, n_head, n_layer)whisper_decoder(n_vocab, n_ctx, n_state, n_head, n_layer)
n_vocab |
Vocabulary size |
n_ctx |
Maximum context length |
n_state |
Hidden dimension |
n_head |
Number of attention heads |
n_layer |
Number of transformer layers |
Transformer decoder with cross-attention to encoder outputs. Decoder Layer Pre-norm transformer decoder layer with self-attention and cross-attention.
whisper_decoder_layer(n_state, n_head)whisper_decoder_layer(n_state, n_head)
n_state |
Hidden dimension |
n_head |
Number of attention heads |
Utilities for managing torch devices and data types. Get Default Device Returns CUDA device if available, otherwise CPU.
whisper_device()whisper_device()
torch device object
if (torch::torch_is_installed()) { device <- whisper_device() device$type }if (torch::torch_is_installed()) { device <- whisper_device() device$type }
Returns float16 on CUDA, float32 on CPU.
whisper_dtype(device = whisper_device())whisper_dtype(device = whisper_device())
device |
torch device |
torch dtype
if (torch::torch_is_installed()) { dtype <- whisper_dtype() dtype }if (torch::torch_is_installed()) { dtype <- whisper_dtype() dtype }
Full Whisper encoder: Conv stem + positional encoding + transformer layers.
whisper_encoder(n_mels, n_ctx, n_state, n_head, n_layer)whisper_encoder(n_mels, n_ctx, n_state, n_head, n_layer)
n_mels |
Number of mel spectrogram bins |
n_ctx |
Maximum context length (1500 for 30s audio) |
n_state |
Hidden dimension |
n_head |
Number of attention heads |
n_layer |
Number of transformer layers |
Pre-norm transformer encoder layer.
whisper_encoder_layer(n_state, n_head)whisper_encoder_layer(n_state, n_head)
n_state |
Hidden dimension |
n_head |
Number of attention heads |
Reverse lookup: convert a language token ID back to a two-letter code.
whisper_lang_from_id(token_id)whisper_lang_from_id(token_id)
token_id |
Integer token ID (e.g., 50259 for English) |
Two-letter language code
Get Language Token ID
whisper_lang_token(lang = "en", model = "tiny")whisper_lang_token(lang = "en", model = "tiny")
lang |
Two-letter language code (e.g., "en", "es", "fr") |
model |
Model name for correct token IDs |
Token ID for the language
Returns the named integer vector mapping language codes to offsets.
whisper_language_table()whisper_language_table()
Named integer vector (language code -> offset from 50259)
Full Whisper model combining encoder and decoder. Whisper Model Module
whisper_model(config)whisper_model(config)
config |
Model configuration |
Main transcription API for Whisper.
Create a Whisper Pipeline
Load the model, tokenizer, and config once. Call $transcribe()
repeatedly without reloading.
whisper_pipeline(model = "tiny", device = "auto", dtype = "auto", download = TRUE, verbose = TRUE)whisper_pipeline(model = "tiny", device = "auto", dtype = "auto", download = TRUE, verbose = TRUE)
model |
Model name: "tiny", "base", "small", "medium", "large-v3" |
device |
Device: "auto", "cpu", "cuda" |
dtype |
Data type: "auto", "float16", "float32" |
download |
If TRUE and model not present, prompt to download. |
verbose |
Print loading messages. |
A whisper_pipeline object with a $transcribe() method.
if (model_exists("tiny")) { pipe <- whisper_pipeline("tiny") pipe$transcribe(system.file("audio", "jfk.mp3", package = "whisper")) }if (model_exists("tiny")) { pipe <- whisper_pipeline("tiny") pipe$transcribe(system.file("audio", "jfk.mp3", package = "whisper")) }
Convert audio files to mel spectrograms for Whisper input. Whisper Audio Constants
WHISPER_SAMPLE_RATEWHISPER_SAMPLE_RATE
An object of class integer of length 1.
Get special token IDs for a Whisper model. Token IDs differ between model variants (e.g., large-v3 has extra language tokens).
whisper_special_tokens(model = "tiny")whisper_special_tokens(model = "tiny")
model |
Model name (default: "tiny") |
Named list of special token IDs
Byte-pair encoding tokenizer for Whisper models. Create Whisper Tokenizer Load or create a Whisper tokenizer from HuggingFace vocab files.
whisper_tokenizer(model = "tiny")whisper_tokenizer(model = "tiny")
model |
Model name for vocab lookup |
Tokenizer object (list with encode/decode functions)
# Load tokenizer (requires prior model download) if (model_exists("tiny")) { tok <- whisper_tokenizer("tiny") tok$encode("Hello world") tok$decode(c(50258, 50259, 50359, 50363)) }# Load tokenizer (requires prior model download) if (model_exists("tiny")) { tok <- whisper_tokenizer("tiny") tok$encode("Hello world") tok$decode(c(50258, 50259, 50359, 50363)) }