Google Well being AI crew has launched MedASR, an open weights medical speech to textual content mannequin that targets medical dictation and doctor affected person conversations and is designed to plug immediately into trendy AI workflows.
What MedASR is and the place it matches?
MedASR is a speech to textual content mannequin based mostly on the Conformer structure and is pre educated for medical dictation and transcription. It’s positioned as a place to begin for builders who need to construct healthcare based mostly voice purposes resembling radiology dictation instruments or go to word seize techniques.
The mannequin has 105 million parameters and accepts mono channel audio at 16000 hertz with 16 bit integer waveforms. It produces textual content solely output, so it drops immediately into downstream pure language processing or generative fashions resembling MedGemma.
MedASR sits contained in the Well being AI Developer Foundations portfolio, alongside MedGemma, MedSigLIP and different area particular medical fashions that share widespread phrases of use and a constant governance story.
Coaching knowledge and area specialization
MedASR is educated on a various corpus of de recognized medical speech. The dataset consists of about 5000 hours of doctor dictations and medical conversations throughout radiology, inside medication and household medication.
The coaching pairs audio segments with transcripts and metadata. Subsets of the conversational knowledge are annotated with medical named entities together with signs, medicines and circumstances. This provides the mannequin sturdy protection of medical vocabulary and phrasing patterns that seem in routine documentation.
The mannequin is English solely, and most coaching audio comes from audio system for whom English is a primary language and who had been raised in the USA. The documentation notes that efficiency could also be decrease for different speaker profiles or noisy microphones and recommends nice tuning for such settings.
Structure and decoding
MedASR follows the Conformer encoder design. Conformer combines convolution blocks with self consideration layers so it might seize native acoustic patterns and longer vary temporal dependencies in the identical stack.
The mannequin is uncovered as an automatic speech detector with a CTC type interface. Within the reference implementation, builders use AutoProcessor to create enter options from waveform audio and AutoModelForCTC to provide token sequences. Decoding makes use of grasping decoding by default. The mannequin can be paired with an exterior six gram language mannequin with beam search of dimension 8 to enhance phrase error price.
MedASR coaching makes use of JAX and ML Pathways on TPUv4p, TPUv5p and TPUv5e {hardware}. These techniques present the size wanted for giant speech fashions and align with Google’s broader basis mannequin coaching stack.
Efficiency on medical speech duties
Key outcomes, with grasping decoding and with a six gram language mannequin, are:
- RAD DICT, radiologist dictation: MedASR grasping 6.6 %, MedASR plus language mannequin 4.6 %, Gemini 2.5 Professional 10.0 %, Gemini 2.5 Flash 24.4 %, Whisper v3 Giant 25.3 %.
- GENERAL DICT, common and inside medication: MedASR grasping 9.3 %, MedASR plus language mannequin 6.9 %, Gemini 2.5 Professional 16.4 %, Gemini 2.5 Flash 27.1 %, Whisper v3 Giant 33.1 %.
- FM DICT, household medication: MedASR grasping 8.1 %, MedASR plus language mannequin 5.8 %, Gemini 2.5 Professional 14.6 %, Gemini 2.5 Flash 19.9 %, Whisper v3 Giant 32.5 %.
- Eye Gaze, dictation on 998 MIMIC chest X ray instances: MedASR grasping 6.6 %, MedASR plus language mannequin 5.2 %, Gemini 2.5 Professional 5.9 %, Gemini 2.5 Flash 9.3 %, Whisper v3 Giant 12.5 %.
Developer workflow and deployment choices
A minimal pipeline instance is:
from transformers import pipeline
import huggingface_hub
audio = huggingface_hub.hf_hub_download("google/medasr", "test_audio.wav")
pipe = pipeline("automatic-speech-recognition", mannequin="google/medasr")
consequence = pipe(audio, chunk_length_s=20, stride_length_s=2)
print(consequence)For extra management, builders load AutoProcessor and AutoModelForCTC, resample audio to 16000 hertz with librosa, transfer tensors to CUDA if accessible and name mannequin.generate adopted by processor.batch_decode.
Key Takeaways
- MedASR is a light-weight, open weights Conformer based mostly medical ASR mannequin: It has 105M parameters, is educated particularly for medical dictation and transcription, and is launched underneath the Well being AI Developer Foundations program as an English solely mannequin for healthcare builders.
- Area particular coaching on about 5000 hours of de recognized medical audio: MedASR is pre educated on doctor dictations and medical conversations throughout specialties like radiology, inside medication and household medication, which supplies it sturdy protection of medical terminology in comparison with common function ASR techniques.
- Aggressive or higher phrase error charges on medical dictation benchmarks: On inside radiology, common medication, household medication and Eye Gaze datasets, MedASR with grasping or language mannequin decoding matches or outperforms massive common fashions resembling Gemini 2.5 Professional, Gemini 2.5 Flash and Whisper v3 Giant on phrase error price for English medical speech.
Try the Repo, Model on HF and Technical details. Additionally, be at liberty to observe us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

