docs: fix README streaming example (runnable + actually streams)#2987
Conversation
The Usage streaming snippet used input="chunk.wav" (a file that does not exist) and a single one-shot generate() call missing is_final / encoder_chunk_look_back / the chunk loop, so it neither ran nor demonstrated streaming. Replace it with the real chunk-by-chunk loop (matching the repo example examples/industrial_data_pretraining/paraformer_streaming/demo.py): read audio, iterate fixed-stride chunks, pass cache + is_final + look-back, print partial text per chunk. Verified on GPU (paraformer-zh-streaming): emits incremental text per chunk and reconstructs the full sentence.
There was a problem hiding this comment.
Code Review
This pull request updates both the English and Chinese README files to provide a complete, realistic example of streaming real-time audio chunk-by-chunk using the soundfile library. The feedback suggests converting the loaded audio to mono if it has multiple channels to prevent potential shape mismatch errors during model inference.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| import soundfile as sf | ||
| model = AutoModel(model="paraformer-zh-streaming", device="cuda") | ||
| result = model.generate(input="chunk.wav", cache={}, chunk_size=[0, 10, 5]) | ||
| audio, sr = sf.read("speech.wav", dtype="float32") # 16 kHz mono |
There was a problem hiding this comment.
If the user's speech.wav is a stereo audio file, sf.read will return a 2D array, which can cause shape mismatch errors during feature extraction or model inference. Converting the audio to mono if it has multiple channels makes the example more robust.
| audio, sr = sf.read("speech.wav", dtype="float32") # 16 kHz mono | |
| audio, sr = sf.read("speech.wav", dtype="float32") | |
| if audio.ndim > 1: | |
| audio = audio[:, 0] # Convert to mono if stereo |
| import soundfile as sf | ||
| model = AutoModel(model="paraformer-zh-streaming", device="cuda") | ||
| result = model.generate(input="chunk.wav", cache={}, chunk_size=[0, 10, 5]) | ||
| audio, sr = sf.read("speech.wav", dtype="float32") # 16 kHz 单声道 |
There was a problem hiding this comment.
If the user's speech.wav is a stereo audio file, sf.read will return a 2D array, which can cause shape mismatch errors during feature extraction or model inference. Converting the audio to mono if it has multiple channels makes the example more robust.
| audio, sr = sf.read("speech.wav", dtype="float32") # 16 kHz 单声道 | |
| audio, sr = sf.read("speech.wav", dtype="float32") | |
| if audio.ndim > 1: | |
| audio = audio[:, 0] # Convert to mono if stereo |
Problem
The Usage section's streaming example was broken and misleading:
chunk.wavis a placeholder file that doesn't exist.generate()call missingis_final,encoder_chunk_look_back/decoder_chunk_look_back, and the chunk loop. A user can't learn streaming from it.Fix
Replace it with the real chunk-by-chunk loop, matching the repo's own example (
examples/industrial_data_pretraining/paraformer_streaming/demo.py): read the audio, iterate fixed-stride chunks, passcache+is_final+ look-back, and print partial text per chunk. Applied to bothREADME.mdandREADME_zh.md.Verification
Ran the new snippet on GPU with
paraformer-zh-streaming— it emits incremental text chunk by chunk and reconstructs the full sentence:Only the streaming code example changes — no header/structure edits.