fix: detok review findings (null vocab guard + utf-8 tokens read) by LauraGPT · Pull Request #3005 · modelscope/FunASR

LauraGPT · 2026-06-20T08:49:21Z

Addresses the gemini-code-assist findings on the merged in-binary detok (#3004/#302).

Null guard on gguf_get_arr_str (funasr-sensevoice.cpp / funasr-paraformer.cpp): a malformed GGUF could return nullptr for a vocab element; assigning that to std::string is UB. Now coerces to empty string.
tokens.json read (export_paraformer_gguf.py): use with open(..., encoding="utf-8") — a context manager (no fd leak) and explicit UTF-8 so the non-ASCII tokens load correctly on Windows (default cp1252).

Not changed: adjacent VAD-segment texts are concatenated without a separator — that is intentional for Chinese, where the VAD splits fall mid-utterance and a space would fragment the sentence (CER is unaffected either way). --ids still gives raw ids if a caller wants per-segment structure.

Verified: rebuilt, both binaries still print the correct text (sample + full 002).

…s.json read)

gemini-code-assist

Code Review

This pull request improves vocabulary loading by explicitly using UTF-8 encoding when reading the JSON file in Python, and adds null checks when retrieving vocabulary strings from GGUF arrays in C++ (funasr-paraformer.cpp and funasr-sensevoice.cpp). The reviewer recommends formatting the long, single-line vocabulary extraction blocks in both C++ files to improve readability and adding defensive checks to ensure the array size is positive before resizing the vector.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-20T08:50:35Z

  m.c.enc_blocks=rdi("pf.enc.num_blocks",50);m.c.dec_blocks=rdi("pf.dec.num_blocks",16);m.c.dec_att=rdi("pf.dec.att_layer_num",16);
  m.c.dec3=rdi("pf.dec.decoders3",1);m.c.vocab=rdi("pf.vocab_size",8404);m.c.tail=rdf("pf.predictor.tail_threshold",0.45f);m.c.thresh=rdf("pf.predictor.threshold",1.0f);
-  std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"pf.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++)vocab[i]=gguf_get_arr_str(gg,ki,i);}}
+  std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"pf.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++){const char*s=gguf_get_arr_str(gg,ki,i); vocab[i]=s?s:"";}}}


The single-line block is extremely long (over 200 characters) and contains multiple nested statements, making it very difficult to read, maintain, and debug. Additionally, if gguf_get_arr_n returns a negative value or 0, calling vocab.resize directly can lead to undefined behavior or unnecessary allocation attempts. Formatting this block and adding a defensive check for nv > 0 improves both readability and robustness.

std::vector<std::string> vocab; { int ki = gguf_find_key(gg, "pf.vocab"); if (ki >= 0) { int nv = gguf_get_arr_n(gg, ki); if (nv > 0) { vocab.resize(nv); for (int i = 0; i < nv; i++) { const char* s = gguf_get_arr_str(gg, ki, i); vocab[i] = s ? s : ""; } } } }

gemini-code-assist · 2026-06-20T08:50:35Z

  int qi=gguf_find_key(gg,"sv.query_tokens"); int nq=qi<0?0:(int)gguf_get_arr_n(gg,qi);
  std::vector<int> qtok(nq); for(int i=0;i<nq;i++) qtok[i]=((const int32_t*)gguf_get_arr_data(gg,qi))[i];
-  std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"sv.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++)vocab[i]=gguf_get_arr_str(gg,ki,i);}}
+  std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"sv.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++){const char*s=gguf_get_arr_str(gg,ki,i); vocab[i]=s?s:"";}}}


The single-line block is extremely long (over 200 characters) and contains multiple nested statements, making it very difficult to read, maintain, and debug. Additionally, if gguf_get_arr_n returns a negative value or 0, calling vocab.resize directly can lead to undefined behavior or unnecessary allocation attempts. Formatting this block and adding a defensive check for nv > 0 improves both readability and robustness.

std::vector<std::string> vocab; { int ki = gguf_find_key(gg, "sv.vocab"); if (ki >= 0) { int nv = gguf_get_arr_n(gg, ki); if (nv > 0) { vocab.resize(nv); for (int i = 0; i < nv; i++) { const char* s = gguf_get_arr_str(gg, ki, i); vocab[i] = s ? s : ""; } } } }

fix: detok review findings (null gguf_get_arr_str guard + utf-8 token…

3e49658

…s.json read)

gemini-code-assist Bot reviewed Jun 20, 2026

View reviewed changes

LauraGPT merged commit 9cc0ebc into main Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: detok review findings (null vocab guard + utf-8 tokens read)#3005

fix: detok review findings (null vocab guard + utf-8 tokens read)#3005
LauraGPT merged 1 commit into
mainfrom
fix/detok-robustness

LauraGPT commented Jun 20, 2026

gemini-code-assist Bot left a comment

gemini-code-assist Bot Jun 20, 2026

gemini-code-assist Bot Jun 20, 2026

Labels

1 participant

Uh oh!

Conversation

LauraGPT commented Jun 20, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Labels

1 participant