fix: detok review findings (null vocab guard + utf-8 tokens read)#3005
Conversation
There was a problem hiding this comment.
Code Review
This pull request improves vocabulary loading by explicitly using UTF-8 encoding when reading the JSON file in Python, and adds null checks when retrieving vocabulary strings from GGUF arrays in C++ (funasr-paraformer.cpp and funasr-sensevoice.cpp). The reviewer recommends formatting the long, single-line vocabulary extraction blocks in both C++ files to improve readability and adding defensive checks to ensure the array size is positive before resizing the vector.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| m.c.enc_blocks=rdi("pf.enc.num_blocks",50);m.c.dec_blocks=rdi("pf.dec.num_blocks",16);m.c.dec_att=rdi("pf.dec.att_layer_num",16); | ||
| m.c.dec3=rdi("pf.dec.decoders3",1);m.c.vocab=rdi("pf.vocab_size",8404);m.c.tail=rdf("pf.predictor.tail_threshold",0.45f);m.c.thresh=rdf("pf.predictor.threshold",1.0f); | ||
| std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"pf.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++)vocab[i]=gguf_get_arr_str(gg,ki,i);}} | ||
| std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"pf.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++){const char*s=gguf_get_arr_str(gg,ki,i); vocab[i]=s?s:"";}}} |
There was a problem hiding this comment.
The single-line block is extremely long (over 200 characters) and contains multiple nested statements, making it very difficult to read, maintain, and debug. Additionally, if gguf_get_arr_n returns a negative value or 0, calling vocab.resize directly can lead to undefined behavior or unnecessary allocation attempts. Formatting this block and adding a defensive check for nv > 0 improves both readability and robustness.
std::vector<std::string> vocab;
{
int ki = gguf_find_key(gg, "pf.vocab");
if (ki >= 0) {
int nv = gguf_get_arr_n(gg, ki);
if (nv > 0) {
vocab.resize(nv);
for (int i = 0; i < nv; i++) {
const char* s = gguf_get_arr_str(gg, ki, i);
vocab[i] = s ? s : "";
}
}
}
}| int qi=gguf_find_key(gg,"sv.query_tokens"); int nq=qi<0?0:(int)gguf_get_arr_n(gg,qi); | ||
| std::vector<int> qtok(nq); for(int i=0;i<nq;i++) qtok[i]=((const int32_t*)gguf_get_arr_data(gg,qi))[i]; | ||
| std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"sv.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++)vocab[i]=gguf_get_arr_str(gg,ki,i);}} | ||
| std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"sv.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++){const char*s=gguf_get_arr_str(gg,ki,i); vocab[i]=s?s:"";}}} |
There was a problem hiding this comment.
The single-line block is extremely long (over 200 characters) and contains multiple nested statements, making it very difficult to read, maintain, and debug. Additionally, if gguf_get_arr_n returns a negative value or 0, calling vocab.resize directly can lead to undefined behavior or unnecessary allocation attempts. Formatting this block and adding a defensive check for nv > 0 improves both readability and robustness.
std::vector<std::string> vocab;
{
int ki = gguf_find_key(gg, "sv.vocab");
if (ki >= 0) {
int nv = gguf_get_arr_n(gg, ki);
if (nv > 0) {
vocab.resize(nv);
for (int i = 0; i < nv; i++) {
const char* s = gguf_get_arr_str(gg, ki, i);
vocab[i] = s ? s : "";
}
}
}
}
Addresses the gemini-code-assist findings on the merged in-binary detok (#3004/#302).
gguf_get_arr_str(funasr-sensevoice.cpp / funasr-paraformer.cpp): a malformed GGUF could returnnullptrfor a vocab element; assigning that tostd::stringis UB. Now coerces to empty string.tokens.jsonread (export_paraformer_gguf.py): usewith open(..., encoding="utf-8")— a context manager (no fd leak) and explicit UTF-8 so the non-ASCII tokens load correctly on Windows (default cp1252).Not changed: adjacent VAD-segment texts are concatenated without a separator — that is intentional for Chinese, where the VAD splits fall mid-utterance and a space would fragment the sentence (CER is unaffected either way).
--idsstill gives raw ids if a caller wants per-segment structure.Verified: rebuilt, both binaries still print the correct text (sample + full 002).