Skip to content

fix: detok review findings (null vocab guard + utf-8 tokens read)#3005

Merged
LauraGPT merged 1 commit into
mainfrom
fix/detok-robustness
Jun 20, 2026
Merged

fix: detok review findings (null vocab guard + utf-8 tokens read)#3005
LauraGPT merged 1 commit into
mainfrom
fix/detok-robustness

Conversation

@LauraGPT

Copy link
Copy Markdown
Collaborator

Addresses the gemini-code-assist findings on the merged in-binary detok (#3004/#302).

  • Null guard on gguf_get_arr_str (funasr-sensevoice.cpp / funasr-paraformer.cpp): a malformed GGUF could return nullptr for a vocab element; assigning that to std::string is UB. Now coerces to empty string.
  • tokens.json read (export_paraformer_gguf.py): use with open(..., encoding="utf-8") — a context manager (no fd leak) and explicit UTF-8 so the non-ASCII tokens load correctly on Windows (default cp1252).

Not changed: adjacent VAD-segment texts are concatenated without a separator — that is intentional for Chinese, where the VAD splits fall mid-utterance and a space would fragment the sentence (CER is unaffected either way). --ids still gives raw ids if a caller wants per-segment structure.

Verified: rebuilt, both binaries still print the correct text (sample + full 002).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves vocabulary loading by explicitly using UTF-8 encoding when reading the JSON file in Python, and adds null checks when retrieving vocabulary strings from GGUF arrays in C++ (funasr-paraformer.cpp and funasr-sensevoice.cpp). The reviewer recommends formatting the long, single-line vocabulary extraction blocks in both C++ files to improve readability and adding defensive checks to ensure the array size is positive before resizing the vector.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

m.c.enc_blocks=rdi("pf.enc.num_blocks",50);m.c.dec_blocks=rdi("pf.dec.num_blocks",16);m.c.dec_att=rdi("pf.dec.att_layer_num",16);
m.c.dec3=rdi("pf.dec.decoders3",1);m.c.vocab=rdi("pf.vocab_size",8404);m.c.tail=rdf("pf.predictor.tail_threshold",0.45f);m.c.thresh=rdf("pf.predictor.threshold",1.0f);
std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"pf.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++)vocab[i]=gguf_get_arr_str(gg,ki,i);}}
std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"pf.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++){const char*s=gguf_get_arr_str(gg,ki,i); vocab[i]=s?s:"";}}}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The single-line block is extremely long (over 200 characters) and contains multiple nested statements, making it very difficult to read, maintain, and debug. Additionally, if gguf_get_arr_n returns a negative value or 0, calling vocab.resize directly can lead to undefined behavior or unnecessary allocation attempts. Formatting this block and adding a defensive check for nv > 0 improves both readability and robustness.

  std::vector<std::string> vocab;
  {
    int ki = gguf_find_key(gg, "pf.vocab");
    if (ki >= 0) {
      int nv = gguf_get_arr_n(gg, ki);
      if (nv > 0) {
        vocab.resize(nv);
        for (int i = 0; i < nv; i++) {
          const char* s = gguf_get_arr_str(gg, ki, i);
          vocab[i] = s ? s : "";
        }
      }
    }
  }
int qi=gguf_find_key(gg,"sv.query_tokens"); int nq=qi<0?0:(int)gguf_get_arr_n(gg,qi);
std::vector<int> qtok(nq); for(int i=0;i<nq;i++) qtok[i]=((const int32_t*)gguf_get_arr_data(gg,qi))[i];
std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"sv.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++)vocab[i]=gguf_get_arr_str(gg,ki,i);}}
std::vector<std::string> vocab; {int ki=gguf_find_key(gg,"sv.vocab"); if(ki>=0){int nv=gguf_get_arr_n(gg,ki); vocab.resize(nv); for(int i=0;i<nv;i++){const char*s=gguf_get_arr_str(gg,ki,i); vocab[i]=s?s:"";}}}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The single-line block is extremely long (over 200 characters) and contains multiple nested statements, making it very difficult to read, maintain, and debug. Additionally, if gguf_get_arr_n returns a negative value or 0, calling vocab.resize directly can lead to undefined behavior or unnecessary allocation attempts. Formatting this block and adding a defensive check for nv > 0 improves both readability and robustness.

  std::vector<std::string> vocab;
  {
    int ki = gguf_find_key(gg, "sv.vocab");
    if (ki >= 0) {
      int nv = gguf_get_arr_n(gg, ki);
      if (nv > 0) {
        vocab.resize(nv);
        for (int i = 0; i < nv; i++) {
          const char* s = gguf_get_arr_str(gg, ki, i);
          vocab[i] = s ? s : "";
        }
      }
    }
  }
@LauraGPT LauraGPT merged commit 9cc0ebc into main Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant