Please try native formats instead of disabling dynamic vram.#14577
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughIn 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
So, gguf support still will be removed in the future. Really bad news, gguf saves half memory and disk than fp8, and i think quality better even using q4. GGUF is fater than nvfp4 on cards below 50s |
|
nvfp4 run so bad on my RTX4080, even FP8 that overspill (ideogram at 4.2mpx) run better lol |
|
"Please try" is a lot different from "too bad, tough luck." So, please try to not use euphemisms when making controversial changes. |
I have seen detailed data that shows Q8_0 GGUF beating FP8 soundly in perplexity. I have yet to see anyone produce data that shows otherwise but am very willing to see it if anyone can show it. |
|
Quality issues aside, dynamic vram often interferes with compilation. Maybe this will change with native int8 but it's a huge unknown. All dynamic vram has done for me is introduce headaches and overhead. I literally lose .20s from it shuffling the 20mb of TAE vae where normal non-dynamic doesn't have the problem. I am eager to try the cutlass/cublas int8 though. GGUF support won't be removed, it just has to be modified to work with the system to get back most of the speed. There is an open PR so you are free to test/apply it. my experience of dynamic vram have been overall negative, sorry to say. it solves a problem I don't have on my system and introduces many others. |
Counter point: Dynamic VRAM solved my main gripes in ComfyUI memory management and actually works vs the previous smart memory which only worked (for me in Windows, Linux was and still is problem free) with tuning --reserve-vram flag per workflow, or just having it inefficiently high. I recognize there has been issues, and perhaps still are on specific setups, but overall it's been an improvement for most users. For compile I have workaround in KJNodes, the advanced compile node has worked fine on latest image models such as Ideogram4 and Krea2 with dynamic VRAM enabled, chasing full graph has always been next to impossible in Comfy though so no change there, but dynamic VRAM without some extra compile guards definitely makes it worse, my node is already testable to address that. Officially supporting compile is rather complex endeavor and instead we've been applying targeted custom kernels for the hotspots where the biggest compile gains are, on such models compile doesn't really give any benefit anymore. Can you elaborate on the TAE though, .2 seconds? 20 seconds? As live preview? Because that sounds like a bug that should be addressed in the TAE code. |
That's great to hear. |
I use TAE for the final VAE instead of the "proper" one in a few WFs. With dynamic vram disabled, it either stays in vram or gets moved faster. When I enable aimdo, the workflow slows by 0.20s and I see the [tiny] VAE being shuffled around. These WF are meant for speed and well tested so I can see performance regressions immediately.
eh.. I see complaints about it all the time on reddit. For vramlet systems it probably solves OOMs and lets them run models they normally couldn't, albeit slowly. It did hurt a bunch of custom nodes too. I will dread the day it's no longer able to be turned off. Do admit that it's in much better shape than it used to be, but I've yet to get a single benefit. On my system I reserve no vram for the OS, it has free run of the GPU and plenty of sysram while disk loading is slow. Am 180 deg from the target audience. At least we finally have the --high-ram flag. Tried the native int8 in kitchen today too. Both cuda kernels are slower on my turing card. Triton is for some reason gated by SM_80, despite working when I commented that bit out. After adding my previous autotune ( Compile under dynamic vram actually works, but again there is slight penalty. Maybe the cuda kernels are faster if you use xformers or some other attention besides sage? Triton always auto-compiles partially so it could be related to that. I can't say exactly why adding a compile node after drops 1/2 a second for the models I'm using.. but it demonstrably does. A lot of the "fixes" done to address speeds are targeted at newer cards like blackwell and that's not in budget for me. Still on turing/ampere. |
No description provided.