In Realtime this doesn't really work!,

how about layer wise inference?
4
1
Andre Bieler
Mastering LLM (Large Language Model)
·Follow
Oct 15, 2024
--
In Realtime this doesn't really work!, this will reduce the VRAM usage significantly but token/sec is unusable for even chat based applications.
--
--
Written by Mastering LLM (Large Language Model)2.9K Followers
·2 Following
MasteringLLM is a AI first EdTech company making learning LLM simplified with its visual contents. Look out for our LLM Interview Prep & AgenticRAG courses.
No responses yet
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams