LLM, such as GPT and Palm, change the methods and interaction methods, providing the work of everything: from assistant programmers to universal chat boots. However, the launch of these incredibly powerful models is very expensive, often 10 times more than a traditional search by keywords.
The hidden memory eater: KSh KV
LLM is based on the TRANSFORMER model, which generates the text one word at a time, writes Xrust. For effective work, it is necessary to remember the “context” of previous tokens. This memory is stored in the so-called cache «key-meaning» (KV). You can imagine it as a short-term LLM memory for conversation.
The problem is that this KV cache is huge, and its size increases dynamically and decreases for each request. Existing systems are faced with this problem, as they usually store the KV cache in one continuous memory block. This approach leads to two serious problems. The first is a fragmentation of memory:
- Internal fragmentation. The systems are in advance a large amount of memory for each request, based on the maximum possible length of the output data (for example, 2048 tokens). However, if the request generates only short weekends, most of the reserved memory remains unused, which leads to significant losses;
- External fragmentation. Since different requests reserve fragments of different sizes, the memory of the graphic processor is scattered, filled with small unused intervals, which complicates the processing of new requests even in the presence of all free memory. In existing systems, only 20.4–38.2% KV cache is actually used to store tokens states, and the rest is not used.
The second problem is no memory separation.
Advanced decoding methods, such as a parallel sample or radiation search, often generate several output data from one sequence, which allows them to jointly use the parts of the KV cache. However, existing systems cannot easily jointly use this memory, since the KV cache of each sequence is in a separate continuous block.
This inefficiency seriously limits the number of requests that can be processed simultaneously («packet size»), which directly affects the system of the system (how many tokens/query. Process in a second).
PageDattion enters the stage. Researchers have developed a program that solves these problems. At the basis-virtual memory and pages of pages.
How PAGEDATTENTION works
since KV blocks do not have to be continuous in physical memory, PAGEDATTENTION can dynamically allocate blocks on demand. This practically excludes internal fragmentation, since the memory is distinguished only if necessary, and external fragmentation is eliminated, since all blocks have the same size.
Xrust LLM devour a lot of memory, and PAGEDATENTION solves this problem
- Если Вам понравилась статья, рекомендуем почитать
- Programming: Vibe Coding - revolution 2026
- Agentive and Physical AI - examples for dummies








