vLLM 源碼之分離式架構

1，背景

本文主要分析 vLLM 分離式架構原理。

關於大模型推理分離式架構，一直是今年討論的一個熱點，筆者之前針對該技術也做了一些總結。

作爲大模型推理最流行的框架之一，vLLM 功能迭代非常的快。關於 vLLM 的一些個人理解，筆者之前也做了一些總結。

當前，vLLM 社區已經有分離式架構的 pr，分別是如下 2 個。本文以第一個爲基礎介紹一下 vLLM 分離式架構簡單實現。

1https://https://github.com/vllm-project/vllm/pull/8498/vllm/pull/8498
https://https://github.com/vllm-project/vllm/pull/9079

2，vLLM 分離式架構

本章主要介紹 vLLM 社區版分離式架構實現，目前該實現功能較爲簡單，代碼簡潔，沒有複雜的調度、kv cache pool 等功能，適合初學者學習。

2.1，整體架構

quick start

按照要求安裝好 vLLM，執行 benchmarks/disagg_benchmarks/visualize_benchmark_results.py 即可完成一個測試。我們進行簡單的拆解，看一下到底怎麼使用的。

我們看到首先啓動了 3 個進程，然後 curl 即可，第一個進程是一個叫 proxy 的進行，後面 2 個進程分別是 vLLM 的 producer 和 consumer 進程。

接着介紹一下 vLLM 分離式整體架構，如下圖所示，是原 pr 的設計圖。

對上述架構解釋如下：

該系統有 Proxy API server、vLLM prefill、vLLM decode 三個角色；
請求到來時，先進入 Proxy API server，Proxy API server 將請求發送給 vLLM prefill，vLLM prefill 進行 prefill，產生 kv cache，並轉發，將 First token 返回給 Proxy API server；
Proxy API server 繼續將請求發送給 vLLM decode，vLLM decode 進行 drop_select 操作，我們可以看出註釋，該功能主要是獲取 kv 和跳過 prefill，drop_select 之後再產生 First token；
我們可以看到，vLLM decode 階段其實也會產生 First token，Proxy API server 最終的 token 其實全部都是從 vLLM decode 拿到的，vLLM prefill 只負責產生 kv cache。

這裏第 4 點的實現看起來確實有點奇怪，比如爲什麼 vLLM decode 爲什麼還要進行 First token 計算，既然 vLLM decode 的 First token 計算 vLLM prefill 計算有啥區別。這些後面分析代碼的時候會解釋，再分析代碼之前，我們再看一下 Proxy API server 的代碼如下，我們可以看到代碼和架構圖一樣的簡單，就是分別發送一個請求給 vLLM prefill 和 vLLM decode。

2.2，核心組件

總結一下上述的 vLLM 分離式架構請求流程，有 Proxy API server、vLLM prefill、vLLM decode 三個實例（這裏可以理解爲進程，但是實際一個實例可能是多個進程），請求首先在 vLLM prefill 進行 prefill 計算，計算完之後將 kv cache 發送 vLLM decode，vLLM decode 進行後續的 decode 階段。我們對照 pr 看看需要實現這個功能，大概要進行哪些修改。

我們看到 pr 主要添加和修改了上述文件，說明如下：

kv_pipe：從名字我們可以看出，這是一個通信通道，vLLM prefill 藉助這個類發送消息給 vLLM decode、vLLM decode 通過這個類接受消息。這裏的消息特指 torch 的 Tensor，底層也是使用 torch 的分佈式 api 如 send recv。

class KVPipeBase(ABC):
    """
    This class provides an interface for sending and receiving tensors, or
    None, by distributed communications.
    """

    @abstractmethod
    def send_tensor(self, tensor: Optional[torch.Tensor]) -> None:
        """Send a tensor, or None, via the pipe.
        
        Need to support sending None -- important for error handling.
        
        TODO: add a `key` argument so that we can use traditional 
        key-value database as the distributed communication mechanism behind 
        the pipe.

        Args:
            tensor (Optional[torch.Tensor]): The tensor to be sent. Can be None.

        Raises:
            NotImplementedError: This method must be implemented in subclasses.
        """
        raise NotImplementedError

    @abstractmethod
    def recv_tensor(self) -> Optional[torch.Tensor]:
        """Receive a tensor (can be None) from the pipeline.

        Returns:
            Optional[torch.Tensor]: The tensor received from the pipeline. Can 
                                    be None.

        Raises:
            NotImplementedError: This method must be implemented in subclasses.
        """
        raise NotImplementedError

kv_buffer：我們可以看到 kv_pipe 發送是基礎的 torch 的 Tensor，這個比較底層，不是很好用，需要進一步抽象才能給上層用，即 kv buffer，我們可以看到，這個 kv_buffer 有 insert 和 drop_select 2 個函數，vLLM prefill 使用 insert 發送 kv 數據， vLLM decode 接受 kv 數據，交換的消息是 input_tokens, roi, key, value, hidden 這些數據，數據的粒度是 attention 一層的。

class KVLookupBufferBase(ABC):
    """
    Abstract base class for a lookup buffer.

    This class provides an abstraction for a key-value (KV) cache lookup buffer.
    
    The key of the lookup buffer:
    - input_tokens: token IDs of the request
    - roi: a binary mask on top of input_tokens.
      - Purpose of roi: Since KV cache may only be available for a subset of 
        tokens in the input (for example, when vLLM is connected to an external 
        KV cache service), roi specifies the subset of tokens that the KV cache 
        is associated with.
      - NOTE: roi can be further extended to describe which part of KV the 
        current process is holding (each process may only hold a part of KV 
        due to TP and PP). This is not implemented for now.
        
    The value of the lookup buffer:
    - key: the key tensor in the KV cache
    - value: the value tensor in the KV cache
    - hidden: the final hidden state generated by model forwarding. This allows 
      vLLM to bypass further model forwarding by transmitting the hidden state.
    """

    @abstractmethod
    def insert(self, input_tokens: torch.Tensor, roi: torch.Tensor,
               key: torch.Tensor, value: torch.Tensor,
               hidden: torch.Tensor) -> None:
        """Insert into the lookup buffer.
        
        The functionality is similar to the following python statement
        ```
        buffer[input_tokens, roi] = [key, value, hidden]
        ```
        
        FIXME: in the future, we should only have two arguments, key and value,
        where key is a tensor dict and value is a tensor dict.
        
        FIXME: we should transmit both sampler outputs and the hidden states.

        Args:
            input_tokens (torch.Tensor): token IDs.
            roi (torch.Tensor): A binary mask on top of the input tokens
            key (torch.Tensor): The key tensor in the KV cache.
            value (torch.Tensor): The value tensor in the KV cache.
            hidden (torch.Tensor): The final hidden state tensor generated 
                                   during model forwarding to bypass model 
                                   forwarding.

        Raises:
            NotImplementedError: This method must be implemented in subclasses.
        """
        raise NotImplementedError

    @abstractmethod
    def drop_select(
            self, input_tokens: Optional[torch.Tensor],
            roi: Optional[torch.Tensor]) -> List[Optional[torch.Tensor]]:
        """Select and *drop* KV cache entries from the lookup buffer.
        
        The functionality is similar to the following python statements
        ```
        ret = buffer.pop(input_tokens, roi)
        return ret
        ```
        
        If `input_tokens` and `roi` is `None`, it means selecting any of the
        KV caches in the buffer, return, and remove it from the buffer, useful
        when offloading KV cache to KV cache storage service.

        Args:
            input_tokens (torch.Tensor): token IDs.
            roi (torch.Tensor): A binary mask on top of the input tokens

        Returns:
            List[Optional[torch.Tensor]]: A list of tensors. Can be None.

        Raises:
            NotImplementedError: This method must be implemented in subclasses.
        """
        raise NotImplementedError

vllm_adapter 和 model_runner 是串整個流程的，驅動 vLLM prefill 前向之後發送 kv cache、vLLM decode 前向之前獲取 kv cache。

最終組件的整體流程如下圖。

對上圖的說明如下：

vLLM prefill 和 vLLM decode 在接受到請求後都會執行 ModelRunner 的 execute_model，依賴 KV_transfer_agent 類執行發送和接收所有 attention layer 的 kv cache，KV_transfer_agent 依賴 KVLookupBufferBase 執行單層 attention layer 的 kv cache 的發送和接收。KVPipeBase 則提供最底層的 tensor 發送和接收（這裏還有數據協議封裝和進程控制，做過 rpc 的都瞭解，沒做過 rpc 的可以從後面代碼看到）。

2.3，kv_pipe

上圖代碼是 kv pipe 的單元測試，有 2 個進程，分別是 rank 0 和 rank 1，我們可以看到，rank 0 的發送必有 rank 1 的接收，臥龍鳳雛一定是成對出現的。

上述代碼是 kv_pipe 的 send_tensor 方法的實現，我們可以看到 2 個點：

這個 prefill 階段發送是異步的；
send_metadata 主要記錄數據類型、數據維度，以及需要發送 tensor 的 shape，這個很好理解，prefill 作爲發送端是知道這些信息的，但是 decode 作爲接受端你怎麼知道這些信息呢？那就是先接受一些 metadata 數據，以確定接下來接收什麼樣的 shape 的數據。

上述代碼是 recv_tensor 的實現，我們看到它確實實現 recv 一個 metadata，然後再 recv 真正的 tensor。下圖是 prefill 和 decode 的 pipe 類的發送和接收線程棧圖。

2.4，kv_buffer

看一下 kv_buffer 的單元測試。

也是一個 prefill 和一個 decode，分爲 2 個進程，不過發送的數據已經抽象成 tokens、roi、key、value、hidden 了。

prefill 端的 insert 的 code 如下圖。

我們可以看到 insert 主要執行_add_to_buffer，這個容易理解，先將數據加入到本地 buffer 裏；下面的和 thread 相關的只掉用一次，即如果沒有創建過線程，就創建一個線程後臺執行 drop_select_handler。這個本質上也是一個生產者消費者模型。insert 不停的_add_to_buffer。drop_select_handler 後臺則不停的處理這些 buffer，至於如何處理，我們看代碼。

prefill 階段 drop_select_handler 是一個 while 循環，不同的處理 buffer 數據，但是會阻塞在 self.sinnal_pipe.recv_tensor() 上，本質上 drop_select_handler 裏的很多通信操作會和 decode 階段 drop_select 裏的通信操作成對出現，成對出現的線已經拉好。

2.5，模型集成

vLLM 進程會根據自己是 prefill 還是 decode 掉用 KV_transfer_agent 的不同方法，如果是 prefill，則在模型執行後執行 send_kv_caches_and_hidden_states，如果是 decode 階段，則在模型執行前調用 recv_kv_caches_and_hidden_states。

send_kv_caches_and_hidden_states 的代碼如下圖。

recv_kv_caches_and_hidden_states 的代碼如下圖。

左邊是接收 kv cache、右邊是保存 kv cache，以便後續 pageattention 使用。

這裏還有一個 hidden，我們前面分析了，其實 decode 階段也是需要計算 first token 的，decode 拿到了 hidden 即可執行。

3，整體總結

本文只分析了主流程，還有一些細節沒有寫，讀者可以根據自己需要 debug。本 pr 的實現目前也是一個 base 版本的，如 layer wise 通信、prefix cache 等均沒體現。

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/7eckqRQrXMF3md1kDndY9g