paint-brush
Improving Real-Time Inference with Anchor Tokensby@anchoring

Improving Real-Time Inference with Anchor Tokens

by AnchoringOctober 10th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This section discusses how training models to compress information into anchor tokens can optimize inference by reducing keys/values caches. A new algorithm is introduced to manage cache efficiently during prefix processing and token generation in real-time.
featured image - Improving Real-Time Inference with Anchor Tokens
Anchoring HackerNoon profile picture
0-item

Authors:

(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com);

(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk);

(3) Derek F. Wong, University of Macau;

(4) Longyue Wang, Tencent AI Lab, and corresponding author.

Abstract and 1 Introduction

2 Related Work

3 Anchor-based Large Language Models

3.1 Background

3.2 Anchor-based Self-Attention Networks


3.3 Anchor-based Inference

4 Experiments and 4.1 Our Implementation

4.2 Data and Training Procedure

4.3 Evaluation

5 Results

6 Analysis

7 Conclusion, Limitations, Ethics Statement, and References


A More Experimental Results

B Data Settings

3.3 Anchor-based Inference

By training the model to compress information into the anchor token of a natural language sequence, we can optimize the inference process by modifying the keys/values caching mechanism. Specifically, during inference, upon encountering an anchor token that condenses the comprehensive semantic information of preceding tokens in the current sequence, the model can reduce the keys/values caches by deleting the caches of non-anchor tokens within that sequence.


We introduce the inference method in Algorithm 1. The function “REDUCTION” in Line 1 is utilized to remove keys/values caches when the model processes prefix texts in Line 10 or generates an anchor token during the prediction of the next



token in Line 16. This approach aims to reduce the keys/values caches for both prefix tokens and generated outputs during real-time inference.


This paper is available on arxiv under CC BY 4.0 DEED license.