A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

09/14/2023
by   Yeqi Gao, et al.
0

Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function L(X,Y) = ∑_j_0 = 1^n ∑_i_0 = 1^d ( ⟨⟨exp( 𝖠_j_0 x ) , 1_n ⟩^-1exp( 𝖠_j_0 x ), A_3 Y_*,i_0⟩ - b_j_0,i_0 )^2. Here 𝖠∈ℝ^n^2 × d^2 is Kronecker product between A_1 ∈ℝ^n × d and A_2 ∈ℝ^n × d. A_3 is a matrix in ℝ^n × d, 𝖠_j_0∈ℝ^n × d^2 is the j_0-th block of 𝖠. The X, Y ∈ℝ^d × d are variables we want to learn. B ∈ℝ^n × d and b_j_0,i_0∈ℝ is one entry at j_0-th row and i_0-th column of B, Y_*,i_0∈ℝ^d is the i_0-column vector of Y, and x ∈ℝ^d^2 is the vectorization of X. In a multi-layer LLM network, the matrix B ∈ℝ^n × d can be viewed as the output of a layer, and A_1= A_2 = A_3 ∈ℝ^n × d can be viewed as the input of a layer. The matrix version of x can be viewed as QK^⊤ and Y can be viewed as V. We provide an iterative greedy algorithm to train loss function L(X,Y) up ϵ that runs in O( ( T_mat(n,n,d) + T_mat(n,d,d) + d^2ω) log(1/ϵ) ) time. Here T_mat(a,b,c) denotes the time of multiplying a × b matrix another b × c matrix, and ω≈ 2.37 denotes the exponent of matrix multiplication.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro