线性回归模型(linear regression)

1.模型定义

给定数据集,T={(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}T=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\},其中x(i)=(1,x1,x2,...,xn)TX=Rn+1x^{(i)}=(1, x_1, x_2, ..., x_n)^T\in X= R^{n+1}y(i)Y=Ry^{(i)}\in Y=R,线性回归模型试图学到一个通过属性的线性组合来进行预测的函数,即

f(x)=w1x1+w2x2+...+wnxn+b f(x)=w_1x_1+w_2x_2+...+w_nx_n+b

一般用向量写成:

f(x)=wTx+b f(x)=w^T\cdot x+b

其中w=(w1,x2,...,wn)TRnw=(w_1, x_2, ..., w_n)^T\in R^{n}bRb\in R,使得f(x)yf(x)\simeq y

如何确定wwbb呢,显然关键在于如何衡量f(x)f(x)yy之间的差别。均方误差是最常用的性能度量,因此我们可以试图让均方误差最小化,即:

minw,bL(w,b)=i=1m(f(x(i))y(i))2 \min_{w,b} L(w,b)=\displaystyle\sum_{i=1}^m(f(x^{(i)})-y^{(i)})^2

=i=1m(wTx(i)+by(i))2 =\displaystyle\sum_{i=1}^m(w^T\cdot x^{(i)}+b-y^{(i)})^2

均方误差有非常好的几何意义,它对应了常用的欧几里得距离或简称“欧氏距离”(Euclidean distance)。基于均方误差最小化来进行模型求解的方法称为“最小二乘法”(least square method)。在线性回归中,最小二乘法是试图找到一条直线,使得所有样本到直线上的欧氏距离之和最小。

求解wwbb,使L(w,b)=i=1m(f(x(i))y(i))2 L(w,b)=\displaystyle\sum_{i=1}^m(f(x^{(i)})-y^{(i)})^2最小化的过程,称为线性回归模型的最小二乘“参数估计”(parameter estimation)。令w0=bw_0=bx0=1x_0=1,则w=(w0,w1,w2,...,wn)Tw=(w_0,w_1, w_2, ..., w_n)^Tx=(x0,x1,x2,...,xn)Tx=(x_0, x_1, x_2, ..., x_n)^T,原式转换为:

f(x)=wTx f(x)=w^T\cdot x

minwL(w)=i=1m(wTx(i)y(i))2 \min_{w} L(w)=\displaystyle\sum_{i=1}^m(w^T\cdot x^{(i)}-y^{(i)})^2

对其求导,可得:

L(w,b)wj=i=1m(wTx(i)y(i))2wj \dfrac{\partial L(w,b)}{\partial w_j}=\dfrac{\partial \displaystyle\sum_{i=1}^m(w^T\cdot x^{(i)}-y^{(i)})^2}{\partial w_j}

=i=1m2(wTx(i)y(i))xj(i) =\displaystyle\sum_{i=1}^m2(w^T\cdot x^{(i)}-y^{(i)})x^{(i)}_j

得到梯度向量:

L(w)=i=1m(wTx(i)y(i))x(i) \nabla L(w)= \displaystyle\sum_{i=1}^m(w^T\cdot x^{(i)}-y^{(i)})x^{(i)}

假定:

X=[(x(1))T(x(2))T(x(3))T...(x(m))T]=[1x1(1)x2(1)...xn(1)1x1(2)x2(2)...xn(2)1x1(3)x2(3)...xn(3)...1x1(m)x2(m)...xn(m)]X= \begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T \\ (x^{(3)})^T \\ ... \\ ( x^{(m)} )^T \end{bmatrix} = \begin{bmatrix} 1 & x^{(1)}_1 & x^{(1)}_2 & ... & x^{(1)}_n \\ 1 & x^{(2)}_1 & x^{(2)}_2 & ... & x^{(2)}_n \\ 1 & x^{(3)}_1 & x^{(3)}_2 & ... & x^{(3)}_n \\ ... \\ 1 & x^{(m)}_1 & x^{(m)}_2 & ... & x^{(m)}_n \end{bmatrix}Y=[y(1)y(2)y(3)...y(m)]Y=\begin{bmatrix} y^{(1)} \\ y^{(2)} \\ y^{(3)} \\ ... \\ y^{(m)} \end{bmatrix}w=[w0w1w2...wn]w=\begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ ... \\ w_n \end{bmatrix}

则:

Xw=[1x1(1)x2(1)...xn(1)1x1(2)x2(2)...xn(2)1x1(3)x2(3)...xn(3)...1x1(m)x2(m)...xn(m)][w0w1w2...wn]=[(x(1))Tw(x(2))Tw(x(3))Tw...(x(m))Tw]=[wTx(1)wTx(2)wTx(3)...wTx(m)] X\cdot w= \begin{bmatrix} 1 & x^{(1)}_1 & x^{(1)}_2 & ... & x^{(1)}_n \\ 1 & x^{(2)}_1 & x^{(2)}_2 & ... & x^{(2)}_n \\ 1 & x^{(3)}_1 & x^{(3)}_2 & ... & x^{(3)}_n \\ ... \\ 1 & x^{(m)}_1 & x^{(m)}_2 & ... & x^{(m)}_n \end{bmatrix}\cdot \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ ... \\ w_n \end{bmatrix}=\begin{bmatrix} (x^{(1)})^T\cdot w \\ (x^{(2)})^T\cdot w \\ (x^{(3)})^T\cdot w \\ ... \\ (x^{(m)})^T\cdot w \end{bmatrix}=\begin{bmatrix} w^T \cdot x^{(1)} \\ w^T \cdot x^{(2)} \\ w^T \cdot x^{(3)} \\ ... \\ w^T \cdot x^{(m)} \end{bmatrix}

XwY=[wTx(1)y(1)wTx(2)y(2)wTx(3)y(3)...wTx(m)y(m)] X\cdot w-Y =\begin{bmatrix} w^T \cdot x^{(1)}-y^{(1)} \\ w^T \cdot x^{(2)}-y^{(2)} \\ w^T \cdot x^{(3)}-y^{(3)} \\ ... \\ w^T \cdot x^{(m)}-y^{(m)} \end{bmatrix}

XT(XwY)=[x(1)x(2)x(3)...x(m)][wTx(1)y(1)wTx(2)y(2)wTx(3)y(3)...wTx(m)y(m)] X^T\cdot (X\cdot w-Y )=\begin{bmatrix} x^{(1)} & x^{(2)} & x^{(3)} & ... & x^{(m)} \end{bmatrix}\cdot \begin{bmatrix} w^T \cdot x^{(1)}-y^{(1)} \\ w^T \cdot x^{(2)}-y^{(2)} \\ w^T \cdot x^{(3)}-y^{(3)} \\ ... \\ w^T \cdot x^{(m)}-y^{(m)} \end{bmatrix}

于是得到:

L(w)=2XT(XwY) \nabla L(w)=2 X^T\cdot (X\cdot w-Y )

令一方面L(w)L(w)也可以表示成:

L(w)=(XwY)T(XwY)=(wTXTXw2wTXTY+YTY) L(w)=(X\cdot w-Y)^T\cdot (X\cdot w-Y)=(w^T X^TX w -2w^TX^TY+Y^T Y)

ww求导,同样可以得到梯度。

欲求的最优解,令梯度为0,

L(w)=2XTXw2XTY=0 \nabla L(w)=2 X^T\cdot X\cdot w-2 X^T\cdot Y =0

则得到:

w=(XTX)1XTY w=(X^T\cdot X)^{-1}\cdot X^T\cdot Y

于是最终得到线性模型:

f(x)=wTx=xT(XTX)1XTY f(x)=w^T\cdot x=x^T\cdot (X^T\cdot X)^{-1}\cdot X^T\cdot Y

X=(XTX)1XTX^\dagger =(X^T\cdot X)^{-1}\cdot X^T,称为伪逆(seudo-inverse),代入得到。

f(x)=xTXY f(x) = x^T\cdot X^\dagger\cdot Y

2.学习算法

2.1 Normal Equations

计算w=(XTX)1XTYw=(X^T\cdot X)^{-1}\cdot X^T\cdot Y,直接得到模型f(x)=wTxf(x)=w^T\cdot x

但是计算矩阵的逆是非常费时的事情,Xn×mTXm×n=Zn×nX^T_{n\times m}\cdot X_{m\times n}=Z_{n\times n},因此当特征数量比较多时,偏向于用梯度下降法。

2.2 批量梯度下降(Batch Gradient Descent)

输入:训练数据集T={(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}T=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\},其中x(i)X=Rnx^{(i)}\in X= R^ny(i)Y=Rny^{(i)}\in Y=R^ni=1,2,...,mi=1,2,...,m,学习率η(0<η1)\eta(0<\eta\leqslant1)

输出:w=(w1,w2,...,wn)Tw=(w_1, w_2, ..., w_n)^Tbb,模型f(x)=wTx+bf(x)=w^T\cdot x+b

1)将输入的每个x(i)x^{(i)}转换成x(i)=(1,x1,x2,...xn)x^{(i)}=(1, x_1, x_2,...x_n),令w0=bw_0=b,则输出为w=(w0,w1,w2,...,wn)Tw=(w_0, w_1, w_2, ..., w_n)^T

2)选取初始w(0)=(w0,w1,w2,...,wn)Tw^{(0)}=(w_0, w_1, w_2, ..., w_n)^T

3)计算梯度XT(Xw(j)Y)X^T\cdot (X\cdot w^{(j)}-Y ),这里略去系数2,其中w(j)w^{(j)}为第jj次迭代的结果,则第j+1j+1迭代:

w(j+1)w(j)ηXT(Xw(j)Y) w^{(j+1)} \gets w^{(j)} - \eta X^T\cdot (X\cdot w^{(j)}-Y )

4)转到步骤(3),一直到满足一定条件,或者迭代到足够的次数。

在批量梯度下降算法中,每一步的迭代都需要计算所有样本,当样本数较大时,计算量会很大。

2.3 随机梯度下降(Stochastic Gradient Descent)

将上面的步骤(3)改为:

3) 随机选取某个样本x(i)x^{(i)},则:

w(j+1)w(j)η((w(j))Tx(i)y(i))x(i) w^{(j+1)} \gets w^{(j)} - \eta ((w^{(j)})^T\cdot x^{(i)}-y^{(i)})x^{(i)}

一直到迭代到足够的次数。

results matching ""

    No results matching ""