一、简单的多元线性回归:
data.txt
1,230.1,37.8,69.2,22.12,44.5,39.3,45.1,10.43,17.2,45.9,69.3,9.34,151.5,41.3,58.5,18.55,180.8,10.8,58.4,12.96,8.7,48.9,75,7.27,57.5,32.8,23.5,11.88,120.2,19.6,11.6,13.29,8.6,2.1,1,4.810,199.8,2.6,21.2,10.611,66.1,5.8,24.2,8.612,214.7,24,4,17.413,23.8,35.1,65.9,9.214,97.5,7.6,7.2,9.715,204.1,32.9,46,1916,195.4,47.7,52.9,22.417,67.8,36.6,114,12.518,281.4,39.6,55.8,24.419,69.2,20.5,18.3,11.320,147.3,23.9,19.1,14.621,218.4,27.7,53.4,1822,237.4,5.1,23.5,12.523,13.2,15.9,49.6,5.624,228.3,16.9,26.2,15.525,62.3,12.6,18.3,9.726,262.9,3.5,19.5,1227,142.9,29.3,12.6,1528,240.1,16.7,22.9,15.929,248.8,27.1,22.9,18.930,70.6,16,40.8,10.531,292.9,28.3,43.2,21.432,112.9,17.4,38.6,11.933,97.2,1.5,30,9.634,265.6,20,0.3,17.435,95.7,1.4,7.4,9.536,290.7,4.1,8.5,12.837,266.9,43.8,5,25.438,74.7,49.4,45.7,14.739,43.1,26.7,35.1,10.140,228,37.7,32,21.541,202.5,22.3,31.6,16.642,177,33.4,38.7,17.143,293.6,27.7,1.8,20.744,206.9,8.4,26.4,12.945,25.1,25.7,43.3,8.546,175.1,22.5,31.5,14.947,89.7,9.9,35.7,10.648,239.9,41.5,18.5,23.249,227.2,15.8,49.9,14.850,66.9,11.7,36.8,9.751,199.8,3.1,34.6,11.452,100.4,9.6,3.6,10.753,216.4,41.7,39.6,22.654,182.6,46.2,58.7,21.255,262.7,28.8,15.9,20.256,198.9,49.4,60,23.757,7.3,28.1,41.4,5.558,136.2,19.2,16.6,13.259,210.8,49.6,37.7,23.860,210.7,29.5,9.3,18.461,53.5,2,21.4,8.162,261.3,42.7,54.7,24.263,239.3,15.5,27.3,15.764,102.7,29.6,8.4,1465,131.1,42.8,28.9,1866,69,9.3,0.9,9.367,31.5,24.6,2.2,9.568,139.3,14.5,10.2,13.469,237.4,27.5,11,18.970,216.8,43.9,27.2,22.371,199.1,30.6,38.7,18.372,109.8,14.3,31.7,12.473,26.8,33,19.3,8.874,129.4,5.7,31.3,1175,213.4,24.6,13.1,1776,16.9,43.7,89.4,8.777,27.5,1.6,20.7,6.978,120.5,28.5,14.2,14.279,5.4,29.9,9.4,5.380,116,7.7,23.1,1181,76.4,26.7,22.3,11.882,239.8,4.1,36.9,12.383,75.3,20.3,32.5,11.384,68.4,44.5,35.6,13.685,213.5,43,33.8,21.786,193.2,18.4,65.7,15.287,76.3,27.5,16,1288,110.7,40.6,63.2,1689,88.3,25.5,73.4,12.990,109.8,47.8,51.4,16.791,134.3,4.9,9.3,11.292,28.6,1.5,33,7.393,217.7,33.5,59,19.494,250.9,36.5,72.3,22.295,107.4,14,10.9,11.596,163.3,31.6,52.9,16.997,197.6,3.5,5.9,11.798,184.9,21,22,15.599,289.7,42.3,51.2,25.4100,135.2,41.7,45.9,17.2101,222.4,4.3,49.8,11.7102,296.4,36.3,100.9,23.8103,280.2,10.1,21.4,14.8104,187.9,17.2,17.9,14.7105,238.2,34.3,5.3,20.7106,137.9,46.4,59,19.2107,25,11,29.7,7.2108,90.4,0.3,23.2,8.7109,13.1,0.4,25.6,5.3110,255.4,26.9,5.5,19.8111,225.8,8.2,56.5,13.4112,241.7,38,23.2,21.8113,175.7,15.4,2.4,14.1114,209.6,20.6,10.7,15.9115,78.2,46.8,34.5,14.6116,75.1,35,52.7,12.6117,139.2,14.3,25.6,12.2118,76.4,0.8,14.8,9.4119,125.7,36.9,79.2,15.9120,19.4,16,22.3,6.6121,141.3,26.8,46.2,15.5122,18.8,21.7,50.4,7123,224,2.4,15.6,11.6124,123.1,34.6,12.4,15.2125,229.5,32.3,74.2,19.7126,87.2,11.8,25.9,10.6127,7.8,38.9,50.6,6.6128,80.2,0,9.2,8.8129,220.3,49,3.2,24.7130,59.6,12,43.1,9.7131,0.7,39.6,8.7,1.6132,265.2,2.9,43,12.7133,8.4,27.2,2.1,5.7134,219.8,33.5,45.1,19.6135,36.9,38.6,65.6,10.8136,48.3,47,8.5,11.6137,25.6,39,9.3,9.5138,273.7,28.9,59.7,20.8139,43,25.9,20.5,9.6140,184.9,43.9,1.7,20.7141,73.4,17,12.9,10.9142,193.7,35.4,75.6,19.2143,220.5,33.2,37.9,20.1144,104.6,5.7,34.4,10.4145,96.2,14.8,38.9,11.4146,140.3,1.9,9,10.3147,240.1,7.3,8.7,13.2148,243.2,49,44.3,25.4149,38,40.3,11.9,10.9150,44.7,25.8,20.6,10.1151,280.7,13.9,37,16.1152,121,8.4,48.7,11.6153,197.6,23.3,14.2,16.6154,171.3,39.7,37.7,19155,187.8,21.1,9.5,15.6156,4.1,11.6,5.7,3.2157,93.9,43.5,50.5,15.3158,149.8,1.3,24.3,10.1159,11.7,36.9,45.2,7.3160,131.7,18.4,34.6,12.9161,172.5,18.1,30.7,14.4162,85.7,35.8,49.3,13.3163,188.4,18.1,25.6,14.9164,163.5,36.8,7.4,18165,117.2,14.7,5.4,11.9166,234.5,3.4,84.8,11.9167,17.9,37.6,21.6,8168,206.8,5.2,19.4,12.2169,215.4,23.6,57.6,17.1170,284.3,10.6,6.4,15171,50,11.6,18.4,8.4172,164.5,20.9,47.4,14.5173,19.6,20.1,17,7.6174,168.4,7.1,12.8,11.7175,222.4,3.4,13.1,11.5176,276.9,48.9,41.8,27177,248.4,30.2,20.3,20.2178,170.2,7.8,35.2,11.7179,276.7,2.3,23.7,11.8180,165.6,10,17.6,12.6181,156.6,2.6,8.3,10.5182,218.5,5.4,27.4,12.2183,56.2,5.7,29.7,8.7184,287.6,43,71.8,26.2185,253.8,21.3,30,17.6186,205,45.1,19.6,22.6187,139.5,2.1,26.6,10.3188,191.1,28.7,18.2,17.3189,286,13.9,3.7,15.9190,18.7,12.1,23.4,6.7191,39.5,41.1,5.8,10.8192,75.5,10.8,6,9.9193,17.2,4.1,31.6,5.9194,166.8,42,3.6,19.6195,149.7,35.6,6,17.3196,38.2,3.7,13.8,7.6197,94.2,4.9,8.1,9.7198,177,9.3,6.4,12.8199,283.6,42,66.2,25.5200,232.1,8.6,8.7,13.4
回归代码:
% A=importdata('data.txt',' ',200);%????????A.dataa = load('data.txt');x1=a(:,[2]) ;x2=a(:,[3]) ;x3=a(:,[4]) ;y=a(:,[5]);X=[ones(length(y),1), x1,x2,x3];[b,bint,r,rint,stats]=regress(y,X);b;bint;stats;rcoplot(r,rint)tx=[230.1,37.8,69.2];b2=[b(2),b(3),b(4)];ty=b(1)+b2*tx';ty;
简单的得到一个变换的公式
y=b(1)+b(2)*x1+b(3)*x2+b(3)*x3;
二、ridge regression岭回归
其实就是在回归前对数据进行预处理,去掉一些偏差数据的影响。
1、一般线性回归遇到的问题
在处理复杂的数据的回归问题时,普通的线性回归会遇到一些问题,主要表现在:
- 预测精度:这里要处理好这样一对为题,即样本的数量和特征的数量
- 时,最小二乘回归会有较小的方差
- 时,容易产生过拟合
- 时,最小二乘回归得不到有意义的结果
- 模型的解释能力:如果模型中的特征之间有相互关系,这样会增加模型的复杂程度,并且对整个模型的解释能力并没有提高,这时,我们就要进行特征选择。
以上的这些问题,主要就是表现在模型的方差和偏差问题上,这样的关系可以通过下图说明:
(摘自:机器学习实战)
方差指的是模型之间的差异,而偏差指的是模型预测值和数据之间的差异。我们需要找到方差和偏差的折中。
2、岭回归的概念
在进行特征选择时,一般有三种方式:
- 子集选择
- 收缩方式(Shrinkage method),又称为正则化(Regularization)。主要包括岭回归个lasso回归。
- 维数缩减
岭回归(Ridge Regression)是在平方误差的基础上增加正则项
,
通过确定的值可以使得在方差和偏差之间达到平衡:随着的增大,模型方差减小而偏差增大。
对求导,结果为
令其为0,可求得的值:
3、实验的过程
我们去探讨一下取不同的对整个模型的影响。
MATLAB代码
function [ w ] = ridgeRegression( x, y, lam ) xTx = x'*x; [m,n] = size(xTx); temp = xTx + eye(m,n)*lam; if det(temp) == 0 disp('This matrix is singular, cannot do inverse'); end w = temp^(-1)*x'*y; end
%% ???(Ridge Regression) clc; %???? data = load('data.txt'); [m,n] = size(data); dataX = data(:,2:4);%?? dataY = data(:,5);%?? %??? yMeans = mean(dataY); for i = 1:m yMat(i,:) = dataY(i,:)-yMeans; end xMeans = mean(dataX); xVars = var(dataX); for i = 1:m xMat(i,:) = (dataX(i,:) - xMeans)./xVars; end % ??30? testNum = 30; weights = zeros(testNum, n-2); for i = 1:testNum w = ridgeRegression(xMat, yMat, exp(i-10)); weights(i,:) = w'; end % ??????lam hold on axis([-9 20 -1.0 2.5]); xlabel log(lam); ylabel weights; for i = 1:n-2 x = -9:20; y(1,:) = weights(:,i)'; plot(x,y); end
plot出来的图像显示,k=5的时候,出现了拟合,因此取k=5时的w值,
% resualt output ,i=5
w = ridgeRegression(xMat, yMat, exp(5-10));
三、另外一个岭回归比较好的例子
function [b,bint,r,rint,stats] = ridge1(Y,X,k) [n,p] = size(X);mx = mean (X);my = mean (Y); stdx = std(X);stdy=std(Y);idx = find(abs(stdx) < sqrt(eps));MX = mx(ones(n,1),:);STDX = stdx(ones(n,1),:);Z = (X - MX) ./ STDX;Y=(Y-my)./stdy;pseudo = sqrt(k*(n-1)) * eye(p);Zplus = [Z;pseudo];Yplus = [Y;zeros(p,1)];[b,bint,r,rint,stats] = regress(Yplus,Zplus);end
x=[71.35 22.90 3.76 1158.18 12.20 55.87; 67.92 34048 17.11 1494.38 19.82 56.60; 79.38 24.91 33.60 691.56 16.17 92.78; 87.97 10.18 0.73 923.04 12.15 24.66; 59.03 7.71 3.58 696.92 13.50 61.81; 55.23 22.94 1.34 1083.84 10.76 49.79; 58.30 12.78 5.25 1180.36 9.58 57.02; 67.43 9.59 2.92 797.72 16.82 38.29; 76.63 15.12 2.55 919.49 17.79 32.07];y=[28.46;27.76;26.02;33.29;40.84;44.50;28.09;46.24; 45.21];x'*x;count=0;kvec=0.1:0.1:1;for k=0.1:0.1:1 count=count+1; [b,bint,r,rint,stats]=ridge1(y,x,k); bb(:,count)=b; stats1(count,:)=stats; endbb',stats1 plot(kvec',bb),xlabel('k'),ylabel('b','FontName','Symbo l')
从运行结果及图1可见,k≥0.7时每个变量相应
的岭回归系数变化较为稳定,因而可选k=0.7,建立 岭回归方程
y=-0.219 5x1-0.120 2x2-0.237 8x3- 0.244 6x4+0.203 6x5-0.249 4x6