- 2. 连续输出
- 10. 回归线性方程
- 17. 编码
- 18. sklearn 中的年龄/净值回归
- 19. 通过sklearn提取信息
- 21. 现在你练习提取信息
- 22. 线性回归误差
- 25. 最小化误差平方和
- 27. 为何最小化 SSE
- 28. 最小化误差的问题
- 31. 回归的 R 平方指标
- 34. 什么数据适用于线性回归
- 35. 比较分类与回归
- 36. 多元回归(Multi-Variate Regression)
- 37. 回归迷你项目
- 47. 异常值破坏回归
2. 连续输出
输出必须是连续的:
10. 回归线性方程
17. 编码
参考1.1. Generalized Linear Models
>>> from sklearn import linear_model
>>> reg = linear_model.LinearRegression()
>>> reg.fit([[0,0],[1,1],[2,2]],[0,1,2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> reg.coef_
array([ 0.5, 0.5])
18. sklearn 中的年龄/净值回归
studentMain.py
#!/usr/bin/python
import numpy
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
from studentRegression import studentReg
from class_vis import prettyPicture, output_image
from ages_net_worths import ageNetWorthData
ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()
reg = studentReg(ages_train, net_worths_train)
plt.clf()
plt.scatter(ages_train, net_worths_train, color="b", label="train data")
plt.scatter(ages_test, net_worths_test, color="r", label="test data")
plt.plot(ages_test, reg.predict(ages_test), color="black")
plt.legend(loc=2)
plt.xlabel("ages")
plt.ylabel("net worths")
plt.savefig("test.png")
output_image("test.png", "png", open("test.png", "rb").read())
studentRegression.py
def studentReg(ages_train, net_worths_train):
### import the sklearn regression module, create, and train your regression
### name your regression reg
### your code goes here!
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(ages_train,net_worths_train)
return reg
- 输出结果:
19. 通过sklearn提取信息
...
reg.fit(ages_train,net_worths_train)
# 根据训练后的回归模型,预测27岁人的收入
print reg.predict([27])
# 回归模型的斜率
print reg.coef_
# 回归模型的截距
print reg.intercept_
# 计算r平方分数(r-squared score),最大值1,越大越精确
print reg.score(ages_test,net_worths_test)
21. 现在你练习提取信息
- 从回归中提取预测、斜率和截距,以及训练和测试分数。
import numpy
import matplotlib.pyplot as plt
from ages_net_worths import ageNetWorthData
ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(ages_train, net_worths_train)
### get Katie's net worth (she's 27)
### sklearn predictions are returned in an array, so you'll want to index into
### the output to get what you want, e.g. net_worth = predict([[27]])[0][0] (not
### exact syntax, the point is the [0] at the end). In addition, make sure the
### argument to your prediction function is in the expected format - if you get
### a warning about needing a 2d array for your data, a list of lists will be
### interpreted by sklearn as such (e.g. [[27]]).
km_net_worth = reg.predict([27])[0][0] ### fill in the line of code to get the right value
### get the slope
### again, you'll get a 2-D array, so stick the [0][0] at the end
slope = reg.coef_[0][0] ### fill in the line of code to get the right value
### get the intercept
### here you get a 1-D array, so stick [0] on the end to access
### the info we want
intercept = reg.intercept_[0] ### fill in the line of code to get the right value
### get the score on test data
test_score = reg.score(ages_test,net_worths_test) ### fill in the line of code to get the right value
### get the score on the training data
training_score = reg.score(ages_train,net_worths_train) ### fill in the line of code to get the right value
def submitFit():
# all of the values in the returned dictionary are expected to be
# numbers for the purpose of the grader.
return {"networth":km_net_worth,
"slope":slope,
"intercept":intercept,
"stats on test":test_score,
"stats on training": training_score}
- 输出
{"slope": 6.473549549577059, "stats on training": 0.8745882358217186, "intercept": -14.35378330775552, "stats on test": 0.812365729230847, "networth": 160.43205453082507}
22. 线性回归误差
25. 最小化误差平方和
好的线性回归模型,就是找到最合适的m和b,使得误差平方和最小化:
有两种算法计算最小的误差平方和:
-
ordinary least squares(OLS) 最小二乘法,sklearn使用该算法进行线性回归
-
gradient descent 梯度下降法
27. 为何最小化 SSE
下面的示例能解释为什么最好的回归模型,是误差的平方最小:
28. 最小化误差的问题
左右两个图形中,线性回归模型质量是一样好,但是右边的数据集更大,所以最后SSE即最小化误差更大,显得拟合的质量比左边差,最小化误差的问题,就是不同的数据集之间对比可能会存在问题。
31. 回归的 R 平方指标
为了回避上面最小化误差的问题,引入了R平方指标来评价回归的质量,R值在0-1之间,0最差,1最好,由于其值固定在0-1之间,所以避免了数据集数量的问题:
# 计算r平方分数(r-squared score),最大值1,越大越精确
print reg.score(ages_test,net_worths_test)
34. 什么数据适用于线性回归
第一幅图选错了的,这里y的值并不随x的变化而变化,就是说不能通过x对y进行预测。
第四幅图,其中m为0,b为固定值,即y为定值,无论x如何变化都是定值。
35. 比较分类与回归
- 监督分类(朴素贝叶斯/SVM/决策树)
- 回归
36. 多元回归(Multi-Variate Regression)
前面的例子是一个变量,一个预测值,如果多个变量来进行预测的话,属于多元回归。
37. 回归迷你项目
#!/usr/bin/python
"""
Starter code for the regression mini-project.
Loads up/formats a modified version of the dataset
(why modified? we've removed some trouble points
that you'll find yourself in the outliers mini-project).
Draws a little scatterplot of the training/testing data
You fill in the regression code where indicated:
"""
import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
dictionary = pickle.load( open("../final_project/final_project_dataset_modified.pkl", "r") )
### list the features you want to look at--first item in the
### list will be the "target" feature
features_list = ["bonus", "long_term_incentive"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )
### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "r"
### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(feature_train,target_train)
print "slope:",reg.coef_
print "intercept:",reg.intercept_
print "score train:",reg.score(feature_train,target_train)
print "score test:",reg.score(feature_test,target_test)
### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
plt.scatter( feature, target, color=test_color )
for feature, target in zip(feature_train, target_train):
plt.scatter( feature, target, color=train_color )
### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")
### draw the regression line, once it's coded
try:
plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()
- 输出结果与图形
使用salary预测结果:
slope: [ 5.44814029]
intercept: -102360.543294
score train: 0.0455091926995
score test: -1.48499241737
使用long_term_incentive预测结果:
slope: [ 1.19214699]
intercept: 554478.756215
score train: 0.217085971258
score test: -0.59271289995
47. 异常值破坏回归
这是下节课的内容简介,关于异常值的识别和删除。返回至之前的一个设置,你在其中使用工资预测奖金,并且重新运行代码来回顾数据。你可能注意到,少量数据点落在了主趋势之外,即某人拿到高工资(超过 1 百万美元!)却拿到相对较少的奖金。此为异常值的一个示例,我们将在下节课中重点讲述它们。
类似的这种点可以对回归造成很大的影响:如果它落在训练集内,它可能显著影响斜率/截距。如果它落在测试集内,它可能比落在测试集外要使分数低得多。就目前情况来看,此点落在测试集内(而且最终很可能降低分数)。让我们做一些处理,看看它落在训练集内会发生什么。在 finance_regression.py 底部附近并且在 plt.xlabel(features_list[1]) 之前添加这两行代码:
reg.fit(feature_test, target_test) plt.plot(feature_train, reg.predict(feature_train), color=”b”)
现在,我们将绘制两条回归线,一条在测试数据上拟合(有异常值),一条在训练数据上拟合(无异常值)。来看看现在的图形,有很大差别,对吧?单一的异常值会引起很大的差异。
新的回归线斜率是多少?
reg.fit(feature_test, target_test)
plt.plot(feature_train, reg.predict(feature_train), color="y")
print "test slope:",reg.coef_
输出图形如下:
- 蓝色斜线,使用训练数据进行回归训练,用测试数据预测
- 黄色斜线,使用测试数据进行回归训练,用训练数据预测
- 其中测试数据上有异常值,所以最终拟合的
输出斜率,截距,以及测试分数如下:
train slope: [ 5.44814029]
train intercept: -102360.543294
train score train: 0.0455091926995
train score test: -1.48499241737
#############
test slope: [ 2.27410114]
test intercept: 124444.388866
test score train: -0.123597985403
test score test: 0.251488150398