hmchzb19<em></em 阅读(8) 评论(0)
multivariate regression. (multiple regression)

1. What if more than 1 variable influence the one you are interested in ?
Example:
  predicting a price for a car based on its many attributes(body style, brand, mileage, etc.) 例如有多个因素都会影响车辆的价格,车型,品牌,公里数,车门数。

#still use least squares
We just end up with coefficients for each factor.
  For example, prices = a + b(mileage) + c(doors)
  These coefficients imply how important each factor is (if the data is all normalized)
  Get rid of ones that don't matter.
Can still measure fit with r-squared.
Need to assume the different factors are not themselves dependent on each other.

2.  直接上代码了.
使用pandas 来读一个excel. 有可能碰到ImportError

点击(此处)折叠或打开

  1. #statsmodel package
  2. #ImportError: Install xlrd >= 0.9.0 for Excel support , need xlrd.

  3. apt-get install python3-xlrd
  4. import pandas as pd
  5. df=pd.read_excel("http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls")
  6. df.head()


1st version code

点击(此处)折叠或打开

  1. import statsmodels.api as sm
  2. from sklearn.preprocessing import StandardScaler

  3. scale= StandardScaler()
  4. X=df[['Mileage','Cylinder', 'Doors', 'Model_ord']]
  5. y=df[['Price']]

  6. X[['Mileage', 'Cylinder', 'Doors']]=scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())
  7. print(X)
  8. est=sm.OLS(y, X).fit()
  9. est.summary()

2nd version

点击(此处)折叠或打开

  1. '''2nd version code '''
  2. df['Model_ord'] = pd.Categorical(df.Model).codes
  3. X=df[['Mileage','Cylinder', 'Doors', 'Model_ord']]
  4. y=df[['Price']]
  5. X1=sm.add_constant(X)
  6. est=sm.OLS(y, X1).fit()

  7. est.summary()
2种方式得到的summary page是完全一样的.

点击(此处)折叠或打开


  1. """
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                  Price   R-squared:                       0.425
    Model:                            OLS   Adj. R-squared:                  0.422
    Method:                 Least Squares   F-statistic:                     147.4
    Date:                Sun, 08 Jul 2018   Prob (F-statistic):           2.10e-94
    Time:                        20:51:27   Log-Likelihood:                -8313.9
    No. Observations:                 804   AIC:                         1.664e+04
    Df Residuals:                     799   BIC:                         1.666e+04
    Df Model:                           4                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    const       1.037e+04   1669.695      6.211      0.000    7093.297    1.36e+04
    Mileage       -0.1608      0.032     -4.966      0.000      -0.224      -0.097
    Cylinder    4726.1696    204.906     23.065      0.000    4323.952    5128.387
    Doors      -1742.6007    312.188     -5.582      0.000   -2355.406   -1129.795
    Model_ord   -309.4669     32.666     -9.474      0.000    -373.587    -245.346
    ==============================================================================
    Omnibus:                      193.097   Durbin-Watson:                   0.081
    Prob(Omnibus):                  0.000   Jarque-Bera (JB):              440.016
    Skew:                           1.288   Prob(JB):                     2.83e-96
    Kurtosis:                       5.549   Cond. No.                     1.37e+05
    ==============================================================================

  2. Warnings:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    [2] The condition number is large, 1.37e+05. This might indicate that there are
    strong multicollinearity or other numerical problems.
    """

注意这里的R-squared ,1是完美,0是极差,0.425, 算是不那么差。如果把使用的columns 减少一个,例如:使用三列,这个R-squared 就会变得很比较糟, 0.042.

X=df[['Mileage','Cylinder', 'Doors', ]]

点击(此处)折叠或打开

  1. In [13]: df['Model_ord'] = pd.Categorical(df.Model).codes
  2.     ...: X=df[['Mileage', 'Doors', 'Model_ord']]
  3.     ...: y=df[['Price']]
  4.     ...: X1=sm.add_constant(X)
  5.     ...: est=sm.OLS(y, X1).fit()
  6.     ...:
  7.     ...: est.summary()
  8.     ...:
  9. Out[13]:
  10. <class 'statsmodels.iolib.summary.Summary'>
  11. """
  12.                             OLS Regression Results
  13. ==============================================================================
  14. Dep. Variable: Price R-squared: 0.042
  15. Model: OLS Adj. R-squared: 0.038
  16. Method: Least Squares F-statistic: 11.57
  17. Date: Sun, 08 Jul 2018 Prob (F-statistic): 1.98e-07
  18. Time: 20:52:16 Log-Likelihood: -8519.1
  19. No. Observations: 804 AIC: 1.705e+04
  20. Df Residuals: 800 BIC: 1.706e+04
  21. Df Model: 3
  22. Covariance Type: nonrobust
  23. ==============================================================================
  24.                  coef std err t P>|t| [0.025 0.975]
  25. ------------------------------------------------------------------------------
  26. const 3.125e+04 1809.549 17.272 0.000 2.77e+04 3.48e+04
  27. Mileage -0.1765 0.042 -4.227 0.000 -0.259 -0.095
  28. Doors -1652.9303 402.649 -4.105 0.000 -2443.303 -862.558
  29. Model_ord -39.0387 39.326 -0.993 0.321 -116.234 38.157
  30. ==============================================================================
  31. Omnibus: 206.410 Durbin-Watson: 0.080
  32. Prob(Omnibus): 0.000 Jarque-Bera (JB): 470.872
  33. Skew: 1.379 Prob(JB): 5.64e-103
  34. Kurtosis: 5.541 Cond. No. 1.15e+05
  35. ==============================================================================

  36. Warnings:
  37. [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  38. [2] The condition number is large, 1.15e+05. This might indicate that there are
  39. strong multicollinearity or other numerical problems.
  40. """


3. 最后

点击(此处)折叠或打开

  1. #The table of coefficients above give us the values to plug into an equation of form B0 + B1 mileage + B2 model_ord + B3 * doors
  2. #In this example it's pretty clear that the number of cylinders is more important than anything based on the coefficient.
  3. #Could we have figured that out earlier?
  4. y.groupby(df.Doors).mean()
  5. #surpringly , more doors does not mean a higher price, so it's not surprising that it