http://pegasus.cc.ucf.edu/~xsu/PIC/rainbow.gif

Statistics 5703 – Fall, 2009

Data Mining Methodology I

http://pegasus.cc.ucf.edu/~xsu/PIC/rainbow.gif

 

 

 

Instructor:

Dr. Xiaogang Su

Office

Room 102, CC II

Phone:

(407) 823-2940

Email:

xiaosu@mail.ucf.edu

Prerequisite

STA 5103 and STA 5206

Website

http://pegasus.cc.ucf.edu/~xsu/CLASS/STA5703/

 http://pegasus.cc.ucf.edu/~xsu/PIC/star.gif Announcement: Considering the limited time, there will be no Project III for the course so that you can concentrate better on your final project.  http://pegasus.cc.ucf.edu/~xsu/PIC/new2.gif

http://pegasus.cc.ucf.edu/~xsu/PIC/star.gif Description of the Course:  Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of observational data in order to discover meaningful patterns and models to the data owner. By applying data mining techniques, data miners can fully exploit data patterns and behavior, and gain a greater understanding of the inside of the data. The goal of data mining application in business is to produce new knowledge that decision-makers can act upon. It does this by using sophisticated techniques such as logistic regression and decision trees to build a model of the real world based on data collected from a variety of sources including corporate transactions, customer histories and demographics, and from external sources such as credit bureaus. This model produces knowledge that can be used to support decision-making and to predict new business opportunities. This course will cover data mining techniques such as clustering and decision trees. In addition, assessments of classification rules and how to use SAS Enterprise Miner will be covered.

http://pegasus.cc.ucf.edu/~xsu/PIC/star.gifStatistical Computing:  

·         SAS and SAS Enterprise Miner 9.1 is available and you have to pay first to have installation on your laptop computer. For more information, contact JoAnne Roche at (407) 823-5562 or the data mining lab, which is located at Room 350, MAP Building.  

http://pegasus.cc.ucf.edu/~xsu/PIC/star.gifSyllabus and the grading policy:

Range

94+

93-90

89-87

86-83

82-80

79-77

76-73

72-70

69-67

66-63

62-60

59-0

Grade

A

A-

B+

B

B-

C+

C

C-

D+

D

D-

F

http://pegasus.cc.ucf.edu/~xsu/PIC/star.gifClass Notes: The Acrobat Reader, which is free, is needed in order to view pdf files appropriately.  

 

Notes

Data Sets

SAS Programs

R Programs

1

Data Mining –An Overview

 

 

2

donors,  myscore

 

 

3

Cluster Analysis

clexam1

Chap3-Cluster.sas

R-chp3.R

4

Principal Components Analysis and Some Extensions

 

Chap4-PCA.sas

R-chp4-PCA.R

5

Multidimensional Scaling (MDS)

 

Chap5-MDS.sas

R-chp5-MDS.R

6

Model Validation: Variable Selection and Regularization

icu.txt; icu-decription.txt

 

R-Chp6-Logit.R

7

An Introduction to Tree-Based Method

 

 

8

Splitting Criteria (1, 2)

 

 

R-Chp8.R

9

Tree Implementation (1,2)

 general.sas7bdat

 

R-chp9.R

10

 Boosting

 

 

R-chp10-Boosting.R

11

 Bagging and Random Forests

 

 

R-chp11-Bagging-RF.R

12

 Feature Selection - RELIEF

 

 

R-chp12-RELIEF.R

13

PageRank of Googlehttp://pegasus.cc.ucf.edu/~xsu/PIC/new2.gif

 

 

R-chp13-PageRank.Rhttp://pegasus.cc.ucf.edu/~xsu/PIC/new2.gif

 

http://pegasus.cc.ucf.edu/~xsu/PIC/star.gif Homework assignments and Class Handouts:

 

Homework

Assignment

Other Related Files

1

Homework 1

 

2

 Homework 2

 organics.sas7bdat

3

 

 

4

 

 

Final

 Fianl Projecthttp://pegasus.cc.ucf.edu/~xsu/PIC/new2.gif

 

 

http://pegasus.cc.ucf.edu/~xsu/PIC/star.gif Reading Articles and Materials

  1. Introduction to Data Mining and Knowledge Discovery, Third Edition. ISBN: 1-892095-02-5. 1999 by Two Crows Corporation
  2. Data Preprocessing presentation, taken from David Squire’s Data Mining Class.  
  3. Clustering
    1. Sarle, W. (1983). Cubic Clustering Criterion. Technical Report A-108, SAS Institute, Inc., 1983.
    2. Wong, M.A. and Lane, T. (1983).  A  k-th Nearest Neighbor Clustering Procedure. Journal of the Royal Statistical Society, Series B, 45: 362-368.
    3. Hartigan, J.A. and Wong, M. A. (1979). Algorithm AS136: A K-Means Clustering Program. Applied Statistics, 28: 100-128.
    4. Hartigan, J.A. (1967). Representation of Similarity Matrices by Trees. Journal of the American Statistical Association, 62: 1140-1158.
    5. Tishirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. JRSSB, 63: 411-423.
  4. Difference between data mining and traditional statistical tools
    1. Page 1-10 of Multivariate Adaptive Regression Splines by Friedman (1991). Annals of Statistics, 19: 1-67.
    2. Friedman, J. (1997) “Data Mining and Statistics: What’s the Connection?” in Keynote Address, 29th Symposium on the Interface: Computing Science and Statistics.
    3. Hand, D. (1998). “Data Mining: Statistics and More?” The American Statistician, 52: 112-118.
    4. Hand, D. (1999). “Statistics and Data Mining: Intersecting DisciplinesACM SIGKDD, 1: 1-16.
  5. Multidimensional Scaling

o    Buja, A. et al. (2004). "Interactive Data Visualization with Multidimensional Scaling".

  1. Decision Trees
    1. Murthy, S.K. (1997). Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey. Data Mining and Knowledge Discovery.
    2. Murthy, S.K. (1996). On Growing Better Decision Trees from Data.
    3. De Raedt, L and Blockeel, H.(1997). Using Logical Decision Trees for Clustering . Proceedings of the 7th International Workshop on Inductive Logic Programming.
    4. S. Salzberg, A. L. Delcher, K.H. Fasman, J. Henderson (1997). A Decision Tree System for Finding Genes in DNA. Journal of Computational Biology.
    5. J. Mingers. (1989). An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3:319-342.
    6. W. Buntine and T. Niblett. (1992). A Further Comparison of Splitting Rules for Decision-Tree Induction.  Machine Learning, 8: 75-85
  2. Pruning Trees
    1. J. J. Oliver and D. J. Hand. (1995). On Pruning and Averaging Decision Trees. Proc. 12th Int. Conf. Machine Learning.
    2. F. Esposito, D. Malerba, G. Semeraro. (1997). A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. 
  3. Bagging, Boosting, and Random Forests
    1. Bauer, E. and Kohavi, R. (1999). An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36, 105-142.
    2. Breiman, L. (2001). Random Forests. Technical Report. Department of Statistics, UC Berkeley. 
    3. Breiman, L. (1994). Bagging Predictors. Technical Report. Department of Statistics, UC Berkeley. 
    4. Freund, Y. and Schapire, R. (1999). A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence, 14 (5): 771-780.  
    5. Friedman, J. H. "Stochastic Gradient Boosting ." (March 1999b) (software)
    6. Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." (Feb. 1999a) (software)

 

http://pegasus.cc.ucf.edu/~xsu/PIC/star.gif Other Resources: 

 

  • Other Implementation of Tree-Based Methods:
  1. CART from Salford Systems.
  2. Two functions in R that does classification and regression trees: tree() and rpart().
  3. PROC SPLIT