Statistics 5703 – Fall, 2005

Data Mining Methodology I

 

 

 

Instructor:

Dr. Xiaogang Su

Office

Room 102, CC II

Phone:

(407) 823-2940

Email:

xiaosu@mail.ucf.edu

Office Hour

W 2:30-3:20 pm and R 3:30-4:20 pm

Prerequisite

STA 5103 and STA 5206

Website

http://pegasus.cc.ucf.edu/~xsu/CLASS/STA5703/

 

 Description of the Course:  Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of observational data in order to discover meaningful patterns and models to the data owner. By applying data mining techniques, data miners can fully exploit data patterns and behavior, and gain a greater understanding of the inside of the data. The goal of data mining application in business is to produce new knowledge that decision-makers can act upon. It does this by using sophisticated techniques such as logistic regression and decision trees to build a model of the real world based on data collected from a variety of sources including corporate transactions, customer histories and demographics, and from external sources such as credit bureaus. This model produces knowledge that can be used to support decision-making and to predict new business opportunities. This course will cover data mining techniques such as clustering and decision trees. In addition, assessments of classification rules and how to use SAS Enterprise Miner will be covered.

Statistical Computing:  SAS Enterprise Miner 9.1 is available. In order to reinstall the new version of SAS and SAS Enterprise Miner on your home computer, please bring 7 CDs and one floppy disk to the data mining lab, which is located at Room 350, MAP Building. Before you go, please check their time schedule. 

Syllabus and the grading policy:

Range

94+

93-90

89-87

86-83

82-80

79-77

76-73

72-70

69-67

66-63

62-60

59-0

Grade

A

A-

B+

B

B-

C+

C

C-

D+

D

D-

F

Class Notes: The Acrobat Reader, which is free, is needed in order to view pdf files appropriately.  

 

Notes

Data Sets
SAS Programs
1

Data Mining –An Overview

 

 

2

Understand the Structure of SAS Enterprise Miner

donors ,  myscore

 

3

Clustering

clexam1, develop

 

4

Introduction to Tree-Structured Models  (4.1, 4.2)

expcar

example4-1.sas

5

Decision Trees I - Tree Growing

 

example5-1.sas

6

Decision Trees II - Tree Pruning

 

 

7

Decision Trees III - Tree Size Selection (7.1, 7.2)

 

 

8

An Example on Decision Trees with Binary Response (8.1, 8.2)

general 

 

9

Trees for Multi-Class and Continuous Responses (9.1, 9.2)

housing

 

10

Incorporating Costs in SAS EM

campscr, camptrn

 

11

Auxiliary Use of Trees

 

 

12

Bagging, Boosting, and Random Forests

pen

example12.sas

13

PROC SPLIT

 

example13.sas

 

 Homework assignments and Class Handouts:

 

Homework

Assignment

Other Related Files

1

Homework 1

 

2

Homework 2

PID

3

Homework 3

HMEQ

4

Homework 4

HW4DAT

5

Homework 5

 

6

Homework 6

ORGANICS

7

Homework 7

ABALONE

Final

Final Project

 

 

 Reading Articles and Materials

  1. Introduction to Data Mining and Knowledge Discovery, Third Edition. ISBN: 1-892095-02-5. 1999 by Two Crows Corporation
  2. Data Preprocessing presentation, taken from David Squire’s Data Mining Class.  
  3. Clustering
    1. Sarle, W. (1983). Cubic Clustering Criterion. Technical Report A-108, SAS Institute, Inc., 1983.
    2. Wong, M.A. and Lane, T. (1983).  A  k-th Nearest Neighbor Clustering Procedure. Journal of the Royal Statistical Society, Series B, 45: 362-368.
    3. Hartigan, J.A. and Wong, M. A. (1979). Algorithm AS136: A K-Means Clustering Program. Applied Statistics, 28: 100-128.
    4. Hartigan, J.A. (1967). Representation of Similarity Matrices by Trees. Journal of the American Statistical Association, 62: 1140-1158.
  4. Difference between data mining and traditional statistical tools
    1. Page 1-10 of Multivariate Adaptive Regression Splines by Friedman (1991). Annals of Statistics, 19: 1-67.
    2. Friedman, J. (1997) “Data Mining and Statistics: What’s the Connection?in Keynote Address, 29th Symposium on the Interface: Computing Science and Statistics.
    3. Hand, D. (1998). “Data Mining: Statistics and More?” The American Statistician, 52: 112-118.
    4. Hand, D. (1999). “Statistics and Data Mining: Intersecting Disciplines.” ACM SIGKDD, 1: 1-16.
  5. Decision Trees
    1. Murthy, S.K. (1997). Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey. Data Mining and Knowledge Discovery.
    2. Murthy, S.K. (1996). On Growing Better Decision Trees from Data.
    3. De Raedt, L and Blockeel, H.(1997). Using Logical Decision Trees for Clustering . Proceedings of the 7th International Workshop on Inductive Logic Programming.
    4. S. Salzberg, A. L. Delcher, K.H. Fasman, J. Henderson (1997). A Decision Tree System for Finding Genes in DNA. Journal of Computational Biology.
    5. J. Mingers. (1989). An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3:319-342.
    6. W. Buntine and T. Niblett. (1992). A Further Comparison of Splitting Rules for Decision-Tree Induction.  Machine Learning, 8: 75-85
  6. Pruning Trees
    1. J. J. Oliver and D. J. Hand. (1995). On Pruning and Averaging Decision Trees. Proc. 12th Int. Conf. Machine Learning.
    2. F. Esposito, D. Malerba, G. Semeraro. (1997). A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. 
  7. Bagging, Boosting, and Random Forests
    1. Bauer, E. and Kohavi, R. (1999). An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36, 105-142.
    2. Breiman, L. (2001). Random Forests. Technical Report. Department of Statistics, UC Berkeley. 
    3. Breiman, L. (1994). Bagging Predictors. Technical Report. Department of Statistics, UC Berkeley. 
    4. Freund, Y. and Schapire, R. (1999). A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence, 14 (5): 771-780.  
    5. Friedman, J. H. "Stochastic Gradient Boosting ." (March 1999b) (software)
    6. Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." (Feb. 1999a) (software)

 

 Other Resources: 

 

  • Several Implementation of Tree-Based Methods:
  1. CART from Salford Systems.
  2. Two functions in R that does classification and regression trees: tree() and rpart().