
Statistics 5703 Fall, 2009
Data Mining Methodology I

|
Instructor:
|
Dr. Xiaogang Su
|
|
Office
|
Room 102, CC II
|
|
Phone:
|
(407) 823-2940
|
|
Email:
|
xiaosu@mail.ucf.edu
|
|
Prerequisite
|
STA 5103 and STA
5206
|
|
Website
|
http://pegasus.cc.ucf.edu/~xsu/CLASS/STA5703/
|
Announcement: Considering
the limited time, there will be no Project III for the course so that you can
concentrate better on your final project. 
Description of the Course: Data mining
is the process of exploration and analysis, by automatic or semiautomatic
means, of large quantities of observational data in order to discover
meaningful patterns and models to the data owner. By applying data mining
techniques, data miners can fully exploit data patterns and behavior, and gain
a greater understanding of the inside of the data. The goal of data mining
application in business is to produce new knowledge that decision-makers can
act upon. It does this by using sophisticated techniques such as logistic
regression and decision trees to build a model of the real world based on data
collected from a variety of sources including corporate transactions, customer
histories and demographics, and from external sources such as credit bureaus.
This model produces knowledge that can be used to support decision-making and
to predict new business opportunities. This course will cover data mining
techniques such as clustering and decision trees. In addition, assessments of
classification rules and how to use SAS Enterprise Miner will be covered.
Statistical
Computing:
·
SAS and SAS Enterprise Miner
9.1 is available and you have to pay first to have installation on your laptop
computer. For more information, contact JoAnne Roche at (407) 823-5562 or the
data mining lab, which is located at Room 350, MAP Building.
Syllabus and the grading policy:
Range
|
94+
|
93-90
|
89-87
|
86-83
|
82-80
|
79-77
|
76-73
|
72-70
|
69-67
|
66-63
|
62-60
|
59-0
|
Grade
|
A
|
A-
|
B+
|
B
|
B-
|
C+
|
C
|
C-
|
D+
|
D
|
D-
|
F
|
Class Notes: The Acrobat Reader,
which is free, is needed in order to
view pdf files appropriately.
Homework assignments and Class Handouts:
Reading Articles and Materials
- Introduction to
Data Mining and Knowledge Discovery, Third Edition. ISBN: 1-892095-02-5.
1999 by Two Crows Corporation
- Data
Preprocessing presentation, taken from David Squires Data
Mining Class.
- Clustering
- Sarle, W. (1983). Cubic
Clustering Criterion. Technical Report
A-108, SAS Institute, Inc., 1983.
- Wong, M.A. and Lane,
T. (1983). A k-th Nearest Neighbor Clustering
Procedure. Journal of the Royal Statistical Society, Series B,
45: 362-368.
- Hartigan, J.A. and
Wong, M. A. (1979). Algorithm
AS136: A K-Means Clustering Program. Applied Statistics,
28: 100-128.
- Hartigan, J.A. (1967).
Representation
of Similarity Matrices by Trees. Journal of the American
Statistical Association, 62: 1140-1158.
- Tishirani, R.,
Walther, G. and Hastie, T. (2000). Estimating the
number of clusters in a dataset via the Gap statistic. JRSSB, 63:
411-423.
- Difference between data
mining and traditional statistical tools
- Page 1-10 of Multivariate
Adaptive Regression Splines by Friedman (1991). Annals of Statistics,
19: 1-67.
- Friedman, J. (1997) Data Mining and
Statistics: Whats the Connection? in Keynote
Address, 29th Symposium on the Interface: Computing Science and
Statistics.
- Hand,
D. (1998). Data
Mining: Statistics and More? The
American Statistician, 52:
112-118.
- Hand, D. (1999). Statistics
and Data Mining: Intersecting Disciplines. ACM SIGKDD, 1: 1-16.
- Multidimensional Scaling
o Buja,
A. et al. (2004). "Interactive
Data Visualization with Multidimensional Scaling".
- Decision Trees
- Murthy, S.K. (1997). Automatic
Construction of Decision Trees from Data: A Multi-Disciplinary Survey.
Data Mining and Knowledge Discovery.
- Murthy,
S.K. (1996). On
Growing Better Decision Trees from Data.
- De Raedt, L and
Blockeel, H.(1997). Using
Logical Decision Trees for Clustering . Proceedings of
the 7th International Workshop on Inductive Logic Programming.
- S. Salzberg, A. L. Delcher, K.H. Fasman, J. Henderson
(1997). A
Decision Tree System for Finding Genes in DNA. Journal of
Computational Biology.
- J.
Mingers. (1989). An empirical comparison of selection measures for
decision-tree induction. Machine Learning, 3:319-342.
- W. Buntine and T.
Niblett. (1992). A Further
Comparison of Splitting Rules for Decision-Tree Induction. Machine
Learning, 8: 75-85
- Pruning Trees
- J. J. Oliver and D. J.
Hand. (1995). On
Pruning and Averaging Decision Trees. Proc. 12th Int. Conf.
Machine Learning.
- F. Esposito, D.
Malerba, G. Semeraro. (1997). A
Comparative Analysis of Methods for Pruning Decision Trees. IEEE
Transactions on Pattern Analysis and Machine Intelligence.
- Bagging, Boosting, and Random
Forests
- Bauer,
E. and Kohavi, R. (1999). An
Empirical Comparison of Voting Classification Algorithms: Bagging,
Boosting, and Variants. Machine Learning, 36, 105-142.
- Breiman, L. (2001). Random
Forests. Technical Report. Department of Statistics, UC
Berkeley.
- Breiman, L. (1994). Bagging
Predictors. Technical Report. Department of Statistics, UC
Berkeley.
- Freund, Y. and
Schapire, R. (1999). A
Short Introduction to Boosting. Journal of Japanese Society for
Artificial Intelligence, 14 (5): 771-780.
- Friedman, J. H. "Stochastic
Gradient Boosting ." (March 1999b) (software)
- Friedman, J. H. "Greedy
Function Approximation: A Gradient Boosting Machine." (Feb.
1999a) (software)
Other Resources:
- Other
Implementation of Tree-Based Methods:
- CART from Salford
Systems.
- Two
functions in R that does
classification and regression trees: tree() and rpart().
- PROC SPLIT
- SAS
-V9 Documentation is free on the web, both HTML and PDF http://support.sas.com/v9doc/, or
within your software: HELP -> SAS Help and
Documentation.
- The
introduction to SAS Enterprise Miner Software, available from SAS.
- George
H. John. Enhancements to the Data Mining Process,
PhD Thesis, Computer Science Department, School
of Engineering, Stanford University,
194pp., March 1997.
- To have
offline SAS help info, you might want to install this HotFix
only applicable to SAS V8.2.
- I also
recommend you to install SAS
System Viewer.