Page 125 - Computational Statistics Handbook with MATLAB
P. 125
112 Computational Statistics Handbook with MATLAB
in the data set, then robust statistical methods might be more
appropriate. In Chapter 10, we illustrate an example where a graph-
ical look at the data indicates the presence of outliers, so we use a
robust method of nonparametric regression.
• We have a random sample that will be used to develop a model.
This model will be included in our simulation of a process (e.g.,
simulating a physical process such as a queue). We can use EDA
techniques to help us determine how the data might be distributed
and what model might be appropriate.
In this chapter, we will be discussing graphical EDA and how these tech-
niques can be used to gain information and insights about the data. Some
experts include techniques such as smoothing, probability density estima-
tion, clustering and principal component analysis in exploratory data analy-
sis. We agree that these can be part of EDA, but we do not cover them in this
chapter. Smoothing techniques are discussed in Chapter 10 where we present
methods for nonparametric regression. Techniques for probability density
estimation are presented in Chapter 8, but we do discuss simple histograms
in this chapter. Methods for clustering are described in Chapter 9. Principal
component analysis is not covered in this book, because the subject is dis-
cussed in many linear algebra texts [Strang, 1988; Jackson, 1991].
It is likely that some of the visualization methods in this chapter are famil-
iar to statisticians, data analysts and engineers. As we stated in Chapter 1,
one of the goals of this book is to promote the use of MATLAB for statistical
analysis. Some readers might not be familiar with the extensive graphics
capabilities of MATLAB, so we endeavor to describe the most useful ones for
data analysis. In Section 5.2, we consider techniques for visualizing univari-
ate data. These include such methods as stem-and-leaf plots, box plots, histo-
grams, and quantile plots. We turn our attention to techniques for visualizing
bivariate data in Section 5.3 and include a description of surface plots, scat-
terplots and bivariate histograms. Section 5.4 offers several methods for
viewing multi-dimensional data, such as slices, isosurfaces, star plots, paral-
lel coordinates, Andrews curves, projection pursuit, and the grand tour.
5.2 Exploring Univariate Data
Two important goals of EDA are: 1) to determine a reasonable model for the
process that generated the data, and 2) to locate possible outliers in the sam-
ple. For example, we might be interested in finding out whether the distribu-
tion that generated the data is symmetric or skewed. We might also like to
know whether it has one mode or many modes. The univariate visualization
techniques presented here will help us answer questions such as these.
© 2002 by Chapman & Hall/CRC