Page 40 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R
P. 40
1.8 Software Tools 19
book we will pay attention to the topic of model validation both in classification
and regression.
1.7 Datasets
A statistical data analysis project starts, of course, by the data collection task. The
quality with which this task is performed is a major determinant of the quality of
the overall project. Issues such as reducing the number of missing data, recording
the pertinent documentation on what the problem is and how the data was collected
and inserting the appropriate description of the meaning of the variables involved
must be adequately addressed.
Missing data – failure to obtain for certain objects/cases the values of one or
more variables – will always undermine the degree of certainty of the statistical
conclusions. Many software products provide means to cope with missing data.
These can be simply coding missing data by symbolic numbers or tags, such as
“na” (“not available”) which are neglected when performing statistical analysis
operations. Another possibility is the substitution of missing data by average values
of the respective variables. Yet another solution is to simply remove objects with
missing data. Whatever method is used the quality of the project is always
impaired.
The collected data should be stored in a tabular form (“data matrix”), usually
with the rows corresponding to objects and the columns corresponding to the
variables. A spreadsheet such as the one provided by EXCEL (a popular
application of the WINDOWS systems) constitutes an adequate data storing
solution. An example is shown in Figure 2.1. It allows to easily performing simple
calculations on the data and to store an accompanying data description sheet. It
also simplifies data entry operations for many statistical software products.
All the statistical methods explained in this book are illustrated with real-life
problems. The real datasets used in the book examples and exercises are stored in
EXCEL files. They are described in Appendix E and included in the book CD.
Dataset names correspond to the respective EXCEL file names. Variable identifiers
correspond to the column identifiers of the EXCEL files.
There are also many datasets available through the Internet which the reader
may find useful for practising the taught matters. We particularly recommend the
datasets of the UCI Machine Learning Repository (http://www.ics.uci.edu/
~mlearn/MLRepository.html). In these (and other) datasets data is presented in text
file format. Conversion to EXCEL format is usually straightforward since EXCEL
provides means to read in text files with several types of column delimitation.
1.8 Software Tools
There are many software tools for statistical analysis, covering a broad spectrum of
possibilities. At one end we find “closed” products where the user can only