Page 40 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R
P. 40

1.8 Software Tools   19


           book we will pay attention to the topic of model validation both in classification
           and regression.



           1.7 Datasets

           A statistical data analysis project starts, of course, by the data collection task. The
           quality with which this task is performed is a major determinant of the quality of
           the overall project. Issues such as reducing the number of missing data, recording
           the pertinent documentation on what the problem is and how the data was collected
           and inserting the appropriate description of the meaning of the variables involved
           must be adequately addressed.
              Missing data – failure to obtain for certain objects/cases the values of one or
           more variables – will always undermine the degree of certainty of the statistical
           conclusions.  Many software products provide means to cope  with missing data.
           These can be simply coding missing data by symbolic numbers or tags, such as
           “na” (“not available”) which are neglected when performing statistical analysis
           operations. Another possibility is the substitution of missing data by average values
           of the respective variables. Yet another solution is to simply remove objects with
           missing data.  Whatever method is used the quality of the project is always
           impaired.
              The collected data should be stored in a tabular form (“data matrix”), usually
           with the  rows corresponding to  objects  and the columns corresponding to the
           variables. A  spreadsheet such as the one provided  by EXCEL  (a popular
           application of the  WINDOWS systems) constitutes an adequate data storing
           solution. An example is shown in Figure 2.1. It allows to easily performing simple
           calculations on the data and to store an accompanying data description sheet. It
           also simplifies data entry operations for many statistical software products.
              All the statistical  methods explained in this book are illustrated with real-life
           problems. The real datasets used in the book examples and exercises are stored in
           EXCEL  files.  They are  described in  Appendix E and included in the  book CD.
           Dataset names correspond to the respective EXCEL file names. Variable identifiers
           correspond to the column identifiers of the EXCEL files.
              There  are also  many datasets available  through the  Internet which the  reader
           may find useful for practising the taught matters. We particularly recommend the
           datasets of the  UCI Machine  Learning Repository  (http://www.ics.uci.edu/
           ~mlearn/MLRepository.html). In these (and other) datasets data is presented in text
           file format. Conversion to EXCEL format is usually straightforward since EXCEL
           provides means to read in text files with several types of column delimitation.



           1.8 Software Tools

           There are many software tools for statistical analysis, covering a broad spectrum of
           possibilities. At one end  we  find “closed”  products  where the  user can  only
   35   36   37   38   39   40   41   42   43   44   45