Page 146 - Data Architecture

P. 146

Chapter 4.3: Parallel Processing
Fig. 4.3.8 shows the parsing of nonrepetitive data.

Fig. 4.3.8 Parsing nonrepetitive data.

The parsing of nonrepetitive is an entirely different matter than the parsing of repetitive
data. In fact, the term—“parsing of nonrepetitive data”—is often referred to as textual
disambiguation. There is much more to the reading of nonrepetitive data than merely

parsing it.

However it is done, nonrepetitive data are read and turned into a form that can be

managed by a database management system.

There is a very good reason why nonrepetitive data require well beyond a parsing

algorithm. The reason is that context in nonrepetitive data hides in many and complex
forms. For that reason, textual disambiguation is usually done external to the
nonrepetitive data in big data. (In other words, because of the inherent complexity of
nonrepetitive data, textual disambiguation is done outside of the database system that
manages big data.)

A related issue to parallel processing in the big data environment is that of the efficiency
of queries. As seen in Fig. 4.3.6, when a simple query is done against big data, the parsing
of the entire set of data contained in big data must be parsed. Even though the data are
managed in parallel, such a full database scan of data causes many machine resources to
be used.

An alternate approach is to scan the data once and create a separate index. This approach
146

141 142 143 144 145 146 147 148 149 150 151