Page 149 - Handbook of Deep Learning in Biomedical Engineering Techniques and Applications

P. 149

138 Chapter 5 Depression discovery in cancer communities using deep learning

Figure 5.1 Proposed system architecture.

(a) Preprocessing
Because of the informal nature of the medium, tweets are
often unstructured and written informally. They do not follow
the rules of the grammar and may contain uniform resource lo-
cators (URLs), abbreviations, special characters, and punctua-
tions. Before further processing the tweets, we proposed to
eliminate RT (retweet), nonalphanumeric characters, URLs, and
@ (mentions) from the tweets [93]. Moreover, the stopwords
such as a, and, the, and so on are to be removed as these words
do not contribute to sentiment. The stopwords can be removed
using the list of stopwords as enumerated in Python NLTK
excluding negations (“not,”“nor,”“neither”) and pronouns. We
suggested not to eliminate pronouns as it has been shown by au-
thors in Ref. [75] that dejected people have a tendency to write
ﬁrst-person singular pronouns considerably more than other pro-
nouns. We carry out tokenization, i.e., segregate the tweets into
constituent words using the NLTK tokenizer. We enumerate all
the remaining words after tokenization from the training data
set and use this list of words to represent tweets as a structure
of indices.
(b) Word encoding
Input to the word encoding system is a group of tokens T ¼ [t 1 ,
t 2 , .,t x ] where x denotes the time step. The one hot encoding of

144 145 146 147 148 149 150 151 152 153 154