Arabic nested noun compound extraction based on linguistic features and statistical measures
The extraction of Arabic nested noun compound is significant for several research areas such as sentiment analysis, text summarization, word categorization, grammar checker, and machine translation. Much research has studied the extraction of Arabic noun compound using linguistic approaches, stat...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Penerbit Universiti Kebangsaan Malaysia
2018
|
Online Access: | http://journalarticle.ukm.my/13773/ http://journalarticle.ukm.my/13773/ http://journalarticle.ukm.my/13773/1/25313-76332-1-PB.pdf |
Summary: | The extraction of Arabic nested noun compound is significant for several research areas such
as sentiment analysis, text summarization, word categorization, grammar checker, and
machine translation. Much research has studied the extraction of Arabic noun compound
using linguistic approaches, statistical methods, or a hybrid of both. A wide range of the
existing approaches concentrate on the extraction of the bi-gram or tri-gram noun compound.
Nonetheless, extracting a 4-gram or 5-gram nested noun compound is a challenging task due
to the morphological, orthographic, syntactic and semantic variations. Many features have an
important effect on the efficiency of extracting a noun compound such as unit-hood,
contextual information, and term-hood. Hence, there is a need to improve the effectiveness of
the Arabic nested noun compound extraction. Thus, this paper proposes a hybrid linguistic
approach and a statistical method with a view to enhance the extraction of the Arabic nested
noun compound. A number of pre-processing phases are presented, including transformation,
tokenization, and normalisation. The linguistic approaches that have been used in this study
consist of a part-of-speech tagging and the named entities pattern, whereas the proposed
statistical methods that have been used in this study consist of the NC-value, NTC-value,
NLC-value, and the combination of these association measures. The proposed methods have
demonstrated that the combined association measures have outperformed the NLC-value,
NTC-value, and NC-value in terms of nested noun compound extraction by achieving 90%,
88%, 87%, and 81% for bigram, trigram, 4-gram, and 5-gram, respectively. |
---|