Comparing support vector machine and multinomial
naive Bayes for named entity classification of South
African languages
W. Fourie
Centre for Text Technology
North-West University, Potchefstroom Campus
Potchefstroom, South Africa
wildrich.fourie@nwu.ac.za
J.V. Du Toit & D.P. Snyman
School for Computer, Statistical and Mathematical Sciences
North-West University, Potchefstroom Campus
Potchefstroom, South Africa
{tiny.dutoit; dirk.snyman}@nwu.ac.za
Abstract—In this study, two classical machine learning
algorithms, multinomial naive Bayes and support vector
machines, are compared when applied to named entity
recognition for two South African languages, Afrikaans and
English.
The definition of a named entity was based on previous
definitions and deliberations in literature as well as the intended
purpose of classifying sensitive personal information in textual
data. For the purpose of this study, the best algorithm should be
able to deliver accurate results while requiring the least amount
of time to train the classification model. A binary nominal class
was selected for the classifiers and the standard implementation
of the algorithms were utilised; no parameter optimisation was
done.
All the models achieved remarkable results in both ten-fold
cross validation and independent evaluations with the support
vector machine models outperforming the multinomial naive
Bayes models. The multinomial naive Bayes models, however,
required less time to train and would be more suited to low
resource implementations.
Keywords—binary class; cross-domain; named entity
classification; multilingual; multinomial naive Bayes; support
vector machines
I. INTRODUCTION
Digital textual data resources for South African languages
are very rare compared to other available international corpuses
[[1], [2], [3]]. In a bid to address this issue, the South African
Government’s Department of Arts and Culture (DAC) funded
and launched the National Centre for Human Language
Technologies’ (NCHLT) Resource Management Agency
(RMA; [4], [5]). The centre is based on similar centres
internationally and provides a sustainable step towards
providing resources for research and development in Human
Language Technology (HLT). The aim of the centre is to
provide a centralised platform for the distribution of Natural
Language Processing (NLP) resources such as text and audio
corpora [5]. One problem faced by such centres is the
anonymisation of private information contained in data sourced
from private companies, organisations and publishing houses.
During anonymisation, private and personal information
such as telephone numbers, addresses (residential, postal, e-
mail), values of currency and named entities (NEs) are
removed or replaced with predefined or generated information.
This is done to protect the individual or organisation from
attempts to derive the information by examining the publicly
published corpus. While the numbers and addresses are easily
identified using regular expressions and lists, the classification
of NEs is a more imposing problem concerning the plethora of
organisations, names, surnames and other subjective entities
such as president, colonel, health ministry, Autshumato project,
Mona Lisa and Jurassic period for example.
Computerised learning techniques have shown acceptable
to remarkable results in NE classification [[6], [7], [8], [9],
[10], [11]], although Nadeau and Sekine [12] argue that
comparisons between results are difficult due to differences in
evaluation techniques. This study seeks to report the results of
applying two specific classification algorithms for NE
classification, with the aim of anonymisation, on two South
African languages. The article is organised as follows: a brief
overview of similar investigations is given in Section II,
followed by the experimental setup in Section III. Results from
the experiments are presented in Section IV and finally in
Section V, some conclusions are drawn.
II. RELATED WORK
Information extraction (IE) is the extraction of useful
information from raw data sets in order to aid in decision-
making and the automation of certain processes [13]. This
varied field includes disciplines such as image recognition, text
classification, biomedical classification and data mining. This
study will focus on one specific branch of text classification
known as named entity recognition and classification (NERC).
The aim of NERC systems is the recognising and
classifying of predefined textual units which are referred to as
NEs [[7], [14], [15], [16], [17]]. The identified units are
classified using predefined classes of NEs and subsequent mark
up for each. The sentence
“Mr. Kroon, from GlobalCorp, can be contacted directly at
012 555 5555.”
can be classified as
Mr. Kroon, from
GlobalCorp can be contacted directly at
012 555 5555.
The tag set indicates a person,
an organisation and a
number. For removing confidential information from texts,
classified units can be replaced by blank or randomised values
from the same class (person, organisation and number).
The term NE was first defined by the sixth Message
Understanding Conference (MUC) in 1995 [[7], [12], [18],
[19]] and expanded for the seventh MUC [21]. The aim of the
NE shared tasks of the MUC-6 and MUC-7 conferences in
1995 and 1997 respectively, was to assign several teams with
the NERC of supplied data sets. For the tasks, a structured
definition of an NE was provided together with training and
testing data as well as evaluation metrics [[12], [19], [21], [22],
[23]]. Marrero et al. [7] note that most current NERC systems
are built on the basis for NEs as laid out by the MUC shared
tasks.
Puttkammer [10] details the only attempt at NERC for a
South African language (Afrikaans), aided by the use of
gazetteers [[11], [18]]. His hybrid system achieved an F1-
measure of 0.9474. The survey of NERC [12] is recommended
for further reading into the history and scope of NERC
research. In [7] a recent and excellent overview of NERC
research is provided. In addition, key faults of previous
investigations are discussed.
Next, the experimental setup is explained by detailing the
definition of an NE, the algorithm selection, corpora used,
experimental toolkit and configuration as well as the evaluation
criteria.
III. EXPERIMENTAL SETUP
A. Definition of a named entity
The MUC defines the NE task as follows: “The Named
Entity task consists of three subtasks (entity names, temporal
expressions, number expressions). The expressions to be
annotated are ‘unique identifiers’ of entities (organisations,
persons, locations), times (dates, times), and quantities
(monetary values, percentages)” [[21], [22], [23]]. A set of
words and numbers representing a duration or point in time is
defined as a temporal expression. Alhelbawy and Gaizauskas’
[24] definition as well as that of Puttkammer [10] was based
closely on the MUC definition. Although the MUC shared
tasks delivered a reused basis for the definition of an NE,
multiple versions and deviations exist in previous work.
Borrega et al. [19] attribute the varied differences in NE
definitions to the separate restrictions required to implement
the NERC system practically. The evolution of the definition to
suit the domain and purpose is evident in the literature and is,
according to Marrero et al. [7], “the only one constant” in the
aim to define an NE.
As with the MUCs, the definition is based on its intended
purpose; additions to the definition are based on the
examination of the corpus. With the aim of identifying
sensitive information this study defines NEs as phrases that
contain the names of persons, organisations, locations, time
and quantities [[20], [23]]. It includes official status (president,
general, colonel), non-profit organisation (NPO) names, laws,
acts, product names, public holidays, seasons, scientific
measures, titles, government departments and forms,
educational institutions and courses, language names, past or
ongoing project names, denominators and values of currency,
dates in written and decimal form, telephone numbers, ID
numbers, any addresses (e-mail, website, residential, business,
home), and quantities. General knowledge terms or information
that was readily available was not included in the NE
definition. The following entities did not reveal specific
information in this domain: names of plants, animal and bug
species, scientific names; and general directions (north, east,
south, west).
A single NE will constitute the longest possible sequence of
words that can be viewed as a single entity. For example, the
sequence: “14 Boom Street, Klerksdorp, South Africa”, is
recognised as a single NE since it describes a single entity.
Although most temporal expressions could be sufficiently
handled practically using language-specific regular expressions
[19] and gazetteers [[11], [18]] the combination of these
techniques with an automated classification system could
improve the accuracy of an anonymisation system. This
definition forms a basis for the intended purpose of building a
working NERC system to annotate textual resources in the
English and Afrikaans languages.
Next, the selection of classification algorithms is discussed.
B. Algorithm selection
The support vector machine (SVM) is considered the most
accurate general-purpose classifier for pattern recognition, but
can be computationally expensive when faced with very large
data sets [25]. This technique was first proposed by Vapnik
[26] and conceptualised by Vapnik and Cortes [27]. SVMs do
not rely on probabilities to build a classification system.
Instead, binary class assignment is used, which represents data
points as high but finite dimensional vectors [[26], [27], [28]].
For the p-dimensional vectors an optimal (p-1)-dimensional
hyperplane is sought, one which maximises the distance or
margin between the different classes [[25], [26], [27]]. The
vectors that best define the separation of the classes are
designated as the support vectors and the optimal separating
hyperplane function is defined by these support vectors. Slack
variables as well as the kernel trick are applied when the data
cannot be separated linearly [[25], [28]]. The SVM algorithm
has been selected since classification only makes use of the
limited number of support vectors identified during the training
of the system. A small corpus might be enough to build a
functional and competing classification system. The ability of
SVMs to generalise easily might make them adaptable between
domains and languages.
Zhang [8] states that the naive Bayes (NB) type of Bayesian
network has delivered “surprisingly good classification
performance”, a belief supported by McCallum and Nigam
[29]. Traditionally, two first-order probabilistic naive Bayes
assumption-based models are used: the multivariate Bernoulli
model and the multinomial naive Bayes (MNB) model [29].
The multivariate Bernoulli model is based on the occurrence of
a text unit in a textual resource (document, paragraph,
sentence); the frequency and order of occurrences are not
considered, only whether a text unit is present or not. The
multinomial model is also not concerned with the order of the
text units in the resource, but it does include the frequencies of
occurrences. McCallum and Nigam [29] have demonstrated
that the multivariate Bernoulli model fares better for small
vocabularies but is outperformed by the multinomial model
before a vocabulary of 1000 words is reached. The multinomial
model also fares better with classifying text units that vary in
length. A formal definition of the naive Bayes probability
equation for NERC purposes is given in [30]. The probability
that the current inspected “word” (or sequence of words) is an
NE, is equal to the probability of all NEs in the text multiplied
by the product of the probabilities of each word in the text
being an NE. Similar to SVMs, MNB algorithms have shown
remarkable results using small corpora for classification. The
MNB algorithm however is not as computationally complex as
the SVM algorithm and is well suited to practical
implementations.
The MNB and SVM machine learning algorithms have
been shown to deliver reasonably acceptable classification
results while using minimal textual resources [[2], [6], [8],
[29]].
In the next section, the corpora and its attributes are
discussed.
C. Corpora
Two separate data sets were obtained; the first a parallel
corpus of Afrikaans and English texts and the second an
annotated Afrikaans word corpus. The official ISO 639-3
language code [31] and ISO 3166-1 country code [32]
combinations for South African English (ENG-GB) and
Afrikaans (AFR-ZA) were used. The first corpus was provided
in 233 separate AFR-ZA and ENG-GB documents, aligned on
sentence level. The second corpus was provided in a single
comma separated values (CSV) document. Each line contained
a word and Boolean value; the words followed chronologically
from the original government domain texts.
The first corpus originated from a local magazine which
publishes in several languages. All of the separate documents
for each language were merged, in parallel, into two sentence-
aligned documents. Automatic annotation methods were first
used to retrieve an initial gazetteer from the given text. The
automatically annotated texts and gazetteers were then checked
by a native language speaker in either language. The languages
were not checked in parallel although several similarities
existed such as person names, numbers, business and location
names. The revised gazetteers were then used to classify the
NEs in the original texts. This bootstrapping process was
repeated iteratively until all noticeable and discernible NEs
were classified. The annotated corpus is assumed not to be of a
Gold Standard [33].
The Stanford Named Entity Recogniser (SNER, [34]) and
Autshumato Text Anonymiser (ATA, [35]) were used to
automatically annotate the corpus before the first iteration. The
SNER annotation used the supplied 7-class MUC, 4-class
CoNLL (defined by the Conference on Natural Language
Learning [20]) and 3-class combined models. The flexible
nature of the ATA application allowed the inclusion of
language specific (and non-specific) lists and rules for
classification. Currently the ATA application does not utilise
any machine learning model in the classification process; it
relies on user-supplied gazetteers and customisation of the
rules. The data was annotated incrementally with each entity
not recognised in a step, being included in the custom lists for
the next annotation iteration. Finally, the automatically
annotated sentence-level English and Afrikaans documents
were checked manually, and any entities falsely classified or
not classified, were corrected.
At this stage it was noted that the number of classified NEs
differed between the Afrikaans and English texts. This could be
attributed to error during translation or annotation. The longest
combination of words that represented an NE for both English
and Afrikaans was seven words. The annotated data was then
processed by splitting the texts into word n-gram windows
between 3-grams and 7-grams and outputting separate
documents for each language and n-gram. Three-gram
windows were chosen as the lowest granularity since they can
already be considered too small to sufficiently include the
context around a word [14]. Up to 7-gram granularity was
chosen since the longest single NE found in the data consisted
of seven words. Additionally, word-separated and sentence-
separated documents were created. Duplicates were not
removed from any of the data sets so as not to distort the
occurrence frequency which should aid in disambiguation.
Table III indicates the number of instances per language for
each class: textual units containing NEs and units not
containing NEs. The number of instances for each language
was different across all of the different levels of granularity. It
indicates that for this set of data all of the NEs were not
directly mapped across the languages, although many
similarities did occur.
Next the experiment toolkit and implementations of the
algorithms are discussed.
D. The WEKA toolkit
The WEKA toolkit [[30], [36], [37]] was used in
conducting the experiments using the supplied implementations
of the MNB and SVM algorithms. The WEKA implementation
of an SVM classifier is applied through Platt’s Sequential
Minimal Optimisation (SMO) algorithm [36], which breaks up
the large complex quadratic programming (QP) problem posed
by SVM training [27] into smaller, more easily computable QP
problems [38].
The data was converted, with the aid of the WEKA toolkit,
with a string-to-word vector filter. Each of the words found in
the data is defined as a class. The strings are converted to
decimal arrays; a single decimal value in the array maps to a
class defined word. The word classes were not lowercased or
balanced. In balanced classes, the frequency of occurrence has
been removed so as not to skew the model towards one
particular class per instance. As several words are shared
among NEs and non-NEs, an unbalanced approach is required
for accurate classification. For example, consider the following
sentence: “Mr. Ward was allowed to visit the children’s ward.”
A person might have the surname Ward, which also refers to a
specific room in a hospital. By removing separate occurrences
of the word “ward”, the instance of the word as a surname
would also be removed, resulting in the misclassification of
Mr. Ward instances.
The techniques and metrics used to evaluate the algorithms
are discussed next.
E. Evaluation
The converted data sets for each combination of language
and granularity were fed to the WEKA toolkit and trained on
both the SVM and MNB algorithms using the default
parameters. The results were evaluated using a stratified ten-
fold cross-validation test producing confusion matrixes for
each set. An explanation of a confusion matrix is given in
Table I. True positive (TP) is the sum of units (n-gram,
sentence, word) containing one or several NEs which were
classified as containing NEs. False Positive (FP) is the number
of units containing NEs that were not classified as containing
NEs. False Negative (FN) is the sum of units that do not
contain NEs but were classified as containing NEs and True
Negative (TN) is the number of non-containing NE units
classified correctly.
TABLE I. CONFUSION MATRIX
Model \ Actual
classification
Contains
NE(s)
Does not
contain
NE(s)
Contains
NE(s)
True Positive
(TP)
False Positive
(FP)
Does not
contain
NE(s)
False Negative
(FN)
True Negative
(TN)
The WEKA toolkit represents classification, n-fold cross
validation and independent test set evaluations in industry
accepted precision, recall and F1-measures [[1], [2], [6], [10],
[14], [39]] as originally defined by [40]. The formulas for
recall and precision are given in (1) and (2). The F1-measure
(3) provides a weighted harmonic mean between the recall and
precision; an equal weight assignment is used in this study. In
the case of n-fold cross validation, the results of each iteration
is averaged into the final results [36].
Recall (R) = TP / (TP + FN); (1)
Precision (P) = TP / (TP + FP); and (2)
F1-measure = 2(R × P) / (R + P). (3)
A statistical significance comparison was done on the 3-
gram MNB and SVM model for the AFR-ZA corpus utilising
the Experimenter from the WEKA-toolkit. The modified T-
Test evaluation method used is referred to by Bouckaert et al.
[36] as the “corrected resampled T-Test”.
IV. RESULTS
The time taken to train each model is given in Table II,
these values are not explicitly accurate as various other
background processes can influence the time required to train
the model. Problems with this measure are evident and oppose
the number of entities contained within the data set, shown in
Table III. The 3-gram, 4-gram and 5-gram AFR-ZA data set
contained a total of 63254, 59283 and 55490 entities
respectively, while showing fluctuations in MNB times and
increases in the SVM times. For the ENG-GB data, the MNB
models recorded similar results to the AFR-ZA data whereas
the SVM models showed some drastic differences in the 6-
gram and 7-gram data. Although the fluctuation of times
clearly indicates noticeable inaccuracy in the “time to train”
measure, noticeable differences can be spotted across the
investigated algorithms. This is good enough to draw broad
conclusions on the time required to train a specific data set.
The results of the language/granularity evaluation are
shown in Table IV. The accuracy of the MNB and SVM
models decreased as the granularity level increased. The
biggest decline was noticed in the Afrikaans models; the SVM
declined from a 0.994 to 0.992 F1-measure and the MNB from
a 0.988 to 0.978 F1-measure, a difference of 0.002 and 0.010
respectively. Across the granularity levels, the worst (although
adequate) results were obtained from the word-level and
sentence-level models, again for Afrikaans. The SVM word
model achieved an F1-measure of 0.909 and the sentence
model an F1-measure of 0.923; with 0.904 and 0.913 for MNB
words and sentences. The SVM models fared better than the
MNB models across all of the granularity levels, although the
differences seem marginal. The SVM models required
exhaustive computational resources and time to complete the
training of the model whereas the MNB algorithm delivered
excellent results using minimal time to build the models.
In Afrikaans, the SVM models outperformed the MNB
models and the best SVM models were the 3-gram to 5-gram
models each obtaining an F1-measure of 0.994, which is quite
remarkable. The best results for the English data are the 3-gram
to 6-gram SVM models, each with an F1-measure of 0.995,
which is 0.6% better than the best MNB result, the 3-gram
model. The results of both the MNB and SVM models are
almost mirrored for both languages – which might be an
indication of the similarity of NEs found between the two
languages.
Taking all of the previous results into consideration, the 3-
gram models are deemed the most accurate for delivering the
best or equal to the best F1-measure as well as requiring the
least amount of training time for the SVM algorithm. Although
the 3-gram models have more class instances than other n-gram
models (Table III), the instances are shorter and less expensive
to convert and train. Based on these deliberations, the 3-gram
models are chosen as the most accurate and are used for the
independent test. The word-level models are also included
since they delivered adequate results, using the least amount of
time to train, and being able to deliver a practical classifier.
The results from the granularity/language test are
suspiciously high: 99.5% for the best SVM model and 98.8%
for the best MNB model. To verify the accuracy of the results,
an independent test was conducted; the trained MNB and SVM
models for the 3-gram and word level AFR-ZA models were
used and evaluated on the annotated, government domain
corpus. The results of the experiments for each of the training
algorithms and data sets are also given in Table IV.
The MNB narrowly outperformed the SVM model and
achieved an F1-measure of 0.894 as opposed to 0.893 for the 3-
gram model. This model could efficiently be applied across the
two separate domains of Afrikaans. The speed at which the
model could be trained also enables the use of this machine
learning algorithm in instantly re-trainable systems. It should
be noted that the word model also delivered surprisingly good
results, indicating that although the use of gazetteers can
greatly speed up the annotation process and aid in
classification, their explicit use is not required. A model trained
on data annotated by the means of gazetteers was able to
accurately identify NEs in another data set without the use of
the annotation gazetteers.
The results from the statistical significance test between the
AFR-ZA 3-gram MNB and SVM models indicated that the
SVM model is statistically significant when compared to the
MNB model.
V. CONCLUSION
This study aimed to compare two statistical machine
learning algorithms at the task of identifying NEs contained in
textual resources for two South African languages, English and
Afrikaans. A binary nominal class was selected and the best
model should only be able to determine if an investigated
textual unit is an NE or not. The algorithm must be expandable
to other domains, and not depend on language-specific
linguistic rules and definitions. The definition of an NE was
based on previous relevant definitions and expanded to include
occurrences in the domain-specific data.
Owing to the scarceness of aligned multilingual data for
South African languages, the choice of the domain was
necessitated by the availability of the data. A parallel aligned
English-Afrikaans magazine article corpus was obtained, as
well as an annotated Afrikaans corpus in the government
domain. The parallel corpus, originally in separate parallel
documents, was converted and annotated using an iterative,
bootstrap technique. Several data sets were produced from this
corpus to evaluate the best granularity to use when classifying
unknown text segments.
The choice of algorithms was based on their ability to suit
the restrictions in training data as well as previously reported
results. The SVM only slightly outperformed the MNB models
across all granularity levels and both languages. Because it is
computationally expensive, this model would be suited in
instances where the NERC system uses a fixed, pre-trained
model. The MNB models delivered results as high as the SVM
models with less time required to train the models. The word
and sentence models achieved reasonable results, and MNB
word and sentence models could easily be implemented as low-
resource, re-trainable, early NE detectors that could quickly
scan an incoming text. More accurate and expensive NERC
systems could then be launched if an NE was detected. The
information contained within the grammatical structure is well
maintained with the n-gram models, delivering higher results.
For the practical application of anonymisation of private
information in textual resources, an MNB re-trainable 3-gram
model, with the assistance of gazetteers, will be used. The
MNB models deliver excellent results and use far less
resources than their relevant SVM models, which allow them
to be easily retrained on recently classified data.
This study was limited to focusing on two similar South
African languages. Studies of other related languages would
more clearly indicate cross-lingual adaptability of the
algorithms. The specific NE definition required for any study
limits its comparability to other similar systems and is reported
to limit these models to certain domains. Although a host of
other multilingual NERC approaches exist, remarkable results
can be obtained by good definition, adequate corpus and
classical classification algorithms such as SVM and MNB. It
would also be interesting to extend this study to include more
South African languages, especially sharing similarities
between them. The development of NERC systems for all of
the South African languages could assist in building useful
annotated corpora for natural language processing and human
language technology research.
ACKNOWLEDGMENT
We wish to express our gratitude to Dr. Martin Puttkammer
and Dr. Roald Eiselen for their expert advice as well as to the
Centre for Text Technology (CTexT®) for providing the data.
REFERENCES
[1] D.P. Snyman, G.B. Van Huyssteen and W. Daelemans, “Cross-Lingual
Genre Classification for Closely Related Languages,” in Proc. PRASA,
2012, pp. 133-137.
[2] D.P. Snyman, G.B. Van Huyssteen and W. Daelemans, “Automatic
Genre Classification for Resource Scarce Languages,” in Proc. PRASA,
2011, pp. 132-137.
[3] A. Grover, G.B. Van Huyssteen and M. Pretorius, “The South African
human language technology audit,” Language Resources and
Evaluation, vol. 45, no. 3, 2011, pp. 271-288.
[4] CTexT (Centre for Text Technology). (2012). Resource Management
Agency Newsletter 1 of 2012 [Online]. Available:
http://rma.nwu.ac.za/images/stories/pdfs/News.RMA.Newsletter.1.0.1.M
HM.2012-12-11.pdf
[5] M. Muller. (2012, March 26). Good news for South African languages
[Online]. Available: http://www.researchsa.co.za/news.php?id=1053
[6] N. Jahan and S. Morwal, “Named Entity Recognition in Indian
languages: a sruvey,” Int. J. Engineering Sciences and Research
Technology, vol. 2, no. 4, 2013, pp. 925-929.
[7] M. Marrero, J. Urbano, S. Sánchez-Cuadrado, J. Morato and J.M.
Gómez-Berbís, “Named Entity Recognition: Fallacies, challenges and
opportunities,” Computer Standards & Interfaces, vol. 35, no. 5, 2013,
pp. 482-489.
[8] H. Zhang, “The optimality of naive Bayes,” in Proc. 7th Int. Florida
Artificial Intelligence Research Society (FLAIRS) Conf., AAAI, 2004,
pp. 3-9.
[9] X. Ma, “Toward a name entity aligned bilingual corpus,” in Proc.
LREC, 2010, pp. 17-23.
[10] M.J. Puttkammer, “Automatic Afrikaans tokenisation,” M.A.
dissertation, School of Languages, North-West Univ., Potchefstroom,
South-Africa, 2006.
[11] A. Mikheev, M. Moens and C. Grover, “Named entity recognition
without gazetteers,” in Proc. 9th Conf. European chapter of the
Association for Computational Linguistics, ACL, 1999, pp. 1-8.
[12] D. Nadeau and S. Sekine, “A survey of named entity recognition and
classification,” Lingvisticae Investigationes, vol. 30, no. 1, 2007, pp. 3-
26.
[13] D. Jurafsky and J.H. Martin, Speech & language processing: an
introduction to natural language processing, computational linguistics,
and speech recognition, Prentice Hall, 2000.
[14] R. Al-Rfou and S. Skiena, “SpeedRead: A Fast Named Entity
Recognition Pipeline,” arXiv preprint arXiv:1301.2857, 2013.
[15] H.N. Goh, L.K. Soon and S.C. Haw, “Automatic identification of
protagonist in fairy tales using verb,” Advances in Knowledge
Discovery and Data Mining, P. Tan, S. Chawla, C.K. Ho and J. Baily
eds., Springer Berlin, 2012, pp. 395-406.
[16] M. Marci?czuk and M. Janicki, “Optimizing CRF-based model for
proper name recognition in Polish texts,” in Proc. Computational
Linguistics and Intelligent Text Processing, Springer, 2012, pp. 258-269.
[17] D.M. Nemeskey and E. Simon, “Automatically generated NE tagged
corpora for English and Hungarian,” in Proc. 4th Named Entity
Workshop, Association for Computational Linguistics (ACL), 2012, pp.
38-46.
[18] J. Nothman, N. Ringland, W. Radford, T. Murphy and J.R. Curran,
“Learning multilingual named entity recognition from Wikipedia,”
Artificial Intelligence, vol. 194, 2013, pp. 151-175.
[19] O. Borrega, M. Taulé and M.A. Marti, “What do we mean when we
speak about Named Entities,” in Proc. Corpus Linguistics Conference,
2007.
[20] E.F. Tjong Kim Sang and F. De Meulder, “Introduction to the CoNLL-
2003 shared task: Language-independent named entity recognition,” in
Proc. 7th Conf. on Natural Language Learning at HLT-NAACL 2003,
Association for Computational Linguistics, pp. 142-147.
[21] N. Chinchor and P. Robinson, “MUC-7 named entity task definition,” in
Proc. 7th Conference on Message Understanding (MUC-7), 1997.
[22] R. Grishman and B. Sundheim, “Message Understanding Conference-6:
A Brief History,” in Proc. COLING, Morgan Kaufman, 1996, pp. 466-
471.
[23] R. Grishman and B. Sundheim. (1995, March 21). Sixth Message
Understanding Conference (MUC-6): conference task definition
[Online]. Available:
http://www.cs.nyu.edu/cs/faculty/grishman/COtask21.book_1.html
[24] A. Alhelbawy and R. Gaizauskas, “Named entity based document
similarity with svm-based re-ranking for entity linking,” in Proc.
Advanced Machine Learning Technologies and Applications, Springer,
2012, pp. 379-388.
[25] C.J. Van Heerden, “Efficient training of support vector machines and
their hyperparameters,” Ph.D. dissertation, School of Electrical,
Electronic and Computer Engineering, Nort-West Univ., Potchefstroom,
South-Afirca, 2012.
[26] V.N. Vapnik, B.E. Boser and I.M. Guyon, “. A training algorithm for
optimal margin classifiers,” in Proc. 5th Annu. Workshop on
Computational Learning Theory, ACM, 1992, pp. 144-152.
[27] C. Cortes and V.N. Vapnik, “Support-vector networks,” Machine
learning, vol. 20, no. 3, 1995, pp. 273-297.
[28] W.H. Press, S.A. Teukolsky, W.T. Vetterling and B.P. Flannery,
Numerical Recipes: The art of scientific computing, 3rd ed. New York:
Cambridge University Press, 2007, pp. 883-898.
[29] A. McCallum and K. Nigam, “A comparison of event models for naive
bayes text classification,” in Proc. AAAI-98 workshop on learning for
text categorization, Madison, WI: Citeseer, 1998, vol. 752, pp. 41-48.
[30] W. Ertel, Introduction to artificial intelligence, N. Black ed., London,
UK: Springer, 2011, pp. 202-206.
[31] Codes for the representation of names of languages — Part 3: Alpha-3
code for comprehensive coverage of languages, ISO 639-3, 5 February,
2007.
[32] Codes for the representation of names of countries and their
subdivisions, ISO 3166-1 alpha-2, 1974.
[33] L. Wissler, M. Almashraee, D. Monett and A. Paschke, “The Gold
Standard in Corpus Annotation,” in Proc IEEE Germany Student
Conference 2014 [Online]. Available: http://www.ieee-student-
conference.de/fileadmin/templateConf2014/images/papers/ieeegsc2014_
submission_3.pdf
[34] The Stanford Natural Language Processing Group. Stanford Named
Entity Recognizer (NER), ver. 1.2.8. Stanford, CA: Stanford University,
2013.
[35] CTexT. Autshumato Text Anonymiser (ATA), ver. 2.0.0.
Potchefstroom: Nort-West University, 2012.
[36] R.R. Bouckaert, E. Frank, M. Hall, P. Kirby, P. Reutemann, A. Seewald
and D. Seuse. (2013). WEKA Manual for Version 3-7-10 [Online].
Available:
http://ufpr.dl.sourceforge.net/project/weka/documentation/3.7.x/WekaM
anual-3-7-10.pdf
[37] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I.H.
Witten, “The WEKA data mining software: an update,” ACM SIGKDD
explorations newsletter, vol. 11, no. 1, 2009, pp. 10-18.
[38] J.C. Platt, “Fast training of support vector machines using sequential
minimal optimization,” Advances in kernel methods, B. Schoelkopf, C.
Burges and A. Smola eds., MIT press, 1999, pp. 185-208.
[39] N. Kang, E.M. Van Mulligen and J.A. Kors, “Training text chunkers on
a silver standard corpus: can silver replace gold?,” BMC bioinformatics,
vol. 13, no. 1, 2012, pp. 17-22.
[40] C. Van Rijsbergen, Information retrieval, 2nd ed. London, UK:
Butterworth-Heinemann, 1979.
TABLE II. TIME TAKEN TO TRAIN EACH MODEL
Dataset Time (seconds)
AFR-ZA MNB SVM
3-gram 0.08 1200.34
4-gram 0.06 1367.74
5-gram 0.08 1445.65
6-gram 0.08 1348.62
7-gram 0.08 622.28
Words 0.00 90.17
Sentences 0.03 29.36
ENG-GB MNB SVM
3-gram 0.06 534.54
4-gram 0.08 698.04
5-gram 0.06 568.08
6-gram 0.06 952.43
7-gram 0.08 824.39
Words 0.00 77.48
Sentences 0.02 23.62
TABLE III. NUMBER OF INSTANCES PER LANGUAGE FOR EACH CLASS
Language
AFR-ZA
Total
ENG-GB
Total
AFR-ZA
independent
test Total
NE
Not
NE
NE
Not
NE
NE
Not
NE
G
ra
n
u
la
ri
ty
3-gram 4032 59222 63254 4133 61400 65533 4834 50620 55454
4-gram 4574 54709 59283 4683 56841 61524 - - -
5-gram 4906 50584 55490 5033 52635 57668 - - -
6-gram 5142 46748 51890 5278 48726 54004 - - -
7-gram 5284 43157 48441 5459 45034 50493 - - -
Word 521 7204 7725 469 6346 6815 2460 52997 55457
Sentences 985 2925 3910 923 2890 3813 - - -
TABLE IV. RESULTS FOR THE NAMED ENTITY RECOGNITION OF TWO LANGUAGES AND DIFFERENT GRANULARITIES
MNB SVM
Language Dataset Precision Recall F1-measure Precision Recall F1-measure
AFR-ZA
3-gram 0.988 0.988 0.988 0.994 0.994 0.994
4-gram 0.986 0.985 0.985 0.994 0.994 0.994
5-gram 0.983 0.983 0.983 0.994 0.994 0.994
6-gram 0.983 0.983 0.983 0.993 0.993 0.993
7-gram 0.978 0.977 0.978 0.992 0.992 0.992
Words 0.934 0.934 0.904 0.938 0.936 0.909
Sentences 0.912 0.914 0.913 0.927 0.926 0.923
ENG-GB
3-gram 0.988 0.988 0.988 0.995 0.995 0.995
4-gram 0.986 0.986 0.986 0.995 0.995 0.995
5-gram 0.984 0.983 0.983 0.995 0.995 0.995
6-gram 0.981 0.980 0.981 0.995 0.995 0.995
7-gram 0.979 0.978 0.978 0.994 0.994 0.994
Words 0.933 0.933 0.903 0.939 0.936 0.910
Sentences 0.922 0.923 0.922 0.930 0.930 0.928
AFR-ZA
independent test
3-gram 0.897 0.918 0.894 0.898 0.918 0.893
Words 0.923 0.956 0.934 0.946 0.958 0.946