Scalable Biomedical Named Entity Recognition: Investigation of a
Database-Supported SVM Approach
This paper explores the scalability issues associated
with solving the Named Entity Recognition (NER) problem
using Support Vector Machines (SVM) and high-dimensional
features and presents two implementations to address these
issues. The NER domain chosen for these experiments is the
biomedical publications domain, especially selected due to its
importance and inherent challenges. The performance results of
a set of experiments conducted using existing binary and multiclass
SVM with increasing training data sizes are examined and
compared to results obtained using our new implementations.
Our baseline machine learning approach eliminates prior
language or domain-specific knowledge and achieves good outof-
the-box accuracy measures that are comparable to those
obtained using more complex approaches. The training time of
multi-class SVM is reduced by several orders of magnitude,
which would make support vector machines a more viable and
practical machine learning solution for real-world problems
with large datasets. The first implementation - SVM-PerfMulti
- is a new instantiation of SVM-Struct v3.0 built as a
standalone C executable. The second implementation - SVMMultiDB
- is an embedded database solution for both binary
and multi-class SVM, built as a server-side extension of
PostgreSQL.
Index Terms
Named entity recognition, support vector machines, database extension,
bioinformatics.
Full Text (PDF)