CADIAL - Intelligent search engine based
on EUROVOC thesaurus
The CADIAL search engine is an intelligent search engine
designed for structured document retrieval with the support
for morphological normalisation of Croatian language. The
search engine can be used on documents in any language and
the morphological procedures for any other language can be
easily added. The search engine also supports categorised
documents and enables search procedures to use categories
and any other meta-information to further improve search performance.
Advanced custom filtering is also supported, thus enabling
the construction of custom-faceted search procedures. The
search engine is designed for medium to large scale applications
and was tested to perform quickly and efficiently on over
half a million structured documents. The interface for the
CADIAL search engine is web-based and is applied on Croatian
legislative documents, although the search engine can be integrated
into any other system type (web service or a desktop application)
and can be implemented on any collection of documents.
CADIAL search engine (cadial.hidra.hr)
enables cross-language document retrieval based on the Eurovoc
thesaurus. The search engine is one of the results of the
CADIAL project (www.cadial.org)
carried out in co-operation with the Department of Computer
Science, Katholieke Universiteit Leuven, HIDRA and the Department
of Linguistics, Faculty of Humanities and Social Sciences.
KTN Indexing System - automatic document classification management
The KTN Indexing System is a software package that provides
the capability of automatic text classification and management.
It is primarily aimed to solve media clippings of newspaper
and magazine scans, but can be used on any textual data. The
system uses state-of-the-art machine learning methods to train
automatic classifiers, as well as language-specific resources
in order to maximise classification quality. Also, it enables
semi-automatic classification assisted by human experts. Cleaning
legacy databases is achieved through techniques of active
learning.
Figure 1
CADIAL search engine |
Figure 2
KTN Indexing - learning classification view |
Figure 3
KTN Indexing - category view |
eCADIS - system for automatic document indexing
The eCADIS can assign fully automatically keywords to documents
(that best summarise the content of the document) or can facilitate
the assignment process allowing human indexers to index documents
more efficiently and more consistently. The assignment process
is based on a controlled vocabulary - thesaurus. In its present
version, eCADIS uses Eurovoc thesaurus* although eCADIS can
be adopted to any other thesaurus. Keywords (also called descriptors)
can facilitate document retrieval or can enable cross-language
retrieval as in the case of the Eurovoc.
The eCADIS workstation system works using two parallel windows:
the Document window (Figure 4) and the Eurovoc browser window
(Figure 5).
The Document window displays the document that is being indexed
together with the results of its computational linguistic
processing (alphabetical and frequency lists of types, lemmas,
literal descriptors appearing in the text, word digrams, word
trigrams, and word tetragrams). The Eurovoc browser window
allows the user to freely browse through the hierarchy of
index terms of the Eurovoc thesaurus and to select the descriptors
to be assigned to the document.
eCADIS automatic indexing was achieved by applying machine
learning techniques to a number of manually indexed Croatian
official documents.
The eCADIS behaves intelligently and suggests the descriptors
that best describe the meaning of the text (even though they
might not be literally present in the document itself) and
also in its off-line version eCADIS assigns descriptors fully
automatically.
System for morphological normalisation
The morphological normalisation module enables morphologically-aware
search and thus improves user experience and search performance.
This is particularly important for Croatian as a morphologically
complex language. The normalisation module uses a morphological
lexicon to conflate the various inflectional variants of a
word into a single representative form. A wide-cover lexicon
has been acquired automatically from raw corpora based on
a hand-crafted morphology model.
The morphology model is based on a flexible representation
framework that can readily be applied to other languages.
This makes the development of morphological normalisation
modules for other languages easy and cost-effective.
Figure 4
eCADIS document window |
Figure 5
eCADIS EUROVOC window |
Figure 6
Morphological normalisation module effect - integration
in eCADIS system |
* Eurovoc (http://europa.eu/eurovooc/)
is a multilingual thesaurus (with 1:1 translations of each
descriptor in some 30 languages) used in various government
institutions across the EU. It contains more than 6,000 descriptors
organised into a hierarchy of 21 general fields (politics,
law, economics,) six levels deep. It is base cross-language
retrieval of the official documentation of the EU.
|