:. Connect to CROATIA - let's do IT together .:

FER - Fakultet elektrotehnike
i računarstva

Unska 3
HR-10000 Zagreb, Croatia

Tel: +385 (0)1 6129 935
Fax: +385 (0)1 6129 653
www.fer.hr
bojana.dalbelo@fer.hr

CADIAL

KTN Indexing System

eCADIS

System for morphological
normalisation

CADIAL - Intelligent search engine based on EUROVOC thesaurus

The CADIAL search engine is an intelligent search engine designed for structured document retrieval with the support for morphological normalisation of Croatian language. The search engine can be used on documents in any language and the morphological procedures for any other language can be easily added. The search engine also supports categorised documents and enables search procedures to use categories and any other meta-information to further improve search performance. Advanced custom filtering is also supported, thus enabling the construction of custom-faceted search procedures. The search engine is designed for medium to large scale applications and was tested to perform quickly and efficiently on over half a million structured documents. The interface for the CADIAL search engine is web-based and is applied on Croatian legislative documents, although the search engine can be integrated into any other system type (web service or a desktop application) and can be implemented on any collection of documents.

CADIAL search engine (cadial.hidra.hr) enables cross-language document retrieval based on the Eurovoc thesaurus. The search engine is one of the results of the CADIAL project (www.cadial.org) carried out in co-operation with the Department of Computer Science, Katholieke Universiteit Leuven, HIDRA and the Department of Linguistics, Faculty of Humanities and Social Sciences.


KTN Indexing System - automatic document classification management

The KTN Indexing System is a software package that provides the capability of automatic text classification and management. It is primarily aimed to solve media clippings of newspaper and magazine scans, but can be used on any textual data. The system uses state-of-the-art machine learning methods to train automatic classifiers, as well as language-specific resources in order to maximise classification quality. Also, it enables semi-automatic classification assisted by human experts. Cleaning legacy databases is achieved through techniques of active learning.



Figure 1
CADIAL search engine

Figure 2
KTN Indexing - learning classification view

Figure 3
KTN Indexing - category view



eCADIS - system for automatic document indexing

The eCADIS can assign fully automatically keywords to documents (that best summarise the content of the document) or can facilitate the assignment process allowing human indexers to index documents more efficiently and more consistently. The assignment process is based on a controlled vocabulary - thesaurus. In its present version, eCADIS uses Eurovoc thesaurus* although eCADIS can be adopted to any other thesaurus. Keywords (also called descriptors) can facilitate document retrieval or can enable cross-language retrieval as in the case of the Eurovoc.
The eCADIS workstation system works using two parallel windows: the Document window (Figure 4) and the Eurovoc browser window (Figure 5).
The Document window displays the document that is being indexed together with the results of its computational linguistic processing (alphabetical and frequency lists of types, lemmas, literal descriptors appearing in the text, word digrams, word trigrams, and word tetragrams). The Eurovoc browser window allows the user to freely browse through the hierarchy of index terms of the Eurovoc thesaurus and to select the descriptors to be assigned to the document.
eCADIS automatic indexing was achieved by applying machine learning techniques to a number of manually indexed Croatian official documents.
The eCADIS behaves intelligently and suggests the descriptors that best describe the meaning of the text (even though they might not be literally present in the document itself) and also in its off-line version eCADIS assigns descriptors fully automatically.



System for morphological normalisation

The morphological normalisation module enables morphologically-aware search and thus improves user experience and search performance. This is particularly important for Croatian as a morphologically complex language. The normalisation module uses a morphological lexicon to conflate the various inflectional variants of a word into a single representative form. A wide-cover lexicon has been acquired automatically from raw corpora based on a hand-crafted morphology model.
The morphology model is based on a flexible representation framework that can readily be applied to other languages. This makes the development of morphological normalisation modules for other languages easy and cost-effective.



Figure 4
eCADIS document window

Figure 5
eCADIS EUROVOC window

Figure 6
Morphological normalisation module effect - integration in eCADIS system

* Eurovoc (http://europa.eu/eurovooc/) is a multilingual thesaurus (with 1:1 translations of each descriptor in some 30 languages) used in various government institutions across the EU. It contains more than 6,000 descriptors organised into a hierarchy of 21 general fields (politics, law, economics,) six levels deep. It is base cross-language retrieval of the official documentation of the EU.