Wiem Lahbib

PhD Student
Higher School of Computer Sciences (ENSI)
La Manouba
Research laboratory: 
Research Profile: 
Research Subject Title: 
Arabic terminology extraction and enrichment for information retrieval
Supervisors (s)

Ph.D Thesis description

Brief description

Terminologies are a important resources allowing to extract and represent knowledge useful for Information Retreival (IR) to improve the relevance of search results. We aim to study terminology extraction from multiple sources (i.e. textual vs. structured corpora; corpora vs. dictionaries; bilingual vs. monolingual corpora) and use them for controlled indexing-based IR. Then, we structure terminologies taking into account different knowledge organization axes. Finally and in order to study the impact of terminology in Arabic IR, we exploit terminologies in query expansion.

Detailed description

Arabic terminology extraction requires different types of analysis, such as morphological and syntactic analysis. The morphological level involves the study of word forms and its characteristics such as the grammatical category (Part-of-speech POS). However, many morphological ambiguities are highlighted by Arabic morphology, which adds enormous difficulties in automatic analysis of Arabic on both morphologically and syntactical levels.

As far as syntactical parsing is concerned, it consists in nous phrase extraction. This process is also affected by ambiguities, especially because of the free word order phenomenon, which influences the extraction of complex phrases. Thus, it is crucial to define approaches for syntactic disambiguation. Our first challenge will be then the construction of a syntactic analyzer and disambiguator, which can automatically solve ambiguities.

We especially note that terminology extraction is not performed only in monolingual text corpora, but involves the use of dictionaries, structured and bilingual corpora, to improve coverage and precision. Thus, we focus on bilingual terminology extraction, which allows non Arabic native speakers to retrieve documents in Arabic. This is motivated by the lack of works in this field.

Finally, domain terminologies will be used as thematic indexes allowing the users to navigate in the informational space, formulate their queries and expand them to enhance precision and recall. This is done through terminology organization, which permits to (i) build an adjacency matrix measuring the similarity of domain terms; and, (ii) define semantic relations between couples of terms, through studying the structure of noun phrases.