The goal of this project is to enhance an existing hybrid indexing tool in order to give better efficiency (speed, resources) and effectiveness (recall, precision,…). The JARIR group has been working on developing an Arabic IRS (Information Retrieval System) based on a hybrid index (Ben Guirat et al. 2016). The proposed approach is to build a multilevel index where the hierarchical structure represents the semantic relations between the different word forms (root, verbal pattern and stem). Given the existent tool, this project aims to:

  1. Develop a performant hybrid indexing tool enhancing the capacity of the IRS. A tool developed by our group will be the starting point.
  2. Integrate the proposed tool on Terrier[1] IR Platform using BM25 model.
  3. Perform experiments based on a large scale corpus (Arabic newswire LDC test collection).
  4. Using MADAMIRA[2] and Alkhalil[3] tools to add the lemma unit to the hybrid index.

 Perform interpretations based on performance and significance tests using TANAGRA[4].

[1] http://terrier.org/

[2] https://camel.abudhabi.nyu.edu/madamira/

[3] https://sourceforge.net/projects/alkhalil/

[4] http://eric.univ-lyon2.fr/~ricco/tanagra/fr/tanagra.html

