Kunuz Corpus (Alpha version)



This resource is an XMLized version of Sahih Albukhari, the most authentic hadith book. It has the following characteristics:

  • At the macro-logical level, the collection t is structured through sections in a hierarchical manner (see Structure.xml). From thsi structure, we extracted a list of domains (Domains.txt)  and an index linkig domains to hadiths (Domain-Index.txt).
  • At the micro-logical level,the hadiths are fully tagged; mainly narrator names, Quran verses and the real content of each hadith (metn) are recognized (see "Micro-Documents" folder). This resource may be used for Named Entity Recognition (NER) applications and in general for information extraction from Arabic documents. The main tags are summarized as follows:



<Metn> or <M>

The real content of the hadith (المتن)


Chain of narrators (السند)




Title of a section


Quranic verse






The name of the author (i.e. Al-Bukhari)


Indicates the hadith contains a speech of the prophet PBUH.


The name of the prophet PBUH.


The names of the prophets.


The names of anges.


Names of islamic  doctrines


Names of religions and holy books


Names of places


Names of women


Comments of the author (i.e. Al-Bukhari)


Comments (e.g. definitions)


First narrator and metn




Part of poesy


Names of quranic surats or parts


Credibility comment


A part of a quranic verse

The attribute "l" of each section tag indicates its level e.g. <S l="1"> stands for a section of level one; <S l="2"> represents a sub-section.

  • We also provide full documents containing hadiths and their explanations in the TREC XML format (Full-Documents-Trec folder)
  • The collection is designed for assessing Information Retrieval (IR) systems. We collected a set of standard topics (Queries.xml) and relevance judgments according to standard topic development and sampling procedures of RI compagins (see Qrels.txt).

The resource is available for free usage for the research community (Click here to download). It is distributed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. When using it, you are encouraged to cite:

O. Ben KhirounAyed, R.Elayeb, B.Bounhas, I.Ben Saoud, N., and Evrard, F.Towards a New Standard Arabic Test Collection for Mono- and Cross-Language Information Retrieval, in Natural Language Processing and Information Systems - Proceedings of The 19th International Conference on Applications of Natural Language to Information Systems, NLDB'2014, Montpellier, France, June 18-20, 2014, 2014, pp. 168–171.

The project aims to build a multilingual standard IR collection. Queries and Qrels (relevance judgments) will be published later.


Access conditions: