|Title||A discriminative method for global query expansion and term reweighting using co-occurrence graphs|
|Publication Type||Journal Article|
|Year of Publication||In Press|
|Authors||Aklouche, B, Bounhas, I, Slimani, Y|
|Journal||Journal of Information Science|
|Keywords||BM25, Co-occurrence graph, information retrieval, Query Expansion, Semantic similarity, Term reweighting.|
This paper presents a new Query Expansion (QE) method aiming to tackle term mismatch in Information Retrieval (IR). Previous research showed that selecting good expansion terms which do not hurt retrieval effectiveness remains an open and challenging research question. Our method investigates how global statistics of term co-occurrence can be used effectively to enhance expansion term selection and reweighting. Indeed, we build a co-occurrence graph using a context window approach over the entire collection, thus adopting a global QE approach. Then, we employ a semantic similarity measure inspired by the Okapi BM25 model, which allows to evaluate the discriminative power of words and to select relevant expansion terms based on their similarity to the query as a whole. The proposed method includes a reweighting step where selected terms are assigned weights according to their relevance to the query. What’s more, our method does not require matrix factorization or complex text mining processes. It only requires simple co-occurrence statistics about terms, which reduces complexity and insures scalability. Finally, it has two free parameters that may be tuned to adapt the model to the context of a given collection and control co-occurrence normalization. Extensive experiments on four standard datasets of English (TREC Robust04 and Washington Post) and French (CLEF 2000 and 2003) show that our method improves both retrieval effectiveness and robustness in terms of various evaluation metrics and outperforms competitive state-of-the-art baselines with significantly better results. We also investigate the impact of varying the number of expansion terms on retrieval results.