Research Paper Recommendation by Exploring Co-citation Occurrences in Sections of Scientific Papers


Citation indexes and digital libraries index millions of research papers and make them available to the scientific community; however, searching the intended information from these huge repositories remain a challenge. Everyday, the research papers in online digital libraries are increasing due to different number of conferences, workshop, and journals which are being arranged throughout the world. According to the statistic in 2017, one of the digital libraries in medical domain, such as PubMed consisted of 28 millions of research documents. The manual searching of relevant research papers from such a huge amount of documents is a very difficult task. Therefore, this area has attracted the attention of researcher’s worldwide to propose and implement innovative techniques that could recommend relevant papers to researchers.

The identification of relevant research papers has become an important research area. For this, research community has proposed more than 90 different approaches in the past 15 years. These approaches have utilized different data sources, such as metadata, content, profile based data and citations of research papers. These techniques have certain strengths and limitations which have been critically reviewed and presented in this document.

One of the important approaches in this area is co-citation analysis which considers two documents as relevant if they are co-cited in other scientific documents. The original approach used references from the reference list of scientific documents to make such observations. However, in the recent years, the content of documents have also been exploited along with the reference list to enhance the accuracy. These approaches include Citation Proximity Analysis (CPA), Citation Order Analysis (COA), and exploit bytes of the content of scientific papers. These approaches conceptualize the occurrence of co-citations in different level of proximity and give more weights to the co-cited documents which are co-cited closely. However, the closely co-cited documents in the “Methodology/Results” section may be considered more relevant as compared to the closely co-cited papers in the “Introduction/Discussion” sections. This thesis explores structural organization of scientific documents by giving weights according to the importance of different generic sections, and investigates that whether such approach may increase the accuracy of identifying relevant papers.

This work addresses the following important research challenges and can be considered as the contributions of the thesis: (1) generic section identification in citing document (2) in-text citation patterns and frequencies identification in citing document and (3) design of an algorithm that utilizes evidences from above mentioned sources (sections name, their weight, and the frequency of co-citations) to identify and recommend relevant papers.

For each contribution, the detailed architecture, dataset and evaluation have been discussed in this thesis. First the generic section identification component was designed, implemented and then evaluated with state-of-the-art approaches. The proposed approach was evaluated on two datasets consisted of 150 and 300 citing documents respectively. The aggregated F-score of proposed approach was 92% over the both datasets while the F-score of the state-of-the-art technique was 81%. Second, the component of in-text citation patterns and frequencies identification was implemented with detailed architecture, dataset, and evaluation. For the evaluation, two datasets were prepared from openly available digital libraries, Journal of Universal Computer Science (J.UCS)1 and CiteSeerX2. The proposed model was outperformed the state-of-the-art approach by increasing the F-score from 0.58 to 0.97. The third contribution of this thesis is section wise co-citation analysis which depends on earlier two components. The proposed approach was designed to rank the co-cited documents. For the evaluation purpose, two benchmarks such as JSD and cosine similarity based rankings were selected for the comparison of proposed and state-of-the-art approaches. The score has been compared between the proposed and state-of-the-art approaches using Spearman’s and Kendall’s tau measures. The results show that the proposed approach has outperformed comparatively the state-of-the-art techniques such as: standard co-citation and CPA based on bytes offset.

Download full paper