The Open University

sorry, we can't preview this file

CitationContextbasedDataset.tar.gz (6.33 GB)

Citation-Context Dataset (C2D)

Download (6.33 GB)
Version 2 2018-08-15, 13:14
Version 1 2018-08-03, 12:19
posted on 2018-08-15, 13:14 authored by Anita KhadkaAnita Khadka

We have released the first version of a citation-context based dataset called C2D, created while doing an experiment in the work which will be published in RecSys 2018 as a short paper.

C2D dataset is created by using 2 million full-text open-source research publications obtained from CORE. It contains 53 million unique records of citation-information. To construct C2D, we extracted citation information from each publication. Information such as cited document's title, author(s), published date and citation-context. We will describe the assumption of extracting citation-context in a bit more detail below:

First of all, we extracted positions of citations where they are mentioned including citation-contexts which are texts around the cited documents. For our purpose, we created a citation-context using three sentences; the sentence where the reference has been cited, the preceding, and the following sentence. Additionally, at the start or end of a paragraph, the preceding or following sentence is not extracted respectively.

Therefore, the attributes of the dataset contain:


  • ReferenceID - unique identifier of cited reference in a citing document
  • SourceID - unique identifier of a citing document.
  • ChapterNumber - Chapter number of the citing document where the ReferenceID has mentioned.
  • ParagraphNumber - paragraph number of the citing document where the reference ReferenceID has mentioned.
  • SentenceNumber - sentence number of the citing document where the reference ReferencedID has mentioned.
  • Title - Title of the reference ReferenceID.
  • PublishedDate - Publication date when the reference ReferenceID was published.
  • Authors - Author(s) of the reference ReferenceID
  • TextBeforeRefMention - Sentence just before the sentence where the reference ReferenceID has been cited.
  • TextWhereRefMention - Sentence where the reference ReferenceID has been cited.
  • TextAfterRefMention - Sentence just after the sentence where the reference ReferenceID has been cited.
Please cite our paper if you use this dataset.


  • The actual size of the dataset is ~40gb however compressed size is ~6.7gb.
  • Requirements of different users may be different therefore we have released the raw version of the dataset. Please note, data cleansing (such as special character and stop-word removal) has not been performed.


Usage metrics

    Faculty of Science, Technology, Engineering and Mathematics (STEM)


    Ref. manager