sorry, we can't preview this file
...but you can still download CitationContextbasedDataset.tar.gz
Citation-Context Dataset (C2D)
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
C2D dataset is created by using 2 million full-text open-source research publications obtained from CORE. It contains 53 million unique records of citation-information. To construct C2D, we extracted citation information from each publication. Information such as cited document's title, author(s), published date and citation-context. We will describe the assumption of extracting citation-context in a bit more detail below:
First of all, we extracted positions of citations where they are mentioned including citation-contexts which are texts around the cited documents. For our purpose, we created a citation-context using three sentences; the sentence where the reference has been cited, the preceding, and the following sentence. Additionally, at the start or end of a paragraph, the preceding or following sentence is not extracted respectively.
Therefore, the attributes of the dataset contain:
- ReferenceID - unique identifier of cited reference in a citing document
- SourceID - unique identifier of a citing document.
- ChapterNumber - Chapter number of the citing document where the ReferenceID has mentioned.
- ParagraphNumber - paragraph number of the citing document where the reference ReferenceID has mentioned.
- SentenceNumber - sentence number of the citing document where the reference ReferencedID has mentioned.
- Title - Title of the reference ReferenceID.
- PublishedDate - Publication date when the reference ReferenceID was published.
- Authors - Author(s) of the reference ReferenceID
- TextBeforeRefMention - Sentence just before the sentence where the reference ReferenceID has been cited.
- TextWhereRefMention - Sentence where the reference ReferenceID has been cited.
- TextAfterRefMention - Sentence just after the sentence where the reference ReferenceID has been cited.
- The actual size of the dataset is ~40gb however compressed size is ~6.7gb.
- Requirements of different users may be different therefore we have released the raw version of the dataset. Please note, data cleansing (such as special character and stop-word removal) has not been performed.