Dataset.zip (2.31 GB)
Download file

Career promotions, research publications, Open Access dataset

Download (2.31 GB)
dataset
posted on 28.02.2022, 13:17 authored by Matteo CancellieriMatteo Cancellieri, Nancy PontikaNancy Pontika, David PrideDavid Pride, Petr Knoth, Hannah Metzler, Antonia CorreiaAntonia Correia, Helene Brinken, Bikash Gyawali
This dataset is a compilation of processed data on citation and references for research papers including their author, institution and open access info for a selected sample of academics analysed using Microsoft Academic Graph (MAG) data and CORE. The data for this dataset was collected during December 2019 to January 2020.

Six countries (Austria, Brazil, Germany, India, Portugal, United Kingdom and United States) were the focus of the six questions which make up this dataset. There is one csv file per country and per question (36 files in total).

More details about the creation of this dataset are available on the public ON-MERRIT D3.1 deliverable report.
The dataset is a combination of two different data sources, one part is a dataset created on analysing promotion policies across the target countries, while the second part is a set of data points available to understand the publishing behaviour. To facilitate the analysis the dataset is organised in the following seven folders:


PRT
The dataset with the file name "PRT_policies.csv" contains the related information as this was extracted from promotion, review and tenure (PRT) policies.


Q1: What % of papers coming from a university are Open Access?
- Dataset Name format: oa_status__countryname__papers.csv
- Dataset Contents: Open Access (OA) status of all papers of all the universities listed in Times Higher Education World University Rankings (THEWUR) for the given country. A paper is marked OA if there is at least an OA link available. OA links are collected using the CORE Discovery API.
- Important considerations about this dataset:
- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to.
- The service we used to recognise if a paper is OA, CORE Discovery, does not contain entries for all _paperids_ in MAG. This implies that some of the records in the dataset extracted will not have either a true or false value for the _is_OA_ field.
- Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).

Q2: How are papers, published by the selected universities, distributed across the three scientific disciplines of our choice?

- Dataset Name format: fsid__countryname__papers.csv
- Dataset Contents: For the given country, all papers for all the universities listed in THEWUR with the information of _fieldofstudy_ they belong to.
- Important considerations about this dataset:
* MAG can associate a paper to multiple _fieldofstudyid_. If a paper belongs to more than one of our _fieldofstudyid_, separate records were created for the paper with each of those _fieldofstudyid_s.
- MAG assigns _fieldofstudyid_ to every paper with a _score_. We preserve only those records whose score is more than 0.5 for any _fieldofstudyid_ it belongs to.
- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.

Q3: What is the gender distribution in authorship of papers published by the universities?

- Dataset Name format: author_gender__countryname__papers.csv
- Dataset Contents: All papers with their author names for all the universities listed in THEWUR.
- Important considerations about this dataset :
- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.
- An external script was executed to determine the gender of the authors. The script is available here.

Q4: Distribution of staff seniority (= number of years from their first publication until the last publication) in the given university.

- Dataset Name format: author_ids__countryname__papers.csv
- Dataset Contents: For a given country, all papers for authors with their publication year for all the universities listed in THEWUR.
- Important considerations about this work :
- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.
- Calculating staff seniority can be achieved in various ways. The most straightforward option is to calculate it as _academic_age = MAX(year) - MIN(year) _for each _authorid_.

Q5: Citation counts (incoming) for OA vs Non-OA papers published by the university.

- Dataset Name format: cc_oa__countryname__papers.csv
- Dataset Contents: OA status and OA links for all papers of all the universities listed in THEWUR and for each of those papers, count of incoming citations available in MAG.
- Important considerations about this dataset :
- CORE Discovery was used to establish the OA status of papers.
- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to.
- Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).

Q6: Count of OA vs Non-OA references (outgoing) for all papers published by universities.

- Dataset Name format: rc_oa__countryname_-papers.csv
- Dataset Contents: Counts of all OA and unknown papers referenced by all papers published by all the universities listed in THEWUR.
- Important considerations about this dataset :
- CORE Discovery was used to establish the OA status of papers being referenced.
- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.

Additional files:

- _fieldsofstudy_mag_.csv: this file contains a dump of _fieldsofstudy_ table of MAG mapping each of the ids to their actual field of study name.


Funding

Observing and Negating Matthew Effects in Responsible Research and Innovation Transition

European Commission

Find out more...

History