rcristin provides two functions to facilitate easy querying of the Cristin database through its REST API. This vignette shows you how to:
Use get_cristin_results()
to get the registered results of a research institution, unit or individual researcher.
Use get_contributor_info()
to extract further information about the institutional affiliations of the contributors to the results you extracted in the last step.
With these two functions and some basic data handling you can answer many common questions about the reported results of an institution, a particular research project or an academic journal, to name a few of the options available.
First, we load rcristin. We also load in the dplyr package from the tidyverse for easy data exploration:
get_cristin_results()
The most important function of rcristin is get_cristin_results()
: it allows you to specify search terms that will (hopefully) return meaningful results from the database.
Basically, you pass the function one or several arguments that correspond to the search parameters of the API specification, for example institution
or year_reported
. For this example, we will fetch all results reported by the Centre for Sami Studies (SESAM) at UiT - The Arctic University of Norway since 2015.
First, we need to establish the search parameters. In this case we want results for a particular unit that belongs to an institution, but we don’t want to download all results for the whole institution. We need to specify the unit
parameter, as well as published_since
.
UiT has the institutional ID of 186 in the Cristin base, and a look at the units list for this institution we can see that SESAM has the unit ID of 186.33.85.0.
The call to get_cristin_results()
looks as follows:
sesam_results <- get_cristin_results(
unit = "186.33.85.0",
published_since = 2015
)
This yields a tibble with 362 results, with a lot of metadata contained in the columns:
sesam_results
#> # A tibble: 362 x 56
#> cristin_result_~ result_url category_code title year_published date_published
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1214492 https://a~ MEDIAINTERVI~ "GEN~ 2015 2015-01-23T00~
#> 2 1210116 https://a~ ANTHOLOGYACA "Sav~ 2017 <NA>
#> 3 1229676 https://a~ READEROPINION "Hva~ 2015 2015-03-03T00~
#> 4 1258447 https://a~ ARTICLE "Hun~ 2015 <NA>
#> 5 1261075 https://a~ ARTICLE "\"S~ 2015 <NA>
#> # ... with 357 more rows, and 50 more variables: links <list>,
#> # open_access <chr>, original_language <chr>, issue <chr>,
#> # number_of_pages <chr>, international_standard_numbers <list>,
#> # year_printed <chr>, year_reported <chr>, year_online <chr>, volume <chr>,
#> # organiser <chr>, funding_sources <list>, projects <list>,
#> # import_sources <list>, category_name_en <chr>, contributors_url <chr>,
#> # contributors_count <int>, contributors_preview <list>, created_date <chr>,
#> # last_modified_date <chr>, channel_title <chr>,
#> # publisher_cristin_publisher_id <chr>, publisher_name <chr>,
#> # publisher_url <chr>, publisher_place <chr>,
#> # series_cristin_journal_id <chr>, series_name <chr>,
#> # series_international_standard_numbers <list>, series_nvi_level <chr>,
#> # chapters_url <chr>, journal_cristin_journal_id <chr>, journal_name <chr>,
#> # journal_international_standard_numbers <list>, journal_nvi_level <chr>,
#> # journal_publisher_cristin_publisher_id <chr>, journal_publisher_name <chr>,
#> # journal_publisher_place <chr>, journal_publisher_url <chr>,
#> # pages_from <chr>, pages_to <chr>, pages_count <chr>, event_name <chr>,
#> # event_location <chr>, event_date_from <chr>, event_date_to <chr>,
#> # event_arranged_by_name <chr>, part_of_url <chr>,
#> # classification_scientific_disciplines <list>,
#> # classification_keywords <list>, summary <chr>
This can be summarised in various ways. For example, by counting the number of results per year:
sesam_results %>%
count(year_published)
#> # A tibble: 6 x 2
#> year_published n
#> <chr> <int>
#> 1 2015 60
#> 2 2016 63
#> 3 2017 67
#> 4 2018 53
#> 5 2019 64
#> # ... with 1 more row
Looks like a pretty stable output over time. What are the types of results?
sesam_results %>%
count(category_name_en, sort = TRUE)
#> # A tibble: 33 x 2
#> category_name_en n
#> <chr> <int>
#> 1 Academic lecture 73
#> 2 Lecture 69
#> 3 Academic article 39
#> 4 Academic chapter/article/Conference paper 39
#> 5 Interview 36
#> # ... with 28 more rows
A lot of communication work in the form of lectures, but also chapters in books and academic articles.
get_contributor_info()
This function takes in a data frame of Cristin results retrieved using get_cristin_results()
, looks up the contributor information for each result in that data frame, and returns a data frame with author and affilation data for each Cristin result ID, which can then be joined to the original data frame using the result ID or analysed as is.
Continuing with our example, we want to see which institutions SESAM publish together with. First, we filter out anything that is not a peer-reviewed publication results from sesam_results
:
sesam_pubs <- sesam_results %>%
filter(
category_code %in% c(
"CHAPTERACADEMIC", "ARTICLE",
"ANTHOLOGYACA", "COMMENTARYACA"
)
)
Then, we simply pass the whole tibble to get_contributor_info()
:
sesam_contributors <- get_contributor_info(sesam_pubs)
The 87 publications has a total of 241 affiliated contributors. How many publications have authors from other institutions?
sesam_copublishers <- sesam_contributors %>%
filter(
cristin_institution_id != 186
)
There are 30 publications with authors from outside SESAM, but some of these have many co-authors from other institutions.
sesam_copublishers %>%
count(unit_name, sort = TRUE)
#> # A tibble: 40 x 2
#> unit_name n
#> <chr> <int>
#> 1 Stockholm University 10
#> 2 Sámi University of Applied Sciences 9
#> 3 Unknown 9
#> 4 NORCE Samfunn 7
#> 5 Russian Academy of Sciences 7
#> # ... with 35 more rows
Not all the affiliations have informative unit names (“Sweden”, “New Zealand”, “Administrasjon”), but it is clear that the researchers of SESAM are just as liable to co-publish internationally as with other Norwegian universities.
rcristin is great for exploring the contents of the Cristin database. Often, you will find yourself trying out different combinations of queries to get the data you are interested in. This is essential to get a better understanding of what data is available.
However, it also helps to remember that every call to the Cristin API results in a lookup in a database somewhere, and the transfer of data from a server to your computer. While Cristin is not a huge database, and the maintainers have plenty of processing capacity, it is good practice to avoid unnecessarily large or redundant API calls. Try to specify as many parameters as you can in the queries, and if you are trying out multiple queries in succession (for example looping through a long list of parameter combinations) always remember to test it out on a smaller subset of those queries before running the full process. This will help you catch any errors or tune the parameters to reduce the amount of data transferred.