This is the multi-page printable view of this section. Click here to print.
Finding and using open access data
1 - Search open access data collections
The make_datasets_tb
function from the ready4 library can be used to create a summary table of the open access datasets we curate in our ready4 Dataverse Collection.
make_datasets_tb("ready4") -> x
One way to inspect this information is to group contents by Dataverse Collections using the print-data
function.
print_data(x,
by_dv_1L_lgl = T) %>%
kableExtra::scroll_box(width = "100%")
Dataverse | Name | Description | Creator | Datasets |
---|---|---|---|---|
fakes | Fake Data For Instruction And Illustration | Fake data used to illustrate toolkits developed with the ready4 open science framework. | Orygen | 1, 2, 3, 4, 5 |
firstbounce | First Bounce | A ready4 framework model of platforms. Aims to identify opportunities to improve the efficiency and equity of mental health services. | Orygen | |
ready4fw | ready4 Framework | A collection of datasets that support implementation of the ready4 framework for open science computational models of mental health systems. | Orygen | 6 |
readyforwhatsnext | readyforwhatsnext | Data collections for the readyforwhatsnext mental health systems model. | Orygen | 7, 8 |
springtides | Springtides | A ready4 framework model of places. Synthesises geometry (boundary, coordinate) and spatial attribute (e.g. population counts, environmental characteristics, service identifier and model coefficients associated with areas) data. | Orygen | 9 |
springtolife | Spring To Life | A ready4 framework model of people. Models the characteristics, behaviours, relationships and outcomes of groups of individuals relevant to policymakers and service planners aiming to improve population mental health. | Orygen | 10 |
TTU | Transfer to Utility | A collection of transfer to utility datasets developed with the ready4 open science framework. | Orygen | 11 |
Alternatively, we can itemise individual Dataverse Datasets. When doing so, it makes sense to prepare separate views for toy datasets designed for instruction and real datasets appropriate for use in modelling.
Datasets appropriate for use in modelling projects can be returned by supplying the value “real” to the what_1L_chr
argument of print_data
.
print_data(x,
what_1L_chr = "real") %>%
kableExtra::scroll_box(width = "100%")
Title | Description | Dataverse | DOI |
---|---|---|---|
ready4 Framework Abbreviations and Definitions | This dataset contains resources that help ready4 Framework Developers adopt common standards and workflows. | ready4fw | 10.7910/DVN/RIQTKK |
readyforwhatsnext posters | A collection of poster summaries about the readyforwhatsnext project and its outputs. | readyforwhatsnext | 10.7910/DVN/QBZFQV |
Australian demographic input parameters for Springtides model | Geometry, spatial attribute and metadata inputs for the demographic module of the readyforwhatsnext model. The demographic module is a systems dynamics spatial simulation of area demographic characteristics. The current version of the model is quite rudimentary and is designed to be extended by other models developped with the ready4 open science mental health modelling tools. | readyforwhatsnext | 10.7910/DVN/JHSCDJ |
Springtides reports for Local Government Areas in the North West of Melbourne | This dataset is a collection of reports generated by a development version of the Springtides Model Of Places. Each report summarises prevalence projections for a specified mental disorder / mental health condition for a Local Government Area that is wholly or partially within the catchment area of the Orygen youth mental health service in North West Melbourne. As these reports were generated by a development version of the Springtides Model, these projections should be regarded as exploratory. | springtides | 10.7910/DVN/V3OKZV |
Modelling the online helpseeking choice of socially anxious young people |
Models to predict the online helpseeking choices of socially anxious young people in Australia and replication code and documentation to implement the discrete choice experiment that generated the models. All study outputs were created with the aid of the mychoice R package (https://ready4-dev.github.io/mychoice). |
springtolife | 10.7910/DVN/VGPIPS |
Transfer to AQoL-6D Utility Mapping Algorithms | Catalogues of models (and the programs that produced them) that can be used in conjunction with the youthu R package to predict AQoL-6D health utility (and thus, derive QALYs) from measures collected in youth mental health services. | TTU | 10.7910/DVN/DKDIB0 |
To view toy datasets, instead supply the value “fakes”.
print_data(x,
what_1L_chr = "fakes") %>%
kableExtra::scroll_box(width = "100%")
Title | Description | Dataverse | DOI |
---|---|---|---|
TTU (Transfer to Utility) R package - AQoL-6D vignette output | This dataset has been generated from fake data as an instructional aid. It is not to be used to inform decision making. | fakes | 10.7910/DVN/D74QMP |
TTU (Transfer to Utility) R package - EQ-5D vignette output | This dataset is provided as a teaching aid. It is the output of tools from the TTU R package, applied to a synthetic dataset (Fake Data) of psychological distress and psychological wellbeing. It is not to be used to support decision-making. | fakes | 10.7910/DVN/612HDC |
Synthetic (fake) youth mental health datasets and data dictionaries |
The datasets in this collection are entirely fake. They were developed principally to demonstrate the workings of a number of utility scoring and mapping algorithms. However, they may be of more general use to others. In some limited cases, some of the included files could be used in exploratory simulation based analyses. However, you should read the metadata descriptors for each file to inform yourself of the validity and limitations of each fake dataset. To open the RDS format files included in this dataset, the R package ready4use needs to be installed (see https://ready4-dev.github.io/ready4use/ ). It is also recommended that you install the youthvars package ( |
fakes | 10.7910/DVN/HJXYKQ |
ready4use R package vignette output | This dataset is provided so that others can compare the output they generate when implementing vignette code with that generated by the authors. | fakes | 10.7910/DVN/W95KED |
Specific R Package - AQoL-6D Vignette Output | This dataset is provided so that others can apply the algorithms we have developed, consistent with the principles of the ready4 open science framework for data synthesis and simulation in mental health. | fakes | 10.7910/DVN/GW7ZKC |
2 - Ingest data from an open access repository
This below section renders a R Markdown program from the Acumen website. You can use the following links to:
- view the tutorial on the Acument website (adds useful hyperlinks to code blocks)
- view the source file from that article, and;
- edit its contents (requires a GitHub account).
1. Objectives
On completion of this tutorial you should be able to:
-
Understand basic concepts relating to the Australian Mental Health Systems Models Dataverse Collection; and
-
Have the ability to search for, download and ingest files contained in Dataverse Datasets that are linked to by the Australian Mental Health Systems Models Dataverse Collection using two alternative approaches;
- Using a web based interface; and
- Using R commands.
2. Prerequisites
You can complete most of this tutorial without any specialist skills or software other than having a web-browser connected to the Internet. However, if you wish to try running the R code for finding and downloading files described in the last part of the tutorial, then you must have R (and ideally RStudio as well) installed on your machine. Instructions for how to install this software are available at https://rstudio-education.github.io/hopr/starting.html .
3. Concepts
Before searching for or retrieving data from the Australian Mental Health Systems Models Dataverse Collection, the following concepts are useful to understand:
-
The Dataverse Project is “an open source web application to share, preserve, cite, explore, and analyze research data.” It is developed at Harvard’s Institute for Quantitative Social Science (IQSS). More information about the project is available on the Dataverse Project’s website.
-
There are many Dataverse Installations around the world (85 at the time of writing this tutorial). Each Dataverse Installation is an instance of an organisation installing the Dataverse Project’s software on its own servers to create and manage online data repositories. At the time of writing there is one Australian Dataverse Installation listed on the Dataverse Project’s website, which is the Australian Data Archive.
-
The Harvard Dataverse is a Dataverse Installation that is managed by Harvard University, that is open to researchers from all disciplines from anywhere in the world. More details are available from its website.
-
A Dataverse Collection (frequently and confusingly also referred to as simply a “Dataverse”) is a part of a Dataverse Installation that a user can set up to host multiple “Dataverse Datasets” (see next bullet point). Dataverse Collections typically share common attributes (for example, are in the same topic area or produced by the same group(s) of researchers) and can be branded to a limited degree. Dataverse Collections will also contain descriptive metadata about the purpose and ownership of the collection.
-
A Dataverse Dataset is a uniquely identified collection of files (some of which, again confusingly, can be tabular data files of the type that researchers typically refer to as “datasets”) within a Dataverse Collection. Each Dataverse Dataset will have a name, a Digital Object Identifier, a version number, citation information and details of the licensing/terms of use that apply to its contents.
-
A Linked Dataverse Dataset is a Dataverse Dataset that appears in a Dataverse Collection’s list of contents without actually being in that Dataverse Collection (it is hosted in another Dataverse Collection and is potentially owned and controlled by another user).
-
The Australian Mental Health Systems Models Dataverse Collection (which we will refer to as “our Dataverse Collection”) is a Dataverse Collection of Linked Dataverse Datasets within the Harvard Dataverse. We established our Dataverse Collection in the Harvard Dataverse because of the robustness and flexibility that this service provides. A factor in our choice of the Harvard Dataverse was that the aim of our Dataverse Collection is to promote easy access to non-confidential data relevant to modelling Australian mental health policy and service planning topics. The non-confidential nature of the data means that the additional administrative requirements that some other Dataverse Installations place on users were potentially unnecessary for our specific purposes. As a collection of Linked Dataverse Datasets, our Dataverse Collection can be used by modelling groups as both a centralised location to find relevant data and as an additional promotion / distribution channel to share Dataverse Datasets from their own Harvard Dataverse Collections without surrendering any control over the management of their data (they continue to curate their Dataverse Dataset and can modify Dataverse Dataset contents, metadata and terms of use as they see fit).
3. Search and download dataset files
There are multiple options for searching and downloading files contained in our Dataverse Collection. This tutorial will discuss just two - one based on using a web browser and the other based on using R commands. For details on other options, it is recommended to consult the Harvard Dataverse user guide and (for more technical readers) api guide.
3.1. Web browser approach
Searching and retriving data from our Dataverse Collection via a web-browser is very simple, and this methods is suitable for low volume requests (i.e. occasional use) where reproducibility is not important.
To find and download data using your web browser, implement the following steps:
-
Go to our Dataverse Collection at https://dataverse.harvard.edu/dataverse/openmind
-
Search for the Linked Dataverse Dataset most of interest to you by using the tools provided on the landing page.
-
Click on the link to your selected Dataverse Dataset. Note that by doing so you will leave our Dataverse Collection and be taken to the Dataverse Collection controlled by the Dataverse Dataset’s owner.
-
(Optional) - Click on the “Metadata”, “Terms” and “Versions” tabs or (if available) the Related Publication links to discover more about the dataset. When you are done, click on the “Files” tab to review the files contained in the Dataverse Dataset.
-
Select the files that you wish to download using the checkboxes and click on the “Download” button.
-
When prompted, review any terms of use you are presented with and either accept them or cancel the download as you feel appropriate.
More detail on some of the above steps is available in the following section of the Harvard Dataverse user guide: https://guides.dataverse.org/en/latest/user/find-use-data.html#finding-data
3.2 Using R commands
Some limitations of relying purely on a web-browser are that it is a purely manual approach that can become inefficient for large number of data requests and which is not reproducible (thereby limiting transparency about the specific data items / versions used in an analysis). It can therefore be desirable to explore alternatives that are based on programming commands. Programmatic approaches have the advantage of being more readily incorporated into automated and reproducible workflows.
There are a range of software tools in different languages that can be used to programmatically search and retrieve files in Dataverse Collections. More information on these resources on a dedicated page within the Dataverse Project’s documentation.
One of these tools is dataverse
- “the R Client for Dataverse Repositories”. The dataverse
R package has a range of functions that are very helpful for general tasks relating to the search and retrieval of files contained in Dataverse Datasets. These functions are not the focus of this tutorial, but you can read more about them on the [packages documentation website]((https://iqss.github.io/dataverse-client-r/).
The remainder of this tutorial is focused on the use of another R package called ready4use
which created by Orygen to help manage open-source data for use in mental health models. The ready4use
R package extends the dataverse
R package and one of its applications is to ingest R objects stored in Dataverse Datasets in the “.Rds” file format directly into an R Session’s working memory. More information about ready4use
is available on its documentation website.
3.2.1 Install and load required R packages
As ready4use
is still a development package, you may need to first install the devtools
package to help install it. The following commands entered in your R console will do this.
utils::install.packages("devtools")
devtools::install_github("ready4-dev/ready4use")
We now load the ready4use
package and the ready4
framework for youth mental health modelling that it depends on. The ready4
framework will have been automatically installed along with ready4use.
3.2.2 Specify repository details
The next step is to create a Ready4useRepos
object, which in this example we will call X
, that contains the details of the Dataverse Dataset from which we wish to retrieve R objects. We need to supply three pieces of information to Ready4useRepos
. Two of these items of information will be the same for any data item retrieved from our Dataverse Collection and are the Dataverse Collection identifier (which for us is “openmind”) and the server on which the containing Dataverse Installation is hosted (in our case “dataverse.harvard.edu”). The one item of information that will vary based on your requirements is the name / identifier (DOI) of the Dataverse Dataset from which we wish to retrieve data. In this example we are using the DOI for the “Synthetic (fake) youth mental health datasets and data dictionaries” Dataverse Dataset.
X <- Ready4useRepos(dv_nm_1L_chr = "openmind",
dv_server_1L_chr = "dataverse.harvard.edu",
dv_ds_nm_1L_chr = "https://doi.org/10.7910/DVN/HJXYKQ")
Having supplied the details of where the data is stored we can now ingest the data we are interested in. We can either ingest all R object in the selected Dataverse Dataset or just objects that we specify. By default R objects are ingested along with their metadata, but we can choose not to ingest the metadata.
3.2.3 Ingest all R objects from a Dataverse Dataset along with its metadata
To ingest all R objects in the dataset, we enter the following command.
Y <- ingest(X)
We can now create separate list objects for the ingested data and its metadata.
data_ls <- procureSlot(Y,"b_Ready4useIngest@objects_ls")
meta_ls <- procureSlot(Y,"a_Ready4usePointer@b_Ready4useRepos@dv_ds_metadata_ls$ds_ls")
We can itemise the data objects we have ingested with the following command.
names(data_ls)
#> [1] "eq5d_ds_dict" "eq5d_ds_tb" "ymh_clinical_dict_r3"
#> [4] "ymh_clinical_tb"
We can also see what metadata fields we have ingested.
names(meta_ls)
#> [1] "id" "datasetId" "datasetPersistentId"
#> [4] "storageIdentifier" "versionNumber" "versionMinorNumber"
#> [7] "versionState" "lastUpdateTime" "releaseTime"
#> [10] "createTime" "termsOfUse" "fileAccessRequest"
#> [13] "metadataBlocks" "files"
There can be a lot of useful information contained in this metadata list object. For example, we can retrieve descriptive information about the Dataverse Dataset from which we have ingested data.
meta_ls$metadataBlocks$citation$fields$value[[7]]$dsDescriptionValue$value
#> [1] "The datasets in this collection are entirely fake. They were developed principally to demonstrate the workings of a number of utility scoring and mapping algorithms. However, they may be of more general use to others. In some limited cases, some of the included files could be used in exploratory simulation based analyses. However, you should read the metadata descriptors for each file to inform yourself of the validity and limitations of each fake dataset. To open the RDS format files included in this dataset, the R package ready4use needs to be installed (see https://ready4-dev.github.io/ready4use/ ). It is also recommended that you install the youthvars package ( https://ready4-dev.github.io/youthvars/) as this provides useful tools for inspecting and validating each dataset."
The metadata also contains descriptive information on each file in the Dataverse Dataset.
meta_ls$files$description[5]
#> [1] "A synthetic (fake) dataset representing clients in an Australian primary youth mental health service. This dataset was generated from parameter values derived from a sample of 1107 clients of headspace services using a script that is also included in this dataset. The purpose of this synthetic dataset was to allow the replication code for a utility mapping study (see: https://doi.org/10.1101/2021.07.07.21260129) to be run by those lacking access to the original dataset. The dataset may also have some limited value as an input dataset for purely exploratory studies in simulation studies of headspace clients, as its source dataset was reasonably representative of the headpace client population. However, it should be noted that the algorithm that generated this dataset only captures aspects of the joint distributions of the psychological and health utility measures. Other sample characteristic variables (age, gender, etc) are only representative of the source dataset when considered in isolation, rather than jointly."
3.2.4 Ingest all R objects from a Dataverse Dataset without metadata
If we wished to ingest only the R objects without metadata, we could have simply run the following command.
data_2_ls <- ingest(X,
metadata_1L_lgl = F)
We can see that this ingest is identical to that made using the previous method.
identical(data_ls, data_2_ls)
#> [1] TRUE
3.2.5 Ingest selected R objects
If we only want to ingest one specific object, we can supply its name.
The output from an object specific call to the ingest
method is the requested object.
#> # A tibble: 6 × 43
#> fkClientID round d_interv…¹ d_age d_gen…² d_sex…³ d_sex…⁴ d_ATSI d_cou…⁵
#> <chr> <fct> <date> <int> <chr> <chr> <fct> <chr> <chr>
#> 1 Participant_1 Basel… 2020-03-22 14 Male Male Hetero… No Austra…
#> 2 Participant_2 Basel… 2020-06-15 19 Female Female Hetero… Yes Other
#> 3 Participant_3 Basel… 2020-08-20 21 Female Female Other NA NA
#> 4 Participant_4 Basel… 2020-05-23 12 Female Female Hetero… Yes Other
#> 5 Participant_5 Basel… 2020-04-05 19 Male Male Hetero… Yes Other
#> 6 Participant_6 Basel… 2020-06-09 19 Male Male Hetero… Yes Other
#> # … with 34 more variables: d_english_home <chr>, d_english_native <chr>,
#> # d_studying_working <chr>, d_relation_s <chr>, s_centre <chr>,
#> # c_p_diag_s <chr>, c_clinical_staging_s <chr>, k6_total <int>,
#> # phq9_total <int>, bads_total <int>, gad7_total <int>, oasis_total <int>,
#> # scared_total <int>, c_sofas <int>, aqol6d_q1 <int>, aqol6d_q2 <int>,
#> # aqol6d_q3 <int>, aqol6d_q4 <int>, aqol6d_q5 <int>, aqol6d_q6 <int>,
#> # aqol6d_q7 <int>, aqol6d_q8 <int>, aqol6d_q9 <int>, aqol6d_q10 <int>, …
We can also request to ingest multiple specified objects from a Dataverse Dataset.
data_3_ls <- ingest(X,
fls_to_ingest_chr = c("ymh_clinical_tb","ymh_clinical_dict_r3"),
metadata_1L_lgl = F)
This last request produces a list of ingested objects.
names(data_3_ls)
#> [1] "ymh_clinical_dict_r3" "ymh_clinical_tb"