Fully and partially labeled data
Fully and partially labeled datasets serve a special role in the ASReview context. These datasets have review decisions for a subset of the records or for all records in the dataset.
Tip
Partially labeled data is useful in the Oracle mode, whereas Fully labeled data is useful in the Simulation and Exploration mode.
Label format
For tabular datasets (e.g., CSV, XLSX), the dataset
should contain a column called “included” or “label” (See Data format for all naming conventions), which is filled with 1
’s or
0
’s for the records that are already screened, or selected by experts to
be used for prior knowledge. The value is left empty for the records that you
haven’t screened yet, or which are added to the dataset.
For the RIS file format, the labels ASReview_relevant
,
ASReview_irrelevant
, and ASReview_not_seen
) can be stored with the N1
(Notes) tag. An example of a RIS file with labels in the N1 tag can be found
in the ASReview GitHub repository.
All labels in this example are valid ways to label the data.
Note
Exported files containing labeling decisions can be imported into ASReview LAB again, and whereafter all labels are recognized.
Partially labeled data
Tip
Useful for Oracle projects. Read more about Project modes.
Partially labeled datasets are datasets with a labeling decision for a subset of the records in the dataset and no decision for another subset.
A partially labeled dataset can be obtained by exporting results from ASReview LAB or other software. It can also be constructed given the format described above by merging a labeled dataset with new unlabeled records.
Partially labeled datasets are useful as the labels will be recognized by ASReview LAB as Prior Knowledge, and labels are used to train the first iteration of the active learning model.
Note
Merging labeled with unlabeled data should be done outside ASReview LAB, for example, with Citation Managers.
Fully labeled data
Tip
Useful for Simulation and Exploration projects. Read more about Project modes.
Fully labeled datasets are datasets with a labeling decision for each record in the dataset. Fully labeled datasets are useful for exploration or simulation purposes (see also What is a simulation? and Project modes).
Benchmark datasets
The ASReview research project collects fully labeled datasets published open access. The labeled datasets are PRISMA-based systematic reviews or meta-analyses on various research topics. They can be useful for teaching purposes or for testing the performance of (new) active learning models. The datasets and their metadata are available via the SYNERGY Dataset repository. In ASReview LAB, these datasets are found under “Benchmark Datasets”.
The Benchmark Datasets are directly available in the software. During the
Add Dataset step of the project setup, there is a panel
with all the datasets. The datasets can be selected and used directly.
Benchmark datasets are also available via the Simulation via command line. Use the prefix
synergy:
followed by the identifier of the dataset (see Synergy Dataset
repository). For example, to use the Van de Schoot et al. (2018) dataset, use
synergy:van_de_schoot_2018
.