Access data from ASReview file

The API is still under development and can change at any time without warning.

Data generated using ASReview LAB is stored in an ASReview project file. Via the ASReview Python API, there are two ways to access the data in the ASReview (extension .asreview) file: Via the project-API and the state-API. The project API is for retrieving general project settings, the imported dataset, the feature matrix, etc. The state API retrieves data related directly to the reviewing process, such as the labels, the time of labeling, and the classifier used.

Example Data

To illustrate the ASReview Python API, the benchmark dataset van_de_Schoot_2017 is used. The project file example.asreview can be obtained by running asreview simulate benchmark:van_de_Schoot_2017 -s example.asreview --seed 101.

The ASReview Python API can be used for project files obtained via the Oracle, Validation, and Simulation mode.

Python Imports

[1]:
import shutil
from pathlib import Path

import pandas as pd
from asreview import open_state
from asreview import ASReviewProject
from asreview import ASReviewData

Project API

The ASReview project file is a zipped folder. To unzip the folder and store its contents in a temporary directory, use the following code:

[2]:
project_path = Path("tmp_data")
project_path.mkdir()
project = ASReviewProject.load("example.asreview", project_path)

The returned project instance is of type ASReviewProject.

To inspect the project details, use the following code:

[3]:
project.config

[3]:
{'version': '1.2+6.g41c4257.dirty',
 'id': 'example',
 'mode': 'simulate',
 'name': 'example',
 'description': 'Simulation created via ASReview via command line interface',
 'authors': None,
 'created_at_unix': 1683798551,
 'datetimeCreated': '2023-05-11 11:49:11.327073',
 'reviews': [{'id': 'e611d2cbd89b401aa376fa4eca1c517e',
   'start_time': '2023-05-11 11:49:12.323797',
   'status': 'finished',
   'end_time': '2023-05-11 11:49:32.450593'}],
 'feature_matrices': [{'id': 'tfidf', 'filename': 'tfidf_feature_matrix.npz'}],
 'dataset_path': 'van_de_Schoot_2017.csv'}

The imported dataset is located at /tmp_data/{project_name}/data/{dataset_filename}, and can be inspected using the following code:

[4]:
dataset_fp = Path(
    project_path, project.config["id"], "data", project.config["dataset_path"]
)
dataset = ASReviewData.from_file(dataset_fp)
print(f"The dataset contains {len(dataset)} records.")
dataset.to_dataframe().head()

The dataset contains 6189 records.
[4]:
title abstract keywords authors year date doi label_included label_abstract_screening duplicate_record_id
record_id
0 Manual for ASEBA School-Age Forms & Profiles Achenbach, T. M., Rescorla, L. A. 2001.0 2001 NaN 0 0 NaN
1 Queensland Trauma Registry: A summary of paedi... Dallow, N., Lang, J., Bellamy, N. 2007.0 2007 NaN 0 0 NaN
2 Posttraumatic Stress Disorder: Scientific and ... This comprehensive overview of research and cl... Ford, J. D., Grasso, D. J., Elhai, J. D., Cour... 2015.0 NaN NaN 0 0 NaN
3 SOCIAL CLASS AND MENTAL ILLNESS Hollingshead, A. B., Redlich, F. C. 1958.0 NaN NaN 0 0 NaN
4 Computerised test generation for cross-nationa... “‘Computerised Test Generation for Cross-Natio... Irvine, S. H. 2014.0 NaN NaN 0 0 NaN

To obtain the content of the feature matrix, for example, the first row of the matrix, use the following code (note the matrix is in a sparse matrix format):

[5]:
feature_extraction_id = project.feature_matrices[0]["id"]
feature_matrix = project.get_feature_matrix(feature_extraction_id)
print(feature_matrix[0])

  (0, 20452)    0.35937211648312967
  (0, 18297)    0.26158369118434677
  (0, 13842)    0.3248271421716685
  (0, 9739)     0.38355660008860293
  (0, 3231)     0.7059309068495663
  (0, 2384)     0.22684547910949254

State API

The data stored during the review process can be accessed as a pandas DataFrame using the following code:

[6]:
with open_state("example.asreview") as state:
    df = state.get_dataset()
    print(f"The state contains {len(df)} records.")

The state contains 561 records.

The returned state instance is of type SQLiteState. Note that the state contains less records than the original dataset. This is because by default the simulation stops after finding all relevant records.

[7]:
df.to_csv(project_path / "example_state.csv", index=False)
df.head()

[7]:
record_id label classifier query_strategy balance_strategy feature_extraction training_set labeling_time notes
0 4435 1 None prior None None -1 2023-05-11 11:49:16.186034 None
1 5560 0 None prior None None -1 2023-05-11 11:49:16.186034 None
2 4434 1 nb max double tfidf 2 2023-05-11 11:49:16.420695 None
3 3668 0 nb max double tfidf 3 2023-05-11 11:49:16.444989 None
4 3142 0 nb max double tfidf 4 2023-05-11 11:49:16.505603 None

You can merge the information from the state file with the original dataset.

[8]:
df["labeling_order"] = df.index
dataset_with_results = dataset.df.join(df.set_index("record_id"))
dataset_with_results.to_csv(project_path / "data_and_state_merged.csv", index=False)
dataset_with_results

[8]:
title abstract keywords authors year date doi label_included label_abstract_screening duplicate_record_id label classifier query_strategy balance_strategy feature_extraction training_set labeling_time notes labeling_order
record_id
0 Manual for ASEBA School-Age Forms & Profiles Achenbach, T. M., Rescorla, L. A. 2001.0 2001 NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Queensland Trauma Registry: A summary of paedi... Dallow, N., Lang, J., Bellamy, N. 2007.0 2007 NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Posttraumatic Stress Disorder: Scientific and ... This comprehensive overview of research and cl... Ford, J. D., Grasso, D. J., Elhai, J. D., Cour... 2015.0 NaN NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 SOCIAL CLASS AND MENTAL ILLNESS Hollingshead, A. B., Redlich, F. C. 1958.0 NaN NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Computerised test generation for cross-nationa... “‘Computerised Test Generation for Cross-Natio... Irvine, S. H. 2014.0 NaN NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6184 Biological and clinical framework for posttrau... Three decades of posttraumatic stress disorder... Vermetten, E., Lanius, R. A. 2012.0 NaN NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6185 Dividing traffic sub-areas based on a parallel... In order to alleviate the traffic congestion a... GPS trajectories, K-means, MapReduce, Traffic ... Wang, B., Tao, L., Gao, C., Xia, D., Rong, Z.,... 2014.0 NaN NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6186 Quantifying resilience to enhance individualiz... Resilience is the human ability to adapt in th... Adaptation, Autonomic Nervous System, Resilien... Winslow, B., Carroll, M., Jones, D., Hannigan,... 2013.0 NaN NaN 0 0 NaN 0.0 nb max double tfidf 535.0 2023-05-11 11:49:31.593247 None 535.0
6187 A discriminant analysis of variables related t... Frye, James S. 1981.0 NaN NaN 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6188 Developmental trajectories of pain/disability ... Sterling, M., Hendrikz, J., Kenardy, J. 2010.0 NaN NaN 0 1 NaN 0.0 nb max double tfidf 333.0 2023-05-11 11:49:25.857883 None 333.0

6189 rows × 19 columns

There are also multiple functions to obtain one specific variable in the data. For example, to plot the labeling times in a graph, use the following code:

[9]:
with open_state("example.asreview") as state:
    labeling_times = state.get_labeling_times()
pd.to_datetime(labeling_times).plot(title="Time of labeling")

[9]:
<Axes: title={'center': 'Time of labeling'}>
_images/example_api_asreview_file_19_1.png

By default, the records that are part of the prior knowledge are included in the results. To obtain the labels use the following code:

[10]:
with open_state("example.asreview") as state:
    labels = state.get_labels(priors=False)
labels

[10]:
0      1
1      0
2      0
3      0
4      0
      ..
554    0
555    0
556    0
557    0
558    1
Name: label, Length: 559, dtype: int64

To obtain the data corresponding to a specific record identifier, use the following code:

[11]:
with open_state("example.asreview") as state:
    record_data = state.get_data_by_record_id(5176)
record_data

[11]:
record_id label classifier query_strategy balance_strategy feature_extraction training_set labeling_time notes
0 5176 0 nb max double tfidf 29 2023-05-11 11:49:17.247842 None

To obtain all settings used for the project, run the following code:

[12]:
with open_state("example.asreview") as state:
    settings = state.settings_metadata
settings

[12]:
{'settings': {'model': 'nb',
  'query_strategy': 'max',
  'balance_strategy': 'double',
  'feature_extraction': 'tfidf',
  'n_instances': 1,
  'stop_if': 'min',
  'n_prior_included': 1,
  'n_prior_excluded': 1,
  'model_param': {'alpha': 3.822},
  'query_param': {},
  'feature_param': {'ngram_max': 1,
   'stop_words': 'english',
   'split_ta': 0,
   'use_keywords': 0},
  'balance_param': {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}},
 'state_version': '1',
 'software_version': '1.2+6.g41c4257.dirty',
 'model_has_trained': True}

The state also contains the ranking and the relevance score (if the model uses relevance scores) of the last iteration of the machine learning model. To get these, use the following code:

[13]:
with open_state("example.asreview") as state:
    last_ranking = state.get_last_ranking()
    last_probabilities = state.get_last_probabilities()
print("RANKING:")
print(last_ranking[["record_id", "ranking"]])
print("RELEVANCE SCORES:")
print(last_probabilities)

RANKING:
      record_id  ranking
0          2445        0
1          2446        1
2          2444        2
3           720        3
4           719        4
...         ...      ...
6184       1766     6184
6185         63     6185
6186       4427     6186
6187       2851     6187
6188       4888     6188

[6189 rows x 2 columns]
RELEVANCE SCORES:
0       0.637417
1       0.671088
2       0.707728
3       0.777025
4       0.672183
          ...
6184    0.792209
6185    0.697030
6186    0.828880
6187    0.768638
6188    0.844882
Name: proba, Length: 6189, dtype: float64

Cleanup

The following code removes the temporary folder that was created:

[14]:
shutil.rmtree(project_path)