Access data from ASReview file
The API is still under development and can change at any time without warning.
Data generated using ASReview LAB is stored in an ASReview project file. Via the ASReview Python API, there are two ways to access the data in the ASReview (extension .asreview
) file: Via the project-API and the state-API. The project API is for retrieving general project settings, the imported dataset, the feature matrix, etc. The state API retrieves data related directly to the reviewing process, such as the labels, the
time of labeling, and the classifier used.
Example Data
To illustrate the ASReview Python API, the benchmark dataset van_de_Schoot_2017
is used. The project file example.asreview
can be obtained by running asreview simulate benchmark:van_de_Schoot_2017 -s example.asreview --seed 101
.
The ASReview Python API can be used for project files obtained via the Oracle, Validation, and Simulation mode.
Python Imports
[1]:
import shutil
from pathlib import Path
import pandas as pd
from asreview import open_state
from asreview import ASReviewProject
from asreview import ASReviewData
Project API
The ASReview project file is a zipped folder. To unzip the folder and store its contents in a temporary directory, use the following code:
[2]:
project_path = Path("tmp_data")
project_path.mkdir()
project = ASReviewProject.load("example.asreview", project_path)
The returned project
instance is of type ASReviewProject.
To inspect the project details, use the following code:
[3]:
project.config
[3]:
{'version': '1.2+6.g41c4257.dirty',
'id': 'example',
'mode': 'simulate',
'name': 'example',
'description': 'Simulation created via ASReview via command line interface',
'authors': None,
'created_at_unix': 1683798551,
'datetimeCreated': '2023-05-11 11:49:11.327073',
'reviews': [{'id': 'e611d2cbd89b401aa376fa4eca1c517e',
'start_time': '2023-05-11 11:49:12.323797',
'status': 'finished',
'end_time': '2023-05-11 11:49:32.450593'}],
'feature_matrices': [{'id': 'tfidf', 'filename': 'tfidf_feature_matrix.npz'}],
'dataset_path': 'van_de_Schoot_2017.csv'}
The imported dataset is located at /tmp_data/{project_name}/data/{dataset_filename}
, and can be inspected using the following code:
[4]:
dataset_fp = Path(
project_path, project.config["id"], "data", project.config["dataset_path"]
)
dataset = ASReviewData.from_file(dataset_fp)
print(f"The dataset contains {len(dataset)} records.")
dataset.to_dataframe().head()
The dataset contains 6189 records.
[4]:
title | abstract | keywords | authors | year | date | doi | label_included | label_abstract_screening | duplicate_record_id | |
---|---|---|---|---|---|---|---|---|---|---|
record_id | ||||||||||
0 | Manual for ASEBA School-Age Forms & Profiles | Achenbach, T. M., Rescorla, L. A. | 2001.0 | 2001 | NaN | 0 | 0 | NaN | ||
1 | Queensland Trauma Registry: A summary of paedi... | Dallow, N., Lang, J., Bellamy, N. | 2007.0 | 2007 | NaN | 0 | 0 | NaN | ||
2 | Posttraumatic Stress Disorder: Scientific and ... | This comprehensive overview of research and cl... | Ford, J. D., Grasso, D. J., Elhai, J. D., Cour... | 2015.0 | NaN | NaN | 0 | 0 | NaN | |
3 | SOCIAL CLASS AND MENTAL ILLNESS | Hollingshead, A. B., Redlich, F. C. | 1958.0 | NaN | NaN | 0 | 0 | NaN | ||
4 | Computerised test generation for cross-nationa... | “‘Computerised Test Generation for Cross-Natio... | Irvine, S. H. | 2014.0 | NaN | NaN | 0 | 0 | NaN |
To obtain the content of the feature matrix, for example, the first row of the matrix, use the following code (note the matrix is in a sparse matrix format):
[5]:
feature_extraction_id = project.feature_matrices[0]["id"]
feature_matrix = project.get_feature_matrix(feature_extraction_id)
print(feature_matrix[0])
(0, 20452) 0.35937211648312967
(0, 18297) 0.26158369118434677
(0, 13842) 0.3248271421716685
(0, 9739) 0.38355660008860293
(0, 3231) 0.7059309068495663
(0, 2384) 0.22684547910949254
State API
The data stored during the review process can be accessed as a pandas DataFrame using the following code:
[6]:
with open_state("example.asreview") as state:
df = state.get_dataset()
print(f"The state contains {len(df)} records.")
The state contains 561 records.
The returned state
instance is of type SQLiteState. Note that the state contains less records than the original dataset. This is because by default the simulation stops after finding all relevant records.
[7]:
df.to_csv(project_path / "example_state.csv", index=False)
df.head()
[7]:
record_id | label | classifier | query_strategy | balance_strategy | feature_extraction | training_set | labeling_time | notes | |
---|---|---|---|---|---|---|---|---|---|
0 | 4435 | 1 | None | prior | None | None | -1 | 2023-05-11 11:49:16.186034 | None |
1 | 5560 | 0 | None | prior | None | None | -1 | 2023-05-11 11:49:16.186034 | None |
2 | 4434 | 1 | nb | max | double | tfidf | 2 | 2023-05-11 11:49:16.420695 | None |
3 | 3668 | 0 | nb | max | double | tfidf | 3 | 2023-05-11 11:49:16.444989 | None |
4 | 3142 | 0 | nb | max | double | tfidf | 4 | 2023-05-11 11:49:16.505603 | None |
You can merge the information from the state file with the original dataset.
[8]:
df["labeling_order"] = df.index
dataset_with_results = dataset.df.join(df.set_index("record_id"))
dataset_with_results.to_csv(project_path / "data_and_state_merged.csv", index=False)
dataset_with_results
[8]:
title | abstract | keywords | authors | year | date | doi | label_included | label_abstract_screening | duplicate_record_id | label | classifier | query_strategy | balance_strategy | feature_extraction | training_set | labeling_time | notes | labeling_order | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
record_id | |||||||||||||||||||
0 | Manual for ASEBA School-Age Forms & Profiles | Achenbach, T. M., Rescorla, L. A. | 2001.0 | 2001 | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
1 | Queensland Trauma Registry: A summary of paedi... | Dallow, N., Lang, J., Bellamy, N. | 2007.0 | 2007 | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
2 | Posttraumatic Stress Disorder: Scientific and ... | This comprehensive overview of research and cl... | Ford, J. D., Grasso, D. J., Elhai, J. D., Cour... | 2015.0 | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
3 | SOCIAL CLASS AND MENTAL ILLNESS | Hollingshead, A. B., Redlich, F. C. | 1958.0 | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
4 | Computerised test generation for cross-nationa... | “‘Computerised Test Generation for Cross-Natio... | Irvine, S. H. | 2014.0 | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6184 | Biological and clinical framework for posttrau... | Three decades of posttraumatic stress disorder... | Vermetten, E., Lanius, R. A. | 2012.0 | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
6185 | Dividing traffic sub-areas based on a parallel... | In order to alleviate the traffic congestion a... | GPS trajectories, K-means, MapReduce, Traffic ... | Wang, B., Tao, L., Gao, C., Xia, D., Rong, Z.,... | 2014.0 | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6186 | Quantifying resilience to enhance individualiz... | Resilience is the human ability to adapt in th... | Adaptation, Autonomic Nervous System, Resilien... | Winslow, B., Carroll, M., Jones, D., Hannigan,... | 2013.0 | NaN | NaN | 0 | 0 | NaN | 0.0 | nb | max | double | tfidf | 535.0 | 2023-05-11 11:49:31.593247 | None | 535.0 |
6187 | A discriminant analysis of variables related t... | Frye, James S. | 1981.0 | NaN | NaN | 0 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
6188 | Developmental trajectories of pain/disability ... | Sterling, M., Hendrikz, J., Kenardy, J. | 2010.0 | NaN | NaN | 0 | 1 | NaN | 0.0 | nb | max | double | tfidf | 333.0 | 2023-05-11 11:49:25.857883 | None | 333.0 |
6189 rows × 19 columns
There are also multiple functions to obtain one specific variable in the data. For example, to plot the labeling times in a graph, use the following code:
[9]:
with open_state("example.asreview") as state:
labeling_times = state.get_labeling_times()
pd.to_datetime(labeling_times).plot(title="Time of labeling")
[9]:
<Axes: title={'center': 'Time of labeling'}>
By default, the records that are part of the prior knowledge are included in the results. To obtain the labels use the following code:
[10]:
with open_state("example.asreview") as state:
labels = state.get_labels(priors=False)
labels
[10]:
0 1
1 0
2 0
3 0
4 0
..
554 0
555 0
556 0
557 0
558 1
Name: label, Length: 559, dtype: int64
To obtain the data corresponding to a specific record identifier, use the following code:
[11]:
with open_state("example.asreview") as state:
record_data = state.get_data_by_record_id(5176)
record_data
[11]:
record_id | label | classifier | query_strategy | balance_strategy | feature_extraction | training_set | labeling_time | notes | |
---|---|---|---|---|---|---|---|---|---|
0 | 5176 | 0 | nb | max | double | tfidf | 29 | 2023-05-11 11:49:17.247842 | None |
To obtain all settings used for the project, run the following code:
[12]:
with open_state("example.asreview") as state:
settings = state.settings_metadata
settings
[12]:
{'settings': {'model': 'nb',
'query_strategy': 'max',
'balance_strategy': 'double',
'feature_extraction': 'tfidf',
'n_instances': 1,
'stop_if': 'min',
'n_prior_included': 1,
'n_prior_excluded': 1,
'model_param': {'alpha': 3.822},
'query_param': {},
'feature_param': {'ngram_max': 1,
'stop_words': 'english',
'split_ta': 0,
'use_keywords': 0},
'balance_param': {'a': 2.155, 'alpha': 0.94, 'b': 0.789, 'beta': 1.0}},
'state_version': '1',
'software_version': '1.2+6.g41c4257.dirty',
'model_has_trained': True}
The state also contains the ranking and the relevance score (if the model uses relevance scores) of the last iteration of the machine learning model. To get these, use the following code:
[13]:
with open_state("example.asreview") as state:
last_ranking = state.get_last_ranking()
last_probabilities = state.get_last_probabilities()
print("RANKING:")
print(last_ranking[["record_id", "ranking"]])
print("RELEVANCE SCORES:")
print(last_probabilities)
RANKING:
record_id ranking
0 2445 0
1 2446 1
2 2444 2
3 720 3
4 719 4
... ... ...
6184 1766 6184
6185 63 6185
6186 4427 6186
6187 2851 6187
6188 4888 6188
[6189 rows x 2 columns]
RELEVANCE SCORES:
0 0.637417
1 0.671088
2 0.707728
3 0.777025
4 0.672183
...
6184 0.792209
6185 0.697030
6186 0.828880
6187 0.768638
6188 0.844882
Name: proba, Length: 6189, dtype: float64
Cleanup
The following code removes the temporary folder that was created:
[14]:
shutil.rmtree(project_path)