API Reference

Low level API

class asreview.review.BaseReview(as_data, model=None, query_model=None, balance_model=None, feature_model=None, n_papers=None, n_instances=1, n_queries=None, start_idx=[], state_file=None, log_file=None)[source]

Base class for Systematic Review.

Parameters:
  • as_data (asreview.ASReviewData) – The data object which contains the text, labels, etc.
  • model (BaseModel) – Initialized model to fit the data during active learning. See asreview.models.utils.py for possible models.
  • query_model (BaseQueryModel) – Initialized model to query new instances for review, such as random sampling or max sampling. See asreview.query_strategies.utils.py for query models.
  • balance_model (BaseBalanceModel) – Initialized model to redistribute the training data during the active learning process. They might either resample or undersample specific papers.
  • feature_model (BaseFeatureModel) – Feature extraction model that converts texts and keywords to feature matrices.
  • n_papers (int) – Number of papers to review during the active learning process, excluding the number of initial priors. To review all papers, set n_papers to None.
  • n_instances (int) – Number of papers to query at each step in the active learning process.
  • n_queries (int) – Number of steps/queries to perform. Set to None for no limit.
  • start_idx (numpy.ndarray) – Start the simulation/review with these indices. They are assumed to be already labeled. Failing to do so might result bad behaviour.
  • state_file (str) – Path to state file. Replaces log_file argument.
classify(query_idx, inclusions, state, method=None)[source]

Classify new papers and update the training indices.

It automaticaly updates the state.

Parameters:
  • query_idx (list, numpy.ndarray) – Indices to classify.
  • inclusions (list, numpy.ndarray) – Labels of the query_idx.
  • state (BaseLogger) – Logger to store the classification in.
  • method (str) – If not set to None, all inclusions have this query method.
log_probabilities(state)[source]

Store the modeling probabilities of the training indices and pool indices.

n_pool()[source]

Number of indices left in the pool.

Returns:int – Number of indices left in the pool.
query(n_instances, query_model=None)[source]

Query records from pool.

Parameters:
  • n_instances (int) – Batch size of the queries, i.e. number of records to be queried.
  • query_model (BaseQueryModel) – Query strategy model to use. If None, the query model of the reviewer is used.
Returns:

numpy.ndarray – Indices of records queried.

review(*args, **kwargs)[source]

Do the systematic review, writing the results to the state file.

Parameters:
  • stop_after_class (bool) – When to stop; if True stop after classification step, otherwise stop after training step.
  • instant_save (bool) – If True, save results after each single classification.
settings

Get an ASReview settings object

statistics()[source]

Get statistics on the current state of the review.

Returns:dict – A dictonary with statistics like n_included and last_inclusion.
train()[source]

Train the model.

class asreview.ReviewSimulate(as_data, *args, n_prior_included=0, n_prior_excluded=0, prior_idx=None, init_seed=None, **kwargs)[source]

ASReview Simulation mode class.

Parameters:
  • as_data (asreview.ASReviewData) – The data object which contains the text, labels, etc.
  • model (BaseModel) – Initialized model to fit the data during active learning. See asreview.models.utils.py for possible models.
  • query_model (BaseQueryModel) – Initialized model to query new instances for review, such as random sampling or max sampling. See asreview.query_strategies.utils.py for query models.
  • balance_model (BaseBalanceModel) – Initialized model to redistribute the training data during the active learning process. They might either resample or undersample specific papers.
  • feature_model (BaseFeatureModel) – Feature extraction model that converts texts and keywords to feature matrices.
  • n_prior_included (int) – Sample n prior included papers.
  • n_prior_excluded (int) – Sample n prior excluded papers.
  • prior_idx (int) – Prior indices by row number.
  • n_papers (int) – Number of papers to review during the active learning process, excluding the number of initial priors. To review all papers, set n_papers to None.
  • n_instances (int) – Number of papers to query at each step in the active learning process.
  • n_queries (int) – Number of steps/queries to perform. Set to None for no limit.
  • start_idx (numpy.ndarray) – Start the simulation/review with these indices. They are assumed to be already labeled. Failing to do so might result bad behaviour.
  • init_seed (int) – Seed for setting the prior indices if the –prior_idx option is not used. If the option prior_idx is used with one or more index, this option is ignored.
  • state_file (str) – Path to state file. Replaces log_file argument.
classify(query_idx, inclusions, state, method=None)

Classify new papers and update the training indices.

It automaticaly updates the state.

Parameters:
  • query_idx (list, numpy.ndarray) – Indices to classify.
  • inclusions (list, numpy.ndarray) – Labels of the query_idx.
  • state (BaseLogger) – Logger to store the classification in.
  • method (str) – If not set to None, all inclusions have this query method.
log_probabilities(state)

Store the modeling probabilities of the training indices and pool indices.

n_pool()

Number of indices left in the pool.

Returns:int – Number of indices left in the pool.
query(n_instances, query_model=None)

Query records from pool.

Parameters:
  • n_instances (int) – Batch size of the queries, i.e. number of records to be queried.
  • query_model (BaseQueryModel) – Query strategy model to use. If None, the query model of the reviewer is used.
Returns:

numpy.ndarray – Indices of records queried.

review(*args, **kwargs)

Do the systematic review, writing the results to the state file.

Parameters:
  • stop_after_class (bool) – When to stop; if True stop after classification step, otherwise stop after training step.
  • instant_save (bool) – If True, save results after each single classification.
settings

Get an ASReview settings object

statistics()

Get statistics on the current state of the review.

Returns:dict – A dictonary with statistics like n_included and last_inclusion.
train()

Train the model.

Classifiers

class asreview.models.classifiers.NaiveBayesClassifier(alpha=3.822)[source]

Naive Bayes classifier

Naive Bayes classifier. Only works in combination with the asreview.models.feature_extraction.Tfidf feature extraction model. Though relatively simplistic, seems to work quite well on a wide range of datasets.

The naive Bayes classifier is an implementation based on the sklearn multinomial naive Bayes classifier.

Parameters:alpha (float, default=3.822) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)

Fit the model to the data.

Parameters:
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (numpy.ndarray) – Feature matrix to predict.
Returns:numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.classifiers.RandomForestClassifier(n_estimators=100, max_features=10, class_weight=1.0, random_state=None)[source]

Random Forest classifier

The Random Forest classifier is an implementation based on the sklearn Random Forest classifier.

Parameters:
  • n_estimators (int, default=100) – The number of trees in the forest.
  • max_features (int, default=10) – Number of features in the model.
  • class_weight (float, default=1.0) – Class weight of the inclusions.
  • random_state (int or RandomState, default=None) – Controls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)

Fit the model to the data.

Parameters:
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (numpy.ndarray) – Feature matrix to predict.
Returns:numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.classifiers.SVMClassifier(gamma='auto', class_weight=0.249, C=15.4, kernel='linear', random_state=None)[source]

Support Vector Machine classifier

The Support Vector Machine classifier is an implementation based on the sklearn Support Vector Machine classifier.

Parameters:
  • gamma (str) – Gamma parameter of the SVM model.
  • class_weight (float) – class_weight of the inclusions.
  • C (float) – C parameter of the SVM model.
  • kernel (str) – SVM kernel type.
  • random_state (int, RandomState) – State of the RNG.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)

Fit the model to the data.

Parameters:
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (numpy.ndarray) – Feature matrix to predict.
Returns:numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.classifiers.LogisticClassifier(C=1.0, class_weight=1.0, random_state=None, n_jobs=1)[source]

Logistic regressions classifier

The Logistic regressions classifier is an implementation based on the sklearn Logistic regressions classifier.

Parameters:
  • C (float) – Parameter inverse to the regularization strength of the model.
  • class_weight (float) – Class weight of the inclusions.
  • random_state (int, RandomState) – Random state for the model.
  • n_jobs (int) – Number of CPU cores used.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)

Fit the model to the data.

Parameters:
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (numpy.ndarray) – Feature matrix to predict.
Returns:numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.classifiers.LSTMBaseClassifier(embedding_matrix=None, backwards=True, dropout=0.4, optimizer='rmsprop', lstm_out_width=20, learn_rate=1.0, dense_width=128, verbose=0, batch_size=32, epochs=35, shuffle=False, class_weight=30.0)[source]

LSTM base classifier.

LSTM model that consists of an embedding layer, LSTM layer with one output, dense layer, and a single sigmoid output node. Use the asreview.models.feature_extraction.EmbeddingLSTM feature extraction method. Currently not so well optimized and slow.

Note

This model requires tensorflow to be installed. Use pip install tensorflow or install all optional ASReview dependencies with pip install asreview[all]

Parameters:
  • embedding_matrix (numpy.ndarray) – Embedding matrix to use with LSTM model.
  • backwards (bool) – Whether to have a forward or backward LSTM.
  • dropout (float) – Value in [0, 1.0) that gives the dropout and recurrent dropout rate for the LSTM model.
  • optimizer (str) – Optimizer to use.
  • lstm_out_width (int) – Output width of the LSTM.
  • learn_rate (float) – Learn rate multiplier of default learning rate.
  • dense_width (int) – Size of the dense layer of the model.
  • verbose (int) – Verbosity.
  • batch_size (int) – Size of the batch size for the LSTM model.
  • epochs (int) – Number of epochs to train the LSTM model.
  • shuffle (bool) – Whether to shuffle the data before starting to train.
  • class_weight (float) – Class weight for the included papers.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)[source]

Fit the model to the data.

Parameters:
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (numpy.ndarray) – Feature matrix to predict.
Returns:numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.classifiers.LSTMPoolClassifier(embedding_matrix=None, backwards=True, dropout=0.4, optimizer='rmsprop', lstm_out_width=20, lstm_pool_size=128, learn_rate=1.0, verbose=0, batch_size=32, epochs=35, shuffle=False, class_weight=30.0)[source]

LSTM pool classifier.

LSTM model that consists of an embedding layer, LSTM layer with many outputs, max pooling layer, and a single sigmoid output node. Use the asreview.models.feature_extraction.EmbeddingLSTM feature extraction method. Currently not so well optimized and slow.

Note

This model requires tensorflow to be installed. Use pip install tensorflow or install all optional ASReview dependencies with pip install asreview[all]

Parameters:
  • embedding_matrix (numpy.ndarray) – Embedding matrix to use with LSTM model.
  • backwards (bool) – Whether to have a forward or backward LSTM.
  • dropout (float) – Value in [0, 1.0) that gives the dropout and recurrent dropout rate for the LSTM model.
  • optimizer (str) – Optimizer to use.
  • lstm_out_width (int) – Output width of the LSTM.
  • lstm_pool_size (int) – Size of the pool, must be a divisor of max_sequence_length.
  • learn_rate (float) – Learn rate multiplier of default learning rate.
  • verbose (int) – Verbosity.
  • batch_size (int) – Size of the batch size for the LSTM model.
  • epochs (int) – Number of epochs to train the LSTM model.
  • shuffle (bool) – Whether to shuffle the data before starting to train.
  • class_weight (float) – Class weight for the included papers.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)[source]

Fit the model to the data.

Parameters:
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (numpy.ndarray) – Feature matrix to predict.
Returns:numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.classifiers.NN2LayerClassifier(dense_width=128, optimizer='rmsprop', learn_rate=1.0, regularization=0.01, verbose=0, epochs=35, batch_size=32, shuffle=False, class_weight=30.0)[source]

Dense neural network classifier.

Neural network with two hidden, dense layers of the same size.

Recommended feature extraction model is asreview.models.feature_extraction.Doc2Vec.

Note

This model requires tensorflow to be installed. Use pip install tensorflow or install all optional ASReview dependencies with pip install asreview[all]

Warning

Might crash on some systems with limited memory in combination with asreview.models.feature_extraction.Tfidf.

Parameters:
  • dense_width (int) – Size of the dense layers.
  • optimizer (str) – Name of the Keras optimizer.
  • learn_rate (float) – Learning rate multiplier of the default learning rate.
  • regularization (float) – Strength of the regularization on the weights and biases.
  • verbose (int) – Verbosity of the model mirroring the values for Keras.
  • epochs (int) – Number of epochs to train the neural network.
  • batch_size (int) – Batch size used for the neural network.
  • shuffle (bool) – Whether to shuffle the training data prior to training.
  • class_weight (float) – Class weights for inclusions (1’s).
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)[source]

Fit the model to the data.

Parameters:
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)[source]

Get the inclusion probability for each sample.

Parameters:X (numpy.ndarray) – Feature matrix to predict.
Returns:numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
asreview.models.classifiers.list_classifiers()[source]

List available classifiers.

Returns:list – Names of available classifiers in alphabetical order.
asreview.models.classifiers.get_classifier(name, *args, random_state=None, **kwargs)[source]

Get an instance of a model from a string.

Parameters:
  • name (str) – Name of the model.
  • *args – Arguments for the model.
  • **kwargs – Keyword arguments for the model.
Returns:

BaseFeatureExtraction – Initialized instance of classifier.

asreview.models.classifiers.get_classifier_class(name)[source]

Get class of model from string.

Parameters:name (str) – Name of the model, e.g. ‘svm’, ‘nb’ or ‘lstm-pool’.
Returns:BaseModel – Class corresponding to the name.

Query

class asreview.models.query.MaxQuery[source]

Maximum sampling query strategy.

Choose the most likely samples to be included according to the model.

default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
query(X, classifier, pool_idx=None, n_instances=1, shared={})

Query method for strategies which use class probabilities.

class asreview.models.query.MixedQuery(strategy_1='max', strategy_2='random', mix_ratio=0.95, random_state=None, **kwargs)[source]

Class for mixed query strategy.

The idea is to use two different query strategies at the same time with a ratio of one to the other. A mix of two query strategies is used. For example mixing max and random sampling with a mix ratio of 0.95 would mean that at each query 95% of the instances would be sampled with the max query strategy after which the remaining 5% would be sampled with the random query strategy. It would be called the max_random query strategy. Every combination of primitive query strategy is possible.

Parameters:
  • strategy_1 (str) – Name of the first query strategy.
  • strategy_2 (str) – Name of the second query strategy.
  • mix_ratio (float) – Portion of queries done by the first strategy. So a mix_ratio of 0.95 means that 95% of the time query strategy 1 is used and 5% of the time query strategy 2.
  • **kwargs (dict) – Keyword arguments for the two strategy. To specify which of the strategies the argument is for, prepend with the name of the query strategy and an underscore, e.g. ‘max’ for maximal sampling.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
name

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
query(X, classifier, pool_idx=None, n_instances=1, shared={})[source]

Query new instances.

Parameters:
  • X (numpy.ndarray) – Feature matrix to choose samples from.
  • classifier (SKLearnModel) – Trained classifier to compute probabilities if they are necessary.
  • pool_idx (numpy.ndarray) – Indices of samples that are still in the pool.
  • n_instances (int) – Number of instances to query.
  • shared (dict) – Dictionary for exchange between query strategies and others. It is mainly used to store the current class probabilities, and the source of the queries; which query strategy has produced which index.
class asreview.models.query.UncertaintyQuery[source]

Maximum uncertainty query strategy.

Choose the most uncertain samples according to the model (i.e. closest to 0.5 probability). Doesn’t work very well in the case of LSTM’s, since the probabilities are rather arbitrary.

default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
query(X, classifier, pool_idx=None, n_instances=1, shared={})

Query method for strategies which use class probabilities.

class asreview.models.query.RandomQuery(random_state=None)[source]

Random sampling query strategy.

Randomly select samples with no regard to model assigned probabilities.

Warning

Selecting this option means your review is not going to be accelerated by ASReview.

default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
query(X, classifier, pool_idx=None, n_instances=1, shared={})

Query method for strategies which do not use class probabilities

class asreview.models.query.ClusterQuery(cluster_size=350, update_interval=200, random_state=None)[source]

Query strategy using clustering algorithms.

Use clustering after feature extraction on the dataset. Then the highest probabilities within random clusters are sampled.

Parameters:
  • cluster_size (int) – Size of the clusters to be made. If the size of the clusters is smaller than the size of the pool, fall back to max sampling.
  • update_interval (int) – Update the clustering every x instances.
  • random_state (int, RandomState) – State/seed of the RNG.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
query(X, classifier, pool_idx=None, n_instances=1, shared={})

Query method for strategies which use class probabilities.

asreview.models.query.list_query_strategies()[source]

List available query strategies.

This excludes all possible mixed query strategies.

Returns:list – Names of available query strategies in alphabetical order.
asreview.models.query.get_query_model(name, *args, random_state=None, **kwargs)[source]

Get an instance of the query strategy.

Parameters:
  • name (str) – Name of the query strategy.
  • *args – Arguments for the model.
  • **kwargs – Keyword arguments for the model.
Returns:

asreview.query.base.BaseQueryModel – Initialized instance of query strategy.

asreview.models.query.get_query_class(name)[source]

Get class of query strategy from its name.

Parameters:name (str) – Name of the query strategy, e.g. ‘max’, ‘uncertainty’, ‘random. A special mixed query strategy is als possible. The mix is denoted by an underscore: ‘max_random’ or ‘max_uncertainty’.
Returns:class – Class corresponding to the name name.

Balance

class asreview.models.balance.SimpleBalance[source]

No balancing.

Use all training data.

default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
sample(X, y, train_idx, shared)[source]

Function that does not resample the training set.

Parameters:
  • X (numpy.ndarray) – Complete matrix of all samples.
  • y (numpy.ndarray) – Classified results of all samples.
  • extra_vars (dict:) – Extra variables that can be passed around between functions.
Returns:

  • numpy.ndarray – Training samples.
  • numpy.ndarray – Classification of training samples.

class asreview.models.balance.DoubleBalance(a=2.155, alpha=0.94, b=0.789, beta=1.0, random_state=None)[source]

Dynamic Resampling balance strategy.

Class to get the two way rebalancing function and arguments. It super samples ones depending on the number of 0’s and total number of samples in the training data.

Parameters:
  • a (float) – Governs the weight of the 1’s. Higher values mean linearly more 1’s in your training sample.
  • alpha (float) – Governs the scaling the weight of the 1’s, as a function of the ratio of ones to zeros. A positive value means that the lower the ratio of zeros to ones, the higher the weight of the ones.
  • b (float) – Governs how strongly we want to sample depending on the total number of samples. A value of 1 means no dependence on the total number of samples, while lower values mean increasingly stronger dependence on the number of samples.
  • beta (float) – Governs the scaling of the weight of the zeros depending on the number of samples. Higher values means that larger samples are more strongly penalizing zeros.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
sample(X, y, train_idx, shared)[source]

Resample the training data.

Parameters:
  • X (numpy.ndarray) – Complete feature matrix.
  • y (numpy.ndarray) – Labels for all papers.
  • train_idx (numpy.ndarray) – Training indices, that is all papers that have been reviewed.
  • shared (dict) – Dictionary to share data between balancing models and other models.
Returns:

numpy.ndarray,numpy.ndarray – X_train, y_train: the resampled matrix, labels.

class asreview.models.balance.TripleBalance(a=2.155, alpha=0.94, b=0.789, beta=1.0, c=0.835, gamma=2.0, shuffle=True, random_state=None)[source]

Triple balance strategy.

This divides the training data into three sets: included papers, excluded papers found with random sampling and papers found with max sampling. They are balanced according to formulas depending on the percentage of papers read in the dataset, the number of papers with random/max sampling etc. Works best for stochastic training algorithms. Reduces to both full sampling and undersampling with corresponding parameters.

Parameters:
  • a (float) – Governs the weight of the 1’s. Higher values mean linearly more 1’s in your training sample.
  • alpha (float) – Governs the scaling the weight of the 1’s, as a function of the ratio of ones to zeros. A positive value means that the lower the ratio of zeros to ones, the higher the weight of the ones.
  • b (float) – Governs how strongly we want to sample depending on the total number of samples. A value of 1 means no dependence on the total number of samples, while lower values mean increasingly stronger dependence on the number of samples.
  • beta (float) – Governs the scaling of the weight of the zeros depending on the number of samples. Higher values means that larger samples are more strongly penalizing zeros.
  • c (float) – Value between one and zero that governs the weight of samples done with maximal sampling. Higher values mean higher weight.
  • gamma (float) – Governs the scaling of the weight of the max samples as a function of the % of papers read. Higher values mean stronger scaling.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
sample(X, y, train_idx, shared)[source]

Resample the training data.

Parameters:
  • X (numpy.ndarray) – Complete feature matrix.
  • y (numpy.ndarray) – Labels for all papers.
  • train_idx (numpy.ndarray) – Training indices, that is all papers that have been reviewed.
  • shared (dict) – Dictionary to share data between balancing models and other models.
Returns:

numpy.ndarray,numpy.ndarray – X_train, y_train: the resampled matrix, labels.

class asreview.models.balance.UndersampleBalance(ratio=1.0, random_state=None)[source]

Balancing class that undersamples the data with a given ratio.

This undersamples the data, leaving out excluded papers so that the included and excluded papers are in some particular ratio (closer to one).

Parameters:ratio (double) – Undersampling ratio of the zero’s. If for example we set a ratio of 0.25, we would sample only a quarter of the zeros and all the ones.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
sample(X, y, train_idx, shared)[source]

Resample the training data.

Parameters:
  • X (numpy.ndarray) – Complete feature matrix.
  • y (numpy.ndarray) – Labels for all papers.
  • train_idx (numpy.ndarray) – Training indices, that is all papers that have been reviewed.
  • shared (dict) – Dictionary to share data between balancing models and other models.
Returns:

numpy.ndarray,numpy.ndarray – X_train, y_train: the resampled matrix, labels.

asreview.models.balance.list_balance_strategies()[source]

List available balancing strategies.

Returns:list – Names of available balance strategies in alphabetical order.
asreview.models.balance.get_balance_model(name, *args, random_state=None, **kwargs)[source]

Get an instance of a balance model from a string.

Parameters:
  • name (str) – Name of the balance model.
  • *args – Arguments for the balance model.
  • **kwargs – Keyword arguments for the balance model.
Returns:

BaseFeatureExtraction – Initialized instance of features extraction algorithm.

asreview.models.balance.get_balance_class(name)[source]

Get class of balance model from string.

Parameters:name (str) – Name of the model, e.g. ‘simple’, ‘double’ or ‘undersample’.
Returns:BaseBalanceModel – Class corresponding to the name.

Feature extraction

class asreview.models.feature_extraction.Tfidf(*args, ngram_max=1, stop_words='english', **kwargs)[source]

Class to apply TF-IDF to texts.

Use the standard TF-IDF (Term Frequency-Inverse Document Frequency) feature extraction from SKLearn. Gives a sparse matrix as output. Works well in combination with asreview.models.NBModel and other fast training models (given that the features vectors are relatively wide).

Parameters:
  • ngram_max (int) – Can use up to ngrams up to ngram_max. For example in the case of ngram_max=2, monograms and bigrams could be used.
  • stop_words (str) – When set to ‘english’, use stopwords. If set to None or ‘none’, do not use stop words.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(texts)[source]

Fit the model to the texts.

It is not always necessary to implement this if there’s not real fitting being done.

Parameters:texts (numpy.ndarray) – Texts to be fitted.
fit_transform(texts, titles=None, abstracts=None, keywords=None)

Fit and transform a list of texts.

Parameters:texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:numpy.ndarray – Feature matrix representing the texts.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
transform(texts)[source]

Transform a list of texts.

Parameters:texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:numpy.ndarray – Feature matrix representing the texts.
class asreview.models.feature_extraction.Doc2Vec(*args, vector_size=40, epochs=33, min_count=1, n_jobs=1, window=7, dm_concat=0, dm=2, dbow_words=0, **kwargs)[source]

Base class for doc2vec feature extraction.

Feature extraction method provided by the gensim package. It takes relatively long to create a feature matrix with this method. However, this only has to be done once per simulation/review. The upside of this method is the dimension- reduction that generally takes place, which makes the modelling quicker.

Note

This feature extraction algorithm requires gensim to be installed. Use pip install gensim or install all optional ASReview dependencies with pip install asreview[all]

Parameters:
  • vector_size (int) – Output size of the vector.
  • epochs (int) – Number of epochs to train the doc2vec model.
  • min_count (int) – Minimum number of occurences for a word in the corpus for it to be included in the model.
  • n_jobs (int) – Number of threads to train the model with.
  • window (int) – Maximum distance over which word vectors influence each other.
  • dm_concat (int) – Whether to concatenate word vectors or not. See paper for more detail.
  • dm (int) – Model to use. 0: Use distribute bag of words (DBOW). 1: Use distributed memory (DM). 2: Use both of the above with half the vector size and concatenate them.
  • dbow_words (int) – Whether to train the word vectors using the skipgram method.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(texts)[source]

Fit the model to the texts.

It is not always necessary to implement this if there’s not real fitting being done.

Parameters:texts (numpy.ndarray) – Texts to be fitted.
fit_transform(texts, titles=None, abstracts=None, keywords=None)

Fit and transform a list of texts.

Parameters:texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:numpy.ndarray – Feature matrix representing the texts.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
transform(texts)[source]

Transform a list of texts.

Parameters:texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:numpy.ndarray – Feature matrix representing the texts.
class asreview.models.feature_extraction.EmbeddingIdf(*args, embedding_fp=None, random_state=None, **kwargs)[source]

Class for Embedding-Idf model.

This model averages the weighted word vectors of all the words in the text, in order to get a single feature vector for each text. The weights are provided by the inverse document frequencies.

Note

This feature extraction algorithm requires tensorflow to be installed. Use pip install tensorflow or install all optional ASReview dependencies with pip install asreview[all]

Parameters:embedding_fp (str) – Path to embedding.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(texts)

Fit the model to the texts.

It is not always necessary to implement this if there’s not real fitting being done.

Parameters:texts (numpy.ndarray) – Texts to be fitted.
fit_transform(texts, titles=None, abstracts=None, keywords=None)

Fit and transform a list of texts.

Parameters:texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:numpy.ndarray – Feature matrix representing the texts.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
transform(texts)[source]

Transform a list of texts.

Parameters:texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:numpy.ndarray – Feature matrix representing the texts.
class asreview.models.feature_extraction.EmbeddingLSTM(*args, loop_sequence=1, num_words=20000, max_sequence_length=1000, padding='post', truncating='post', n_jobs=1, **kwargs)[source]

Class to create embedding matrices for LSTM models.

Feature extraction method for asreview.models.classifiers.LSTMBaseClassifier and asreview.models.classifiers.LSTMPoolClassifier models.

Note

This feature extraction algorithm requires tensorflow to be installed. Use pip install tensorflow or install all optional ASReview dependencies with pip install asreview[all]

Parameters:
  • loop_sequence (bool) – Instead of zeros at the start/end of sequence loop it.
  • num_words (int) – Maximum number of unique words to be processed.
  • max_sequence_length (int) – Maximum length of the sequence. Shorter get struncated. Longer sequences get either padded with zeros or looped.
  • padding (str) – Which side should be padded [pre/post].
  • truncating – Which side should be truncated [pre/post].
  • n_jobs – Number of processors used in reading the embedding matrix.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(texts)

Fit the model to the texts.

It is not always necessary to implement this if there’s not real fitting being done.

Parameters:texts (numpy.ndarray) – Texts to be fitted.
fit_transform(texts, titles=None, abstracts=None, keywords=None)

Fit and transform a list of texts.

Parameters:texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:numpy.ndarray – Feature matrix representing the texts.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
transform(texts)[source]

Transform a list of texts.

Parameters:texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:numpy.ndarray – Feature matrix representing the texts.
class asreview.models.feature_extraction.SBERT(split_ta=0, use_keywords=0)[source]

Sentence BERT class for feature extraction.

Feature extraction method based on Sentence BERT. Implementation based on the sentence_transformers package. It is relatively slow.

Note

This feature extraction algorithm requires sentence_transformers to be installed. Use pip install sentence_transformers or install all optional ASReview dependencies with pip install asreview[all]

default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(texts)

Fit the model to the texts.

It is not always necessary to implement this if there’s not real fitting being done.

Parameters:texts (numpy.ndarray) – Texts to be fitted.
fit_transform(texts, titles=None, abstracts=None, keywords=None)

Fit and transform a list of texts.

Parameters:texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:numpy.ndarray – Feature matrix representing the texts.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
transform(texts)[source]

Transform a list of texts.

Parameters:texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:numpy.ndarray – Feature matrix representing the texts.
asreview.models.feature_extraction.list_feature_extraction()[source]

List available feature extraction methods.

Returns:list – Names of available feature extraction methods in alphabetical order.
asreview.models.feature_extraction.get_feature_model(name, *args, random_state=None, **kwargs)[source]

Get an instance of a feature extraction model from a string.

Parameters:
  • name (str) – Name of the feature extraction model.
  • *args – Arguments for the feature extraction model.
  • **kwargs – Keyword arguments for thefeature extraction model.
Returns:

BaseFeatureExtraction – Initialized instance of feature extraction algorithm.

asreview.models.feature_extraction.get_feature_class(name)[source]

Get class of feature extraction from string.

Parameters:name (str) – Name of the feature model, e.g. ‘doc2vec’, ‘tfidf’ or ‘embedding-lstm’.
Returns:BaseFeatureExtraction – Class corresponding to the name.

Data

class asreview.ASReviewData(df=None, data_name='empty', data_type='standard', column_spec=None)[source]

Data object to the dataset with texts, labels, DOIs etc.

Parameters:
  • df (pandas.DataFrame) – Dataframe containing the data for the ASReview data object.
  • data_name (str) – Give a name to the data object.
  • data_type (str) – What kind of data the dataframe contains.
  • column_spec (dict) – Specification for which column corresponds to which standard specification. Key is the standard specification, key is which column it is actually in.
append(as_data)[source]

Append another ASReviewData object.

It puts the training data at the end.

Parameters:as_data (ASReviewData) – Dataset to append.
classmethod from_file(fp, read_fn=None, data_name=None, data_type=None)[source]

Create instance from csv/ris/excel file.

It works in two ways; either manual control where the conversion functions are supplied or automatic, where it searches in the entry points for the right conversion functions.

Parameters:
  • fp (str, pathlib.Path) – Read the data from this file.
  • read_fn (callable) – Function to read the file. It should return a standardized dataframe.
  • data_name (str) – Name of the data.
  • data_type (str) – What kind of data it is. Special names: ‘included’, ‘excluded’, ‘prior’.
get(name)[source]

Get column with name.

hash()[source]

Compute a hash from the dataset.

Returns:str – SHA1 hash, computed from the titles/abstracts of the dataframe.
prior_data_idx

Get prior_included, prior_excluded from dataset.

prior_labels(state, by_index=True)[source]

Get the labels that are marked as ‘initial’.

state: BaseState
Open state that contains the label information.
by_index: bool
If True, return internal indexing. If False, return record_ids for indexing.
Returns:numpy.ndarray – Array of indices that have the ‘initial’ property.
record(i, by_index=True)[source]

Create a record from an index.

Parameters:
  • i (int, iterable) – Index of the record, or list of indices.
  • by_index (bool) – If True, take the i-th value as used internally by the review. If False, take the record with record_id==i.
Returns:

PaperRecord – The corresponding record if i was an integer, or a list of records if i was an iterable.

slice(idx, by_index=True)[source]

Create a slice from itself.

Useful if some parts should be kept/thrown away.

Parameters:idx (list, numpy.ndarray) – Record ids that should be kept.
Returns:ASReviewData – Slice of itself.
to_csv(fp, labels=None, ranking=None)[source]

Export to csv.

Parameters:
  • fp (str, NoneType) – Filepath or None for buffer.
  • labels (list, numpy.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
  • ranking (list) – Reorder the dataframe according to these (internal) indices. Default ordering if ranking is None.
Returns:

pandas.DataFrame – Dataframe of all available record data.

to_dataframe(labels=None, ranking=None)[source]

Create new dataframe with updated label (order).

Parameters:
  • labels (list, numpy.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
  • ranking (list) – Reorder the dataframe according to these record_ids. Default ordering if ranking is None.
Returns:

pandas.DataFrame – Dataframe of all available record data.

to_excel(fp, labels=None, ranking=None)[source]

Export to Excel xlsx file.

Parameters:
  • fp (str, NoneType) – Filepath or None for buffer.
  • labels (list, numpy.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
  • ranking (list) – Reorder the dataframe according to these (internal) indices. Default ordering if ranking is None.
Returns:

pandas.DataFrame – Dataframe of all available record data.

to_file(fp, labels=None, ranking=None)[source]

Export data object to file.

RIS, CSV and Excel are supported file formats at the moment.

Parameters:
  • fp (str) – Filepath to export to.
  • labels (list, numpy.ndarray) – Labels to be inserted into the dataframe before export.
  • ranking (list, numpy.ndarray) – Optionally, dataframe rows can be reordered.

Utils

asreview.load_embedding(fp, word_index=None, n_jobs=None)[source]

Load embedding matrix from file.

The embedding matrix needs to be stored in the FastText format.

Parameters:
  • fp (str) – File path of the trained embedding vectors.
  • word_index (dict) – Sample word embeddings.
  • n_jobs (int) – Number of processes to parse the embedding (+1 process for reading).
  • verbose (int) – The verbosity. Default 1.
Returns:

dict – The embedding weights stored in a dict with the word as key and the weights as values.

asreview.sample_embedding(embedding, word_index)[source]

Sample embedding matrix

Parameters:
  • embedding (dict) – A dictionary with the words and embedding vectors.
  • word_index (dict) – A word_index like the output of Keras Tokenizer.word_index.
  • verbose (int) – The verbosity. Default 1.
Returns:

(np.ndarray, list) – The embedding weights strored in a two dimensional numpy array and a list with the corresponding words.

State

asreview.state.open_state(fp, *args, read_only=False, **kwargs)[source]

Open a state from a file.

Parameters:
  • fp (str) – File to open.
  • read_only (bool) – Whether to open the file in read_only mode.
Returns:

Basestate – Depending on the extension the appropriate state is chosen: - [.h5, .hdf5, .he5] -> HDF5state. - None -> Dictstate (doesn’t store anything permanently). - Anything else -> JSONstate.

class asreview.state.BaseState(state_fp, read_only=False)[source]
add_classification(idx, labels, methods, query_i)[source]

Add training indices and their labels.

Parameters:
  • indices (list, numpy.ndarray) – A list of indices used for training.
  • labels (list) – A list of labels corresponding with the training indices.
  • i (int) – The query number.
add_proba(pool_idx, train_idx, proba, query_i)[source]

Add inverse pool indices and their labels.

Parameters:
  • indices (list, numpy.ndarray) – A list of indices used for unlabeled pool.
  • pred (numpy.ndarray) – Array of prediction probabilities for unlabeled pool.
  • i (int) – The query number.
close()[source]

Close the files opened by the state.

Also sets the end time if not in read-only mode.

delete_last_query()[source]

Delete the last query from the state object.

get(variable, query_i=None, default=None, idx=None)[source]

Get data from the state object.

This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.

Parameters:
  • variable (str) – Name of the variable/data to get. Options are: label_idx, inclusions, label_methods, labels, final_labels, proba , train_idx, pool_idx.
  • query_i (int) – Query number, should be between 0 and self.n_queries().
  • idx (int, numpy.ndarray,list) – Indices to get in the returned array.
get_current_queries()[source]

Get the current queries made by the model.

This is useful to get back exactly to the state it was in before shutting down a review.

Returns:dict – The last known queries according to the state file.
get_feature_matrix(data_hash)[source]

Get feature matrix out of the state.

Parameters:data_hash (str) – Hash of as_data object from which the matrix is derived.
Returns:np.ndarray, sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.
initialize_structure()[source]

Create empty internal structure for state

is_empty()[source]

Check if state has no results.

Returns:bool – True if empty.
n_queries()[source]

Number of queries saved in the state.

Returns:int – Number of queries.
pred_proba

Get last predicted probabilities.

restore(fp)[source]

Restore or create state from a state file.

If the state file doesn’t exist, creates and empty state that is ready for storage.

Parameters:fp (str) – Path to file to restore/create.
save()[source]

Save state to file.

Parameters:fp (str) – The file path to export the results to.
set_current_queries(current_queries)[source]

Set the current queries made by the model.

Parameters:current_queries (dict) – The last known queries, with {query_idx: query_method}.
set_final_labels(y)[source]

Add/set final labels to state.

If final_labels does not exist yet, add it.

Parameters:y (numpy.ndarray) – One dimensional integer numpy array with final inclusion labels.
set_labels(y)[source]

Add/set labels to state

If the labels do not exist, add it to the state.

Parameters:y (numpy.ndarray) – One dimensional integer numpy array with inclusion labels.
settings

Get settings from state

startup_vals()[source]

Get variables for reviewer to continue review.

Returns:
  • numpy.ndarray – Current labels of dataset.
  • numpy.ndarray – Current training indices.
  • dict – Dictionary containing the sources of the labels.
  • query_i – Currenty query number (starting from 0).
to_dict()[source]

Convert state to dictionary.

Returns:dict – Dictionary with all relevant variables.
class asreview.state.HDF5State(state_fp, read_only=False)[source]

Class for storing the review state with HDF5 storage.

add_classification(idx, labels, methods, query_i)[source]

Add training indices and their labels.

Parameters:
  • indices (list, numpy.ndarray) – A list of indices used for training.
  • labels (list) – A list of labels corresponding with the training indices.
  • i (int) – The query number.
add_proba(pool_idx, train_idx, proba, query_i)[source]

Add inverse pool indices and their labels.

Parameters:
  • indices (list, numpy.ndarray) – A list of indices used for unlabeled pool.
  • pred (numpy.ndarray) – Array of prediction probabilities for unlabeled pool.
  • i (int) – The query number.
close()[source]

Close the files opened by the state.

Also sets the end time if not in read-only mode.

delete_last_query()[source]

Delete the last query from the state object.

get(variable, query_i=None, idx=None)[source]

Get data from the state object.

This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.

Parameters:
  • variable (str) – Name of the variable/data to get. Options are: label_idx, inclusions, label_methods, labels, final_labels, proba , train_idx, pool_idx.
  • query_i (int) – Query number, should be between 0 and self.n_queries().
  • idx (int, numpy.ndarray,list) – Indices to get in the returned array.
get_current_queries()[source]

Get the current queries made by the model.

This is useful to get back exactly to the state it was in before shutting down a review.

Returns:dict – The last known queries according to the state file.
get_feature_matrix(data_hash)[source]

Get feature matrix out of the state.

Parameters:data_hash (str) – Hash of as_data object from which the matrix is derived.
Returns:np.ndarray, sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.
initialize_structure()[source]

Create empty internal structure for state

is_empty()

Check if state has no results.

Returns:bool – True if empty.
n_queries()[source]

Number of queries saved in the state.

Returns:int – Number of queries.
pred_proba

Get last predicted probabilities.

restore(fp)[source]

Restore or create state from a state file.

If the state file doesn’t exist, creates and empty state that is ready for storage.

Parameters:fp (str) – Path to file to restore/create.
save()[source]

Save state to file.

Parameters:fp (str) – The file path to export the results to.
set_current_queries(current_queries)[source]

Set the current queries made by the model.

Parameters:current_queries (dict) – The last known queries, with {query_idx: query_method}.
set_final_labels(y)[source]

Add/set final labels to state.

If final_labels does not exist yet, add it.

Parameters:y (numpy.ndarray) – One dimensional integer numpy array with final inclusion labels.
set_labels(y)[source]

Add/set labels to state

If the labels do not exist, add it to the state.

Parameters:y (numpy.ndarray) – One dimensional integer numpy array with inclusion labels.
settings

Get settings from state

startup_vals()

Get variables for reviewer to continue review.

Returns:
  • numpy.ndarray – Current labels of dataset.
  • numpy.ndarray – Current training indices.
  • dict – Dictionary containing the sources of the labels.
  • query_i – Currenty query number (starting from 0).
to_dict()

Convert state to dictionary.

Returns:dict – Dictionary with all relevant variables.
class asreview.state.JSONState(state_fp, read_only=False)[source]

Class for storing the state of a Systematic Review using JSON files.

add_classification(idx, labels, methods, query_i)

Add training indices and their labels.

Parameters:
  • indices (list, numpy.ndarray) – A list of indices used for training.
  • labels (list) – A list of labels corresponding with the training indices.
  • i (int) – The query number.
add_proba(pool_idx, train_idx, proba, query_i)

Add inverse pool indices and their labels.

Parameters:
  • indices (list, numpy.ndarray) – A list of indices used for unlabeled pool.
  • pred (numpy.ndarray) – Array of prediction probabilities for unlabeled pool.
  • i (int) – The query number.
close()

Close the files opened by the state.

Also sets the end time if not in read-only mode.

delete_last_query()

Delete the last query from the state object.

get(variable, query_i=None, idx=None)

Get data from the state object.

This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.

Parameters:
  • variable (str) – Name of the variable/data to get. Options are: label_idx, inclusions, label_methods, labels, final_labels, proba , train_idx, pool_idx.
  • query_i (int) – Query number, should be between 0 and self.n_queries().
  • idx (int, numpy.ndarray,list) – Indices to get in the returned array.
get_current_queries()

Get the current queries made by the model.

This is useful to get back exactly to the state it was in before shutting down a review.

Returns:dict – The last known queries according to the state file.
get_feature_matrix(data_hash)

Get feature matrix out of the state.

Parameters:data_hash (str) – Hash of as_data object from which the matrix is derived.
Returns:np.ndarray, sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.
initialize_structure()

Create empty internal structure for state

is_empty()

Check if state has no results.

Returns:bool – True if empty.
n_queries()

Number of queries saved in the state.

Returns:int – Number of queries.
pred_proba

Get last predicted probabilities.

restore(fp)[source]

Restore or create state from a state file.

If the state file doesn’t exist, creates and empty state that is ready for storage.

Parameters:fp (str) – Path to file to restore/create.
save()[source]

Save state to file.

Parameters:fp (str) – The file path to export the results to.
set_current_queries(current_queries)

Set the current queries made by the model.

Parameters:current_queries (dict) – The last known queries, with {query_idx: query_method}.
set_final_labels(y)

Add/set final labels to state.

If final_labels does not exist yet, add it.

Parameters:y (numpy.ndarray) – One dimensional integer numpy array with final inclusion labels.
set_labels(y)

Add/set labels to state

If the labels do not exist, add it to the state.

Parameters:y (numpy.ndarray) – One dimensional integer numpy array with inclusion labels.
settings

Get settings from state

startup_vals()

Get variables for reviewer to continue review.

Returns:
  • numpy.ndarray – Current labels of dataset.
  • numpy.ndarray – Current training indices.
  • dict – Dictionary containing the sources of the labels.
  • query_i – Currenty query number (starting from 0).
to_dict()

Convert state to dictionary.

Returns:dict – Dictionary with all relevant variables.
class asreview.state.DictState(state_fp, *_, **__)[source]

Class for storing the state of a review with no permanent storage.

add_classification(idx, labels, methods, query_i)[source]

Add training indices and their labels.

Parameters:
  • indices (list, numpy.ndarray) – A list of indices used for training.
  • labels (list) – A list of labels corresponding with the training indices.
  • i (int) – The query number.
add_proba(pool_idx, train_idx, proba, query_i)[source]

Add inverse pool indices and their labels.

Parameters:
  • indices (list, numpy.ndarray) – A list of indices used for unlabeled pool.
  • pred (numpy.ndarray) – Array of prediction probabilities for unlabeled pool.
  • i (int) – The query number.
close()[source]

Close the files opened by the state.

Also sets the end time if not in read-only mode.

delete_last_query()[source]

Delete the last query from the state object.

get(variable, query_i=None, idx=None)[source]

Get data from the state object.

This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.

Parameters:
  • variable (str) – Name of the variable/data to get. Options are: label_idx, inclusions, label_methods, labels, final_labels, proba , train_idx, pool_idx.
  • query_i (int) – Query number, should be between 0 and self.n_queries().
  • idx (int, numpy.ndarray,list) – Indices to get in the returned array.
get_current_queries()[source]

Get the current queries made by the model.

This is useful to get back exactly to the state it was in before shutting down a review.

Returns:dict – The last known queries according to the state file.
get_feature_matrix(data_hash)[source]

Get feature matrix out of the state.

Parameters:data_hash (str) – Hash of as_data object from which the matrix is derived.
Returns:np.ndarray, sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.
initialize_structure()[source]

Create empty internal structure for state

is_empty()[source]

Check if state has no results.

Returns:bool – True if empty.
n_queries()[source]

Number of queries saved in the state.

Returns:int – Number of queries.
pred_proba

Get last predicted probabilities.

restore(*_, **__)[source]

Restore or create state from a state file.

If the state file doesn’t exist, creates and empty state that is ready for storage.

Parameters:fp (str) – Path to file to restore/create.
save()[source]

Save state to file.

Parameters:fp (str) – The file path to export the results to.
set_current_queries(current_queries)[source]

Set the current queries made by the model.

Parameters:current_queries (dict) – The last known queries, with {query_idx: query_method}.
set_final_labels(y)[source]

Add/set final labels to state.

If final_labels does not exist yet, add it.

Parameters:y (numpy.ndarray) – One dimensional integer numpy array with final inclusion labels.
set_labels(y)[source]

Add/set labels to state

If the labels do not exist, add it to the state.

Parameters:y (numpy.ndarray) – One dimensional integer numpy array with inclusion labels.
settings

Get settings from state

startup_vals()

Get variables for reviewer to continue review.

Returns:
  • numpy.ndarray – Current labels of dataset.
  • numpy.ndarray – Current training indices.
  • dict – Dictionary containing the sources of the labels.
  • query_i – Currenty query number (starting from 0).
to_dict()

Convert state to dictionary.

Returns:dict – Dictionary with all relevant variables.

Analysis

class asreview.analysis.Analysis(states, key=None)[source]

Analysis object to do statistical analysis on state files.

avg_time_to_discovery(result_format='number')[source]

Estimate the Time to Discovery (TD) for each paper.

Get the best/last estimate on how long it takes to find a paper.

Parameters:result_format (str) – Desired output format: “number”, “fraction” or “percentage”.
Returns:dict – For each inclusion, key=paper_id, value=avg time.
close()[source]

Close states.

classmethod from_dir(data_dir, prefix='', key=None)[source]

Create an Analysis object from a directory.

Parameters:
  • data_dir (str) – Directory to read the state files from.
  • prefix (str) – Only assume files starting with this prefix are state files. Ignore all other files.
  • key (str) – Name for the analysis object.
classmethod from_file(data_fp, key=None)[source]

Create an Analysis object from a file.

Parameters:
  • data_fp (str) – Path to state file to analyse.
  • key (str) – Name for analysis object.
classmethod from_path(data_path, prefix='', key=None)[source]

Create an Analysis object from either a file or a directory.

inclusions_found(result_format='fraction', final_labels=False, **kwargs)[source]

Get the number of inclusions at each point in time.

Caching is used to prevent multiple calls being expensive.

Parameters:
  • result_format (str) – The format % or # of the returned values.
  • final_labels (bool) – If true, use the final_labels instead of labels for analysis.
Returns:

tuple – Three numpy arrays with x, y, error_bar.

limits(prob_allow_miss=[0.1], result_format='percentage')[source]

For each query, compute the number of papers for a criterium.

A criterium is the average number of papers missed. For example, with 0.1, the criterium is that after reading x papers, there is (about) a 10% chance that one paper is not included. Another example, with 2.0, there are on average 2 papers missed after reading x papers. The value for x is returned for each query and probability by the function.

Parameters:prob_allow_miss (list, float) – Sets the criterium for how many papers can be missed.
Returns:dict – One entry, “x_range” with the number of papers read. List, “limits” of results for each probability and at # papers read.
rrf(val=10, x_format='percentage', **kwargs)[source]

Get the RRF (Relevant References Found).

Parameters:
  • val – At which recall, between 0 and 100.
  • x_format – Format for position of RRF value in graph.
Returns:

tuple – Tuple consisting of RRF value, x_positions, y_positions of RRF bar.

wss(val=100, x_format='percentage', **kwargs)[source]

Get the WSS (Work Saved Sampled) value.

Parameters:
  • val – At which recall, between 0 and 100.
  • x_format – Format for position of WSS value in graph.
Returns:

tuple – Tuple consisting of WSS value, x_positions, y_positions of WSS bar.

Extensions

class asreview.entry_points.BaseEntryPoint[source]

Base class for defining entry points.

classmethod execute(argv)[source]

Perform the functionality of the entry point.

Parameters:argv (list) – Argument list, with the entry point and program removed. For example, if asreview plot X is executed, then argv == [‘X’].
format(entry_name='?')[source]

Create a short formatted description of the entry point.

Parameters:entry_name (str) – Name of the entry point. For example ‘plot’ in asreview plot X