API Reference

Low level API

class asreview.review.BaseReview(as_data, model=None, query_model=None, balance_model=None, feature_model=None, n_papers=None, n_instances=1, n_queries=None, start_idx=[], state_file=None, log_file=None)[source]

Base class for Systematic Review.

Parameters:
  • as_data (asreview.ASReviewData) – The data object which contains the text, labels, etc.
  • model (BaseModel) – Initialized model to fit the data during active learning. See asreview.models.utils.py for possible models.
  • query_model (BaseQueryModel) – Initialized model to query new instances for review, such as random sampling or max sampling. See asreview.query_strategies.utils.py for query models.
  • balance_model (BaseBalanceModel) – Initialized model to redistribute the training data during the active learning process. They might either resample or undersample specific papers.
  • feature_model (BaseFeatureModel) – Feature extraction model that converts texts and keywords to feature matrices.
  • n_papers (int) – Number of papers to review during the active learning process, excluding the number of initial priors. To review all papers, set n_papers to None.
  • n_instances (int) – Number of papers to query at each step in the active learning process.
  • n_queries (int) – Number of steps/queries to perform. Set to None for no limit.
  • start_idx (numpy.array) – Start the simulation/review with these indices. They are assumed to be already labeled. Failing to do so might result bad behaviour.
  • state_file (str) – Path to state file. Replaces log_file argument.
classify(query_idx, inclusions, state, method=None)[source]

Classify new papers and update the training indices.

It automaticaly updates the state.

Parameters:
  • query_idx (list, np.array) – Indices to classify.
  • inclusions (list, np.array) – Labels of the query_idx.
  • state (BaseLogger) – Logger to store the classification in.
  • method (str) – If not set to None, all inclusions have this query method.
log_probabilities(state)[source]

Store the modeling probabilities of the training indices and pool indices.

n_pool()[source]

Number of indices left in the pool.

Returns:int – Number of indices left in the pool.
query(n_instances, query_model=None)[source]

Query records from pool.

Parameters:
  • n_instances (int) – Batch size of the queries, i.e. number of records to be queried.
  • query_model (BaseQueryModel) – Query strategy model to use. If None, the query model of the reviewer is used.
Returns:

np.array – Indices of records queried.

review(*args, **kwargs)[source]

Do the systematic review, writing the results to the state file.

Parameters:
  • stop_after_class (bool) – When to stop; if True stop after classification step, otherwise stop after training step.
  • instant_save (bool) – If True, save results after each single classification.
settings

Get an ASReview settings object

statistics()[source]

Get statistics on the current state of the review.

Returns:dict – A dictonary with statistics like n_included and last_inclusion.
train()[source]

Train the model.

class asreview.ReviewSimulate(as_data, *args, n_prior_included=0, n_prior_excluded=0, prior_idx=None, init_seed=None, **kwargs)[source]

ASReview Simulation mode class.

Parameters:
  • as_data (asreview.ASReviewData) – The data object which contains the text, labels, etc.
  • model (BaseModel) – Initialized model to fit the data during active learning. See asreview.models.utils.py for possible models.
  • query_model (BaseQueryModel) – Initialized model to query new instances for review, such as random sampling or max sampling. See asreview.query_strategies.utils.py for query models.
  • balance_model (BaseBalanceModel) – Initialized model to redistribute the training data during the active learning process. They might either resample or undersample specific papers.
  • feature_model (BaseFeatureModel) – Feature extraction model that converts texts and keywords to feature matrices.
  • n_prior_included (int) – Sample n prior included papers.
  • n_prior_excluded (int) – Sample n prior excluded papers.
  • prior_idx (int) – Prior indices by id.
  • n_papers (int) – Number of papers to review during the active learning process, excluding the number of initial priors. To review all papers, set n_papers to None.
  • n_instances (int) – Number of papers to query at each step in the active learning process.
  • n_queries (int) – Number of steps/queries to perform. Set to None for no limit.
  • start_idx (numpy.array) – Start the simulation/review with these indices. They are assumed to be already labeled. Failing to do so might result bad behaviour.
  • init_seed (int) – Seed for setting the prior indices if the –prior_idx option is not used. If the option prior_idx is used with one or more index, this option is ignored.
  • state_file (str) – Path to state file. Replaces log_file argument.
classify(query_idx, inclusions, state, method=None)

Classify new papers and update the training indices.

It automaticaly updates the state.

Parameters:
  • query_idx (list, np.array) – Indices to classify.
  • inclusions (list, np.array) – Labels of the query_idx.
  • state (BaseLogger) – Logger to store the classification in.
  • method (str) – If not set to None, all inclusions have this query method.
log_probabilities(state)

Store the modeling probabilities of the training indices and pool indices.

n_pool()

Number of indices left in the pool.

Returns:int – Number of indices left in the pool.
query(n_instances, query_model=None)

Query records from pool.

Parameters:
  • n_instances (int) – Batch size of the queries, i.e. number of records to be queried.
  • query_model (BaseQueryModel) – Query strategy model to use. If None, the query model of the reviewer is used.
Returns:

np.array – Indices of records queried.

review(*args, **kwargs)

Do the systematic review, writing the results to the state file.

Parameters:
  • stop_after_class (bool) – When to stop; if True stop after classification step, otherwise stop after training step.
  • instant_save (bool) – If True, save results after each single classification.
settings

Get an ASReview settings object

statistics()

Get statistics on the current state of the review.

Returns:dict – A dictonary with statistics like n_included and last_inclusion.
train()

Train the model.

Models

class asreview.models.NBModel(alpha=3.822)[source]

Naive Bayes classifier

The Naive Bayes classifier is an implementation based on the sklearn multinomial Naive Bayes classifier.

Parameters:alpha (float, default=3.822) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)

Fit the model to the data.

X: np.array
Feature matrix to fit.
y: np.array
Labels for supervised learning.
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:
  • dict – Parameter space.
  • dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (np.array) – Feature matrix to predict.
Returns:np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.RFModel(n_estimators=100, max_features=10, class_weight=1.0, random_state=None)[source]

Random Forest classifier

The Random Forest classifier is an implementation based on the sklearn Random Forest classifier.

Parameters:
  • n_estimators (int, default=100) – The number of trees in the forest.
  • max_features (int, default=10) – Number of features in the model.
  • class_weight (float, default=1.0) – Class weight of the inclusions.
  • random_state (int or RandomState, default=None) – Controls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)

Fit the model to the data.

X: np.array
Feature matrix to fit.
y: np.array
Labels for supervised learning.
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:
  • dict – Parameter space.
  • dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (np.array) – Feature matrix to predict.
Returns:np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.SVMModel(gamma='auto', class_weight=0.249, C=15.4, kernel='linear', random_state=None)[source]

Support Vector Machine classifier

The Support Vector Machine classifier is an implementation based on the sklearn Support Vector Machine classifier.

Parameters:
  • gamma (str) – Gamma parameter of the SVM model.
  • class_weight (float) – class_weight of the inclusions.
  • C (float) – C parameter of the SVM model.
  • kernel (str) – SVM kernel type.
  • random_state (int, RandomState) – State of the RNG.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)

Fit the model to the data.

X: np.array
Feature matrix to fit.
y: np.array
Labels for supervised learning.
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:
  • dict – Parameter space.
  • dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (np.array) – Feature matrix to predict.
Returns:np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.LogisticModel(C=1.0, class_weight=1.0, random_state=None, n_jobs=1)[source]

Logistic regressions classifier

The Logistic regressions classifier is an implementation based on the sklearn Logistic regressions classifier.

Parameters:
  • C (float) – Parameter inverse to the regularization strength of the model.
  • class_weight (float) – Class weight of the inclusions.
  • random_state (int, RandomState) – Random state for the model.
  • n_jobs (int) – Number of CPU cores used.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)

Fit the model to the data.

X: np.array
Feature matrix to fit.
y: np.array
Labels for supervised learning.
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:
  • dict – Parameter space.
  • dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (np.array) – Feature matrix to predict.
Returns:np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.LSTMBaseModel(embedding_matrix=None, backwards=True, dropout=0.4, optimizer='rmsprop', lstm_out_width=20, learn_rate=1.0, dense_width=128, verbose=0, batch_size=32, epochs=35, shuffle=False, class_weight=30.0)[source]

LSTM base classifier.

LSTM model consisting of an embedding layer, one LSTM layer, and one dense layer.

Parameters:
  • embedding_matrix (np.array) – Embedding matrix to use with LSTM model.
  • backwards (bool) – Whether to have a forward or backward LSTM.
  • dropout (float) – Value in [0, 1.0) that gives the dropout and recurrent dropout rate for the LSTM model.
  • optimizer (str) – Optimizer to use.
  • lstm_out_width (int) – Output width of the LSTM.
  • learn_rate (float) – Learn rate multiplier of default learning rate.
  • dense_width (int) – Size of the dense layer of the model.
  • verbose (int) – Verbosity.
  • batch_size (int) – Size of the batch size for the LSTM model.
  • epochs (int) – Number of epochs to train the LSTM model.
  • shuffle (bool) – Whether to shuffle the data before starting to train.
  • class_weight (float) – Class weight for the included papers.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)[source]

Fit the model to the data.

X: np.array
Feature matrix to fit.
y: np.array
Labels for supervised learning.
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:
  • dict – Parameter space.
  • dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (np.array) – Feature matrix to predict.
Returns:np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.LSTMPoolModel(embedding_matrix=None, backwards=True, dropout=0.4, optimizer='rmsprop', lstm_out_width=20, lstm_pool_size=128, learn_rate=1.0, verbose=0, batch_size=32, epochs=35, shuffle=False, class_weight=30.0)[source]

LSTM pool classifier.

LSTM model consisting of an embedding layer, one LSTM layer, and one max pooling layer.

Parameters:
  • embedding_matrix (np.array) – Embedding matrix to use with LSTM model.
  • backwards (bool) – Whether to have a forward or backward LSTM.
  • dropout (float) – Value in [0, 1.0) that gives the dropout and recurrent dropout rate for the LSTM model.
  • optimizer (str) – Optimizer to use.
  • lstm_out_width (int) – Output width of the LSTM.
  • lstm_pool_size (int) – Size of the pool, must be a divisor of max_sequence_length.
  • learn_rate (float) – Learn rate multiplier of default learning rate.
  • verbose (int) – Verbosity.
  • batch_size (int) – Size of the batch size for the LSTM model.
  • epochs (int) – Number of epochs to train the LSTM model.
  • shuffle (bool) – Whether to shuffle the data before starting to train.
  • class_weight (float) – Class weight for the included papers.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)[source]

Fit the model to the data.

X: np.array
Feature matrix to fit.
y: np.array
Labels for supervised learning.
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:
  • dict – Parameter space.
  • dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)

Get the inclusion probability for each sample.

Parameters:X (np.array) – Feature matrix to predict.
Returns:np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
class asreview.models.NN2LayerModel(dense_width=128, optimizer='rmsprop', learn_rate=1.0, regularization=0.01, verbose=0, epochs=35, batch_size=32, shuffle=False, class_weight=30.0)[source]

Dense neural network classifier.

Neural network with two hidden, dense layers of the same size.

Parameters:
  • dense_width (int) – Size of the dense layers.
  • optimizer (str) – Name of the Keras optimizer.
  • learn_rate (float) – Learning rate multiplier of the default learning rate.
  • regularization (float) – Strength of the regularization on the weights and biases.
  • verbose (int) – Verbosity of the model mirroring the values for Keras.
  • epochs (int) – Number of epochs to train the neural network.
  • batch_size (int) – Batch size used for the neural network.
  • shuffle (bool) – Whether to shuffle the training data prior to training.
  • class_weight (float) – Class weights for inclusions (1’s).
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(X, y)[source]

Fit the model to the data.

X: np.array
Feature matrix to fit.
y: np.array
Labels for supervised learning.
full_hyper_space()[source]

Get a hyperparameter space to use with hyperopt.

Returns:
  • dict – Parameter space.
  • dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
predict_proba(X)[source]

Get the inclusion probability for each sample.

Parameters:X (np.array) – Feature matrix to predict.
Returns:np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
asreview.models.list_classifiers()[source]

List available classifiers.

Returns:list – Names of available classifiers in alphabetical order.
asreview.models.get_model(method, *args, random_state=None, **kwargs)[source]

Get an instance of a model from a string.

Parameters:
  • method (str) – Name of the model.
  • *args – Arguments for the model.
  • **kwargs – Keyword arguments for the model.
asreview.models.get_model_class(method)[source]

Get class of model from string.

Parameters:method (str) – Name of the model, e.g. ‘svm’, ‘nb’ or ‘lstm-pool’.
Returns:BaseModel – Class corresponding to the method.

Query strategies

class asreview.query_strategies.MaxQuery[source]

Maximum sampling query strategy.

default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
query(X, classifier, pool_idx=None, n_instances=1, shared={})

Query method for strategies which use class probabilities.

class asreview.query_strategies.MixedQuery(strategy_1='max', strategy_2='random', mix_ratio=0.95, random_state=None, **kwargs)[source]

Class for mixed query strategy.

The idea is to use two different query strategies at the same time with a ratio of one to the other.

Parameters:
  • strategy_1 (str) – Name of the first query strategy.
  • strategy_2 (str) – Name of the second query strategy.
  • mix_ratio (float) – Portion of queries done by the first strategy. So a mix_ratio of 0.95 means that 95% of the time query strategy 1 is used and 5% of the time query strategy 2.
  • **kwargs (dict) – Keyword arguments for the two strategy. To specify which of the strategies the argument is for, prepend with the name of the query strategy and an underscore, e.g. ‘max’ for maximal sampling.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
name

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
query(X, classifier, pool_idx=None, n_instances=1, shared={})[source]

Query new instances.

Parameters:
  • X (np.array) – Feature matrix to choose samples from.
  • classifier (SKLearnModel) – Trained classifier to compute probabilities if they are necessary.
  • pool_idx (np.array) – Indices of samples that are still in the pool.
  • n_instances (int) – Number of instances to query.
  • shared (dict) – Dictionary for exchange between query strategies and others. It is mainly used to store the current class probabilities, and the source of the queries; which query strategy has produced which index.
class asreview.query_strategies.UncertaintyQuery[source]

Maximum uncertainty query strategy.

default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
query(X, classifier, pool_idx=None, n_instances=1, shared={})

Query method for strategies which use class probabilities.

class asreview.query_strategies.RandomQuery(random_state=None)[source]

Random sampling query strategy.

default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
query(X, classifier, pool_idx=None, n_instances=1, shared={})

Query method for strategies which do not use class probabilities

class asreview.query_strategies.ClusterQuery(cluster_size=350, update_interval=200, random_state=None)[source]

Query strategy using clustering algorithms.

default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
query(X, classifier, pool_idx=None, n_instances=1, shared={})

Query method for strategies which use class probabilities.

asreview.query_strategies.list_query_strategies()[source]

List available query strategies.

This excludes all possible mixed query strategies.

Returns:list – Names of available query strategies in alphabetical order.
asreview.query_strategies.get_query_model(method, *args, random_state=None, **kwargs)[source]

Get an instance of the query strategy.

Parameters:
  • method (str) – Name of the query strategy.
  • *args – Arguments for the model.
  • **kwargs – Keyword arguments for the model.
Returns:

BaseQueryModel – Initialized instance of query strategy.

asreview.query_strategies.get_query_class(method)[source]

Get class of query strategy from its name.

Parameters:method (str) – Name of the query strategy, e.g. ‘max’, ‘uncertainty’, ‘random. A special mixed query strategy is als possible. The mix is denoted by an underscore: ‘max_random’ or ‘max_uncertainty’.
Returns:BaseQueryModel – Class corresponding to the method name.

Balance Strategies

class asreview.balance_strategies.SimpleBalance[source]
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
sample(X, y, train_idx, shared)[source]

Function that does not resample the training set.

Parameters:
  • X (np.array) – Complete matrix of all samples.
  • y (np.array) – Classified results of all samples.
  • extra_vars (dict:) – Extra variables that can be passed around between functions.
Returns:

  • np.array – Training samples.
  • np.array – Classification of training samples.

class asreview.balance_strategies.DoubleBalance(a=2.155, alpha=0.94, b=0.789, beta=1.0, random_state=None)[source]

Dynamic Resampling balance strategy.

Class to get the two way rebalancing function and arguments. It super samples ones depending on the number of 0’s and total number of samples in the training data.

Parameters:
  • a (float) – Governs the weight of the 1’s. Higher values mean linearly more 1’s in your training sample.
  • alpha (float) – Governs the scaling the weight of the 1’s, as a function of the ratio of ones to zeros. A positive value means that the lower the ratio of zeros to ones, the higher the weight of the ones.
  • b (float) – Governs how strongly we want to sample depending on the total number of samples. A value of 1 means no dependence on the total number of samples, while lower values mean increasingly stronger dependence on the number of samples.
  • beta (float) – Governs the scaling of the weight of the zeros depending on the number of samples. Higher values means that larger samples are more strongly penalizing zeros.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
sample(X, y, train_idx, shared)[source]

Resample the training data.

Parameters:
  • X (np.array) – Complete feature matrix.
  • y (np.array) – Labels for all papers.
  • train_idx (np.array) – Training indices, that is all papers that have been reviewed.
  • shared (dict) – Dictionary to share data between balancing models and other models.
Returns:

np.array, np.array – X_train, y_train: the resampled matrix, labels.

class asreview.balance_strategies.TripleBalance(a=2.155, alpha=0.94, b=0.789, beta=1.0, c=0.835, gamma=2.0, shuffle=True, random_state=None)[source]

Triple balance strategy.

Class to get the three way rebalancing function and arguments. It divides the data into three groups: 1’s, 0’s from random sampling, and 0’s from max sampling. Thus it only makes sense to use this class in combination with the rand_max query strategy.

Parameters:
  • a (float) – Governs the weight of the 1’s. Higher values mean linearly more 1’s in your training sample.
  • alpha (float) – Governs the scaling the weight of the 1’s, as a function of the ratio of ones to zeros. A positive value means that the lower the ratio of zeros to ones, the higher the weight of the ones.
  • b (float) – Governs how strongly we want to sample depending on the total number of samples. A value of 1 means no dependence on the total number of samples, while lower values mean increasingly stronger dependence on the number of samples.
  • beta (float) – Governs the scaling of the weight of the zeros depending on the number of samples. Higher values means that larger samples are more strongly penalizing zeros.
  • c (float) – Value between one and zero that governs the weight of samples done with maximal sampling. Higher values mean higher weight.
  • gamma (float) – Governs the scaling of the weight of the max samples as a function of the % of papers read. Higher values mean stronger scaling.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
sample(X, y, train_idx, shared)[source]

Resample the training data.

Parameters:
  • X (np.array) – Complete feature matrix.
  • y (np.array) – Labels for all papers.
  • train_idx (np.array) – Training indices, that is all papers that have been reviewed.
  • shared (dict) – Dictionary to share data between balancing models and other models.
Returns:

np.array, np.array – X_train, y_train: the resampled matrix, labels.

class asreview.balance_strategies.UndersampleBalance(ratio=1.0, random_state=None)[source]

Balancing class that undersamples the data with a given ratio.

Parameters:ratio (double) – Undersampling ratio of the zero’s. If for example we set a ratio of 0.25, we would sample only a quarter of the zeros and all the ones.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
sample(X, y, train_idx, shared)[source]

Resample the training data.

Parameters:
  • X (np.array) – Complete feature matrix.
  • y (np.array) – Labels for all papers.
  • train_idx (np.array) – Training indices, that is all papers that have been reviewed.
  • shared (dict) – Dictionary to share data between balancing models and other models.
Returns:

np.array, np.array – X_train, y_train: the resampled matrix, labels.

asreview.balance_strategies.list_balance_strategies()[source]

List available balancing strategies.

Returns:list – Names of available balance strategies in alphabetical order.
asreview.balance_strategies.get_balance_model(method, *args, random_state=None, **kwargs)[source]

Get an instance of a balance model from a string.

Parameters:
  • method (str) – Name of the balance model.
  • *args – Arguments for the balance model.
  • **kwargs – Keyword arguments for the balance model.
asreview.balance_strategies.get_balance_class(method)[source]

Get class of balance model from string.

Parameters:method (str) – Name of the model, e.g. ‘simple’, ‘double’ or ‘undersample’.
Returns:BaseBalanceModel – Class corresponding to the method.

Feature Extraction

class asreview.feature_extraction.Tfidf(*args, ngram_max=1, **kwargs)[source]

Class to apply SKLearn Tf-idf to texts.

Parameters:ngram_max (int) – Can use up to ngrams up to ngram_max. For example in the case of ngram_max=2, monograms and bigrams could be used.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(texts)[source]

Fit the model to the texts.

It is not always necessary to implement this if there’s not real fitting being done.

Parameters:texts (np.array) – Texts to be fitted.
fit_transform(texts, titles=None, abstracts=None, keywords=None)

Fit and transform a list of texts.

Parameters:texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:np.array – Feature matrix representing the texts.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
transform(texts)[source]

Transform a list of texts.

Parameters:texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:np.array – Feature matrix representing the texts.
class asreview.feature_extraction.Doc2Vec(*args, vector_size=40, epochs=33, min_count=1, n_jobs=1, window=7, dm_concat=0, dm=2, dbow_words=0, **kwargs)[source]

Base class for doc2vec feature extraction.

Requires ‘gensim’ installation.

Parameters:
  • vector_size (int) – Output size of the vector.
  • epochs (int) – Number of epochs to train the doc2vec model.
  • min_count (int) – Minimum number of occurences for a word in the corpus for it to be included in the model.
  • n_jobs (int) – Number of threads to train the model with.
  • window (int) – Maximum distance over which word vectors influence each other.
  • dm_concat (int) – Whether to concatenate word vectors or not. See paper for more detail.
  • dm (int) – Model to use. 0: Use distribute bag of words (DBOW). 1: Use distributed memory (DM). 2: Use both of the above with half the vector size and concatenate them.
  • dbow_words (int) – Whether to train the word vectors using the skipgram method.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(texts)[source]

Fit the model to the texts.

It is not always necessary to implement this if there’s not real fitting being done.

Parameters:texts (np.array) – Texts to be fitted.
fit_transform(texts, titles=None, abstracts=None, keywords=None)

Fit and transform a list of texts.

Parameters:texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:np.array – Feature matrix representing the texts.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
transform(texts)[source]

Transform a list of texts.

Parameters:texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:np.array – Feature matrix representing the texts.
class asreview.feature_extraction.EmbeddingIdf(*args, embedding_fp=None, random_state=None, **kwargs)[source]

Class for Embedding-Idf model.

This model averages the weighted word vectors of all the words in the text, in order to get a single feature vector for each text. The weights are provided by the inverse document frequencies.

Parameters:embedding_fp (str) – Path to embedding.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(texts)

Fit the model to the texts.

It is not always necessary to implement this if there’s not real fitting being done.

Parameters:texts (np.array) – Texts to be fitted.
fit_transform(texts, titles=None, abstracts=None, keywords=None)

Fit and transform a list of texts.

Parameters:texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:np.array – Feature matrix representing the texts.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
transform(texts)[source]

Transform a list of texts.

Parameters:texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:np.array – Feature matrix representing the texts.
class asreview.feature_extraction.EmbeddingLSTM(*args, loop_sequence=1, num_words=20000, max_sequence_length=1000, padding='post', truncating='post', n_jobs=1, **kwargs)[source]

Class to create embedding matrices for LSTM models.

Parameters:
  • loop_sequence (bool) – Instead of zeros at the start/end of sequence loop it.
  • num_words (int) – Maximum number of unique words to be processed.
  • max_sequence_length (int) – Maximum length of the sequence. Shorter get struncated. Longer sequences get either padded with zeros or looped.
  • padding (str) – Which side should be padded [pre/post].
  • truncating – Which side should be truncated [pre/post].
  • n_jobs – Number of processors used in reading the embedding matrix.
default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(texts)

Fit the model to the texts.

It is not always necessary to implement this if there’s not real fitting being done.

Parameters:texts (np.array) – Texts to be fitted.
fit_transform(texts, titles=None, abstracts=None, keywords=None)

Fit and transform a list of texts.

Parameters:texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:np.array – Feature matrix representing the texts.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
transform(texts)[source]

Transform a list of texts.

Parameters:texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:np.array – Feature matrix representing the texts.
class asreview.feature_extraction.SBERT(split_ta=0, use_keywords=0)[source]

Sentence BERT class for feature extraction.

default_param

Get the default parameters of the model.

Returns:dict – Dictionary with parameter: default value
fit(texts)

Fit the model to the texts.

It is not always necessary to implement this if there’s not real fitting being done.

Parameters:texts (np.array) – Texts to be fitted.
fit_transform(texts, titles=None, abstracts=None, keywords=None)

Fit and transform a list of texts.

Parameters:texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:np.array – Feature matrix representing the texts.
param

Get the (assigned) parameters of the model.

Returns:dict – Dictionary with parameter: current value.
transform(texts)[source]

Transform a list of texts.

Parameters:texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized.
Returns:np.array – Feature matrix representing the texts.
asreview.feature_extraction.list_feature_extraction()[source]

List available feature extraction methods.

Returns:list – Names of available feature extraction methods in alphabetical order.
asreview.feature_extraction.get_feature_model(method, *args, random_state=None, **kwargs)[source]

Get an instance of a feature extraction model from a string.

Parameters:
  • method (str) – Name of the feature extraction model.
  • *args – Arguments for the feature extraction model.
  • **kwargs – Keyword arguments for thefeature extraction model.
asreview.feature_extraction.get_feature_class(method)[source]

Get class of feature extraction from string.

Parameters:method (str) – Name of the feature model, e.g. ‘doc2vec’, ‘tfidf’ or ‘embedding-lstm’.
Returns:BaseFeatureExtraction – Class corresponding to the method.

Data

class asreview.ASReviewData(df=None, data_name='empty', data_type='standard', column_spec=None)[source]

Data object to the dataset with texts, labels, DOIs etc.

Parameters:
  • df (pd.DataFrame) – Dataframe containing the data for the ASReview data object.
  • data_name (str) – Give a name to the data object.
  • data_type (str) – What kind of data the dataframe contains.
  • column_spec (dict) – Specification for which column corresponds to which standard specification. Key is the standard specification, key is which column it is actually in.
append(as_data)[source]

Append another ASReviewData object.

It puts the training data at the end.

Parameters:as_data (ASReviewData) – Dataset to append.
format_record(i, by_index=True, *args, **kwargs)[source]

Format one record for displaying in the CLI.

classmethod from_file(fp, read_fn=None, data_name=None, data_type=None)[source]

Create instance from csv/ris/excel file.

It works in two ways; either manual control where the conversion functions are supplied or automatic, where it searches in the entry points for the right conversion functions.

Parameters:
  • fp (str, Path) – Read the data from this file.
  • read_fn (function) – Function to read the file. It should return a standardized dataframe.
  • data_name (str) – Name of the data.
  • data_type (str) – What kind of data it is. Special names: ‘included’, ‘excluded’, ‘prior’.
fuzzy_find(keywords, threshold=60, max_return=10, exclude=None, by_index=True)[source]

Find a record using keywords.

It looks for keywords in the title/authors/keywords (for as much is available). Using the diflib package it creates a ranking based on token set matching.

Parameters:
  • keywords (str) – A string of keywords together, can be a combination.
  • threshold (float) – Don’t return records below this threshold.
  • max_return (int) – Maximum number of records to return.
  • exclude (list, np.ndarray) – List of indices that should be excluded in the search. You would put papers that were already labeled here for example.
  • by_index (bool) – If True, use internal indexing. If False, use record ids for indexing.
Returns:

list – Sorted list of indexes that match best the keywords.

get(name)[source]

Get column with name.

hash()[source]

Compute a hash from the dataset.

Returns:str – SHA1 hash, computed from the titles/abstracts of the dataframe.
preview_record(i, by_index=True, *args, **kwargs)[source]

Return a preview string for record i.

print_record(*args, **kwargs)[source]

Print a record to the CLI.

prior_data_idx

Get prior_included, prior_excluded from dataset.

prior_labels(state, by_index=True)[source]

Get the labels that are marked as ‘initial’.

state: BaseState
Open state that contains the label information.
by_index: bool
If True, return internal indexing. If False, return record_ids for indexing.
Returns:np.array – Array of indices that have the ‘initial’ property.
record(i, by_index=True)[source]

Create a record from an index.

Parameters:
  • i (int, iterable) – Index of the record, or list of indices.
  • by_index (bool) – If True, take the i-th value as used internally by the review. If False, take the record with record_id==i.
Returns:

PaperRecord – The corresponding record if i was an integer, or a list of records if i was an iterable.

slice(idx)[source]

Create a slice from itself.

Useful if some parts should be kept/thrown away.

Parameters:idx (list, np.ndarray) – Record ids that should be kept.
Returns:ASReviewData – Slice of itself.
to_csv(fp, labels=None, ranking=None)[source]

Export to csv.

Parameters:
  • fp (str, NoneType) – Filepath or None for buffer.
  • labels (list, np.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
  • ranking (list) – Reorder the dataframe according to these (internal) indices. Default ordering if ranking is None.
Returns:

pd.DataFrame – Dataframe of all available record data.

to_dataframe(labels=None, ranking=None)[source]

Create new dataframe with updated label (order).

Parameters:
  • labels (list, np.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
  • ranking (list) – Reorder the dataframe according to these (internal) indices. Default ordering if ranking is None.
Returns:

pd.DataFrame – Dataframe of all available record data.

to_excel(fp, labels=None, ranking=None)[source]

Export to Excel xlsx file.

Parameters:
  • fp (str, NoneType) – Filepath or None for buffer.
  • labels (list, np.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
  • ranking (list) – Reorder the dataframe according to these (internal) indices. Default ordering if ranking is None.
Returns:

pd.DataFrame – Dataframe of all available record data.

to_file(fp, labels=None, ranking=None)[source]

Export data object to file.

RIS, CSV and Excel are supported file formats at the moment.

Parameters:
  • fp (str) – Filepath to export to.
  • labels (list, np.array) – Labels to be inserted into the dataframe before export.
  • ranking (list, np.array) – Optionally, dataframe rows can be reordered.

Utils

asreview.load_embedding(fp, word_index=None, n_jobs=None)[source]

Load embedding matrix from file.

The embedding matrix needs to be stored in the FastText format.

Parameters:
  • fp (str) – File path of the trained embedding vectors.
  • word_index (dict) – Sample word embeddings.
  • n_jobs (int) – Number of processes to parse the embedding (+1 process for reading).
  • verbose (int) – The verbosity. Default 1.
Returns:

dict – The embedding weights stored in a dict with the word as key and the weights as values.

asreview.sample_embedding(embedding, word_index)[source]

Sample embedding matrix

Parameters:
  • embedding (dict) – A dictionary with the words and embedding vectors.
  • word_index (dict) – A word_index like the output of Keras Tokenizer.word_index.
  • verbose (int) – The verbosity. Default 1.
Returns:

(np.ndarray, list) – The embedding weights strored in a two dimensional numpy array and a list with the corresponding words.

State

asreview.state.open_state(fp, *args, read_only=False, **kwargs)[source]

Open a state from a file.

Parameters:
  • fp (str) – File to open.
  • read_only (bool) – Whether to open the file in read_only mode.
Returns:

Basestate – Depending on the extension the appropriate state is chosen: - [.h5, .hdf5, .he5] -> HDF5state. - None -> Dictstate (doesn’t store anything permanently). - Anything else -> JSONstate.

class asreview.state.BaseState(state_fp, read_only=False)[source]
add_classification(idx, labels, methods, query_i)[source]

Add training indices and their labels.

Parameters:
  • indices (list, np.array) – A list of indices used for training.
  • labels (list) – A list of labels corresponding with the training indices.
  • i (int) – The query number.
add_proba(pool_idx, train_idx, proba, query_i)[source]

Add inverse pool indices and their labels.

Parameters:
  • indices (list, np.array) – A list of indices used for unlabeled pool.
  • pred (np.array) – Array of prediction probabilities for unlabeled pool.
  • i (int) – The query number.
close()[source]

Close the files opened by the state.

Also sets the end time if not in read-only mode.

delete_last_query()[source]

Delete the last query from the state object.

get(variable, query_i=None, default=None, idx=None)[source]

Get data from the state object.

This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.

Parameters:
  • variable (str) – Name of the variable/data to get. Options are: label_idx, inclusions, label_methods, labels, final_labels, proba , train_idx, pool_idx.
  • query_i (int) – Query number, should be between 0 and self.n_queries().
  • idx (int, np.array, list) – Indices to get in the returned array.
get_current_queries()[source]

Get the current queries made by the model.

This is useful to get back exactly to the state it was in before shutting down a review.

Returns:dict – The last known queries according to the state file.
get_feature_matrix(data_hash)[source]

Get feature matrix out of the state.

Parameters:data_hash (str) – Hash of as_data object from which the matrix is derived.
Returns:np.ndarray or sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.
initialize_structure()[source]

Create empty internal structure for state

is_empty()[source]

Check if state has no results.

Returns:bool – True if empty.
n_queries()[source]

Number of queries saved in the state.

Returns:int – Number of queries.
pred_proba

Get last predicted probabilities.

restore(fp)[source]

Restore or create state from a state file.

If the state file doesn’t exist, creates and empty state that is ready for storage.

Parameters:fp (str) – Path to file to restore/create.
save()[source]

Save state to file.

Parameters:fp (str) – The file path to export the results to.
set_current_queries(current_queries)[source]

Set the current queries made by the model.

Parameters:current_queries (dict) – The last known queries, with {query_idx: query_method}.
set_final_labels(y)[source]

Add/set final labels to state.

If final_labels does not exist yet, add it.

Parameters:y (np.array) – One dimensional integer numpy array with final inclusion labels.
set_labels(y)[source]

Add/set labels to state

If the labels do not exist, add it to the state.

Parameters:y (np.array) – One dimensional integer numpy array with inclusion labels.
settings

Get settings from state

startup_vals()[source]

Get variables for reviewer to continue review.

Returns:
  • np.array – Current labels of dataset.
  • np.array – Current training indices.
  • dict – Dictionary containing the sources of the labels.
  • query_i – Currenty query number (starting from 0).
to_dict()[source]

Convert state to dictionary.

Returns:dict – Dictionary with all relevant variables.
class asreview.state.HDF5State(state_fp, read_only=False)[source]

Class for storing the review state with HDF5 storage.

class asreview.state.JSONState(state_fp, read_only=False)[source]

Class for storing the state of a Systematic Review using JSON files.

class asreview.state.DictState(state_fp, *_, **__)[source]

Class for storing the state of a review with no permanent storage.

Analysis

class asreview.analysis.Analysis(states, key=None)[source]

Analysis object to do statistical analysis on state files.

avg_time_to_discovery(result_format='number')[source]

Estimate the Time to Discovery (TD) for each paper.

Get the best/last estimate on how long it takes to find a paper.

Parameters:result_format (str) – Desired output format: “number”, “fraction” or “percentage”.
Returns:dict – For each inclusion, key=paper_id, value=avg time.
close()[source]

Close states.

classmethod from_dir(data_dir, prefix='', key=None)[source]

Create an Analysis object from a directory.

Parameters:
  • data_dir (str) – Directory to read the state files from.
  • prefix (str) – Only assume files starting with this prefix are state files. Ignore all other files.
  • key (str) – Name for the analysis object.
classmethod from_file(data_fp, key=None)[source]

Create an Analysis object from a file.

Parameters:
  • data_fp (str) – Path to state file to analyse.
  • key (str) – Name for analysis object.
classmethod from_path(data_path, prefix='', key=None)[source]

Create an Analysis object from either a file or a directory.

inclusions_found(result_format='fraction', final_labels=False, **kwargs)[source]

Get the number of inclusions at each point in time.

Caching is used to prevent multiple calls being expensive.

Parameters:
  • result_format (str) – The format % or # of the returned values.
  • final_labels (bool) – If true, use the final_labels instead of labels for analysis.
Returns:

tuple – Three numpy arrays with x, y, error_bar.

limits(prob_allow_miss=[0.1], result_format='percentage')[source]

For each query, compute the number of papers for a criterium.

A criterium is the average number of papers missed. For example, with 0.1, the criterium is that after reading x papers, there is (about) a 10% chance that one paper is not included. Another example, with 2.0, there are on average 2 papers missed after reading x papers. The value for x is returned for each query and probability by the function.

Parameters:prob_allow_miss (list, float) – Sets the criterium for how many papers can be missed.
Returns:dict – One entry, “x_range” with the number of papers read. List, “limits” of results for each probability and at # papers read.
rrf(val=10, x_format='percentage', **kwargs)[source]

Get the RRF (Relevant References Found).

Parameters:
  • val – At which recall, between 0 and 100.
  • x_format – Format for position of RRF value in graph.
Returns:

tuple – Tuple consisting of RRF value, x_positions, y_positions of RRF bar.

wss(val=100, x_format='percentage', **kwargs)[source]

Get the WSS (Work Saved Sampled) value.

Parameters:
  • val – At which recall, between 0 and 100.
  • x_format – Format for position of WSS value in graph.
Returns:

tuple – Tuple consisting of WSS value, x_positions, y_positions of WSS bar.

Extensions

class asreview.entry_points.BaseEntryPoint[source]

Base class for defining entry points.

classmethod execute(argv)[source]

Perform the functionality of the entry point.

Parameters:argv (list) – Argument list, with the entry point and program removed. For example, if asreview plot X is executed, then argv == [‘X’].
format(entry_name='?')[source]

Create a short formatted description of the entry point.

Parameters:entry_name (str) – Name of the entry point. For example ‘plot’ in asreview plot X