API Reference¶
Low level API¶

class
asreview.review.
BaseReview
(as_data, model=None, query_model=None, balance_model=None, feature_model=None, n_papers=None, n_instances=1, n_queries=None, start_idx=[], state_file=None, log_file=None)[source]¶ Base class for Systematic Review.
Parameters:  as_data (asreview.ASReviewData) – The data object which contains the text, labels, etc.
 model (BaseModel) – Initialized model to fit the data during active learning. See asreview.models.utils.py for possible models.
 query_model (BaseQueryModel) – Initialized model to query new instances for review, such as random sampling or max sampling. See asreview.query_strategies.utils.py for query models.
 balance_model (BaseBalanceModel) – Initialized model to redistribute the training data during the active learning process. They might either resample or undersample specific papers.
 feature_model (BaseFeatureModel) – Feature extraction model that converts texts and keywords to feature matrices.
 n_papers (int) – Number of papers to review during the active learning process, excluding the number of initial priors. To review all papers, set n_papers to None.
 n_instances (int) – Number of papers to query at each step in the active learning process.
 n_queries (int) – Number of steps/queries to perform. Set to None for no limit.
 start_idx (numpy.array) – Start the simulation/review with these indices. They are assumed to be already labeled. Failing to do so might result bad behaviour.
 state_file (str) – Path to state file. Replaces log_file argument.

classify
(query_idx, inclusions, state, method=None)[source]¶ Classify new papers and update the training indices.
It automaticaly updates the state.
Parameters:

log_probabilities
(state)[source]¶ Store the modeling probabilities of the training indices and pool indices.

n_pool
()[source]¶ Number of indices left in the pool.
Returns: int – Number of indices left in the pool.

query
(n_instances, query_model=None)[source]¶ Query records from pool.
Parameters:  n_instances (int) – Batch size of the queries, i.e. number of records to be queried.
 query_model (BaseQueryModel) – Query strategy model to use. If None, the query model of the reviewer is used.
Returns: np.array – Indices of records queried.

review
(*args, **kwargs)[source]¶ Do the systematic review, writing the results to the state file.
Parameters:

settings
¶ Get an ASReview settings object

class
asreview.
ReviewSimulate
(as_data, *args, n_prior_included=0, n_prior_excluded=0, prior_idx=None, init_seed=None, **kwargs)[source]¶ ASReview Simulation mode class.
Parameters:  as_data (asreview.ASReviewData) – The data object which contains the text, labels, etc.
 model (BaseModel) – Initialized model to fit the data during active learning. See asreview.models.utils.py for possible models.
 query_model (BaseQueryModel) – Initialized model to query new instances for review, such as random sampling or max sampling. See asreview.query_strategies.utils.py for query models.
 balance_model (BaseBalanceModel) – Initialized model to redistribute the training data during the active learning process. They might either resample or undersample specific papers.
 feature_model (BaseFeatureModel) – Feature extraction model that converts texts and keywords to feature matrices.
 n_prior_included (int) – Sample n prior included papers.
 n_prior_excluded (int) – Sample n prior excluded papers.
 prior_idx (int) – Prior indices by id.
 n_papers (int) – Number of papers to review during the active learning process, excluding the number of initial priors. To review all papers, set n_papers to None.
 n_instances (int) – Number of papers to query at each step in the active learning process.
 n_queries (int) – Number of steps/queries to perform. Set to None for no limit.
 start_idx (numpy.array) – Start the simulation/review with these indices. They are assumed to be already labeled. Failing to do so might result bad behaviour.
 init_seed (int) – Seed for setting the prior indices if the –prior_idx option is not used. If the option prior_idx is used with one or more index, this option is ignored.
 state_file (str) – Path to state file. Replaces log_file argument.

classify
(query_idx, inclusions, state, method=None)¶ Classify new papers and update the training indices.
It automaticaly updates the state.
Parameters:

log_probabilities
(state)¶ Store the modeling probabilities of the training indices and pool indices.

n_pool
()¶ Number of indices left in the pool.
Returns: int – Number of indices left in the pool.

query
(n_instances, query_model=None)¶ Query records from pool.
Parameters:  n_instances (int) – Batch size of the queries, i.e. number of records to be queried.
 query_model (BaseQueryModel) – Query strategy model to use. If None, the query model of the reviewer is used.
Returns: np.array – Indices of records queried.

review
(*args, **kwargs)¶ Do the systematic review, writing the results to the state file.
Parameters:

settings
¶ Get an ASReview settings object

statistics
()¶ Get statistics on the current state of the review.
Returns: dict – A dictonary with statistics like n_included and last_inclusion.

train
()¶ Train the model.
Models¶

class
asreview.models.
NBModel
(alpha=3.822)[source]¶ Naive Bayes classifier
The Naive Bayes classifier is an implementation based on the sklearn multinomial Naive Bayes classifier.
Parameters: alpha (float, default=3.822) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). 
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(X, y)¶ Fit the model to the data.
 X: np.array
 Feature matrix to fit.
 y: np.array
 Labels for supervised learning.

full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns:  dict – Parameter space.
 dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (np.array) – Feature matrix to predict. Returns: np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.


class
asreview.models.
RFModel
(n_estimators=100, max_features=10, class_weight=1.0, random_state=None)[source]¶ Random Forest classifier
The Random Forest classifier is an implementation based on the sklearn Random Forest classifier.
Parameters:  n_estimators (int, default=100) – The number of trees in the forest.
 max_features (int, default=10) – Number of features in the model.
 class_weight (float, default=1.0) – Class weight of the inclusions.
 random_state (int or RandomState, default=None) – Controls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(X, y)¶ Fit the model to the data.
 X: np.array
 Feature matrix to fit.
 y: np.array
 Labels for supervised learning.

full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns:  dict – Parameter space.
 dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (np.array) – Feature matrix to predict. Returns: np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.

class
asreview.models.
SVMModel
(gamma='auto', class_weight=0.249, C=15.4, kernel='linear', random_state=None)[source]¶ Support Vector Machine classifier
The Support Vector Machine classifier is an implementation based on the sklearn Support Vector Machine classifier.
Parameters: 
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(X, y)¶ Fit the model to the data.
 X: np.array
 Feature matrix to fit.
 y: np.array
 Labels for supervised learning.

full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns:  dict – Parameter space.
 dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (np.array) – Feature matrix to predict. Returns: np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.


class
asreview.models.
LogisticModel
(C=1.0, class_weight=1.0, random_state=None, n_jobs=1)[source]¶ Logistic regressions classifier
The Logistic regressions classifier is an implementation based on the sklearn Logistic regressions classifier.
Parameters: 
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(X, y)¶ Fit the model to the data.
 X: np.array
 Feature matrix to fit.
 y: np.array
 Labels for supervised learning.

full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns:  dict – Parameter space.
 dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (np.array) – Feature matrix to predict. Returns: np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.


class
asreview.models.
LSTMBaseModel
(embedding_matrix=None, backwards=True, dropout=0.4, optimizer='rmsprop', lstm_out_width=20, learn_rate=1.0, dense_width=128, verbose=0, batch_size=32, epochs=35, shuffle=False, class_weight=30.0)[source]¶ LSTM base classifier.
LSTM model consisting of an embedding layer, one LSTM layer, and one dense layer.
Parameters:  embedding_matrix (np.array) – Embedding matrix to use with LSTM model.
 backwards (bool) – Whether to have a forward or backward LSTM.
 dropout (float) – Value in [0, 1.0) that gives the dropout and recurrent dropout rate for the LSTM model.
 optimizer (str) – Optimizer to use.
 lstm_out_width (int) – Output width of the LSTM.
 learn_rate (float) – Learn rate multiplier of default learning rate.
 dense_width (int) – Size of the dense layer of the model.
 verbose (int) – Verbosity.
 batch_size (int) – Size of the batch size for the LSTM model.
 epochs (int) – Number of epochs to train the LSTM model.
 shuffle (bool) – Whether to shuffle the data before starting to train.
 class_weight (float) – Class weight for the included papers.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(X, y)[source]¶ Fit the model to the data.
 X: np.array
 Feature matrix to fit.
 y: np.array
 Labels for supervised learning.

full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns:  dict – Parameter space.
 dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (np.array) – Feature matrix to predict. Returns: np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.

class
asreview.models.
LSTMPoolModel
(embedding_matrix=None, backwards=True, dropout=0.4, optimizer='rmsprop', lstm_out_width=20, lstm_pool_size=128, learn_rate=1.0, verbose=0, batch_size=32, epochs=35, shuffle=False, class_weight=30.0)[source]¶ LSTM pool classifier.
LSTM model consisting of an embedding layer, one LSTM layer, and one max pooling layer.
Parameters:  embedding_matrix (np.array) – Embedding matrix to use with LSTM model.
 backwards (bool) – Whether to have a forward or backward LSTM.
 dropout (float) – Value in [0, 1.0) that gives the dropout and recurrent dropout rate for the LSTM model.
 optimizer (str) – Optimizer to use.
 lstm_out_width (int) – Output width of the LSTM.
 lstm_pool_size (int) – Size of the pool, must be a divisor of max_sequence_length.
 learn_rate (float) – Learn rate multiplier of default learning rate.
 verbose (int) – Verbosity.
 batch_size (int) – Size of the batch size for the LSTM model.
 epochs (int) – Number of epochs to train the LSTM model.
 shuffle (bool) – Whether to shuffle the data before starting to train.
 class_weight (float) – Class weight for the included papers.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(X, y)[source]¶ Fit the model to the data.
 X: np.array
 Feature matrix to fit.
 y: np.array
 Labels for supervised learning.

full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns:  dict – Parameter space.
 dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (np.array) – Feature matrix to predict. Returns: np.array – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.

class
asreview.models.
NN2LayerModel
(dense_width=128, optimizer='rmsprop', learn_rate=1.0, regularization=0.01, verbose=0, epochs=35, batch_size=32, shuffle=False, class_weight=30.0)[source]¶ Dense neural network classifier.
Neural network with two hidden, dense layers of the same size.
Parameters:  dense_width (int) – Size of the dense layers.
 optimizer (str) – Name of the Keras optimizer.
 learn_rate (float) – Learning rate multiplier of the default learning rate.
 regularization (float) – Strength of the regularization on the weights and biases.
 verbose (int) – Verbosity of the model mirroring the values for Keras.
 epochs (int) – Number of epochs to train the neural network.
 batch_size (int) – Batch size used for the neural network.
 shuffle (bool) – Whether to shuffle the training data prior to training.
 class_weight (float) – Class weights for inclusions (1’s).

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(X, y)[source]¶ Fit the model to the data.
 X: np.array
 Feature matrix to fit.
 y: np.array
 Labels for supervised learning.

full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns:  dict – Parameter space.
 dict – Parameter choices; in case of hyperparameters with a list of choices, store the choices there.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

asreview.models.
list_classifiers
()[source]¶ List available classifiers.
Returns: list – Names of available classifiers in alphabetical order.
Query strategies¶

class
asreview.query_strategies.
MaxQuery
[source]¶ Maximum sampling query strategy.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

query
(X, classifier, pool_idx=None, n_instances=1, shared={})¶ Query method for strategies which use class probabilities.


class
asreview.query_strategies.
MixedQuery
(strategy_1='max', strategy_2='random', mix_ratio=0.95, random_state=None, **kwargs)[source]¶ Class for mixed query strategy.
The idea is to use two different query strategies at the same time with a ratio of one to the other.
Parameters:  strategy_1 (str) – Name of the first query strategy.
 strategy_2 (str) – Name of the second query strategy.
 mix_ratio (float) – Portion of queries done by the first strategy. So a mix_ratio of 0.95 means that 95% of the time query strategy 1 is used and 5% of the time query strategy 2.
 **kwargs (dict) – Keyword arguments for the two strategy. To specify which of the strategies the argument is for, prepend with the name of the query strategy and an underscore, e.g. ‘max’ for maximal sampling.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

name
¶ str(object=’’) > str str(bytes_or_buffer[, encoding[, errors]]) > str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

query
(X, classifier, pool_idx=None, n_instances=1, shared={})[source]¶ Query new instances.
Parameters:  X (np.array) – Feature matrix to choose samples from.
 classifier (SKLearnModel) – Trained classifier to compute probabilities if they are necessary.
 pool_idx (np.array) – Indices of samples that are still in the pool.
 n_instances (int) – Number of instances to query.
 shared (dict) – Dictionary for exchange between query strategies and others. It is mainly used to store the current class probabilities, and the source of the queries; which query strategy has produced which index.

class
asreview.query_strategies.
UncertaintyQuery
[source]¶ Maximum uncertainty query strategy.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

query
(X, classifier, pool_idx=None, n_instances=1, shared={})¶ Query method for strategies which use class probabilities.


class
asreview.query_strategies.
RandomQuery
(random_state=None)[source]¶ Random sampling query strategy.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

query
(X, classifier, pool_idx=None, n_instances=1, shared={})¶ Query method for strategies which do not use class probabilities


class
asreview.query_strategies.
ClusterQuery
(cluster_size=350, update_interval=200, random_state=None)[source]¶ Query strategy using clustering algorithms.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

query
(X, classifier, pool_idx=None, n_instances=1, shared={})¶ Query method for strategies which use class probabilities.


asreview.query_strategies.
list_query_strategies
()[source]¶ List available query strategies.
This excludes all possible mixed query strategies.
Returns: list – Names of available query strategies in alphabetical order.

asreview.query_strategies.
get_query_model
(method, *args, random_state=None, **kwargs)[source]¶ Get an instance of the query strategy.
Parameters:  method (str) – Name of the query strategy.
 *args – Arguments for the model.
 **kwargs – Keyword arguments for the model.
Returns: BaseQueryModel – Initialized instance of query strategy.

asreview.query_strategies.
get_query_class
(method)[source]¶ Get class of query strategy from its name.
Parameters: method (str) – Name of the query strategy, e.g. ‘max’, ‘uncertainty’, ‘random. A special mixed query strategy is als possible. The mix is denoted by an underscore: ‘max_random’ or ‘max_uncertainty’. Returns: BaseQueryModel – Class corresponding to the method name.
Balance Strategies¶

class
asreview.balance_strategies.
SimpleBalance
[source]¶ 
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

sample
(X, y, train_idx, shared)[source]¶ Function that does not resample the training set.
Parameters:  X (np.array) – Complete matrix of all samples.
 y (np.array) – Classified results of all samples.
 extra_vars (dict:) – Extra variables that can be passed around between functions.
Returns:  np.array – Training samples.
 np.array – Classification of training samples.


class
asreview.balance_strategies.
DoubleBalance
(a=2.155, alpha=0.94, b=0.789, beta=1.0, random_state=None)[source]¶ Dynamic Resampling balance strategy.
Class to get the two way rebalancing function and arguments. It super samples ones depending on the number of 0’s and total number of samples in the training data.
Parameters:  a (float) – Governs the weight of the 1’s. Higher values mean linearly more 1’s in your training sample.
 alpha (float) – Governs the scaling the weight of the 1’s, as a function of the ratio of ones to zeros. A positive value means that the lower the ratio of zeros to ones, the higher the weight of the ones.
 b (float) – Governs how strongly we want to sample depending on the total number of samples. A value of 1 means no dependence on the total number of samples, while lower values mean increasingly stronger dependence on the number of samples.
 beta (float) – Governs the scaling of the weight of the zeros depending on the number of samples. Higher values means that larger samples are more strongly penalizing zeros.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

sample
(X, y, train_idx, shared)[source]¶ Resample the training data.
Parameters:  X (np.array) – Complete feature matrix.
 y (np.array) – Labels for all papers.
 train_idx (np.array) – Training indices, that is all papers that have been reviewed.
 shared (dict) – Dictionary to share data between balancing models and other models.
Returns: np.array, np.array – X_train, y_train: the resampled matrix, labels.

class
asreview.balance_strategies.
TripleBalance
(a=2.155, alpha=0.94, b=0.789, beta=1.0, c=0.835, gamma=2.0, shuffle=True, random_state=None)[source]¶ Triple balance strategy.
Class to get the three way rebalancing function and arguments. It divides the data into three groups: 1’s, 0’s from random sampling, and 0’s from max sampling. Thus it only makes sense to use this class in combination with the rand_max query strategy.
Parameters:  a (float) – Governs the weight of the 1’s. Higher values mean linearly more 1’s in your training sample.
 alpha (float) – Governs the scaling the weight of the 1’s, as a function of the ratio of ones to zeros. A positive value means that the lower the ratio of zeros to ones, the higher the weight of the ones.
 b (float) – Governs how strongly we want to sample depending on the total number of samples. A value of 1 means no dependence on the total number of samples, while lower values mean increasingly stronger dependence on the number of samples.
 beta (float) – Governs the scaling of the weight of the zeros depending on the number of samples. Higher values means that larger samples are more strongly penalizing zeros.
 c (float) – Value between one and zero that governs the weight of samples done with maximal sampling. Higher values mean higher weight.
 gamma (float) – Governs the scaling of the weight of the max samples as a function of the % of papers read. Higher values mean stronger scaling.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

sample
(X, y, train_idx, shared)[source]¶ Resample the training data.
Parameters:  X (np.array) – Complete feature matrix.
 y (np.array) – Labels for all papers.
 train_idx (np.array) – Training indices, that is all papers that have been reviewed.
 shared (dict) – Dictionary to share data between balancing models and other models.
Returns: np.array, np.array – X_train, y_train: the resampled matrix, labels.

class
asreview.balance_strategies.
UndersampleBalance
(ratio=1.0, random_state=None)[source]¶ Balancing class that undersamples the data with a given ratio.
Parameters: ratio (double) – Undersampling ratio of the zero’s. If for example we set a ratio of 0.25, we would sample only a quarter of the zeros and all the ones. 
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

sample
(X, y, train_idx, shared)[source]¶ Resample the training data.
Parameters:  X (np.array) – Complete feature matrix.
 y (np.array) – Labels for all papers.
 train_idx (np.array) – Training indices, that is all papers that have been reviewed.
 shared (dict) – Dictionary to share data between balancing models and other models.
Returns: np.array, np.array – X_train, y_train: the resampled matrix, labels.


asreview.balance_strategies.
list_balance_strategies
()[source]¶ List available balancing strategies.
Returns: list – Names of available balance strategies in alphabetical order.
Feature Extraction¶

class
asreview.feature_extraction.
Tfidf
(*args, ngram_max=1, **kwargs)[source]¶ Class to apply SKLearn Tfidf to texts.
Parameters: ngram_max (int) – Can use up to ngrams up to ngram_max. For example in the case of ngram_max=2, monograms and bigrams could be used. 
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(texts)[source]¶ Fit the model to the texts.
It is not always necessary to implement this if there’s not real fitting being done.
Parameters: texts (np.array) – Texts to be fitted.

fit_transform
(texts, titles=None, abstracts=None, keywords=None)¶ Fit and transform a list of texts.
Parameters: texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized. Returns: np.array – Feature matrix representing the texts.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.


class
asreview.feature_extraction.
Doc2Vec
(*args, vector_size=40, epochs=33, min_count=1, n_jobs=1, window=7, dm_concat=0, dm=2, dbow_words=0, **kwargs)[source]¶ Base class for doc2vec feature extraction.
Requires ‘gensim’ installation.
Parameters:  vector_size (int) – Output size of the vector.
 epochs (int) – Number of epochs to train the doc2vec model.
 min_count (int) – Minimum number of occurences for a word in the corpus for it to be included in the model.
 n_jobs (int) – Number of threads to train the model with.
 window (int) – Maximum distance over which word vectors influence each other.
 dm_concat (int) – Whether to concatenate word vectors or not. See paper for more detail.
 dm (int) – Model to use. 0: Use distribute bag of words (DBOW). 1: Use distributed memory (DM). 2: Use both of the above with half the vector size and concatenate them.
 dbow_words (int) – Whether to train the word vectors using the skipgram method.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(texts)[source]¶ Fit the model to the texts.
It is not always necessary to implement this if there’s not real fitting being done.
Parameters: texts (np.array) – Texts to be fitted.

fit_transform
(texts, titles=None, abstracts=None, keywords=None)¶ Fit and transform a list of texts.
Parameters: texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized. Returns: np.array – Feature matrix representing the texts.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

class
asreview.feature_extraction.
EmbeddingIdf
(*args, embedding_fp=None, random_state=None, **kwargs)[source]¶ Class for EmbeddingIdf model.
This model averages the weighted word vectors of all the words in the text, in order to get a single feature vector for each text. The weights are provided by the inverse document frequencies.
Parameters: embedding_fp (str) – Path to embedding. 
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(texts)¶ Fit the model to the texts.
It is not always necessary to implement this if there’s not real fitting being done.
Parameters: texts (np.array) – Texts to be fitted.

fit_transform
(texts, titles=None, abstracts=None, keywords=None)¶ Fit and transform a list of texts.
Parameters: texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized. Returns: np.array – Feature matrix representing the texts.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.


class
asreview.feature_extraction.
EmbeddingLSTM
(*args, loop_sequence=1, num_words=20000, max_sequence_length=1000, padding='post', truncating='post', n_jobs=1, **kwargs)[source]¶ Class to create embedding matrices for LSTM models.
Parameters:  loop_sequence (bool) – Instead of zeros at the start/end of sequence loop it.
 num_words (int) – Maximum number of unique words to be processed.
 max_sequence_length (int) – Maximum length of the sequence. Shorter get struncated. Longer sequences get either padded with zeros or looped.
 padding (str) – Which side should be padded [pre/post].
 truncating – Which side should be truncated [pre/post].
 n_jobs – Number of processors used in reading the embedding matrix.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(texts)¶ Fit the model to the texts.
It is not always necessary to implement this if there’s not real fitting being done.
Parameters: texts (np.array) – Texts to be fitted.

fit_transform
(texts, titles=None, abstracts=None, keywords=None)¶ Fit and transform a list of texts.
Parameters: texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized. Returns: np.array – Feature matrix representing the texts.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.

class
asreview.feature_extraction.
SBERT
(split_ta=0, use_keywords=0)[source]¶ Sentence BERT class for feature extraction.

default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value

fit
(texts)¶ Fit the model to the texts.
It is not always necessary to implement this if there’s not real fitting being done.
Parameters: texts (np.array) – Texts to be fitted.

fit_transform
(texts, titles=None, abstracts=None, keywords=None)¶ Fit and transform a list of texts.
Parameters: texts (np.array) – A sequence of texts to be transformed. They are not yet tokenized. Returns: np.array – Feature matrix representing the texts.

param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.


asreview.feature_extraction.
list_feature_extraction
()[source]¶ List available feature extraction methods.
Returns: list – Names of available feature extraction methods in alphabetical order.

asreview.feature_extraction.
get_feature_model
(method, *args, random_state=None, **kwargs)[source]¶ Get an instance of a feature extraction model from a string.
Parameters:  method (str) – Name of the feature extraction model.
 *args – Arguments for the feature extraction model.
 **kwargs – Keyword arguments for thefeature extraction model.
Data¶

class
asreview.
ASReviewData
(df=None, data_name='empty', data_type='standard', column_spec=None)[source]¶ Data object to the dataset with texts, labels, DOIs etc.
Parameters:  df (pd.DataFrame) – Dataframe containing the data for the ASReview data object.
 data_name (str) – Give a name to the data object.
 data_type (str) – What kind of data the dataframe contains.
 column_spec (dict) – Specification for which column corresponds to which standard specification. Key is the standard specification, key is which column it is actually in.

append
(as_data)[source]¶ Append another ASReviewData object.
It puts the training data at the end.
Parameters: as_data (ASReviewData) – Dataset to append.

format_record
(i, by_index=True, *args, **kwargs)[source]¶ Format one record for displaying in the CLI.

classmethod
from_file
(fp, read_fn=None, data_name=None, data_type=None)[source]¶ Create instance from csv/ris/excel file.
It works in two ways; either manual control where the conversion functions are supplied or automatic, where it searches in the entry points for the right conversion functions.
Parameters:

fuzzy_find
(keywords, threshold=60, max_return=10, exclude=None, by_index=True)[source]¶ Find a record using keywords.
It looks for keywords in the title/authors/keywords (for as much is available). Using the diflib package it creates a ranking based on token set matching.
Parameters:  keywords (str) – A string of keywords together, can be a combination.
 threshold (float) – Don’t return records below this threshold.
 max_return (int) – Maximum number of records to return.
 exclude (list, np.ndarray) – List of indices that should be excluded in the search. You would put papers that were already labeled here for example.
 by_index (bool) – If True, use internal indexing. If False, use record ids for indexing.
Returns: list – Sorted list of indexes that match best the keywords.

hash
()[source]¶ Compute a hash from the dataset.
Returns: str – SHA1 hash, computed from the titles/abstracts of the dataframe.

prior_data_idx
¶ Get prior_included, prior_excluded from dataset.

prior_labels
(state, by_index=True)[source]¶ Get the labels that are marked as ‘initial’.
 state: BaseState
 Open state that contains the label information.
 by_index: bool
 If True, return internal indexing. If False, return record_ids for indexing.
Returns: np.array – Array of indices that have the ‘initial’ property.

record
(i, by_index=True)[source]¶ Create a record from an index.
Parameters: Returns: PaperRecord – The corresponding record if i was an integer, or a list of records if i was an iterable.

slice
(idx)[source]¶ Create a slice from itself.
Useful if some parts should be kept/thrown away.
Parameters: idx (list, np.ndarray) – Record ids that should be kept. Returns: ASReviewData – Slice of itself.

to_csv
(fp, labels=None, ranking=None)[source]¶ Export to csv.
Parameters: Returns: pd.DataFrame – Dataframe of all available record data.

to_dataframe
(labels=None, ranking=None)[source]¶ Create new dataframe with updated label (order).
Parameters: Returns: pd.DataFrame – Dataframe of all available record data.

to_excel
(fp, labels=None, ranking=None)[source]¶ Export to Excel xlsx file.
Parameters: Returns: pd.DataFrame – Dataframe of all available record data.
Utils¶

asreview.
load_embedding
(fp, word_index=None, n_jobs=None)[source]¶ Load embedding matrix from file.
The embedding matrix needs to be stored in the FastText format.
Parameters: Returns: dict – The embedding weights stored in a dict with the word as key and the weights as values.
State¶

asreview.state.
open_state
(fp, *args, read_only=False, **kwargs)[source]¶ Open a state from a file.
Parameters: Returns: Basestate – Depending on the extension the appropriate state is chosen:  [.h5, .hdf5, .he5] > HDF5state.  None > Dictstate (doesn’t store anything permanently).  Anything else > JSONstate.

class
asreview.state.
BaseState
(state_fp, read_only=False)[source]¶ 
add_classification
(idx, labels, methods, query_i)[source]¶ Add training indices and their labels.
Parameters:

add_proba
(pool_idx, train_idx, proba, query_i)[source]¶ Add inverse pool indices and their labels.
Parameters:

close
()[source]¶ Close the files opened by the state.
Also sets the end time if not in readonly mode.

get
(variable, query_i=None, default=None, idx=None)[source]¶ Get data from the state object.
This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.
Parameters:

get_current_queries
()[source]¶ Get the current queries made by the model.
This is useful to get back exactly to the state it was in before shutting down a review.
Returns: dict – The last known queries according to the state file.

get_feature_matrix
(data_hash)[source]¶ Get feature matrix out of the state.
Parameters: data_hash (str) – Hash of as_data object from which the matrix is derived. Returns: np.ndarray or sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.

pred_proba
¶ Get last predicted probabilities.

restore
(fp)[source]¶ Restore or create state from a state file.
If the state file doesn’t exist, creates and empty state that is ready for storage.
Parameters: fp (str) – Path to file to restore/create.

set_current_queries
(current_queries)[source]¶ Set the current queries made by the model.
Parameters: current_queries (dict) – The last known queries, with {query_idx: query_method}.

set_final_labels
(y)[source]¶ Add/set final labels to state.
If final_labels does not exist yet, add it.
Parameters: y (np.array) – One dimensional integer numpy array with final inclusion labels.

set_labels
(y)[source]¶ Add/set labels to state
If the labels do not exist, add it to the state.
Parameters: y (np.array) – One dimensional integer numpy array with inclusion labels.

settings
¶ Get settings from state


class
asreview.state.
HDF5State
(state_fp, read_only=False)[source]¶ Class for storing the review state with HDF5 storage.
Analysis¶

class
asreview.analysis.
Analysis
(states, key=None)[source]¶ Analysis object to do statistical analysis on state files.

avg_time_to_discovery
(result_format='number')[source]¶ Estimate the Time to Discovery (TD) for each paper.
Get the best/last estimate on how long it takes to find a paper.
Parameters: result_format (str) – Desired output format: “number”, “fraction” or “percentage”. Returns: dict – For each inclusion, key=paper_id, value=avg time.

classmethod
from_dir
(data_dir, prefix='', key=None)[source]¶ Create an Analysis object from a directory.
Parameters:

classmethod
from_file
(data_fp, key=None)[source]¶ Create an Analysis object from a file.
Parameters:

classmethod
from_path
(data_path, prefix='', key=None)[source]¶ Create an Analysis object from either a file or a directory.

inclusions_found
(result_format='fraction', final_labels=False, **kwargs)[source]¶ Get the number of inclusions at each point in time.
Caching is used to prevent multiple calls being expensive.
Parameters: Returns: tuple – Three numpy arrays with x, y, error_bar.

limits
(prob_allow_miss=[0.1], result_format='percentage')[source]¶ For each query, compute the number of papers for a criterium.
A criterium is the average number of papers missed. For example, with 0.1, the criterium is that after reading x papers, there is (about) a 10% chance that one paper is not included. Another example, with 2.0, there are on average 2 papers missed after reading x papers. The value for x is returned for each query and probability by the function.
Parameters: prob_allow_miss (list, float) – Sets the criterium for how many papers can be missed. Returns: dict – One entry, “x_range” with the number of papers read. List, “limits” of results for each probability and at # papers read.

Extensions¶

class
asreview.entry_points.
BaseEntryPoint
[source]¶ Base class for defining entry points.