API Reference¶
Low level API¶
-
class
asreview.review.
BaseReview
(as_data, model=None, query_model=None, balance_model=None, feature_model=None, n_papers=None, n_instances=1, n_queries=None, start_idx=[], state_file=None, log_file=None)[source]¶ Base class for Systematic Review.
Parameters: - as_data (asreview.ASReviewData) – The data object which contains the text, labels, etc.
- model (BaseModel) – Initialized model to fit the data during active learning. See asreview.models.utils.py for possible models.
- query_model (BaseQueryModel) – Initialized model to query new instances for review, such as random sampling or max sampling. See asreview.query_strategies.utils.py for query models.
- balance_model (BaseBalanceModel) – Initialized model to redistribute the training data during the active learning process. They might either resample or undersample specific papers.
- feature_model (BaseFeatureModel) – Feature extraction model that converts texts and keywords to feature matrices.
- n_papers (int) – Number of papers to review during the active learning process, excluding the number of initial priors. To review all papers, set n_papers to None.
- n_instances (int) – Number of papers to query at each step in the active learning process.
- n_queries (int) – Number of steps/queries to perform. Set to None for no limit.
- start_idx (numpy.ndarray) – Start the simulation/review with these indices. They are assumed to be already labeled. Failing to do so might result bad behaviour.
- state_file (str) – Path to state file. Replaces log_file argument.
-
classify
(query_idx, inclusions, state, method=None)[source]¶ Classify new papers and update the training indices.
It automaticaly updates the state.
Parameters: - query_idx (list, numpy.ndarray) – Indices to classify.
- inclusions (list, numpy.ndarray) – Labels of the query_idx.
- state (BaseLogger) – Logger to store the classification in.
- method (str) – If not set to None, all inclusions have this query method.
-
log_probabilities
(state)[source]¶ Store the modeling probabilities of the training indices and pool indices.
-
n_pool
()[source]¶ Number of indices left in the pool.
Returns: int – Number of indices left in the pool.
-
query
(n_instances, query_model=None)[source]¶ Query records from pool.
Parameters: - n_instances (int) – Batch size of the queries, i.e. number of records to be queried.
- query_model (BaseQueryModel) – Query strategy model to use. If None, the query model of the reviewer is used.
Returns: numpy.ndarray – Indices of records queried.
-
review
(*args, **kwargs)[source]¶ Do the systematic review, writing the results to the state file.
Parameters:
-
settings
¶ Get an ASReview settings object
-
class
asreview.
ReviewSimulate
(as_data, *args, n_prior_included=0, n_prior_excluded=0, prior_idx=None, init_seed=None, **kwargs)[source]¶ ASReview Simulation mode class.
Parameters: - as_data (asreview.ASReviewData) – The data object which contains the text, labels, etc.
- model (BaseModel) – Initialized model to fit the data during active learning. See asreview.models.utils.py for possible models.
- query_model (BaseQueryModel) – Initialized model to query new instances for review, such as random sampling or max sampling. See asreview.query_strategies.utils.py for query models.
- balance_model (BaseBalanceModel) – Initialized model to redistribute the training data during the active learning process. They might either resample or undersample specific papers.
- feature_model (BaseFeatureModel) – Feature extraction model that converts texts and keywords to feature matrices.
- n_prior_included (int) – Sample n prior included papers.
- n_prior_excluded (int) – Sample n prior excluded papers.
- prior_idx (int) – Prior indices by row number.
- n_papers (int) – Number of papers to review during the active learning process, excluding the number of initial priors. To review all papers, set n_papers to None.
- n_instances (int) – Number of papers to query at each step in the active learning process.
- n_queries (int) – Number of steps/queries to perform. Set to None for no limit.
- start_idx (numpy.ndarray) – Start the simulation/review with these indices. They are assumed to be already labeled. Failing to do so might result bad behaviour.
- init_seed (int) – Seed for setting the prior indices if the –prior_idx option is not used. If the option prior_idx is used with one or more index, this option is ignored.
- state_file (str) – Path to state file. Replaces log_file argument.
-
classify
(query_idx, inclusions, state, method=None)¶ Classify new papers and update the training indices.
It automaticaly updates the state.
Parameters: - query_idx (list, numpy.ndarray) – Indices to classify.
- inclusions (list, numpy.ndarray) – Labels of the query_idx.
- state (BaseLogger) – Logger to store the classification in.
- method (str) – If not set to None, all inclusions have this query method.
-
log_probabilities
(state)¶ Store the modeling probabilities of the training indices and pool indices.
-
n_pool
()¶ Number of indices left in the pool.
Returns: int – Number of indices left in the pool.
-
query
(n_instances, query_model=None)¶ Query records from pool.
Parameters: - n_instances (int) – Batch size of the queries, i.e. number of records to be queried.
- query_model (BaseQueryModel) – Query strategy model to use. If None, the query model of the reviewer is used.
Returns: numpy.ndarray – Indices of records queried.
-
review
(*args, **kwargs)¶ Do the systematic review, writing the results to the state file.
Parameters:
-
settings
¶ Get an ASReview settings object
-
statistics
()¶ Get statistics on the current state of the review.
Returns: dict – A dictonary with statistics like n_included and last_inclusion.
-
train
()¶ Train the model.
Classifiers¶
-
class
asreview.models.classifiers.
NaiveBayesClassifier
(alpha=3.822)[source]¶ Naive Bayes classifier
Naive Bayes classifier. Only works in combination with the
asreview.models.feature_extraction.Tfidf
feature extraction model. Though relatively simplistic, seems to work quite well on a wide range of datasets.The naive Bayes classifier is an implementation based on the sklearn multinomial naive Bayes classifier.
Parameters: alpha (float, default=3.822) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). -
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(X, y)¶ Fit the model to the data.
Parameters: - X (numpy.ndarray) – Feature matrix to fit.
- y (numpy.ndarray) – Labels for supervised learning.
-
full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns: dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (numpy.ndarray) – Feature matrix to predict. Returns: numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
-
-
class
asreview.models.classifiers.
RandomForestClassifier
(n_estimators=100, max_features=10, class_weight=1.0, random_state=None)[source]¶ Random Forest classifier
The Random Forest classifier is an implementation based on the sklearn Random Forest classifier.
Parameters: - n_estimators (int, default=100) – The number of trees in the forest.
- max_features (int, default=10) – Number of features in the model.
- class_weight (float, default=1.0) – Class weight of the inclusions.
- random_state (int or RandomState, default=None) – Controls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(X, y)¶ Fit the model to the data.
Parameters: - X (numpy.ndarray) – Feature matrix to fit.
- y (numpy.ndarray) – Labels for supervised learning.
-
full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns: dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (numpy.ndarray) – Feature matrix to predict. Returns: numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
-
class
asreview.models.classifiers.
SVMClassifier
(gamma='auto', class_weight=0.249, C=15.4, kernel='linear', random_state=None)[source]¶ Support Vector Machine classifier
The Support Vector Machine classifier is an implementation based on the sklearn Support Vector Machine classifier.
Parameters: -
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(X, y)¶ Fit the model to the data.
Parameters: - X (numpy.ndarray) – Feature matrix to fit.
- y (numpy.ndarray) – Labels for supervised learning.
-
full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns: dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (numpy.ndarray) – Feature matrix to predict. Returns: numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
-
-
class
asreview.models.classifiers.
LogisticClassifier
(C=1.0, class_weight=1.0, random_state=None, n_jobs=1)[source]¶ Logistic regressions classifier
The Logistic regressions classifier is an implementation based on the sklearn Logistic regressions classifier.
Parameters: -
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(X, y)¶ Fit the model to the data.
Parameters: - X (numpy.ndarray) – Feature matrix to fit.
- y (numpy.ndarray) – Labels for supervised learning.
-
full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns: dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (numpy.ndarray) – Feature matrix to predict. Returns: numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
-
-
class
asreview.models.classifiers.
LSTMBaseClassifier
(embedding_matrix=None, backwards=True, dropout=0.4, optimizer='rmsprop', lstm_out_width=20, learn_rate=1.0, dense_width=128, verbose=0, batch_size=32, epochs=35, shuffle=False, class_weight=30.0)[source]¶ LSTM base classifier.
LSTM model that consists of an embedding layer, LSTM layer with one output, dense layer, and a single sigmoid output node. Use the
asreview.models.feature_extraction.EmbeddingLSTM
feature extraction method. Currently not so well optimized and slow.Note
This model requires
tensorflow
to be installed. Usepip install tensorflow
or install all optional ASReview dependencies withpip install asreview[all]
Parameters: - embedding_matrix (numpy.ndarray) – Embedding matrix to use with LSTM model.
- backwards (bool) – Whether to have a forward or backward LSTM.
- dropout (float) – Value in [0, 1.0) that gives the dropout and recurrent dropout rate for the LSTM model.
- optimizer (str) – Optimizer to use.
- lstm_out_width (int) – Output width of the LSTM.
- learn_rate (float) – Learn rate multiplier of default learning rate.
- dense_width (int) – Size of the dense layer of the model.
- verbose (int) – Verbosity.
- batch_size (int) – Size of the batch size for the LSTM model.
- epochs (int) – Number of epochs to train the LSTM model.
- shuffle (bool) – Whether to shuffle the data before starting to train.
- class_weight (float) – Class weight for the included papers.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(X, y)[source]¶ Fit the model to the data.
Parameters: - X (numpy.ndarray) – Feature matrix to fit.
- y (numpy.ndarray) – Labels for supervised learning.
-
full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns: dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (numpy.ndarray) – Feature matrix to predict. Returns: numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
-
class
asreview.models.classifiers.
LSTMPoolClassifier
(embedding_matrix=None, backwards=True, dropout=0.4, optimizer='rmsprop', lstm_out_width=20, lstm_pool_size=128, learn_rate=1.0, verbose=0, batch_size=32, epochs=35, shuffle=False, class_weight=30.0)[source]¶ LSTM pool classifier.
LSTM model that consists of an embedding layer, LSTM layer with many outputs, max pooling layer, and a single sigmoid output node. Use the
asreview.models.feature_extraction.EmbeddingLSTM
feature extraction method. Currently not so well optimized and slow.Note
This model requires
tensorflow
to be installed. Usepip install tensorflow
or install all optional ASReview dependencies withpip install asreview[all]
Parameters: - embedding_matrix (numpy.ndarray) – Embedding matrix to use with LSTM model.
- backwards (bool) – Whether to have a forward or backward LSTM.
- dropout (float) – Value in [0, 1.0) that gives the dropout and recurrent dropout rate for the LSTM model.
- optimizer (str) – Optimizer to use.
- lstm_out_width (int) – Output width of the LSTM.
- lstm_pool_size (int) – Size of the pool, must be a divisor of max_sequence_length.
- learn_rate (float) – Learn rate multiplier of default learning rate.
- verbose (int) – Verbosity.
- batch_size (int) – Size of the batch size for the LSTM model.
- epochs (int) – Number of epochs to train the LSTM model.
- shuffle (bool) – Whether to shuffle the data before starting to train.
- class_weight (float) – Class weight for the included papers.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(X, y)[source]¶ Fit the model to the data.
Parameters: - X (numpy.ndarray) – Feature matrix to fit.
- y (numpy.ndarray) – Labels for supervised learning.
-
full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns: dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
predict_proba
(X)¶ Get the inclusion probability for each sample.
Parameters: X (numpy.ndarray) – Feature matrix to predict. Returns: numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
-
class
asreview.models.classifiers.
NN2LayerClassifier
(dense_width=128, optimizer='rmsprop', learn_rate=1.0, regularization=0.01, verbose=0, epochs=35, batch_size=32, shuffle=False, class_weight=30.0)[source]¶ Dense neural network classifier.
Neural network with two hidden, dense layers of the same size.
Recommended feature extraction model is
asreview.models.feature_extraction.Doc2Vec
.Note
This model requires
tensorflow
to be installed. Usepip install tensorflow
or install all optional ASReview dependencies withpip install asreview[all]
Warning
Might crash on some systems with limited memory in combination with
asreview.models.feature_extraction.Tfidf
.Parameters: - dense_width (int) – Size of the dense layers.
- optimizer (str) – Name of the Keras optimizer.
- learn_rate (float) – Learning rate multiplier of the default learning rate.
- regularization (float) – Strength of the regularization on the weights and biases.
- verbose (int) – Verbosity of the model mirroring the values for Keras.
- epochs (int) – Number of epochs to train the neural network.
- batch_size (int) – Batch size used for the neural network.
- shuffle (bool) – Whether to shuffle the training data prior to training.
- class_weight (float) – Class weights for inclusions (1’s).
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(X, y)[source]¶ Fit the model to the data.
Parameters: - X (numpy.ndarray) – Feature matrix to fit.
- y (numpy.ndarray) – Labels for supervised learning.
-
full_hyper_space
()[source]¶ Get a hyperparameter space to use with hyperopt.
Returns: dict, dict – Parameter space. Parameter choices; in case of hyperparameters with a list of choices, store the choices there.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
predict_proba
(X)[source]¶ Get the inclusion probability for each sample.
Parameters: X (numpy.ndarray) – Feature matrix to predict. Returns: numpy.ndarray – Array with the probabilities for each class, with two columns (class 0, and class 1) and the number of samples rows.
-
asreview.models.classifiers.
list_classifiers
()[source]¶ List available classifiers.
Returns: list – Names of available classifiers in alphabetical order.
-
asreview.models.classifiers.
get_classifier
(name, *args, random_state=None, **kwargs)[source]¶ Get an instance of a model from a string.
Parameters: - name (str) – Name of the model.
- *args – Arguments for the model.
- **kwargs – Keyword arguments for the model.
Returns: BaseFeatureExtraction – Initialized instance of classifier.
Query¶
-
class
asreview.models.query.
MaxQuery
[source]¶ Maximum sampling query strategy.
Choose the most likely samples to be included according to the model.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
query
(X, classifier, pool_idx=None, n_instances=1, shared={})¶ Query method for strategies which use class probabilities.
-
-
class
asreview.models.query.
MixedQuery
(strategy_1='max', strategy_2='random', mix_ratio=0.95, random_state=None, **kwargs)[source]¶ Class for mixed query strategy.
The idea is to use two different query strategies at the same time with a ratio of one to the other. A mix of two query strategies is used. For example mixing max and random sampling with a mix ratio of 0.95 would mean that at each query 95% of the instances would be sampled with the max query strategy after which the remaining 5% would be sampled with the random query strategy. It would be called the max_random query strategy. Every combination of primitive query strategy is possible.
Parameters: - strategy_1 (str) – Name of the first query strategy.
- strategy_2 (str) – Name of the second query strategy.
- mix_ratio (float) – Portion of queries done by the first strategy. So a mix_ratio of 0.95 means that 95% of the time query strategy 1 is used and 5% of the time query strategy 2.
- **kwargs (dict) – Keyword arguments for the two strategy. To specify which of the strategies the argument is for, prepend with the name of the query strategy and an underscore, e.g. ‘max’ for maximal sampling.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
name
¶ str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
query
(X, classifier, pool_idx=None, n_instances=1, shared={})[source]¶ Query new instances.
Parameters: - X (numpy.ndarray) – Feature matrix to choose samples from.
- classifier (SKLearnModel) – Trained classifier to compute probabilities if they are necessary.
- pool_idx (numpy.ndarray) – Indices of samples that are still in the pool.
- n_instances (int) – Number of instances to query.
- shared (dict) – Dictionary for exchange between query strategies and others. It is mainly used to store the current class probabilities, and the source of the queries; which query strategy has produced which index.
-
class
asreview.models.query.
UncertaintyQuery
[source]¶ Maximum uncertainty query strategy.
Choose the most uncertain samples according to the model (i.e. closest to 0.5 probability). Doesn’t work very well in the case of LSTM’s, since the probabilities are rather arbitrary.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
query
(X, classifier, pool_idx=None, n_instances=1, shared={})¶ Query method for strategies which use class probabilities.
-
-
class
asreview.models.query.
RandomQuery
(random_state=None)[source]¶ Random sampling query strategy.
Randomly select samples with no regard to model assigned probabilities.
Warning
Selecting this option means your review is not going to be accelerated by ASReview.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
query
(X, classifier, pool_idx=None, n_instances=1, shared={})¶ Query method for strategies which do not use class probabilities
-
-
class
asreview.models.query.
ClusterQuery
(cluster_size=350, update_interval=200, random_state=None)[source]¶ Query strategy using clustering algorithms.
Use clustering after feature extraction on the dataset. Then the highest probabilities within random clusters are sampled.
Parameters: -
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
query
(X, classifier, pool_idx=None, n_instances=1, shared={})¶ Query method for strategies which use class probabilities.
-
-
asreview.models.query.
list_query_strategies
()[source]¶ List available query strategies.
This excludes all possible mixed query strategies.
Returns: list – Names of available query strategies in alphabetical order.
-
asreview.models.query.
get_query_model
(name, *args, random_state=None, **kwargs)[source]¶ Get an instance of the query strategy.
Parameters: - name (str) – Name of the query strategy.
- *args – Arguments for the model.
- **kwargs – Keyword arguments for the model.
Returns: asreview.query.base.BaseQueryModel – Initialized instance of query strategy.
-
asreview.models.query.
get_query_class
(name)[source]¶ Get class of query strategy from its name.
Parameters: name (str) – Name of the query strategy, e.g. ‘max’, ‘uncertainty’, ‘random. A special mixed query strategy is als possible. The mix is denoted by an underscore: ‘max_random’ or ‘max_uncertainty’. Returns: class – Class corresponding to the name name.
Balance¶
-
class
asreview.models.balance.
SimpleBalance
[source]¶ No balancing.
Use all training data.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
sample
(X, y, train_idx, shared)[source]¶ Function that does not resample the training set.
Parameters: - X (numpy.ndarray) – Complete matrix of all samples.
- y (numpy.ndarray) – Classified results of all samples.
- extra_vars (dict:) – Extra variables that can be passed around between functions.
Returns: - numpy.ndarray – Training samples.
- numpy.ndarray – Classification of training samples.
-
-
class
asreview.models.balance.
DoubleBalance
(a=2.155, alpha=0.94, b=0.789, beta=1.0, random_state=None)[source]¶ Dynamic Resampling balance strategy.
Class to get the two way rebalancing function and arguments. It super samples ones depending on the number of 0’s and total number of samples in the training data.
Parameters: - a (float) – Governs the weight of the 1’s. Higher values mean linearly more 1’s in your training sample.
- alpha (float) – Governs the scaling the weight of the 1’s, as a function of the ratio of ones to zeros. A positive value means that the lower the ratio of zeros to ones, the higher the weight of the ones.
- b (float) – Governs how strongly we want to sample depending on the total number of samples. A value of 1 means no dependence on the total number of samples, while lower values mean increasingly stronger dependence on the number of samples.
- beta (float) – Governs the scaling of the weight of the zeros depending on the number of samples. Higher values means that larger samples are more strongly penalizing zeros.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
sample
(X, y, train_idx, shared)[source]¶ Resample the training data.
Parameters: - X (numpy.ndarray) – Complete feature matrix.
- y (numpy.ndarray) – Labels for all papers.
- train_idx (numpy.ndarray) – Training indices, that is all papers that have been reviewed.
- shared (dict) – Dictionary to share data between balancing models and other models.
Returns: numpy.ndarray,numpy.ndarray – X_train, y_train: the resampled matrix, labels.
-
class
asreview.models.balance.
TripleBalance
(a=2.155, alpha=0.94, b=0.789, beta=1.0, c=0.835, gamma=2.0, shuffle=True, random_state=None)[source]¶ Triple balance strategy.
This divides the training data into three sets: included papers, excluded papers found with random sampling and papers found with max sampling. They are balanced according to formulas depending on the percentage of papers read in the dataset, the number of papers with random/max sampling etc. Works best for stochastic training algorithms. Reduces to both full sampling and undersampling with corresponding parameters.
Parameters: - a (float) – Governs the weight of the 1’s. Higher values mean linearly more 1’s in your training sample.
- alpha (float) – Governs the scaling the weight of the 1’s, as a function of the ratio of ones to zeros. A positive value means that the lower the ratio of zeros to ones, the higher the weight of the ones.
- b (float) – Governs how strongly we want to sample depending on the total number of samples. A value of 1 means no dependence on the total number of samples, while lower values mean increasingly stronger dependence on the number of samples.
- beta (float) – Governs the scaling of the weight of the zeros depending on the number of samples. Higher values means that larger samples are more strongly penalizing zeros.
- c (float) – Value between one and zero that governs the weight of samples done with maximal sampling. Higher values mean higher weight.
- gamma (float) – Governs the scaling of the weight of the max samples as a function of the % of papers read. Higher values mean stronger scaling.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
sample
(X, y, train_idx, shared)[source]¶ Resample the training data.
Parameters: - X (numpy.ndarray) – Complete feature matrix.
- y (numpy.ndarray) – Labels for all papers.
- train_idx (numpy.ndarray) – Training indices, that is all papers that have been reviewed.
- shared (dict) – Dictionary to share data between balancing models and other models.
Returns: numpy.ndarray,numpy.ndarray – X_train, y_train: the resampled matrix, labels.
-
class
asreview.models.balance.
UndersampleBalance
(ratio=1.0, random_state=None)[source]¶ Balancing class that undersamples the data with a given ratio.
This undersamples the data, leaving out excluded papers so that the included and excluded papers are in some particular ratio (closer to one).
Parameters: ratio (double) – Undersampling ratio of the zero’s. If for example we set a ratio of 0.25, we would sample only a quarter of the zeros and all the ones. -
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
sample
(X, y, train_idx, shared)[source]¶ Resample the training data.
Parameters: - X (numpy.ndarray) – Complete feature matrix.
- y (numpy.ndarray) – Labels for all papers.
- train_idx (numpy.ndarray) – Training indices, that is all papers that have been reviewed.
- shared (dict) – Dictionary to share data between balancing models and other models.
Returns: numpy.ndarray,numpy.ndarray – X_train, y_train: the resampled matrix, labels.
-
-
asreview.models.balance.
list_balance_strategies
()[source]¶ List available balancing strategies.
Returns: list – Names of available balance strategies in alphabetical order.
-
asreview.models.balance.
get_balance_model
(name, *args, random_state=None, **kwargs)[source]¶ Get an instance of a balance model from a string.
Parameters: - name (str) – Name of the balance model.
- *args – Arguments for the balance model.
- **kwargs – Keyword arguments for the balance model.
Returns: BaseFeatureExtraction – Initialized instance of features extraction algorithm.
Feature extraction¶
-
class
asreview.models.feature_extraction.
Tfidf
(*args, ngram_max=1, stop_words='english', **kwargs)[source]¶ Class to apply TF-IDF to texts.
Use the standard TF-IDF (Term Frequency-Inverse Document Frequency) feature extraction from SKLearn. Gives a sparse matrix as output. Works well in combination with
asreview.models.NBModel
and other fast training models (given that the features vectors are relatively wide).Parameters: -
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(texts)[source]¶ Fit the model to the texts.
It is not always necessary to implement this if there’s not real fitting being done.
Parameters: texts (numpy.ndarray) – Texts to be fitted.
-
fit_transform
(texts, titles=None, abstracts=None, keywords=None)¶ Fit and transform a list of texts.
Parameters: texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized. Returns: numpy.ndarray – Feature matrix representing the texts.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
transform
(texts)[source]¶ Transform a list of texts.
Parameters: texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized. Returns: numpy.ndarray – Feature matrix representing the texts.
-
-
class
asreview.models.feature_extraction.
Doc2Vec
(*args, vector_size=40, epochs=33, min_count=1, n_jobs=1, window=7, dm_concat=0, dm=2, dbow_words=0, **kwargs)[source]¶ Base class for doc2vec feature extraction.
Feature extraction method provided by the gensim package. It takes relatively long to create a feature matrix with this method. However, this only has to be done once per simulation/review. The upside of this method is the dimension- reduction that generally takes place, which makes the modelling quicker.
Note
This feature extraction algorithm requires
gensim
to be installed. Usepip install gensim
or install all optional ASReview dependencies withpip install asreview[all]
Parameters: - vector_size (int) – Output size of the vector.
- epochs (int) – Number of epochs to train the doc2vec model.
- min_count (int) – Minimum number of occurences for a word in the corpus for it to be included in the model.
- n_jobs (int) – Number of threads to train the model with.
- window (int) – Maximum distance over which word vectors influence each other.
- dm_concat (int) – Whether to concatenate word vectors or not. See paper for more detail.
- dm (int) – Model to use. 0: Use distribute bag of words (DBOW). 1: Use distributed memory (DM). 2: Use both of the above with half the vector size and concatenate them.
- dbow_words (int) – Whether to train the word vectors using the skipgram method.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(texts)[source]¶ Fit the model to the texts.
It is not always necessary to implement this if there’s not real fitting being done.
Parameters: texts (numpy.ndarray) – Texts to be fitted.
-
fit_transform
(texts, titles=None, abstracts=None, keywords=None)¶ Fit and transform a list of texts.
Parameters: texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized. Returns: numpy.ndarray – Feature matrix representing the texts.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
transform
(texts)[source]¶ Transform a list of texts.
Parameters: texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized. Returns: numpy.ndarray – Feature matrix representing the texts.
-
class
asreview.models.feature_extraction.
EmbeddingIdf
(*args, embedding_fp=None, random_state=None, **kwargs)[source]¶ Class for Embedding-Idf model.
This model averages the weighted word vectors of all the words in the text, in order to get a single feature vector for each text. The weights are provided by the inverse document frequencies.
Note
This feature extraction algorithm requires
tensorflow
to be installed. Usepip install tensorflow
or install all optional ASReview dependencies withpip install asreview[all]
Parameters: embedding_fp (str) – Path to embedding. -
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(texts)¶ Fit the model to the texts.
It is not always necessary to implement this if there’s not real fitting being done.
Parameters: texts (numpy.ndarray) – Texts to be fitted.
-
fit_transform
(texts, titles=None, abstracts=None, keywords=None)¶ Fit and transform a list of texts.
Parameters: texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized. Returns: numpy.ndarray – Feature matrix representing the texts.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
transform
(texts)[source]¶ Transform a list of texts.
Parameters: texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized. Returns: numpy.ndarray – Feature matrix representing the texts.
-
-
class
asreview.models.feature_extraction.
EmbeddingLSTM
(*args, loop_sequence=1, num_words=20000, max_sequence_length=1000, padding='post', truncating='post', n_jobs=1, **kwargs)[source]¶ Class to create embedding matrices for LSTM models.
Feature extraction method for
asreview.models.classifiers.LSTMBaseClassifier
andasreview.models.classifiers.LSTMPoolClassifier
models.Note
This feature extraction algorithm requires
tensorflow
to be installed. Usepip install tensorflow
or install all optional ASReview dependencies withpip install asreview[all]
Parameters: - loop_sequence (bool) – Instead of zeros at the start/end of sequence loop it.
- num_words (int) – Maximum number of unique words to be processed.
- max_sequence_length (int) – Maximum length of the sequence. Shorter get struncated. Longer sequences get either padded with zeros or looped.
- padding (str) – Which side should be padded [pre/post].
- truncating – Which side should be truncated [pre/post].
- n_jobs – Number of processors used in reading the embedding matrix.
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(texts)¶ Fit the model to the texts.
It is not always necessary to implement this if there’s not real fitting being done.
Parameters: texts (numpy.ndarray) – Texts to be fitted.
-
fit_transform
(texts, titles=None, abstracts=None, keywords=None)¶ Fit and transform a list of texts.
Parameters: texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized. Returns: numpy.ndarray – Feature matrix representing the texts.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
transform
(texts)[source]¶ Transform a list of texts.
Parameters: texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized. Returns: numpy.ndarray – Feature matrix representing the texts.
-
class
asreview.models.feature_extraction.
SBERT
(split_ta=0, use_keywords=0)[source]¶ Sentence BERT class for feature extraction.
Feature extraction method based on Sentence BERT. Implementation based on the sentence_transformers package. It is relatively slow.
Note
This feature extraction algorithm requires
sentence_transformers
to be installed. Usepip install sentence_transformers
or install all optional ASReview dependencies withpip install asreview[all]
-
default_param
¶ Get the default parameters of the model.
Returns: dict – Dictionary with parameter: default value
-
fit
(texts)¶ Fit the model to the texts.
It is not always necessary to implement this if there’s not real fitting being done.
Parameters: texts (numpy.ndarray) – Texts to be fitted.
-
fit_transform
(texts, titles=None, abstracts=None, keywords=None)¶ Fit and transform a list of texts.
Parameters: texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized. Returns: numpy.ndarray – Feature matrix representing the texts.
-
param
¶ Get the (assigned) parameters of the model.
Returns: dict – Dictionary with parameter: current value.
-
transform
(texts)[source]¶ Transform a list of texts.
Parameters: texts (numpy.ndarray) – A sequence of texts to be transformed. They are not yet tokenized. Returns: numpy.ndarray – Feature matrix representing the texts.
-
-
asreview.models.feature_extraction.
list_feature_extraction
()[source]¶ List available feature extraction methods.
Returns: list – Names of available feature extraction methods in alphabetical order.
-
asreview.models.feature_extraction.
get_feature_model
(name, *args, random_state=None, **kwargs)[source]¶ Get an instance of a feature extraction model from a string.
Parameters: - name (str) – Name of the feature extraction model.
- *args – Arguments for the feature extraction model.
- **kwargs – Keyword arguments for thefeature extraction model.
Returns: BaseFeatureExtraction – Initialized instance of feature extraction algorithm.
Data¶
-
class
asreview.
ASReviewData
(df=None, data_name='empty', data_type='standard', column_spec=None)[source]¶ Data object to the dataset with texts, labels, DOIs etc.
Parameters: - df (pandas.DataFrame) – Dataframe containing the data for the ASReview data object.
- data_name (str) – Give a name to the data object.
- data_type (str) – What kind of data the dataframe contains.
- column_spec (dict) – Specification for which column corresponds to which standard specification. Key is the standard specification, key is which column it is actually in.
-
append
(as_data)[source]¶ Append another ASReviewData object.
It puts the training data at the end.
Parameters: as_data (ASReviewData) – Dataset to append.
-
classmethod
from_file
(fp, read_fn=None, data_name=None, data_type=None)[source]¶ Create instance from csv/ris/excel file.
It works in two ways; either manual control where the conversion functions are supplied or automatic, where it searches in the entry points for the right conversion functions.
Parameters: - fp (str, pathlib.Path) – Read the data from this file.
- read_fn (callable) – Function to read the file. It should return a standardized dataframe.
- data_name (str) – Name of the data.
- data_type (str) – What kind of data it is. Special names: ‘included’, ‘excluded’, ‘prior’.
-
hash
()[source]¶ Compute a hash from the dataset.
Returns: str – SHA1 hash, computed from the titles/abstracts of the dataframe.
-
prior_data_idx
¶ Get prior_included, prior_excluded from dataset.
-
prior_labels
(state, by_index=True)[source]¶ Get the labels that are marked as ‘initial’.
- state: BaseState
- Open state that contains the label information.
- by_index: bool
- If True, return internal indexing. If False, return record_ids for indexing.
Returns: numpy.ndarray – Array of indices that have the ‘initial’ property.
-
record
(i, by_index=True)[source]¶ Create a record from an index.
Parameters: Returns: PaperRecord – The corresponding record if i was an integer, or a list of records if i was an iterable.
-
slice
(idx, by_index=True)[source]¶ Create a slice from itself.
Useful if some parts should be kept/thrown away.
Parameters: idx (list, numpy.ndarray) – Record ids that should be kept. Returns: ASReviewData – Slice of itself.
-
to_csv
(fp, labels=None, ranking=None)[source]¶ Export to csv.
Parameters: - fp (str, NoneType) – Filepath or None for buffer.
- labels (list, numpy.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
- ranking (list) – Reorder the dataframe according to these (internal) indices. Default ordering if ranking is None.
Returns: pandas.DataFrame – Dataframe of all available record data.
-
to_dataframe
(labels=None, ranking=None)[source]¶ Create new dataframe with updated label (order).
Parameters: - labels (list, numpy.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
- ranking (list) – Reorder the dataframe according to these record_ids. Default ordering if ranking is None.
Returns: pandas.DataFrame – Dataframe of all available record data.
-
to_excel
(fp, labels=None, ranking=None)[source]¶ Export to Excel xlsx file.
Parameters: - fp (str, NoneType) – Filepath or None for buffer.
- labels (list, numpy.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
- ranking (list) – Reorder the dataframe according to these (internal) indices. Default ordering if ranking is None.
Returns: pandas.DataFrame – Dataframe of all available record data.
-
to_file
(fp, labels=None, ranking=None)[source]¶ Export data object to file.
RIS, CSV and Excel are supported file formats at the moment.
Parameters: - fp (str) – Filepath to export to.
- labels (list, numpy.ndarray) – Labels to be inserted into the dataframe before export.
- ranking (list, numpy.ndarray) – Optionally, dataframe rows can be reordered.
Utils¶
-
asreview.
load_embedding
(fp, word_index=None, n_jobs=None)[source]¶ Load embedding matrix from file.
The embedding matrix needs to be stored in the FastText format.
Parameters: Returns: dict – The embedding weights stored in a dict with the word as key and the weights as values.
State¶
-
asreview.state.
open_state
(fp, *args, read_only=False, **kwargs)[source]¶ Open a state from a file.
Parameters: Returns: Basestate – Depending on the extension the appropriate state is chosen: - [.h5, .hdf5, .he5] -> HDF5state. - None -> Dictstate (doesn’t store anything permanently). - Anything else -> JSONstate.
-
class
asreview.state.
BaseState
(state_fp, read_only=False)[source]¶ -
add_classification
(idx, labels, methods, query_i)[source]¶ Add training indices and their labels.
Parameters: - indices (list, numpy.ndarray) – A list of indices used for training.
- labels (list) – A list of labels corresponding with the training indices.
- i (int) – The query number.
-
add_proba
(pool_idx, train_idx, proba, query_i)[source]¶ Add inverse pool indices and their labels.
Parameters: - indices (list, numpy.ndarray) – A list of indices used for unlabeled pool.
- pred (numpy.ndarray) – Array of prediction probabilities for unlabeled pool.
- i (int) – The query number.
-
close
()[source]¶ Close the files opened by the state.
Also sets the end time if not in read-only mode.
-
get
(variable, query_i=None, default=None, idx=None)[source]¶ Get data from the state object.
This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.
Parameters: - variable (str) – Name of the variable/data to get. Options are: label_idx, inclusions, label_methods, labels, final_labels, proba , train_idx, pool_idx.
- query_i (int) – Query number, should be between 0 and self.n_queries().
- idx (int, numpy.ndarray,list) – Indices to get in the returned array.
-
get_current_queries
()[source]¶ Get the current queries made by the model.
This is useful to get back exactly to the state it was in before shutting down a review.
Returns: dict – The last known queries according to the state file.
-
get_feature_matrix
(data_hash)[source]¶ Get feature matrix out of the state.
Parameters: data_hash (str) – Hash of as_data object from which the matrix is derived. Returns: np.ndarray, sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.
-
pred_proba
¶ Get last predicted probabilities.
-
restore
(fp)[source]¶ Restore or create state from a state file.
If the state file doesn’t exist, creates and empty state that is ready for storage.
Parameters: fp (str) – Path to file to restore/create.
-
set_current_queries
(current_queries)[source]¶ Set the current queries made by the model.
Parameters: current_queries (dict) – The last known queries, with {query_idx: query_method}.
-
set_final_labels
(y)[source]¶ Add/set final labels to state.
If final_labels does not exist yet, add it.
Parameters: y (numpy.ndarray) – One dimensional integer numpy array with final inclusion labels.
-
set_labels
(y)[source]¶ Add/set labels to state
If the labels do not exist, add it to the state.
Parameters: y (numpy.ndarray) – One dimensional integer numpy array with inclusion labels.
-
settings
¶ Get settings from state
-
-
class
asreview.state.
HDF5State
(state_fp, read_only=False)[source]¶ Class for storing the review state with HDF5 storage.
-
add_classification
(idx, labels, methods, query_i)[source]¶ Add training indices and their labels.
Parameters: - indices (list, numpy.ndarray) – A list of indices used for training.
- labels (list) – A list of labels corresponding with the training indices.
- i (int) – The query number.
-
add_proba
(pool_idx, train_idx, proba, query_i)[source]¶ Add inverse pool indices and their labels.
Parameters: - indices (list, numpy.ndarray) – A list of indices used for unlabeled pool.
- pred (numpy.ndarray) – Array of prediction probabilities for unlabeled pool.
- i (int) – The query number.
-
close
()[source]¶ Close the files opened by the state.
Also sets the end time if not in read-only mode.
-
get
(variable, query_i=None, idx=None)[source]¶ Get data from the state object.
This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.
Parameters: - variable (str) – Name of the variable/data to get. Options are: label_idx, inclusions, label_methods, labels, final_labels, proba , train_idx, pool_idx.
- query_i (int) – Query number, should be between 0 and self.n_queries().
- idx (int, numpy.ndarray,list) – Indices to get in the returned array.
-
get_current_queries
()[source]¶ Get the current queries made by the model.
This is useful to get back exactly to the state it was in before shutting down a review.
Returns: dict – The last known queries according to the state file.
-
get_feature_matrix
(data_hash)[source]¶ Get feature matrix out of the state.
Parameters: data_hash (str) – Hash of as_data object from which the matrix is derived. Returns: np.ndarray, sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.
-
is_empty
()¶ Check if state has no results.
Returns: bool – True if empty.
-
pred_proba
¶ Get last predicted probabilities.
-
restore
(fp)[source]¶ Restore or create state from a state file.
If the state file doesn’t exist, creates and empty state that is ready for storage.
Parameters: fp (str) – Path to file to restore/create.
-
set_current_queries
(current_queries)[source]¶ Set the current queries made by the model.
Parameters: current_queries (dict) – The last known queries, with {query_idx: query_method}.
-
set_final_labels
(y)[source]¶ Add/set final labels to state.
If final_labels does not exist yet, add it.
Parameters: y (numpy.ndarray) – One dimensional integer numpy array with final inclusion labels.
-
set_labels
(y)[source]¶ Add/set labels to state
If the labels do not exist, add it to the state.
Parameters: y (numpy.ndarray) – One dimensional integer numpy array with inclusion labels.
-
settings
¶ Get settings from state
-
startup_vals
()¶ Get variables for reviewer to continue review.
Returns: - numpy.ndarray – Current labels of dataset.
- numpy.ndarray – Current training indices.
- dict – Dictionary containing the sources of the labels.
- query_i – Currenty query number (starting from 0).
-
to_dict
()¶ Convert state to dictionary.
Returns: dict – Dictionary with all relevant variables.
-
-
class
asreview.state.
JSONState
(state_fp, read_only=False)[source]¶ Class for storing the state of a Systematic Review using JSON files.
-
add_classification
(idx, labels, methods, query_i)¶ Add training indices and their labels.
Parameters: - indices (list, numpy.ndarray) – A list of indices used for training.
- labels (list) – A list of labels corresponding with the training indices.
- i (int) – The query number.
-
add_proba
(pool_idx, train_idx, proba, query_i)¶ Add inverse pool indices and their labels.
Parameters: - indices (list, numpy.ndarray) – A list of indices used for unlabeled pool.
- pred (numpy.ndarray) – Array of prediction probabilities for unlabeled pool.
- i (int) – The query number.
-
close
()¶ Close the files opened by the state.
Also sets the end time if not in read-only mode.
-
delete_last_query
()¶ Delete the last query from the state object.
-
get
(variable, query_i=None, idx=None)¶ Get data from the state object.
This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.
Parameters: - variable (str) – Name of the variable/data to get. Options are: label_idx, inclusions, label_methods, labels, final_labels, proba , train_idx, pool_idx.
- query_i (int) – Query number, should be between 0 and self.n_queries().
- idx (int, numpy.ndarray,list) – Indices to get in the returned array.
-
get_current_queries
()¶ Get the current queries made by the model.
This is useful to get back exactly to the state it was in before shutting down a review.
Returns: dict – The last known queries according to the state file.
-
get_feature_matrix
(data_hash)¶ Get feature matrix out of the state.
Parameters: data_hash (str) – Hash of as_data object from which the matrix is derived. Returns: np.ndarray, sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.
-
initialize_structure
()¶ Create empty internal structure for state
-
is_empty
()¶ Check if state has no results.
Returns: bool – True if empty.
-
n_queries
()¶ Number of queries saved in the state.
Returns: int – Number of queries.
-
pred_proba
¶ Get last predicted probabilities.
-
restore
(fp)[source]¶ Restore or create state from a state file.
If the state file doesn’t exist, creates and empty state that is ready for storage.
Parameters: fp (str) – Path to file to restore/create.
-
set_current_queries
(current_queries)¶ Set the current queries made by the model.
Parameters: current_queries (dict) – The last known queries, with {query_idx: query_method}.
-
set_final_labels
(y)¶ Add/set final labels to state.
If final_labels does not exist yet, add it.
Parameters: y (numpy.ndarray) – One dimensional integer numpy array with final inclusion labels.
-
set_labels
(y)¶ Add/set labels to state
If the labels do not exist, add it to the state.
Parameters: y (numpy.ndarray) – One dimensional integer numpy array with inclusion labels.
-
settings
¶ Get settings from state
-
startup_vals
()¶ Get variables for reviewer to continue review.
Returns: - numpy.ndarray – Current labels of dataset.
- numpy.ndarray – Current training indices.
- dict – Dictionary containing the sources of the labels.
- query_i – Currenty query number (starting from 0).
-
to_dict
()¶ Convert state to dictionary.
Returns: dict – Dictionary with all relevant variables.
-
-
class
asreview.state.
DictState
(state_fp, *_, **__)[source]¶ Class for storing the state of a review with no permanent storage.
-
add_classification
(idx, labels, methods, query_i)[source]¶ Add training indices and their labels.
Parameters: - indices (list, numpy.ndarray) – A list of indices used for training.
- labels (list) – A list of labels corresponding with the training indices.
- i (int) – The query number.
-
add_proba
(pool_idx, train_idx, proba, query_i)[source]¶ Add inverse pool indices and their labels.
Parameters: - indices (list, numpy.ndarray) – A list of indices used for unlabeled pool.
- pred (numpy.ndarray) – Array of prediction probabilities for unlabeled pool.
- i (int) – The query number.
-
close
()[source]¶ Close the files opened by the state.
Also sets the end time if not in read-only mode.
-
get
(variable, query_i=None, idx=None)[source]¶ Get data from the state object.
This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.
Parameters: - variable (str) – Name of the variable/data to get. Options are: label_idx, inclusions, label_methods, labels, final_labels, proba , train_idx, pool_idx.
- query_i (int) – Query number, should be between 0 and self.n_queries().
- idx (int, numpy.ndarray,list) – Indices to get in the returned array.
-
get_current_queries
()[source]¶ Get the current queries made by the model.
This is useful to get back exactly to the state it was in before shutting down a review.
Returns: dict – The last known queries according to the state file.
-
get_feature_matrix
(data_hash)[source]¶ Get feature matrix out of the state.
Parameters: data_hash (str) – Hash of as_data object from which the matrix is derived. Returns: np.ndarray, sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.
-
pred_proba
¶ Get last predicted probabilities.
-
restore
(*_, **__)[source]¶ Restore or create state from a state file.
If the state file doesn’t exist, creates and empty state that is ready for storage.
Parameters: fp (str) – Path to file to restore/create.
-
set_current_queries
(current_queries)[source]¶ Set the current queries made by the model.
Parameters: current_queries (dict) – The last known queries, with {query_idx: query_method}.
-
set_final_labels
(y)[source]¶ Add/set final labels to state.
If final_labels does not exist yet, add it.
Parameters: y (numpy.ndarray) – One dimensional integer numpy array with final inclusion labels.
-
set_labels
(y)[source]¶ Add/set labels to state
If the labels do not exist, add it to the state.
Parameters: y (numpy.ndarray) – One dimensional integer numpy array with inclusion labels.
-
settings
¶ Get settings from state
-
startup_vals
()¶ Get variables for reviewer to continue review.
Returns: - numpy.ndarray – Current labels of dataset.
- numpy.ndarray – Current training indices.
- dict – Dictionary containing the sources of the labels.
- query_i – Currenty query number (starting from 0).
-
to_dict
()¶ Convert state to dictionary.
Returns: dict – Dictionary with all relevant variables.
-
Analysis¶
-
class
asreview.analysis.
Analysis
(states, key=None)[source]¶ Analysis object to do statistical analysis on state files.
-
avg_time_to_discovery
(result_format='number')[source]¶ Estimate the Time to Discovery (TD) for each paper.
Get the best/last estimate on how long it takes to find a paper.
Parameters: result_format (str) – Desired output format: “number”, “fraction” or “percentage”. Returns: dict – For each inclusion, key=paper_id, value=avg time.
-
classmethod
from_dir
(data_dir, prefix='', key=None)[source]¶ Create an Analysis object from a directory.
Parameters:
-
classmethod
from_file
(data_fp, key=None)[source]¶ Create an Analysis object from a file.
Parameters:
-
classmethod
from_path
(data_path, prefix='', key=None)[source]¶ Create an Analysis object from either a file or a directory.
-
inclusions_found
(result_format='fraction', final_labels=False, **kwargs)[source]¶ Get the number of inclusions at each point in time.
Caching is used to prevent multiple calls being expensive.
Parameters: Returns: tuple – Three numpy arrays with x, y, error_bar.
-
limits
(prob_allow_miss=[0.1], result_format='percentage')[source]¶ For each query, compute the number of papers for a criterium.
A criterium is the average number of papers missed. For example, with 0.1, the criterium is that after reading x papers, there is (about) a 10% chance that one paper is not included. Another example, with 2.0, there are on average 2 papers missed after reading x papers. The value for x is returned for each query and probability by the function.
Parameters: prob_allow_miss (list, float) – Sets the criterium for how many papers can be missed. Returns: dict – One entry, “x_range” with the number of papers read. List, “limits” of results for each probability and at # papers read.
-
Extensions¶
-
class
asreview.entry_points.
BaseEntryPoint
[source]¶ Base class for defining entry points.