Reference

Low level API

asreview.review.get_reviewer(dataset, mode='simulate', model='nb', query_strategy='max_random', balance_strategy='triple', feature_extraction='tfidf', n_instances=1, n_papers=None, n_queries=None, embedding_fp=None, verbose=0, prior_idx=None, n_prior_included=1, n_prior_excluded=1, config_file=None, state_file=None, model_param=None, query_param=None, balance_param=None, feature_param=None, seed=None, abstract_only=False, included_dataset=[], excluded_dataset=[], prior_dataset=[], new=False, **kwargs)[source]

Get a review object from arguments.

See __main__.py for a description of the arguments.

class asreview.review.BaseReview(as_data, model=None, query_model=None, balance_model=None, feature_model=None, n_papers=None, n_instances=1, n_queries=None, start_idx=[], state_file=None, log_file=None, verbose=1, data_fp=None)[source]

Base class for Systematic Review

classify(query_idx, inclusions, state, method=None)[source]

Classify new papers and update the training indices.

It automaticaly updates the state.

Parameters:
  • query_idx (list, np.array) – Indices to classify.
  • inclusions (list, np.array) – Labels of the query_idx.
  • state (BaseLogger) – Logger to store the classification in.
log_probabilities(state)[source]

Store the modeling probabilities of the training indices and pool indices.

query(n_instances, query_model=None)[source]

Query new results.

Parameters:
  • n_instances (int) – Batch size of the queries, i.e. number of papers to be queried.
  • query_model (BaseQueryModel) – Query strategy model to use. If None, the query model of the reviewer is used.
Returns:

np.array – Indices of papers queried.

review(*args, **kwargs)[source]

Do the systematic review, writing the results to the state file.

Parameters:
  • stop_after_class (bool) – When to stop; if True stop after classification step, otherwise stop after training step.
  • instant_save (bool) – If True, save results after each single classification.
statistics()[source]

Get a number of statistics about the current state of the review.

train()[source]

Train the model.

class asreview.ReviewSimulate(as_data, *args, n_prior_included=0, n_prior_excluded=0, prior_idx=None, init_seed=None, **kwargs)[source]

Automated Systematic Review in simulation mode.

Models

class asreview.models.NBModel(alpha=3.822)[source]

Naive Bayes SKLearn model.

class asreview.models.RFModel(n_estimators=100, max_features=10, class_weight=1.0, random_state=None)[source]

Random Forest SKLearn model.

class asreview.models.SVMModel(gamma='auto', class_weight=0.249, C=15.4, kernel='linear', random_state=None)[source]

Support Vector Machine SKLearn model.

class asreview.models.LogisticModel(C=1.0, class_weight=1.0, random_state=None, n_jobs=1)[source]

Logistic Regression SKLearn model.

asreview.models.get_model(method, *args, random_state=None, **kwargs)[source]

Get an instance of a model from a string.

Parameters:
  • method (str) – Name of the model.
  • *args – Arguments for the model.
  • **kwargs – Keyword arguments for the model.
asreview.models.get_model_class(method)[source]

Get class of model from string.

Parameters:method (str) – Name of the model, e.g. ‘svm’, ‘nb’ or ‘lstm-pool’.
Returns:BaseModel – Class corresponding to the method.

Query strategies

class asreview.query_strategies.MaxQuery[source]

Maximum sampling query strategy.

class asreview.query_strategies.MixedQuery(strategy_1='max', strategy_2='random', mix_ratio=0.95, random_state=None, **kwargs)[source]

Class for mixed query strategy.

The idea is to use two different query strategies at the same time with a ratio of one to the other.

class asreview.query_strategies.UncertaintyQuery[source]

Maximum uncertainty query strategy.

class asreview.query_strategies.RandomQuery(random_state=None)[source]

Random sampling query strategy.

class asreview.query_strategies.ClusterQuery(cluster_size=350, update_interval=200, random_state=None, **kwargs)[source]

Query strategy using clustering algorithms.

asreview.query_strategies.get_query_model(method, *args, random_state=None, **kwargs)[source]

Get an instance of the query strategy.

Parameters:
  • method (str) – Name of the query strategy.
  • *args – Arguments for the model.
  • **kwargs – Keyword arguments for the model.
Returns:

BaseQueryModel – Initialized instance of query strategy.

asreview.query_strategies.get_query_class(method)[source]

Get class of query strategy from its name.

Parameters:method (str) – Name of the query strategy, e.g. ‘max’, ‘uncertainty’, ‘random. A special mixed query strategy is als possible. The mix is denoted by an underscore: ‘max_random’ or ‘max_uncertainty’.
Returns:BaseQueryModel – Class corresponding to the method name.

Balance Strategies

class asreview.balance_strategies.SimpleBalance[source]
class asreview.balance_strategies.DoubleBalance(a=2.155, alpha=0.94, b=0.789, beta=1.0, random_state=None)[source]

Class for the double balance strategy.

Class to get the two way rebalancing function and arguments. It super samples ones depending on the number of 0’s and total number of samples in the training data.

Parameters:
  • a (float) – Governs the weight of the 1’s. Higher values mean linearly more 1’s in your training sample.
  • alpha (float) – Governs the scaling the weight of the 1’s, as a function of the ratio of ones to zeros. A positive value means that the lower the ratio of zeros to ones, the higher the weight of the ones.
  • b (float) – Governs how strongly we want to sample depending on the total number of samples. A value of 1 means no dependence on the total number of samples, while lower values mean increasingly stronger dependence on the number of samples.
  • beta (float) – Governs the scaling of the weight of the zeros depending on the number of samples. Higher values means that larger samples are more strongly penalizing zeros.
class asreview.balance_strategies.TripleBalance(a=2.155, alpha=0.94, b=0.789, beta=1.0, c=0.835, gamma=2.0, shuffle=True, random_state=None)[source]

Class to get the three way rebalancing function and arguments. It divides the data into three groups: 1’s, 0’s from random sampling, and 0’s from max sampling. Thus it only makes sense to use this class in combination with the rand_max query strategy.

class asreview.balance_strategies.UndersampleBalance(ratio=1.0, random_state=None)[source]

Balancing class that undersamples the data with a given ratio.

asreview.balance_strategies.get_balance_model(method, *args, random_state=None, **kwargs)[source]

Get an instance of a balance model from a string.

Parameters:
  • method (str) – Name of the balance model.
  • *args – Arguments for the balance model.
  • **kwargs – Keyword arguments for the balance model.
asreview.balance_strategies.get_balance_class(method)[source]

Get class of balance model from string.

Parameters:method (str) – Name of the model, e.g. ‘simple’, ‘double’ or ‘undersample’.
Returns:BaseBalanceModel – Class corresponding to the method.

Feature Extraction

class asreview.feature_extraction.Tfidf(*args, ngram_max=1, **kwargs)[source]

Class to apply SKLearn Tf-idf to texts.

asreview.feature_extraction.get_feature_model(method, *args, random_state=None, **kwargs)[source]

Get an instance of a feature extraction model from a string.

Parameters:
  • method (str) – Name of the feature extraction model.
  • *args – Arguments for the feature extraction model.
  • **kwargs – Keyword arguments for thefeature extraction model.
asreview.feature_extraction.get_feature_class(method)[source]

Get class of feature extraction from string.

Parameters:method (str) – Name of the feature model, e.g. ‘doc2vec’, ‘tfidf’ or ‘embedding-lstm’.
Returns:BaseFeatureExtraction – Class corresponding to the method.

Data

class asreview.ASReviewData(df=None, data_name='empty', data_type='standard', column_spec=None)[source]

Data object to store csv/ris file.

Extracts relevant properties of papers.

Parameters:
  • df (pd.DataFrame) – Dataframe containing the data for the ASReview data object.
  • data_name (str) – Give a name to the data object.
  • data_type (str) – What kind of data the dataframe contains.
append(as_data)[source]

Append another ASReviewData object.

It puts the training data at the end.

Parameters:as_data (ASReviewData) – Dataset to append.
format_record(i, by_index=True, *args, **kwargs)[source]

Format one record for displaying in the CLI.

classmethod from_file(fp, read_fn=None, data_name=None, data_type=None)[source]

Create instance from csv/ris/excel file.

It works in two ways; either manual control where the conversion functions are supplied or automatic, where it searches in the entry points for the right conversion functions.

Parameters:
  • fp (str, Path) – Read the data from this file.
  • read_fn (function) – Function to read the file. It should return a standardized dataframe.
  • data_name (str) – Name of the data.
  • data_type (str) – What kind of data it is. Special names: ‘included’, ‘excluded’, ‘prior’.
fuzzy_find(keywords, threshold=60, max_return=10, exclude=None, by_index=True)[source]

Find a record using keywords.

It looks for keywords in the title/authors/keywords (for as much is available). Using the diflib package it creates a ranking based on token set matching.

Parameters:
  • keywords (str) – A string of keywords together, can be a combination.
  • threshold (float) – Don’t return records below this threshold.
  • max_return (int) – Maximum number of records to return.
  • exclude (list, np.ndarray) – List of indices that should be excluded in the search. You would put papers that were already labeled here for example.
  • by_index (bool) – If True, use internal indexing. If False, use record ids for indexing.
Returns:

list – Sorted list of indexes that match best the keywords.

get(name)[source]

Get column with name.

hash()[source]

Compute a hash from the dataset.

Returns:str – SHA1 hash, computed from the titles/abstracts of the dataframe.
preview_record(i, by_index=True, *args, **kwargs)[source]

Return a preview string for record i.

print_record(*args, **kwargs)[source]

Print a record to the CLI.

prior_data_idx

Get prior_included, prior_excluded from dataset.

prior_labels(state, by_index=True)[source]

Get the labels that are marked as ‘initial’.

state: BaseState
Open state that contains the label information.
by_index: bool
If True, return internal indexing. If False, return record_ids for indexing.
Returns:np.array – Array of indices that have the ‘initial’ property.
record(i, by_index=True)[source]

Create a record from an index.

Parameters:
  • i (int, iterable) – Index of the record, or list of indices.
  • by_index (bool) – If True, take the i-th value as used internally by the review. If False, take the record with record_id==i.
Returns:

PaperRecord – The corresponding record if i was an integer, or a list of records if i was an iterable.

slice(idx)[source]

Create a slice from itself.

Useful if some parts should be kept/thrown away.

Parameters:idx (list, np.ndarray) – Record ids that should be kept.
Returns:ASReviewData – Slice of itself.
to_csv(fp, labels=None, ranking=None)[source]

Export to csv.

Parameters:
  • fp (str, NoneType) – Filepath or None for buffer.
  • labels (list, np.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
  • ranking (list) – Reorder the dataframe according to these (internal) indices. Default ordering if ranking is None.
Returns:

pd.DataFrame – Dataframe of all available record data.

to_dataframe(labels=None, ranking=None)[source]

Create new dataframe with updated label (order).

Parameters:
  • labels (list, np.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
  • ranking (list) – Reorder the dataframe according to these (internal) indices. Default ordering if ranking is None.
Returns:

pd.DataFrame – Dataframe of all available record data.

to_excel(fp, labels=None, ranking=None)[source]

Export to Excel xlsx file.

Parameters:
  • fp (str, NoneType) – Filepath or None for buffer.
  • labels (list, np.ndarray) – Current labels will be overwritten by these labels (including unlabelled). No effect if labels is None.
  • ranking (list) – Reorder the dataframe according to these (internal) indices. Default ordering if ranking is None.
Returns:

pd.DataFrame – Dataframe of all available record data.

to_file(fp, labels=None, ranking=None)[source]

Export data object to file.

RIS, CSV and Excel are supported file formats at the moment.

Parameters:
  • fp (str) – Filepath to export to.
  • labels (list, np.array) – Labels to be inserted into the dataframe before export.
  • ranking (list, np.array) – Optionally, dataframe rows can be reordered.

Utils

asreview.load_embedding(fp, word_index=None, n_jobs=None)[source]

Load embedding matrix from file.

The embedding matrix needs to be stored in the FastText format.

Parameters:
  • fp (str) – File path of the trained embedding vectors.
  • word_index (dict) – Sample word embeddings.
  • n_jobs (int) – Number of processes to parse the embedding (+1 process for reading).
  • verbose (int) – The verbosity. Default 1.
Returns:

dict – The embedding weights stored in a dict with the word as key and the weights as values.

asreview.sample_embedding(embedding, word_index)[source]

Sample embedding matrix

Parameters:
  • embedding (dict) – A dictionary with the words and embedding vectors.
  • word_index (dict) – A word_index like the output of Keras Tokenizer.word_index.
  • verbose (int) – The verbosity. Default 1.
Returns:

(np.ndarray, list) – The embedding weights strored in a two dimensional numpy array and a list with the corresponding words.

State

asreview.state.open_state(fp, *args, read_only=False, **kwargs)[source]

Open a state from a file.

Parameters:
  • fp (str) – File to open.
  • read_only (bool) – Whether to open the file in read_only mode.
Returns:

Basestate – Depending on the extension the appropriate state is chosen: - [.h5, .hdf5, .he5] -> HDF5state. - None -> Dictstate (doesn’t store anything permanently). - Anything else -> JSONstate.

class asreview.state.BaseState(state_fp, read_only=False)[source]
add_classification(idx, labels, methods, query_i)[source]

Add training indices and their labels.

Parameters:
  • indices (list, np.array) – A list of indices used for training.
  • labels (list) – A list of labels corresponding with the training indices.
  • i (int) – The query number.
add_proba(pool_idx, train_idx, proba, query_i)[source]

Add inverse pool indices and their labels.

Parameters:
  • indices (list, np.array) – A list of indices used for unlabeled pool.
  • pred (np.array) – Array of prediction probabilities for unlabeled pool.
  • i (int) – The query number.
close()[source]

Close the files opened by the state.

Also sets the end time if not in read-only mode.

delete_last_query()[source]

Delete the last query from the state object.

get(variable, query_i=None, default=None, idx=None)[source]

Get data from the state object.

This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.

Parameters:
  • variable (str) – Name of the variable/data to get. Options are: label_idx, inclusions, label_methods, labels, final_labels, proba , train_idx, pool_idx.
  • query_i (int) – Query number, should be between 0 and self.n_queries().
  • idx (int, np.array, list) – Indices to get in the returned array.
get_current_queries()[source]

Get the current queries made by the model.

This is useful to get back exactly to the state it was in before shutting down a review.

Returns:dict – The last known queries according to the state file.
get_feature_matrix(data_hash)[source]

Get feature matrix out of the state.

Parameters:data_hash (str) – Hash of as_data object from which the matrix is derived.
Returns:np.ndarray or sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.
initialize_structure()[source]

Create empty internal structure for state

is_empty()[source]

Check if state has no results.

Returns:bool – True if empty.
n_queries()[source]

Number of queries saved in the state.

Returns:int – Number of queries.
restore(fp)[source]

Restore or create state from a state file.

If the state file doesn’t exist, creates and empty state that is ready for storage.

Parameters:fp (str) – Path to file to restore/create.
save()[source]

Save state to file.

Parameters:fp (str) – The file path to export the results to.
set_current_queries(current_queries)[source]

Set the current queries made by the model.

Parameters:current_queries (dict) – The last known queries, with {query_idx: query_method}.
set_final_labels(y)[source]

Add/set final labels to state.

If final_labels does not exist yet, add it.

Parameters:y (np.array) – One dimensional integer numpy array with final inclusion labels.
set_labels(y)[source]

Add/set labels to state

If the labels do not exist, add it to the state.

Parameters:y (np.array) – One dimensional integer numpy array with inclusion labels.
settings

Get settings from state

startup_vals()[source]

Get variables for reviewer to continue review.

Returns:
  • np.array – Current labels of dataset.
  • np.array – Current training indices.
  • dict – Dictionary containing the sources of the labels.
  • query_i – Currenty query number (starting from 0).
to_dict()[source]

Convert state to dictionary.

Returns:dict – Dictionary with all relevant variables.
class asreview.state.HDF5State(state_fp, read_only=False)[source]

Class for storing the review state with HDF5 storage.

class asreview.state.JSONState(state_fp, read_only=False)[source]

Class for storing the state of a Systematic Review using JSON files.

class asreview.state.DictState(state_fp, *_, **__)[source]

Class for storing the state of a review with no permanent storage.

Analysis

class asreview.analysis.Analysis(states, key=None)[source]

Analysis object to do statistical analysis on state files.

avg_time_to_discovery(result_format='number')[source]

Get the best/last estimate on how long it takes to find a paper.

Returns:dict – For each inclusion, key=paper_id, value=avg time.
close()[source]

Close states.

classmethod from_dir(data_dir, prefix='result', key=None)[source]

Create an Analysis object from a directory.

classmethod from_file(data_fp, key=None)[source]

Create an Analysis object from a file.

classmethod from_path(data_path, prefix='result', key=None)[source]

Create an Analysis object from either a file or a directory.

inclusions_found(result_format='fraction', final_labels=False, **kwargs)[source]

Get the number of inclusions at each point in time.

Caching is used to prevent multiple calls being expensive.

Parameters:
  • result_format (str) – The format % or # of the returned values.
  • final_labels (bool) – If true, use the final_labels instead of labels for analysis.
Returns:

tuple – Three numpy arrays with x, y, error_bar.

limits(prob_allow_miss=[0.1], result_format='percentage')[source]

For each query, compute the number of papers for a criterium.

A criterium is the average number of papers missed. For example, with 0.1, the criterium is that after reading x papers, there is (about) a 10% chance that one paper is not included. Another example, with 2.0, there are on average 2 papers missed after reading x papers. The value for x is returned for each query and probability by the function.

Parameters:prob_allow_miss (list, float) – Sets the criterium for how many papers can be missed.
Returns:dict – One entry, “x_range” with the number of papers read. List, “limits” of results for each probability and at # papers read.
rrf(val=10, x_format='percentage', **kwargs)[source]

Get the RRF (Relevant References Found).

Parameters:
  • val – At which recall, between 0 and 100.
  • x_format – Format for position of RRF value in graph.
Returns:

tuple – Tuple consisting of RRF value, x_positions, y_positions of RRF bar.

wss(val=100, x_format='percentage', **kwargs)[source]

Get the WSS (Work Saved Sampled) value.

Parameters:
  • val – At which recall, between 0 and 100.
  • x_format – Format for position of WSS value in graph.
Returns:

tuple – Tuple consisting of WSS value, x_positions, y_positions of WSS bar.

Extensions

class asreview.entry_points.BaseEntryPoint[source]

Base class for defining entry points.

classmethod execute(argv)[source]

Perform the functionality of the entry point.

Parameters:argv (list) – Argument list, with the entry point and program removed. For example, if asreview plot X is executed, then argv == [‘X’].
format(entry_name='?')[source]

Create a short formatted description of the entry point.

Parameters:entry_name (str) – Name of the entry point. For example ‘plot’ in asreview plot X