API Reference¶
Low level API¶

asreview.review.
get_reviewer
(dataset, mode='simulate', model='nb', query_strategy='max_random', balance_strategy='triple', feature_extraction='tfidf', n_instances=1, n_papers=None, n_queries=None, embedding_fp=None, verbose=0, prior_idx=None, n_prior_included=1, n_prior_excluded=1, config_file=None, state_file=None, model_param=None, query_param=None, balance_param=None, feature_param=None, seed=None, abstract_only=False, included_dataset=[], excluded_dataset=[], prior_dataset=[], new=False, **kwargs)[source]¶ Get a review object from arguments.
See __main__.py for a description of the arguments.

class
asreview.review.
BaseReview
(as_data, model=None, query_model=None, balance_model=None, feature_model=None, n_papers=None, n_instances=1, n_queries=None, start_idx=[], state_file=None, log_file=None)[source]¶ Base class for Systematic Review

classify
(query_idx, inclusions, state, method=None)[source]¶ Classify new papers and update the training indices.
It automaticaly updates the state.
Parameters:

log_probabilities
(state)[source]¶ Store the modeling probabilities of the training indices and pool indices.

query
(n_instances, query_model=None)[source]¶ Query new results.
Parameters:  n_instances (int) – Batch size of the queries, i.e. number of papers to be queried.
 query_model (BaseQueryModel) – Query strategy model to use. If None, the query model of the reviewer is used.
Returns: np.array – Indices of papers queried.

review
(*args, **kwargs)[source]¶ Do the systematic review, writing the results to the state file.
Parameters:

settings
¶ Get an ASReview settings object

Models¶

class
asreview.models.
RFModel
(n_estimators=100, max_features=10, class_weight=1.0, random_state=None)[source]¶ Random Forest SKLearn model.

class
asreview.models.
SVMModel
(gamma='auto', class_weight=0.249, C=15.4, kernel='linear', random_state=None)[source]¶ Support Vector Machine SKLearn model.

class
asreview.models.
LogisticModel
(C=1.0, class_weight=1.0, random_state=None, n_jobs=1)[source]¶ Logistic Regression SKLearn model.
Query strategies¶

class
asreview.query_strategies.
MixedQuery
(strategy_1='max', strategy_2='random', mix_ratio=0.95, random_state=None, **kwargs)[source]¶ Class for mixed query strategy.
The idea is to use two different query strategies at the same time with a ratio of one to the other.

class
asreview.query_strategies.
RandomQuery
(random_state=None)[source]¶ Random sampling query strategy.

class
asreview.query_strategies.
ClusterQuery
(cluster_size=350, update_interval=200, random_state=None)[source]¶ Query strategy using clustering algorithms.

asreview.query_strategies.
get_query_model
(method, *args, random_state=None, **kwargs)[source]¶ Get an instance of the query strategy.
Parameters:  method (str) – Name of the query strategy.
 *args – Arguments for the model.
 **kwargs – Keyword arguments for the model.
Returns: BaseQueryModel – Initialized instance of query strategy.

asreview.query_strategies.
get_query_class
(method)[source]¶ Get class of query strategy from its name.
Parameters: method (str) – Name of the query strategy, e.g. ‘max’, ‘uncertainty’, ‘random. A special mixed query strategy is als possible. The mix is denoted by an underscore: ‘max_random’ or ‘max_uncertainty’. Returns: BaseQueryModel – Class corresponding to the method name.
Balance Strategies¶

class
asreview.balance_strategies.
DoubleBalance
(a=2.155, alpha=0.94, b=0.789, beta=1.0, random_state=None)[source]¶ Class for the double balance strategy.
Class to get the two way rebalancing function and arguments. It super samples ones depending on the number of 0’s and total number of samples in the training data.
Parameters:  a (float) – Governs the weight of the 1’s. Higher values mean linearly more 1’s in your training sample.
 alpha (float) – Governs the scaling the weight of the 1’s, as a function of the ratio of ones to zeros. A positive value means that the lower the ratio of zeros to ones, the higher the weight of the ones.
 b (float) – Governs how strongly we want to sample depending on the total number of samples. A value of 1 means no dependence on the total number of samples, while lower values mean increasingly stronger dependence on the number of samples.
 beta (float) – Governs the scaling of the weight of the zeros depending on the number of samples. Higher values means that larger samples are more strongly penalizing zeros.

class
asreview.balance_strategies.
TripleBalance
(a=2.155, alpha=0.94, b=0.789, beta=1.0, c=0.835, gamma=2.0, shuffle=True, random_state=None)[source]¶ Class to get the three way rebalancing function and arguments. It divides the data into three groups: 1’s, 0’s from random sampling, and 0’s from max sampling. Thus it only makes sense to use this class in combination with the rand_max query strategy.

class
asreview.balance_strategies.
UndersampleBalance
(ratio=1.0, random_state=None)[source]¶ Balancing class that undersamples the data with a given ratio.
Feature Extraction¶

class
asreview.feature_extraction.
Tfidf
(*args, ngram_max=1, **kwargs)[source]¶ Class to apply SKLearn Tfidf to texts.

asreview.feature_extraction.
get_feature_model
(method, *args, random_state=None, **kwargs)[source]¶ Get an instance of a feature extraction model from a string.
Parameters:  method (str) – Name of the feature extraction model.
 *args – Arguments for the feature extraction model.
 **kwargs – Keyword arguments for thefeature extraction model.
Data¶

class
asreview.
ASReviewData
(df=None, data_name='empty', data_type='standard', column_spec=None)[source]¶ Data object to the dataset with texts, labels, DOIs etc.
Parameters:  df (pd.DataFrame) – Dataframe containing the data for the ASReview data object.
 data_name (str) – Give a name to the data object.
 data_type (str) – What kind of data the dataframe contains.
 column_spec (dict) – Specification for which column corresponds to which standard specification. Key is the standard specification, key is which column it is actually in.

append
(as_data)[source]¶ Append another ASReviewData object.
It puts the training data at the end.
Parameters: as_data (ASReviewData) – Dataset to append.

format_record
(i, by_index=True, *args, **kwargs)[source]¶ Format one record for displaying in the CLI.

classmethod
from_file
(fp, read_fn=None, data_name=None, data_type=None)[source]¶ Create instance from csv/ris/excel file.
It works in two ways; either manual control where the conversion functions are supplied or automatic, where it searches in the entry points for the right conversion functions.
Parameters:

fuzzy_find
(keywords, threshold=60, max_return=10, exclude=None, by_index=True)[source]¶ Find a record using keywords.
It looks for keywords in the title/authors/keywords (for as much is available). Using the diflib package it creates a ranking based on token set matching.
Parameters:  keywords (str) – A string of keywords together, can be a combination.
 threshold (float) – Don’t return records below this threshold.
 max_return (int) – Maximum number of records to return.
 exclude (list, np.ndarray) – List of indices that should be excluded in the search. You would put papers that were already labeled here for example.
 by_index (bool) – If True, use internal indexing. If False, use record ids for indexing.
Returns: list – Sorted list of indexes that match best the keywords.

hash
()[source]¶ Compute a hash from the dataset.
Returns: str – SHA1 hash, computed from the titles/abstracts of the dataframe.

prior_data_idx
¶ Get prior_included, prior_excluded from dataset.

prior_labels
(state, by_index=True)[source]¶ Get the labels that are marked as ‘initial’.
 state: BaseState
 Open state that contains the label information.
 by_index: bool
 If True, return internal indexing. If False, return record_ids for indexing.
Returns: np.array – Array of indices that have the ‘initial’ property.

record
(i, by_index=True)[source]¶ Create a record from an index.
Parameters: Returns: PaperRecord – The corresponding record if i was an integer, or a list of records if i was an iterable.

slice
(idx)[source]¶ Create a slice from itself.
Useful if some parts should be kept/thrown away.
Parameters: idx (list, np.ndarray) – Record ids that should be kept. Returns: ASReviewData – Slice of itself.

to_csv
(fp, labels=None, ranking=None)[source]¶ Export to csv.
Parameters: Returns: pd.DataFrame – Dataframe of all available record data.

to_dataframe
(labels=None, ranking=None)[source]¶ Create new dataframe with updated label (order).
Parameters: Returns: pd.DataFrame – Dataframe of all available record data.

to_excel
(fp, labels=None, ranking=None)[source]¶ Export to Excel xlsx file.
Parameters: Returns: pd.DataFrame – Dataframe of all available record data.
Utils¶

asreview.
load_embedding
(fp, word_index=None, n_jobs=None)[source]¶ Load embedding matrix from file.
The embedding matrix needs to be stored in the FastText format.
Parameters: Returns: dict – The embedding weights stored in a dict with the word as key and the weights as values.
State¶

asreview.state.
open_state
(fp, *args, read_only=False, **kwargs)[source]¶ Open a state from a file.
Parameters: Returns: Basestate – Depending on the extension the appropriate state is chosen:  [.h5, .hdf5, .he5] > HDF5state.  None > Dictstate (doesn’t store anything permanently).  Anything else > JSONstate.

class
asreview.state.
BaseState
(state_fp, read_only=False)[source]¶ 
add_classification
(idx, labels, methods, query_i)[source]¶ Add training indices and their labels.
Parameters:

add_proba
(pool_idx, train_idx, proba, query_i)[source]¶ Add inverse pool indices and their labels.
Parameters:

close
()[source]¶ Close the files opened by the state.
Also sets the end time if not in readonly mode.

get
(variable, query_i=None, default=None, idx=None)[source]¶ Get data from the state object.
This is universal accessor method of the State classes. It can be used to get a variable from one specific query. In theory, it should get the whole data set if query_i=None, but this is not currently implemented in any of the States.
Parameters:

get_current_queries
()[source]¶ Get the current queries made by the model.
This is useful to get back exactly to the state it was in before shutting down a review.
Returns: dict – The last known queries according to the state file.

get_feature_matrix
(data_hash)[source]¶ Get feature matrix out of the state.
Parameters: data_hash (str) – Hash of as_data object from which the matrix is derived. Returns: np.ndarray or sklearn.sparse.csr_matrix – Feature matrix as computed by the feature extraction model.

pred_proba
¶ Get last predicted probabilities.

restore
(fp)[source]¶ Restore or create state from a state file.
If the state file doesn’t exist, creates and empty state that is ready for storage.
Parameters: fp (str) – Path to file to restore/create.

set_current_queries
(current_queries)[source]¶ Set the current queries made by the model.
Parameters: current_queries (dict) – The last known queries, with {query_idx: query_method}.

set_final_labels
(y)[source]¶ Add/set final labels to state.
If final_labels does not exist yet, add it.
Parameters: y (np.array) – One dimensional integer numpy array with final inclusion labels.

set_labels
(y)[source]¶ Add/set labels to state
If the labels do not exist, add it to the state.
Parameters: y (np.array) – One dimensional integer numpy array with inclusion labels.

settings
¶ Get settings from state


class
asreview.state.
HDF5State
(state_fp, read_only=False)[source]¶ Class for storing the review state with HDF5 storage.
Analysis¶

class
asreview.analysis.
Analysis
(states, key=None)[source]¶ Analysis object to do statistical analysis on state files.

avg_time_to_discovery
(result_format='number')[source]¶ Get the best/last estimate on how long it takes to find a paper.
Returns: dict – For each inclusion, key=paper_id, value=avg time.

classmethod
from_dir
(data_dir, prefix='', key=None)[source]¶ Create an Analysis object from a directory.
Parameters:

classmethod
from_file
(data_fp, key=None)[source]¶ Create an Analysis object from a file.
Parameters:

classmethod
from_path
(data_path, prefix='', key=None)[source]¶ Create an Analysis object from either a file or a directory.

inclusions_found
(result_format='fraction', final_labels=False, **kwargs)[source]¶ Get the number of inclusions at each point in time.
Caching is used to prevent multiple calls being expensive.
Parameters: Returns: tuple – Three numpy arrays with x, y, error_bar.

limits
(prob_allow_miss=[0.1], result_format='percentage')[source]¶ For each query, compute the number of papers for a criterium.
A criterium is the average number of papers missed. For example, with 0.1, the criterium is that after reading x papers, there is (about) a 10% chance that one paper is not included. Another example, with 2.0, there are on average 2 papers missed after reading x papers. The value for x is returned for each query and probability by the function.
Parameters: prob_allow_miss (list, float) – Sets the criterium for how many papers can be missed. Returns: dict – One entry, “x_range” with the number of papers read. List, “limits” of results for each probability and at # papers read.

Extensions¶

class
asreview.entry_points.
BaseEntryPoint
[source]¶ Base class for defining entry points.