asreview.data.base.BaseReader#

class asreview.data.base.BaseReader[source]#

Bases: ABC

Base class for data readers.

Reading data from a file happens in three steps: read the raw data, perform data cleaning and turn it into Record instances. This happens in read_data, clean_data and to_records. Anyone implementing a BaseReader should provide an implementation of read_data. There are default implementations of clean_data and to_records. They assume that read_data produces a pandas DataFrame. There are a number of ways to customize the default cleaning behavior, see the comments next to the class attributes.

Methods

__init__()

clean_data(df)

Clean the raw data.

read_data(fp, *args, **kwargs)

Read the raw data from a file.

read_records(fp, dataset_id[, record_cls])

standardize_column_names(df)

Standardize column names of input data.

to_records(df[, dataset_id, record_cls])

Turn the cleaned data into records.

classmethod clean_data(df)[source]#

Clean the raw data.

Parameters:

df (pd.DataFrame) – Data to clean. This should be of the same type as the output of read_data.

Returns:

pd.DataFrame – Cleaned data. By default it standardizes the column names, some data types and missing values.

abstract classmethod read_data(fp, *args, **kwargs)[source]#

Read the raw data from a file.

The data type of the output should be equal to the data type of the input of clean_data. Typically this will mean a pandas DataFrame, but anyone creating a custom class can choose a different data type.

This method should not perform any cleaning of the data. That way data writers can add columns to a dataset without changing the original data: Use reader.read_data to get the data, then add the column, then write away the data to a file.

Parameters:

fp (Path) – Filepath of the file to read.

Returns:

pd.DataFrame – A dataframe of user input data that has not been cleaned yet.

classmethod read_records(fp, dataset_id, record_cls=<class 'asreview.data.record.Record'>, *args, **kwargs)[source]#
classmethod standardize_column_names(df)[source]#

Standardize column names of input data.

The reader can accept multiple names for a specific type of data, for example both ‘title’ and ‘primary_title’ could refer to the column containing the title data. This function makes sure the correct columns are used. See also the attribute __alternative_column_names__ for customizing this behavior.

Parameters:

df (pd.DataFrame) – Dataframe containing raw data.

Returns:

pd.DataFrame – Dataframe with column names lowercased and stripped of white space. In addition, for the columns in __alternative_column_names__, the first alternative column name in the data will be used as input for the column values.

classmethod to_records(df, dataset_id=None, record_cls=<class 'asreview.data.record.Record'>)[source]#

Turn the cleaned data into records.

Parameters:
  • df (pd.DataFrame) – Cleaned data.

  • dataset_id (str, optional) – Identifier of the dataset, by default None

  • record_cls (asreview.data.record.Base, optional) – Record class to use, by default Record

Returns:

list[Record] – List of records.