There are several balance strategies that rebalance and reorder the training data. This is sometimes necessary, because the data is often very inbalanced: there are many more papers that should be excluded than included (otherwise, automation cannot help much anyway).
Parameters in the config file should be under the section
We have currently implemented the following balance strategies:
This undersamples the data, leaving out excluded papers so that the included and excluded papers are in some particular ratio (closer to one). Configuration options are as follows:
# Set the ratio of included/excluded to 1 ratio=1.0
This divides the training data into three sets: included papers, excluded papers found with random sampling and papers found with max sampling. They are balanced according to formulas depending on the percentage of papers read in the dataset, the number of papers with random/max sampling etc. Works best for stochastic training algorithms. Reduces to both full sampling and undersampling with corresponding parameters.
a=2.155 alpha=0.94 b=0.789 beta=1.0 max_c=0.835 max_gamma=2.0 shuffle=True