Feature extraction is the process of converting a list of texts into some kind of feature matrix.
Parameters in the config file should be under the section
We have currently implemented the following feature extraction methods:
Use the standard TF-IDF (Term Frequency-Inverse Document Frequency) feature extraction from SKLearn.
Gives a sparse matrix as output. Works well in combination with Naive Bayes and other fast training models (given that the features vectors are relatively wide).
Feature extraction method provided by the gensim package. To use it, please install the gensim package manually:
pip install gensim
It takes relatively long to create a feature matrix with this method. However, this only has to be done once per simulation/review. The upside of this method is the dimension-reduction that generally takes place, which makes the modelling quicker.
Feature extraction method where the average of the word embeddings are taken of words in the text, multiplied by the inverse document frequency of said words.
Feature extraction method for LSTM/RNN models.