$$TF=\frac{n_{i, j}}{\sum_kn_{k, j}}$$ term frequency $$IDF=log\left(\frac{D}{Df(t)}\right)$$
VectorProcessor.py
The tokens are vectorized based on a shallow neural network structure. Usually,
it can be simply expressed as $$S^{dict}_{out}=W^{wv}_{out, in}V^{word}_{in}$$
\(in\) features the dimension of the each word's vector. \(out\) represents the
size of the dictionary. \(S^{dict}_{out}\) is a softmax output in a
Word2Vec
model.
Word2Vec
has two modes including the continuous bag of words (CBOW)
and Skip gram (SG). Lacking of the corpus, we can load the pretained model via
Gensim
library and tune it to meet the specific needs.
DynamicProcessor.py
In this part, you will see some basic untrained models purely for the purpose of studying
the stuctures and principles of those models. Usually, under limited database and computation
resources, we directly utilize the pretrained models from Hugging Face with
transformers
library.