hojichar.filters
Filters for text preprocess.
Each filter module can directly import from hojichar
. i.e., you can inport as from hojichar import document_filters
.
How we classify each filter is below:
hojichar.filters.document_filters
-- General text cleaners.hojichar.filters.deduplication
-- Approximate deduplicate processor, inspired by NEARDUP from https://arxiv.org/abs/2107.06499hojichar.filters.token_filters
-- A per-token filter. For example, to process a specific part of speech.hojichar.filters.tokenization
-- Tokenizer, which splits texts into tokens. Here, a token is an arbitrary unit for splitting a sentence and processing it withhojichar.filters.token_filters
.
1# flake8: noqa 2""" 3Filters for text preprocess. 4 5Each filter module can directly import from `hojichar`. i.e., you can inport as `from hojichar import document_filters`. 6 7How we classify each filter is below: 8- `hojichar.filters.document_filters`-- General text cleaners. 9- `hojichar.filters.deduplication`-- Approximate deduplicate processor, inspired by NEARDUP from https://arxiv.org/abs/2107.06499 10- `hojichar.filters.token_filters`-- A per-token filter. For example, to process a specific part of speech. 11- `hojichar.filters.tokenization`-- Tokenizer, which splits texts into tokens. Here, a token is an arbitrary unit for splitting a sentence and processing it with `hojichar.filters.token_filters`. 12"""