hojichar.filters

Filters for text preprocess.

Each filter module can directly import from hojichar. i.e., you can inport as from hojichar import document_filters.

How we classify each filter is below:

 1# flake8: noqa
 2"""
 3Filters for text preprocess.
 4
 5Each filter module can directly import from `hojichar`. i.e., you can inport as `from hojichar import document_filters`.
 6
 7How we classify each filter is below:
 8- `hojichar.filters.document_filters`-- General text cleaners. 
 9- `hojichar.filters.deduplication`-- Approximate deduplicate processor, inspired by NEARDUP from https://arxiv.org/abs/2107.06499
10- `hojichar.filters.token_filters`-- A per-token filter. For example, to process a specific part of speech.
11- `hojichar.filters.tokenization`-- Tokenizer, which splits texts into tokens. Here, a token is an arbitrary unit for splitting a sentence and processing it with `hojichar.filters.token_filters`.
12"""