kyoto_reader.reader module¶
-
class
kyoto_reader.reader.
ArchiveHandler
(path: pathlib.Path)[source]¶ Bases:
object
-
class
kyoto_reader.reader.
ArchiveType
[source]¶ Bases:
enum.Enum
Enum for file collection types.
-
TAR_GZ
= '.tar.gz'¶
-
ZIP
= '.zip'¶
-
-
class
kyoto_reader.reader.
FileHandler
(path: pathlib.Path)[source]¶ Bases:
object
-
__init__
(path: pathlib.Path) → None[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
content_basename
¶
-
-
class
kyoto_reader.reader.
FileType
[source]¶ Bases:
enum.Enum
Enum for file types.
-
GZ
= '.gz'¶
-
UNCOMPRESSED
= ''¶
-
-
class
kyoto_reader.reader.
KyotoReader
(source: Union[pathlib.Path, str], target_cases: Optional[Collection[str]] = None, target_corefs: Optional[Collection[str]] = None, extract_nes: bool = True, relax_cases: bool = False, use_pas_tag: bool = False, knp_ext: str = '.knp', pickle_ext: str = '.pkl', n_jobs: int = -1, did_from_sid: bool = True)[source]¶ Bases:
object
A class to manage a set of corpus documents. Compressed file is supported. However, nested compression (e.g. .knp.gz in zip file) is not supported.
Parameters: - source (Union[Path, str]) – 対象の文書へのパス。ディレクトリが指定された場合、その中の全てのファイルを対象とする
- target_cases (Optional[Collection[str]]) – 抽出の対象とする格。(default: 全ての格)
- target_corefs (Optional[Collection[str]]) – 抽出の対象とする共参照関係(=など)。(default: 全ての関係)
- extract_nes (bool) – 固有表現をコーパスから抽出するかどうか (default: True)
- relax_cases (bool) – ガ≒格などをガ格として扱うか (default: False)
- knp_ext (str) – KWDLC または KC ファイルの拡張子 (default: knp)
- pickle_ext (str) – Document を pickle 形式で読む場合の拡張子 (default: pkl)
- use_pas_tag (bool) – <rel>タグからではなく、<述語項構造:>タグから PAS を読むかどうか (default: False)
- n_jobs (int) – 文書を読み込む処理の並列数。0: 並列処理なし、-1: コア数 (default: -1)
- did_from_sid (bool) – 文書IDを文書中のS-IDから決定する (default: True)
Note
サポートされる入力パス (i.e. source argument) - 単一ファイル (.knp, .knp.gz, .pkl, .pkl.gz) - 単一ファイルを含むディレクトリ - 単一非圧縮ファイルを含むアーカイブファイル (.tar.gz, .zip)
-
__init__
(source: Union[pathlib.Path, str], target_cases: Optional[Collection[str]] = None, target_corefs: Optional[Collection[str]] = None, extract_nes: bool = True, relax_cases: bool = False, use_pas_tag: bool = False, knp_ext: str = '.knp', pickle_ext: str = '.pkl', n_jobs: int = -1, did_from_sid: bool = True) → None[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
process_all_documents
(n_jobs: Optional[int] = None) → List[Optional[kyoto_reader.document.Document]][source]¶ Process all documents that KyotoReader has loaded.
Parameters: n_jobs (int) – The number of processes spawned to finish this task. (default: inherit from self)
-
process_document
(doc_id: str, archive: Union[tarfile.TarFile, zipfile.ZipFile, None] = None) → Optional[kyoto_reader.document.Document][source]¶ Process one document following the given document ID.
Parameters: - doc_id (str) – An ID of a document to process.
- archive (Optional[ArchiveFile]) – An archive to read the document from.
-
process_documents
(doc_ids: Iterable[str], n_jobs: Optional[int] = None) → List[Optional[kyoto_reader.document.Document]][source]¶ Process multiple documents following the given document IDs.
Parameters: - doc_ids (List[str]) – IDs of documents to process.
- n_jobs (int) – The number of processes spawned to finish this task. (default: inherit from self)