kyoto_reader.reader module¶

class kyoto_reader.reader.ArchiveHandler(path: pathlib.Path)[source]¶

Bases: object

__init__(path: pathlib.Path) → None[source]¶: Initialize self. See help(type(self)) for accurate signature.

classmethod is_supported_path(path: pathlib.Path) → bool[source]¶

open() → Union[tarfile.TarFile, zipfile.ZipFile][source]¶

open_member(archive: Union[tarfile.TarFile, zipfile.ZipFile], member: str) → BinaryIO[source]¶

class kyoto_reader.reader.ArchiveType[source]¶

Bases: enum.Enum

Enum for file collection types.

TAR_GZ = '.tar.gz'¶

ZIP = '.zip'¶

class kyoto_reader.reader.FileHandler(path: pathlib.Path)[source]¶

Bases: object

__init__(path: pathlib.Path) → None[source]¶: Initialize self. See help(type(self)) for accurate signature.

content_basename¶

open(*args, **kwargs) → TextIO[source]¶

class kyoto_reader.reader.FileType[source]¶

Bases: enum.Enum

Enum for file types.

GZ = '.gz'¶

UNCOMPRESSED = ''¶

class kyoto_reader.reader.KyotoReader(source: Union[pathlib.Path, str], target_cases: Optional[Collection[str]] = None, target_corefs: Optional[Collection[str]] = None, extract_nes: bool = True, relax_cases: bool = False, use_pas_tag: bool = False, knp_ext: str = '.knp', pickle_ext: str = '.pkl', n_jobs: int = -1, did_from_sid: bool = True)[source]¶

Bases: object

A class to manage a set of corpus documents. Compressed file is supported. However, nested compression (e.g. .knp.gz in zip file) is not supported.

Parameters:

source (Union[Path, str]) – 対象の文書へのパス。ディレクトリが指定された場合、その中の全てのファイルを対象とする
target_cases (Optional[Collection[str]]) – 抽出の対象とする格。(default: 全ての格)
target_corefs (Optional[Collection[str]]) – 抽出の対象とする共参照関係(=など)。(default: 全ての関係)
extract_nes (bool) – 固有表現をコーパスから抽出するかどうか (default: True)
relax_cases (bool) – ガ≒格などをガ格として扱うか (default: False)
knp_ext (str) – KWDLC または KC ファイルの拡張子 (default: knp)
pickle_ext (str) – Document を pickle 形式で読む場合の拡張子 (default: pkl)
use_pas_tag (bool) – <rel>タグからではなく、<述語項構造:>タグから PAS を読むかどうか (default: False)
n_jobs (int) – 文書を読み込む処理の並列数。0: 並列処理なし、-1: コア数 (default: -1)
did_from_sid (bool) – 文書IDを文書中のS-IDから決定する (default: True)

Note

サポートされる入力パス (i.e. source argument) - 単一ファイル (.knp, .knp.gz, .pkl, .pkl.gz) - 単一ファイルを含むディレクトリ - 単一非圧縮ファイルを含むアーカイブファイル (.tar.gz, .zip)

__init__(source: Union[pathlib.Path, str], target_cases: Optional[Collection[str]] = None, target_corefs: Optional[Collection[str]] = None, extract_nes: bool = True, relax_cases: bool = False, use_pas_tag: bool = False, knp_ext: str = '.knp', pickle_ext: str = '.pkl', n_jobs: int = -1, did_from_sid: bool = True) → None[source]¶: Initialize self. See help(type(self)) for accurate signature.

get_knp(did: str) → str[source]¶

process_all_documents(n_jobs: Optional[int] = None) → List[Optional[kyoto_reader.document.Document]][source]¶

Process all documents that KyotoReader has loaded.

Parameters:	n_jobs (int) – The number of processes spawned to finish this task. (default: inherit from self)

process_document(doc_id: str, archive: Union[tarfile.TarFile, zipfile.ZipFile, None] = None) → Optional[kyoto_reader.document.Document][source]¶

Process one document following the given document ID.

Parameters:	doc_id (str) – An ID of a document to process. archive (Optional[ArchiveFile]) – An archive to read the document from.

process_documents(doc_ids: Iterable[str], n_jobs: Optional[int] = None) → List[Optional[kyoto_reader.document.Document]][source]¶

Process multiple documents following the given document IDs.

Parameters:	doc_ids (List[str]) – IDs of documents to process. n_jobs (int) – The number of processes spawned to finish this task. (default: inherit from self)