kyoto_reader.reader module

class kyoto_reader.reader.ArchiveHandler(path: pathlib.Path)[source]

Bases: object

__init__(path: pathlib.Path) → None[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod is_supported_path(path: pathlib.Path) → bool[source]
open() → Union[tarfile.TarFile, zipfile.ZipFile][source]
open_member(archive: Union[tarfile.TarFile, zipfile.ZipFile], member: str) → BinaryIO[source]
class kyoto_reader.reader.ArchiveType[source]

Bases: enum.Enum

Enum for file collection types.

TAR_GZ = '.tar.gz'
ZIP = '.zip'
class kyoto_reader.reader.FileHandler(path: pathlib.Path)[source]

Bases: object

__init__(path: pathlib.Path) → None[source]

Initialize self. See help(type(self)) for accurate signature.

content_basename
open(*args, **kwargs) → TextIO[source]
class kyoto_reader.reader.FileType[source]

Bases: enum.Enum

Enum for file types.

GZ = '.gz'
UNCOMPRESSED = ''
class kyoto_reader.reader.KyotoReader(source: Union[pathlib.Path, str], target_cases: Optional[Collection[str]] = None, target_corefs: Optional[Collection[str]] = None, extract_nes: bool = True, relax_cases: bool = False, use_pas_tag: bool = False, knp_ext: str = '.knp', pickle_ext: str = '.pkl', n_jobs: int = -1, did_from_sid: bool = True)[source]

Bases: object

A class to manage a set of corpus documents. Compressed file is supported. However, nested compression (e.g. .knp.gz in zip file) is not supported.

Parameters:
  • source (Union[Path, str]) – 対象の文書へのパス。ディレクトリが指定された場合、その中の全てのファイルを対象とする
  • target_cases (Optional[Collection[str]]) – 抽出の対象とする格。(default: 全ての格)
  • target_corefs (Optional[Collection[str]]) – 抽出の対象とする共参照関係(=など)。(default: 全ての関係)
  • extract_nes (bool) – 固有表現をコーパスから抽出するかどうか (default: True)
  • relax_cases (bool) – ガ≒格などをガ格として扱うか (default: False)
  • knp_ext (str) – KWDLC または KC ファイルの拡張子 (default: knp)
  • pickle_ext (str) – Document を pickle 形式で読む場合の拡張子 (default: pkl)
  • use_pas_tag (bool) – <rel>タグからではなく、<述語項構造:>タグから PAS を読むかどうか (default: False)
  • n_jobs (int) – 文書を読み込む処理の並列数。0: 並列処理なし、-1: コア数 (default: -1)
  • did_from_sid (bool) – 文書IDを文書中のS-IDから決定する (default: True)

Note

サポートされる入力パス (i.e. source argument) - 単一ファイル (.knp, .knp.gz, .pkl, .pkl.gz) - 単一ファイルを含むディレクトリ - 単一非圧縮ファイルを含むアーカイブファイル (.tar.gz, .zip)

__init__(source: Union[pathlib.Path, str], target_cases: Optional[Collection[str]] = None, target_corefs: Optional[Collection[str]] = None, extract_nes: bool = True, relax_cases: bool = False, use_pas_tag: bool = False, knp_ext: str = '.knp', pickle_ext: str = '.pkl', n_jobs: int = -1, did_from_sid: bool = True) → None[source]

Initialize self. See help(type(self)) for accurate signature.

get_knp(did: str) → str[source]
process_all_documents(n_jobs: Optional[int] = None) → List[Optional[kyoto_reader.document.Document]][source]

Process all documents that KyotoReader has loaded.

Parameters:n_jobs (int) – The number of processes spawned to finish this task. (default: inherit from self)
process_document(doc_id: str, archive: Union[tarfile.TarFile, zipfile.ZipFile, None] = None) → Optional[kyoto_reader.document.Document][source]

Process one document following the given document ID.

Parameters:
  • doc_id (str) – An ID of a document to process.
  • archive (Optional[ArchiveFile]) – An archive to read the document from.
process_documents(doc_ids: Iterable[str], n_jobs: Optional[int] = None) → List[Optional[kyoto_reader.document.Document]][source]

Process multiple documents following the given document IDs.

Parameters:
  • doc_ids (List[str]) – IDs of documents to process.
  • n_jobs (int) – The number of processes spawned to finish this task. (default: inherit from self)