kyoto-reader: A processor for KWDLC, KyotoCorpus, and AnnotatedFKCCorpus

About

京都大学が公開している述語項構造や共参照関係が付与されたコーパスをパースし、Python から扱うためのインターフェースを提供します。 このツールは pyknp のラッパーであるため、形態素情報や係り受け関係なども扱うことが可能です。

利用可能なコーパス一覧
Name Domain Size
京都大学ウェブ文書リードコーパス (KWDLC) ウェブテキスト 16,038 文
京都大学テキストコーパス (KyotoCorpus) 新聞記事・社説 15,872 文
不満調査データセットタグ付きコーパス (AnnotatedFKCCorpus) 不満に関する投稿 1,282 文

Requirements

Install kyoto-reader

$ pip install kyoto-reader

or

$ git clone https://github.com/ku-nlp/kyoto-reader
$ cd kyoto-reader
$ python setup.py install [--prefix=path]

A Brief Explanation of KWDLC and other corpora

KWDLC, KyotoCorpus, AnnotatedFKCCorpus はいずれも日本語の文書に対して形態素や構文情報の他、述語項構造や共参照関係が人手で付与されたコーパス。
KWDLC はウェブから抽出した 3 文を 1 文書として約 5,000 文書に対してアノテーションされている。
KyotoCorpus は毎日新聞の記事を対象に、形態素・構文情報については 40,000 文に、述語項構造・共参照関係についてはそのうちの約 10,000 文にアノテーションされている。
AnnotatedFKCCorpus は一般の人々から集められた不満テキスト約 1,300 文に対してアノテーションを行ったコーパスである。
なお、述語項構造・共参照関係のアノテーションは <rel> タグによって行われている。

KWDLC の例:

# S-ID:w201106-0000060050-1 JUMAN:6.1-20101108 KNP:3.1-20101107 DATE:2011/06/21 SCORE:-44.94406 MOD:2017/10/15 MEMO:
* 2D
+ 1D
コイン こいん コイン 名詞 6 普通名詞 1 * 0 * 0
+ 3D <rel type="ガ" target="不特定:人"/><rel type="ヲ" target="コイン" sid="w201106-0000060050-1" id="0"/>
トス とす トス 名詞 6 サ変名詞 2 * 0 * 0
を を を 助詞 9 格助詞 1 * 0 * 0
* 2D
+ 3D
3 さん 3 名詞 6 数詞 7 * 0 * 0
回 かい 回 接尾辞 14 名詞性名詞助数辞 3 * 0 * 0
* -1D
+ -1D <rel type="ガ" target="不特定:人"/><rel type="ガ" mode="?" target="読者"/><rel type="ガ" mode="?" target="著者"/><rel type="ヲ" target="トス" sid="w201106-0000060050-1" id="1"/>
行う おこなう 行う 動詞 2 * 0 子音動詞ワ行 12 基本形 2
。 。 。 特殊 1 句点 1 * 0 * 0
EOS

Usage

上記の例のデータが入ったファイル w201106-0000060050.knp を読み込む場合

from kyoto_reader import KyotoReader, Document

# 文書集合を扱うオブジェクト
reader = KyotoReader('w201106-0000060050.knp',  # ファイルまたはディレクトリのパスを指定する
                     target_cases=['ガ', 'ヲ', 'ニ'],  # ガ,ヲ,ニ格のみを対象とする
                     target_corefs=['=', '=構', '=≒', '=構≒'],  # 共参照として扱う関係を列挙
                     extract_nes=True  # 固有表現もコーパスから抽出する
                     )
print('読み込んだ文書:')
for doc_id in reader.doc_ids:
    print(f'  文書 ID: {doc_id}')

print('\n--- 述語項構造 ---')
document: Document = reader.process_document('w201106-0000060050')
for predicate in document.get_predicates():
    print(f'述語: {predicate.core}')
    for case, arguments in document.get_arguments(predicate).items():
        print(f'  {case}格: ', end='')
        print(', '.join(str(argument) for argument in arguments))

print('\n--- ツリー形式 ---')
document.draw_tree(sid='w201106-0000060050-1', coreference=False)

プログラムの出力結果

読み込んだ文書:
  文書 ID: w201106-0000060050

--- 述語項構造 ---
述語: トス
  ガ格: 不特定:人
  ヲ格: コイン
  ニ格:
述語: 行う
  ガ格: 不特定:人, 読者, 著者
  ヲ格: トス
  ニ格:

--- ツリー形式 ---
コイン┐
  トスを─┐  ガ:不特定:人 ヲ:コイン
      3回┤
      行う。  ガ:読者,不特定:人,著者 ヲ:トス

CLI Interfaces

kyoto コマンドを使用することで、コーパスの内容を表示したりコーパスを加工したりできる。

Browsing files

  • kyoto show: KNP ファイルの内容をツリー形式で表示 (ディレクトリを指定した場合、含まれる全てのファイルを表示)
$ kyoto show /path/to/knp/file.knp
  • kyoto list: 指定されたディレクトリに含まれる文書 ID を列挙
$ kyoto list /path/to/knp/directory

Processing Corpus

コーパスを解析し、追加の素性を付与 (KNP と JumanDIC が必要)

  • kyoto configure: コーパスのディレクトリに素性付与のための Makefile を生成
    • make を実行することで、コーパスが 1 文書 1 ファイルに分割され、 knp/ ディレクトリに素性の付与されたファイルが出力される。
$ kyoto configure --corpus-dir /path/to/downloaded/knp/directory --data-dir /path/to/output/directory --juman-dic-dir /path/to/JumanDIC/directory
created Makefile at /path/to/output/directory
$ cd /path/to/output/directory
$ make -j <num-parallel>
  • kyoto idsplit: コーパスを train/dev/test ファイルに分割
$ kyoto idsplit --corpus-dir /path/to/knp/dir --output-dir /path/to/output/dir --train /path/to/train/id/file --dev /path/to/dev/id/file --test /path/to/test/id/file

Zsh Completions

<virtualenv-path>/share/zsh/site-functionsFPATH に追加することで kyoto コマンドの補完が可能 (zsh 限定)

$ echo 'export FPATH=<virtualenv-path>/share/zsh/site-functions:$FPATH' >> ~/.zshrc

Documents

kyoto_reader package

Submodules

kyoto_reader.base_phrase module
class kyoto_reader.base_phrase.BasePhrase(tag: pyknp.knp.tag.Tag, dmid_offset: int, dtid: int, sid: str, doc_id: str, parent: Optional[BasePhrase] = None, children: Optional[List[BasePhrase]] = None)[source]

Bases: object

文中に出現する基本句を表すクラス

tag

Tag object in pyknp.

Type:Tag
sid

Sentence ID.

Type:str
dtid

Document-wide tag ID.

Type:int
content_dmid

Document-wide morpheme ID of the content word in the base phrase.

Type:int
parent

Dependency parent.

Type:Optional[BasePhrase]
children

Dependency children.

Type:List[BasePhrase]
__init__(tag: pyknp.knp.tag.Tag, dmid_offset: int, dtid: int, sid: str, doc_id: str, parent: Optional[BasePhrase] = None, children: Optional[List[BasePhrase]] = None)[source]
Parameters:
  • tag (Tag) – Tag object in pyknp.
  • dmid_offset (int) – Document-wide morpheme ID of the previous morpheme.
  • dtid (int) – Document-wide tag ID.
  • sid (str) – Sentence ID.
  • doc_id (str) – Document ID.
  • parent (Optional[BasePhrase]) – Dependency parent.
  • children (List[BasePhrase]) – Dependency children.
core

A core expression without ancillary words.

dmid

Document-wide morpheme ID.

dmids

A list of document-wide morpheme IDs.

mrph2dmid

A mapping from morpheme to its document-wide ID.

mrph_list() → List[pyknp.juman.morpheme.Morpheme][source]

A list of morphemes

mrphs

A list of morphemes.

surf

A surface expression.

tid

Tag ID in pyknp.

kyoto_reader.cli module
kyoto_reader.cli.configure(args: argparse.Namespace)[source]

Create Makefile to preprocess corpus documents.

kyoto_reader.cli.idsplit(args: argparse.Namespace)[source]

Copy files in a corpus to train, valid (dev), and test directory referring to ID files.

kyoto_reader.cli.list_(args: argparse.Namespace)[source]

List document IDs which specified path contains.

kyoto_reader.cli.main()[source]

Entry point of CLI commands.

kyoto_reader.cli.show(args: argparse.Namespace)[source]

Show the specified document in a tree format.

kyoto_reader.constants module
kyoto_reader.coreference module
class kyoto_reader.coreference.Entity(eid: int, exophor: Optional[str] = None)[source]

Bases: object

A class to represent an entity in coreference. This class manages entity IDs of mentions that refer to this entity.

Parameters:
  • eid (int) – An Entity ID.
  • exophor (str, optional) – The kind of exophor if this entity corresponds to some exophor. Otherwise, None.
eid

An Entity ID.

Type:int
exophor

A string to represent exophor, such as “著者”, “読者”, and “不特定:人”.

Type:str, optional
mentions

A set of mentions that refer to this entity.

Type:Set[Mention]
mentions_unc

Mentions that have uncertain relation with this entity.

Type:Set[Mention]
taigen

Whether this entity is 体言 or not.

Type:bool, optional
yougen

Whether this entity is 用言 or not.

Type:bool, optional
__init__(eid: int, exophor: Optional[str] = None)[source]

Initialize self. See help(type(self)) for accurate signature.

add_mention(mention: kyoto_reader.coreference.Mention, uncertain: bool) → None[source]

Add a mention that refers to this entity.

When a non-uncertain mention is added and the mention has already been registered as an uncertain mention, it will be overwritten as non-uncertain.

Parameters:
  • mention (Mention) – A mention
  • uncertain (bool) – Whether the mention is uncertain (i.e., annotated with “≒”).
all_mentions

All mentions that refer to this entity, including uncertain ones.

is_special

Whether this entity corresponds to special entity, such as exophor.

remove_mention(mention: kyoto_reader.coreference.Mention) → None[source]

Remove a mention that is managed by this entity.

class kyoto_reader.coreference.Mention(bp: kyoto_reader.base_phrase.BasePhrase)[source]

Bases: kyoto_reader.base_phrase.BasePhrase

A class to represent a mention in coreference.

Parameters:bp (BasePhrase) – A base phrase object that corresponds to this mention.
eids

Entity IDs.

Type:set
eids_unc

Uncertain entity IDs. “Uncertain” means the mention is annotated with “≒”.

Type:set
__init__(bp: kyoto_reader.base_phrase.BasePhrase)[source]

Args: tag (Tag): Tag object in pyknp. dmid_offset (int): Document-wide morpheme ID of the previous morpheme. dtid (int): Document-wide tag ID. sid (str): Sentence ID. doc_id (str): Document ID. parent (Optional[BasePhrase]): Dependency parent. children (List[BasePhrase]): Dependency children.

all_eids

All entity IDs this mention refers to.

is_uncertain_to(entity: kyoto_reader.coreference.Entity) → bool[source]

Whether this mention has uncertain relation with a specified entity.

kyoto_reader.document module
class kyoto_reader.document.Document(knp_string: str, doc_id: str, cases: Collection[str], corefs: Collection[str], relax_cases: bool, extract_nes: bool, use_pas_tag: bool)[source]

Bases: object

A class to represent a document of KWDLC, KyotoCorpus, or AnnotatedFKCCorpus.

Parameters:
  • knp_string (str) – KNP format string of the document.
  • doc_id (str) – A document ID.
  • cases (Collection[str]) – Cases to extract.
  • corefs (Collection[str]) – Coreference relations to extract.
  • relax_cases (bool) – Whether to consider relations with “≒” as those without “≒” (e.g. ガ≒格 -> ガ格).
  • extract_nes (bool) – Whether to extract named entities.
  • use_pas_tag (bool) – Whether to read predicate-argument structures from <述語項構造: > tags, not <rel> tags.
knp_string

KNP format string of the document.

Type:str
doc_id

A document ID.

Type:str
cases

Cases to extract.

Type:Collection[str]
corefs

Coreference relations to extract.

Type:Collection[str]
extract_nes

Whether to extract named entities.

Type:bool
sid2sentence

A mapping from a sentence ID to the corresponding sentence.

Type:Dict[str, Sentence]
mentions

A mapping from a document-wide tag ID to the corresponding mention.

Type:Dict[int, Mention]
entities

A mapping from a entity ID to the corresponding entity.

Type:Dict[int, Entity]
named_entities

Extracted named entities.

Type:List[NamedEntity]
__init__(knp_string: str, doc_id: str, cases: Collection[str], corefs: Collection[str], relax_cases: bool, extract_nes: bool, use_pas_tag: bool) → None[source]

Initialize self. See help(type(self)) for accurate signature.

bnst_list() → List[pyknp.knp.bunsetsu.Bunsetsu][source]

Return list of Bunsetsu object in pyknp.

bp_list() → List[kyoto_reader.base_phrase.BasePhrase][source]

Return list of base phrases.

draw_tree(sid: Optional[str] = None, coreference: bool = True, fh: Optional[TextIO] = None) → None[source]

Write out the PAS and coreference relations in the specified sentence in a tree format.

If sid is not specified, write out trees in all the sentences in this document.

Parameters:
  • sid (str, optional) – A sentence ID of the target sentence.
  • coreference (bool) – If True, write out coreference relations as well.
  • fh (TextIO, optional) – The output stream.
get_arguments(predicate: kyoto_reader.base_phrase.BasePhrase, relax: bool = False, include_optional: bool = False) → Dict[str, List[kyoto_reader.pas.BaseArgument]][source]

Return all the arguments that the given predicate has.

Parameters:
  • predicate (Predicate) – A predicate.
  • relax (bool) – If True, return arguments that have a coreference relation with the arguments the predicate has.
  • include_optional (bool) – If True, return adverbial arguments such as “すぐに” as well.
Returns:

A mapping from a case to arguments.

Return type:

Dict[str, List[BaseArgument]]

get_entities(bp: kyoto_reader.base_phrase.BasePhrase, include_uncertain: bool = False) → List[kyoto_reader.coreference.Entity][source]

Return list of entities that the specified mention refers to. The mention is given as a type of BasePhrase.

Parameters:
  • bp (BasePhrase) – A base phrase corresponds to the mention.
  • include_uncertain (bool) – Whether to return entities that has uncertain relation with the mention.
get_predicates() → List[kyoto_reader.base_phrase.BasePhrase][source]

Return list of predicates.

get_siblings(mention: kyoto_reader.coreference.Mention, relax: bool = False) → Set[kyoto_reader.coreference.Mention][source]

Return all the mentions that have coreference chains with the specified mention.

Parameters:
  • mention (Mention) – A mention.
  • relax (bool) – If True, return coreferent mentions as well.
Returns:

A set of mentions.

Return type:

Set[Mention]

mrph2dmid

A mapping from morpheme to its document-wide ID.

mrph_list() → List[pyknp.juman.morpheme.Morpheme][source]

Return list of Morpheme object in pyknp.

pas_list() → List[kyoto_reader.pas.Pas][source]

Return list of predicate-argument structures.

sentences

List of sentences in this document.

Returns:List[Sentence]
stat() → dict[source]

Calculate various kinds of statistics of this document.

surf

A surface expression of this document.

tag_list() → List[pyknp.knp.tag.Tag][source]

Return list of Tag object in pyknp.

kyoto_reader.ne module
class kyoto_reader.ne.NamedEntity(category: str, name: str, sentence: kyoto_reader.sentence.Sentence, mid_range: range, mrph2dmid: Dict[pyknp.juman.morpheme.Morpheme, int])[source]

Bases: object

A class to represent a named entity (NE).

Parameters:
  • category (str) – A category of a NE.
  • name (str) – A name of a NE.
  • sentence (Sentence) – A sentence that contains a NE.
  • mid_range (range) – A range of IDs of morphemes that constitute a NE.
  • mrph2dmid (dict) – A mapping from morpheme to its document-wide ID.
category

A category of a NE.

Type:str
name

A name of a NE.

Type:str
sid

A sentence ID of a sentence that contains a NE.

Type:str
mid_range

A range of IDs of morphemes that constitute a NE.

Type:range
dmid_range

A range of document-wide IDs of morphemes that constitute a NE.

Type:range
__init__(category: str, name: str, sentence: kyoto_reader.sentence.Sentence, mid_range: range, mrph2dmid: Dict[pyknp.juman.morpheme.Morpheme, int])[source]

Initialize self. See help(type(self)) for accurate signature.

kyoto_reader.pas module
class kyoto_reader.pas.Argument(bp: kyoto_reader.base_phrase.BasePhrase, dep_type: str, mode: str)[source]

Bases: kyoto_reader.base_phrase.BasePhrase, kyoto_reader.pas.BaseArgument

文中に出現する(外界ではない)項を表すオブジェクト

Parameters:
  • bp (BasePhrase) – 基本句
  • dep_type (str) – 係り受けタイプ (“overt”, “dep”, “intra”, “inter”, “exo”)
  • mode (str) – モード
__init__(bp: kyoto_reader.base_phrase.BasePhrase, dep_type: str, mode: str) → None[source]

Args: tag (Tag): Tag object in pyknp. dmid_offset (int): Document-wide morpheme ID of the previous morpheme. dtid (int): Document-wide tag ID. sid (str): Sentence ID. doc_id (str): Document ID. parent (Optional[BasePhrase]): Dependency parent. children (List[BasePhrase]): Dependency children.

class kyoto_reader.pas.BaseArgument(dep_type: str, mode: str)[source]

Bases: object

A base class for all kinds of arguments

__init__(dep_type: str, mode: str)[source]

Initialize self. See help(type(self)) for accurate signature.

is_special
class kyoto_reader.pas.Pas(pred_bp: kyoto_reader.base_phrase.BasePhrase)[source]

Bases: object

A class to represent a predicate-argument structure (PAS).

Parameters:pred_bp (BasePhrase) – 述語となる基本句
predicate

述語

Type:Predicate
arguments

格と項

Type:Dict[str, List[BaseArgument]]
__init__(pred_bp: kyoto_reader.base_phrase.BasePhrase)[source]

Initialize self. See help(type(self)) for accurate signature.

add_argument(case: str, bp: kyoto_reader.base_phrase.BasePhrase, mode: str)[source]
add_special_argument(case: str, exophor: str, eid: int, mode: str) → None[source]
dmid

述語の中の内容語形態素の文書レベル形態素ID

dtid

A document-wide tag ID.

set_arguments_optional(case: str) → None[source]
sid

A sentence ID

class kyoto_reader.pas.SpecialArgument(exophor: str, eid: int, mode: str)[source]

Bases: kyoto_reader.pas.BaseArgument

外界を指す項を表すオブジェクト

Parameters:
  • exophor (str) – 外界照応詞 (不特定:人など)
  • eid (int) – 外界照応詞のエンティティID
  • mode (str) – モード
__init__(exophor: str, eid: int, mode: str)[source]

Initialize self. See help(type(self)) for accurate signature.

kyoto_reader.reader module
class kyoto_reader.reader.ArchiveHandler(path: pathlib.Path)[source]

Bases: object

__init__(path: pathlib.Path) → None[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod is_supported_path(path: pathlib.Path) → bool[source]
open() → Union[tarfile.TarFile, zipfile.ZipFile][source]
open_member(archive: Union[tarfile.TarFile, zipfile.ZipFile], member: str) → BinaryIO[source]
class kyoto_reader.reader.ArchiveType[source]

Bases: enum.Enum

Enum for file collection types.

TAR_GZ = '.tar.gz'
ZIP = '.zip'
class kyoto_reader.reader.FileHandler(path: pathlib.Path)[source]

Bases: object

__init__(path: pathlib.Path) → None[source]

Initialize self. See help(type(self)) for accurate signature.

content_basename
open(*args, **kwargs) → TextIO[source]
class kyoto_reader.reader.FileType[source]

Bases: enum.Enum

Enum for file types.

GZ = '.gz'
UNCOMPRESSED = ''
class kyoto_reader.reader.KyotoReader(source: Union[pathlib.Path, str], target_cases: Optional[Collection[str]] = None, target_corefs: Optional[Collection[str]] = None, extract_nes: bool = True, relax_cases: bool = False, use_pas_tag: bool = False, knp_ext: str = '.knp', pickle_ext: str = '.pkl', n_jobs: int = -1, did_from_sid: bool = True)[source]

Bases: object

A class to manage a set of corpus documents. Compressed file is supported. However, nested compression (e.g. .knp.gz in zip file) is not supported.

Parameters:
  • source (Union[Path, str]) – 対象の文書へのパス。ディレクトリが指定された場合、その中の全てのファイルを対象とする
  • target_cases (Optional[Collection[str]]) – 抽出の対象とする格。(default: 全ての格)
  • target_corefs (Optional[Collection[str]]) – 抽出の対象とする共参照関係(=など)。(default: 全ての関係)
  • extract_nes (bool) – 固有表現をコーパスから抽出するかどうか (default: True)
  • relax_cases (bool) – ガ≒格などをガ格として扱うか (default: False)
  • knp_ext (str) – KWDLC または KC ファイルの拡張子 (default: knp)
  • pickle_ext (str) – Document を pickle 形式で読む場合の拡張子 (default: pkl)
  • use_pas_tag (bool) – <rel>タグからではなく、<述語項構造:>タグから PAS を読むかどうか (default: False)
  • n_jobs (int) – 文書を読み込む処理の並列数。0: 並列処理なし、-1: コア数 (default: -1)
  • did_from_sid (bool) – 文書IDを文書中のS-IDから決定する (default: True)

Note

サポートされる入力パス (i.e. source argument) - 単一ファイル (.knp, .knp.gz, .pkl, .pkl.gz) - 単一ファイルを含むディレクトリ - 単一非圧縮ファイルを含むアーカイブファイル (.tar.gz, .zip)

__init__(source: Union[pathlib.Path, str], target_cases: Optional[Collection[str]] = None, target_corefs: Optional[Collection[str]] = None, extract_nes: bool = True, relax_cases: bool = False, use_pas_tag: bool = False, knp_ext: str = '.knp', pickle_ext: str = '.pkl', n_jobs: int = -1, did_from_sid: bool = True) → None[source]

Initialize self. See help(type(self)) for accurate signature.

get_knp(did: str) → str[source]
process_all_documents(n_jobs: Optional[int] = None) → List[Optional[kyoto_reader.document.Document]][source]

Process all documents that KyotoReader has loaded.

Parameters:n_jobs (int) – The number of processes spawned to finish this task. (default: inherit from self)
process_document(doc_id: str, archive: Union[tarfile.TarFile, zipfile.ZipFile, None] = None) → Optional[kyoto_reader.document.Document][source]

Process one document following the given document ID.

Parameters:
  • doc_id (str) – An ID of a document to process.
  • archive (Optional[ArchiveFile]) – An archive to read the document from.
process_documents(doc_ids: Iterable[str], n_jobs: Optional[int] = None) → List[Optional[kyoto_reader.document.Document]][source]

Process multiple documents following the given document IDs.

Parameters:
  • doc_ids (List[str]) – IDs of documents to process.
  • n_jobs (int) – The number of processes spawned to finish this task. (default: inherit from self)
kyoto_reader.sentence module
class kyoto_reader.sentence.Sentence(knp_string: str, dtid_offset: int, dmid_offset: int, doc_id: str)[source]

Bases: object

A class to represent a single sentence.

blist

BList object of pyknp.

Type:BList
doc_id

The document ID of this sentence.

Type:str
bps

Base phrases in this sentence.

Type:List[BasePhrase]
__init__(knp_string: str, dtid_offset: int, dmid_offset: int, doc_id: str) → None[source]
Parameters:
  • knp_string (str) – KNP format string of this sentence.
  • dtid_offset (int) – The document-wide tag ID of the previous base phrase.
  • dmid_offset (int) – The document-wide morpheme ID of the previous morpheme.
  • doc_id (str) – The document ID of this sentence.
bnst_list()[source]

Return list of Bunsetsu object in pyknp.

dtids

A document-wide tag ID.

mrph2dmid

A mapping from morpheme to its document-wide ID.

mrph_list()[source]

Return list of Morpheme object in pyknp.

sid

A sentence ID.

surf

A surface expression

tag_list()[source]

Return list of Tag object in pyknp.

Module contents

Author/Contact

京都大学 黒橋・褚・村脇研究室 (contact at nlp.ist.i.kyoto-u.ac.jp)

  • Nobuhiro Ueda

Indices and Tables