kyoto-reader: A processor for KWDLC, KyotoCorpus, and AnnotatedFKCCorpus¶
About¶
京都大学が公開している述語項構造や共参照関係が付与されたコーパスをパースし、Python から扱うためのインターフェースを提供します。 このツールは pyknp のラッパーであるため、形態素情報や係り受け関係なども扱うことが可能です。
Name | Domain | Size |
---|---|---|
京都大学ウェブ文書リードコーパス (KWDLC) | ウェブテキスト | 16,038 文 |
京都大学テキストコーパス (KyotoCorpus) | 新聞記事・社説 | 15,872 文 |
不満調査データセットタグ付きコーパス (AnnotatedFKCCorpus) | 不満に関する投稿 | 1,282 文 |
Requirements¶
- Python
- Verified Versions: 3.7, 3.8, 3.9, 3.10
- pyknp 0.4.6+
- KNP (optional)
- JumanDIC (optional)
Install kyoto-reader¶
$ pip install kyoto-reader
or
$ git clone https://github.com/ku-nlp/kyoto-reader
$ cd kyoto-reader
$ python setup.py install [--prefix=path]
A Brief Explanation of KWDLC and other corpora¶
<rel>
タグによって行われている。KWDLC の例:
# S-ID:w201106-0000060050-1 JUMAN:6.1-20101108 KNP:3.1-20101107 DATE:2011/06/21 SCORE:-44.94406 MOD:2017/10/15 MEMO:
* 2D
+ 1D
コイン こいん コイン 名詞 6 普通名詞 1 * 0 * 0
+ 3D <rel type="ガ" target="不特定:人"/><rel type="ヲ" target="コイン" sid="w201106-0000060050-1" id="0"/>
トス とす トス 名詞 6 サ変名詞 2 * 0 * 0
を を を 助詞 9 格助詞 1 * 0 * 0
* 2D
+ 3D
3 さん 3 名詞 6 数詞 7 * 0 * 0
回 かい 回 接尾辞 14 名詞性名詞助数辞 3 * 0 * 0
* -1D
+ -1D <rel type="ガ" target="不特定:人"/><rel type="ガ" mode="?" target="読者"/><rel type="ガ" mode="?" target="著者"/><rel type="ヲ" target="トス" sid="w201106-0000060050-1" id="1"/>
行う おこなう 行う 動詞 2 * 0 子音動詞ワ行 12 基本形 2
。 。 。 特殊 1 句点 1 * 0 * 0
EOS
Usage¶
上記の例のデータが入ったファイル w201106-0000060050.knp
を読み込む場合
from kyoto_reader import KyotoReader, Document
# 文書集合を扱うオブジェクト
reader = KyotoReader('w201106-0000060050.knp', # ファイルまたはディレクトリのパスを指定する
target_cases=['ガ', 'ヲ', 'ニ'], # ガ,ヲ,ニ格のみを対象とする
target_corefs=['=', '=構', '=≒', '=構≒'], # 共参照として扱う関係を列挙
extract_nes=True # 固有表現もコーパスから抽出する
)
print('読み込んだ文書:')
for doc_id in reader.doc_ids:
print(f' 文書 ID: {doc_id}')
print('\n--- 述語項構造 ---')
document: Document = reader.process_document('w201106-0000060050')
for predicate in document.get_predicates():
print(f'述語: {predicate.core}')
for case, arguments in document.get_arguments(predicate).items():
print(f' {case}格: ', end='')
print(', '.join(str(argument) for argument in arguments))
print('\n--- ツリー形式 ---')
document.draw_tree(sid='w201106-0000060050-1', coreference=False)
プログラムの出力結果
読み込んだ文書:
文書 ID: w201106-0000060050
--- 述語項構造 ---
述語: トス
ガ格: 不特定:人
ヲ格: コイン
ニ格:
述語: 行う
ガ格: 不特定:人, 読者, 著者
ヲ格: トス
ニ格:
--- ツリー形式 ---
コイン┐
トスを─┐ ガ:不特定:人 ヲ:コイン
3回┤
行う。 ガ:読者,不特定:人,著者 ヲ:トス
CLI Interfaces¶
kyoto
コマンドを使用することで、コーパスの内容を表示したりコーパスを加工したりできる。
Browsing files¶
kyoto show
: KNP ファイルの内容をツリー形式で表示 (ディレクトリを指定した場合、含まれる全てのファイルを表示)
$ kyoto show /path/to/knp/file.knp
kyoto list
: 指定されたディレクトリに含まれる文書 ID を列挙
$ kyoto list /path/to/knp/directory
Processing Corpus¶
コーパスを解析し、追加の素性を付与 (KNP と JumanDIC が必要)
kyoto configure
: コーパスのディレクトリに素性付与のための Makefile を生成make
を実行することで、コーパスが 1 文書 1 ファイルに分割され、knp/
ディレクトリに素性の付与されたファイルが出力される。
$ kyoto configure --corpus-dir /path/to/downloaded/knp/directory --data-dir /path/to/output/directory --juman-dic-dir /path/to/JumanDIC/directory
created Makefile at /path/to/output/directory
$ cd /path/to/output/directory
$ make -j <num-parallel>
kyoto idsplit
: コーパスを train/dev/test ファイルに分割
$ kyoto idsplit --corpus-dir /path/to/knp/dir --output-dir /path/to/output/dir --train /path/to/train/id/file --dev /path/to/dev/id/file --test /path/to/test/id/file
Zsh Completions¶
<virtualenv-path>/share/zsh/site-functions
を FPATH
に追加することで kyoto
コマンドの補完が可能 (zsh 限定)
$ echo 'export FPATH=<virtualenv-path>/share/zsh/site-functions:$FPATH' >> ~/.zshrc
Documents¶
kyoto_reader package¶
Submodules¶
kyoto_reader.base_phrase module¶
-
class
kyoto_reader.base_phrase.
BasePhrase
(tag: pyknp.knp.tag.Tag, dmid_offset: int, dtid: int, sid: str, doc_id: str, parent: Optional[BasePhrase] = None, children: Optional[List[BasePhrase]] = None)[source]¶ Bases:
object
文中に出現する基本句を表すクラス
-
tag
¶ Tag object in pyknp.
Type: Tag
-
sid
¶ Sentence ID.
Type: str
-
dtid
¶ Document-wide tag ID.
Type: int
-
content_dmid
¶ Document-wide morpheme ID of the content word in the base phrase.
Type: int
-
parent
¶ Dependency parent.
Type: Optional[BasePhrase]
-
children
¶ Dependency children.
Type: List[BasePhrase]
-
__init__
(tag: pyknp.knp.tag.Tag, dmid_offset: int, dtid: int, sid: str, doc_id: str, parent: Optional[BasePhrase] = None, children: Optional[List[BasePhrase]] = None)[source]¶ Parameters: - tag (Tag) – Tag object in pyknp.
- dmid_offset (int) – Document-wide morpheme ID of the previous morpheme.
- dtid (int) – Document-wide tag ID.
- sid (str) – Sentence ID.
- doc_id (str) – Document ID.
- parent (Optional[BasePhrase]) – Dependency parent.
- children (List[BasePhrase]) – Dependency children.
-
core
¶ A core expression without ancillary words.
-
dmid
¶ Document-wide morpheme ID.
-
dmids
¶ A list of document-wide morpheme IDs.
-
mrph2dmid
¶ A mapping from morpheme to its document-wide ID.
-
mrphs
¶ A list of morphemes.
-
surf
¶ A surface expression.
-
tid
¶ Tag ID in pyknp.
-
kyoto_reader.cli module¶
-
kyoto_reader.cli.
configure
(args: argparse.Namespace)[source]¶ Create Makefile to preprocess corpus documents.
-
kyoto_reader.cli.
idsplit
(args: argparse.Namespace)[source]¶ Copy files in a corpus to train, valid (dev), and test directory referring to ID files.
kyoto_reader.constants module¶
kyoto_reader.coreference module¶
-
class
kyoto_reader.coreference.
Entity
(eid: int, exophor: Optional[str] = None)[source]¶ Bases:
object
A class to represent an entity in coreference. This class manages entity IDs of mentions that refer to this entity.
Parameters: - eid (int) – An Entity ID.
- exophor (str, optional) – The kind of exophor if this entity corresponds to some exophor. Otherwise, None.
-
eid
¶ An Entity ID.
Type: int
-
exophor
¶ A string to represent exophor, such as “著者”, “読者”, and “不特定:人”.
Type: str, optional
-
taigen
¶ Whether this entity is 体言 or not.
Type: bool, optional
-
yougen
¶ Whether this entity is 用言 or not.
Type: bool, optional
-
__init__
(eid: int, exophor: Optional[str] = None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
add_mention
(mention: kyoto_reader.coreference.Mention, uncertain: bool) → None[source]¶ Add a mention that refers to this entity.
When a non-uncertain mention is added and the mention has already been registered as an uncertain mention, it will be overwritten as non-uncertain.
Parameters: - mention (Mention) – A mention
- uncertain (bool) – Whether the mention is uncertain (i.e., annotated with “≒”).
-
all_mentions
¶ All mentions that refer to this entity, including uncertain ones.
-
is_special
¶ Whether this entity corresponds to special entity, such as exophor.
-
class
kyoto_reader.coreference.
Mention
(bp: kyoto_reader.base_phrase.BasePhrase)[source]¶ Bases:
kyoto_reader.base_phrase.BasePhrase
A class to represent a mention in coreference.
Parameters: bp (BasePhrase) – A base phrase object that corresponds to this mention. -
eids
¶ Entity IDs.
Type: set
-
eids_unc
¶ Uncertain entity IDs. “Uncertain” means the mention is annotated with “≒”.
Type: set
-
__init__
(bp: kyoto_reader.base_phrase.BasePhrase)[source]¶ Args: tag (Tag): Tag object in pyknp. dmid_offset (int): Document-wide morpheme ID of the previous morpheme. dtid (int): Document-wide tag ID. sid (str): Sentence ID. doc_id (str): Document ID. parent (Optional[BasePhrase]): Dependency parent. children (List[BasePhrase]): Dependency children.
-
all_eids
¶ All entity IDs this mention refers to.
-
kyoto_reader.document module¶
-
class
kyoto_reader.document.
Document
(knp_string: str, doc_id: str, cases: Collection[str], corefs: Collection[str], relax_cases: bool, extract_nes: bool, use_pas_tag: bool)[source]¶ Bases:
object
A class to represent a document of KWDLC, KyotoCorpus, or AnnotatedFKCCorpus.
Parameters: - knp_string (str) – KNP format string of the document.
- doc_id (str) – A document ID.
- cases (Collection[str]) – Cases to extract.
- corefs (Collection[str]) – Coreference relations to extract.
- relax_cases (bool) – Whether to consider relations with “≒” as those without “≒” (e.g. ガ≒格 -> ガ格).
- extract_nes (bool) – Whether to extract named entities.
- use_pas_tag (bool) – Whether to read predicate-argument structures from <述語項構造: > tags, not <rel> tags.
-
knp_string
¶ KNP format string of the document.
Type: str
-
doc_id
¶ A document ID.
Type: str
-
cases
¶ Cases to extract.
Type: Collection[str]
-
corefs
¶ Coreference relations to extract.
Type: Collection[str]
-
extract_nes
¶ Whether to extract named entities.
Type: bool
-
mentions
¶ A mapping from a document-wide tag ID to the corresponding mention.
Type: Dict[int, Mention]
-
named_entities
¶ Extracted named entities.
Type: List[NamedEntity]
-
__init__
(knp_string: str, doc_id: str, cases: Collection[str], corefs: Collection[str], relax_cases: bool, extract_nes: bool, use_pas_tag: bool) → None[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
draw_tree
(sid: Optional[str] = None, coreference: bool = True, fh: Optional[TextIO] = None) → None[source]¶ Write out the PAS and coreference relations in the specified sentence in a tree format.
If sid is not specified, write out trees in all the sentences in this document.
Parameters: - sid (str, optional) – A sentence ID of the target sentence.
- coreference (bool) – If True, write out coreference relations as well.
- fh (TextIO, optional) – The output stream.
-
get_arguments
(predicate: kyoto_reader.base_phrase.BasePhrase, relax: bool = False, include_optional: bool = False) → Dict[str, List[kyoto_reader.pas.BaseArgument]][source]¶ Return all the arguments that the given predicate has.
Parameters: - predicate (Predicate) – A predicate.
- relax (bool) – If True, return arguments that have a coreference relation with the arguments the predicate has.
- include_optional (bool) – If True, return adverbial arguments such as “すぐに” as well.
Returns: A mapping from a case to arguments.
Return type: Dict[str, List[BaseArgument]]
-
get_entities
(bp: kyoto_reader.base_phrase.BasePhrase, include_uncertain: bool = False) → List[kyoto_reader.coreference.Entity][source]¶ Return list of entities that the specified mention refers to. The mention is given as a type of BasePhrase.
Parameters: - bp (BasePhrase) – A base phrase corresponds to the mention.
- include_uncertain (bool) – Whether to return entities that has uncertain relation with the mention.
-
get_siblings
(mention: kyoto_reader.coreference.Mention, relax: bool = False) → Set[kyoto_reader.coreference.Mention][source]¶ Return all the mentions that have coreference chains with the specified mention.
Parameters: - mention (Mention) – A mention.
- relax (bool) – If True, return coreferent mentions as well.
Returns: A set of mentions.
Return type: Set[Mention]
-
mrph2dmid
¶ A mapping from morpheme to its document-wide ID.
-
sentences
¶ List of sentences in this document.
Returns: List[Sentence]
-
surf
¶ A surface expression of this document.
kyoto_reader.ne module¶
-
class
kyoto_reader.ne.
NamedEntity
(category: str, name: str, sentence: kyoto_reader.sentence.Sentence, mid_range: range, mrph2dmid: Dict[pyknp.juman.morpheme.Morpheme, int])[source]¶ Bases:
object
A class to represent a named entity (NE).
Parameters: - category (str) – A category of a NE.
- name (str) – A name of a NE.
- sentence (Sentence) – A sentence that contains a NE.
- mid_range (range) – A range of IDs of morphemes that constitute a NE.
- mrph2dmid (dict) – A mapping from morpheme to its document-wide ID.
-
category
¶ A category of a NE.
Type: str
-
name
¶ A name of a NE.
Type: str
-
sid
¶ A sentence ID of a sentence that contains a NE.
Type: str
-
mid_range
¶ A range of IDs of morphemes that constitute a NE.
Type: range
-
dmid_range
¶ A range of document-wide IDs of morphemes that constitute a NE.
Type: range
kyoto_reader.pas module¶
-
class
kyoto_reader.pas.
Argument
(bp: kyoto_reader.base_phrase.BasePhrase, dep_type: str, mode: str)[source]¶ Bases:
kyoto_reader.base_phrase.BasePhrase
,kyoto_reader.pas.BaseArgument
文中に出現する(外界ではない)項を表すオブジェクト
Parameters: - bp (BasePhrase) – 基本句
- dep_type (str) – 係り受けタイプ (“overt”, “dep”, “intra”, “inter”, “exo”)
- mode (str) – モード
-
__init__
(bp: kyoto_reader.base_phrase.BasePhrase, dep_type: str, mode: str) → None[source]¶ Args: tag (Tag): Tag object in pyknp. dmid_offset (int): Document-wide morpheme ID of the previous morpheme. dtid (int): Document-wide tag ID. sid (str): Sentence ID. doc_id (str): Document ID. parent (Optional[BasePhrase]): Dependency parent. children (List[BasePhrase]): Dependency children.
-
class
kyoto_reader.pas.
BaseArgument
(dep_type: str, mode: str)[source]¶ Bases:
object
A base class for all kinds of arguments
-
__init__
(dep_type: str, mode: str)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
is_special
¶
-
-
class
kyoto_reader.pas.
Pas
(pred_bp: kyoto_reader.base_phrase.BasePhrase)[source]¶ Bases:
object
A class to represent a predicate-argument structure (PAS).
Parameters: pred_bp (BasePhrase) – 述語となる基本句 -
predicate
¶ 述語
Type: Predicate
-
arguments
¶ 格と項
Type: Dict[str, List[BaseArgument]]
-
__init__
(pred_bp: kyoto_reader.base_phrase.BasePhrase)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
dmid
¶ 述語の中の内容語形態素の文書レベル形態素ID
-
dtid
¶ A document-wide tag ID.
-
sid
¶ A sentence ID
-
-
class
kyoto_reader.pas.
SpecialArgument
(exophor: str, eid: int, mode: str)[source]¶ Bases:
kyoto_reader.pas.BaseArgument
外界を指す項を表すオブジェクト
Parameters: - exophor (str) – 外界照応詞 (不特定:人など)
- eid (int) – 外界照応詞のエンティティID
- mode (str) – モード
kyoto_reader.reader module¶
-
class
kyoto_reader.reader.
ArchiveHandler
(path: pathlib.Path)[source]¶ Bases:
object
-
class
kyoto_reader.reader.
ArchiveType
[source]¶ Bases:
enum.Enum
Enum for file collection types.
-
TAR_GZ
= '.tar.gz'¶
-
ZIP
= '.zip'¶
-
-
class
kyoto_reader.reader.
FileHandler
(path: pathlib.Path)[source]¶ Bases:
object
-
__init__
(path: pathlib.Path) → None[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
content_basename
¶
-
-
class
kyoto_reader.reader.
FileType
[source]¶ Bases:
enum.Enum
Enum for file types.
-
GZ
= '.gz'¶
-
UNCOMPRESSED
= ''¶
-
-
class
kyoto_reader.reader.
KyotoReader
(source: Union[pathlib.Path, str], target_cases: Optional[Collection[str]] = None, target_corefs: Optional[Collection[str]] = None, extract_nes: bool = True, relax_cases: bool = False, use_pas_tag: bool = False, knp_ext: str = '.knp', pickle_ext: str = '.pkl', n_jobs: int = -1, did_from_sid: bool = True)[source]¶ Bases:
object
A class to manage a set of corpus documents. Compressed file is supported. However, nested compression (e.g. .knp.gz in zip file) is not supported.
Parameters: - source (Union[Path, str]) – 対象の文書へのパス。ディレクトリが指定された場合、その中の全てのファイルを対象とする
- target_cases (Optional[Collection[str]]) – 抽出の対象とする格。(default: 全ての格)
- target_corefs (Optional[Collection[str]]) – 抽出の対象とする共参照関係(=など)。(default: 全ての関係)
- extract_nes (bool) – 固有表現をコーパスから抽出するかどうか (default: True)
- relax_cases (bool) – ガ≒格などをガ格として扱うか (default: False)
- knp_ext (str) – KWDLC または KC ファイルの拡張子 (default: knp)
- pickle_ext (str) – Document を pickle 形式で読む場合の拡張子 (default: pkl)
- use_pas_tag (bool) – <rel>タグからではなく、<述語項構造:>タグから PAS を読むかどうか (default: False)
- n_jobs (int) – 文書を読み込む処理の並列数。0: 並列処理なし、-1: コア数 (default: -1)
- did_from_sid (bool) – 文書IDを文書中のS-IDから決定する (default: True)
Note
サポートされる入力パス (i.e. source argument) - 単一ファイル (.knp, .knp.gz, .pkl, .pkl.gz) - 単一ファイルを含むディレクトリ - 単一非圧縮ファイルを含むアーカイブファイル (.tar.gz, .zip)
-
__init__
(source: Union[pathlib.Path, str], target_cases: Optional[Collection[str]] = None, target_corefs: Optional[Collection[str]] = None, extract_nes: bool = True, relax_cases: bool = False, use_pas_tag: bool = False, knp_ext: str = '.knp', pickle_ext: str = '.pkl', n_jobs: int = -1, did_from_sid: bool = True) → None[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
process_all_documents
(n_jobs: Optional[int] = None) → List[Optional[kyoto_reader.document.Document]][source]¶ Process all documents that KyotoReader has loaded.
Parameters: n_jobs (int) – The number of processes spawned to finish this task. (default: inherit from self)
-
process_document
(doc_id: str, archive: Union[tarfile.TarFile, zipfile.ZipFile, None] = None) → Optional[kyoto_reader.document.Document][source]¶ Process one document following the given document ID.
Parameters: - doc_id (str) – An ID of a document to process.
- archive (Optional[ArchiveFile]) – An archive to read the document from.
-
process_documents
(doc_ids: Iterable[str], n_jobs: Optional[int] = None) → List[Optional[kyoto_reader.document.Document]][source]¶ Process multiple documents following the given document IDs.
Parameters: - doc_ids (List[str]) – IDs of documents to process.
- n_jobs (int) – The number of processes spawned to finish this task. (default: inherit from self)
kyoto_reader.sentence module¶
-
class
kyoto_reader.sentence.
Sentence
(knp_string: str, dtid_offset: int, dmid_offset: int, doc_id: str)[source]¶ Bases:
object
A class to represent a single sentence.
-
blist
¶ BList object of pyknp.
Type: BList
-
doc_id
¶ The document ID of this sentence.
Type: str
-
bps
¶ Base phrases in this sentence.
Type: List[BasePhrase]
-
__init__
(knp_string: str, dtid_offset: int, dmid_offset: int, doc_id: str) → None[source]¶ Parameters: - knp_string (str) – KNP format string of this sentence.
- dtid_offset (int) – The document-wide tag ID of the previous base phrase.
- dmid_offset (int) – The document-wide morpheme ID of the previous morpheme.
- doc_id (str) – The document ID of this sentence.
-
dtids
¶ A document-wide tag ID.
-
mrph2dmid
¶ A mapping from morpheme to its document-wide ID.
-
sid
¶ A sentence ID.
-
surf
¶ A surface expression
-