The CODA Corpus (Version 1.0)
The CODA corpus is provided under a Creative Commons Attribution-Non-Commercial-Share Alike 2.0 UK: England & Wales Licence. When referring to the corpus, please cite:
Stoyanchev, Svetlana and Piwek, Paul (2010). Constructing the CODA corpus: A parallel corpus of monologues and expository dialogues. In: The seventh international conference on Language Resources and Evaluation (LREC), 18 - 21 May 2010, Malta.
The creation of the CODA corpus was supported by the UK's Engineering and Physical Sciences Research Council under grant EP/G/020981/1.
The corpus is provided as is and no guarantee or warranty is given that the corpus is fit for any particular purpose. The user thereof uses the corpus at its sole risk and liability.
The CODA corpus (version 1.0) is based on the following resources:
- What is man? and other essays by Mark Twain. eBook available through Project Gutenberg. The Project Gutenberg license states that This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org.
- Three Dialogues Between Hylas and Philonous by George Berkeley. eBook available through Project Gutenberg. The Project Gutenberg license states that This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org.
Content of this folder:
CODA_AnnotationManual_v1.1.pdf (description of the CODA corpus annotation process and file formats)
SINGLE_FILE_ALIGNED/
- AlignedMonologueDialogue_CODA_RELEASE_1.0.xml contains mapping for all data in this release. See AnnotationManual.pdf, File Formats section for the description of the aligned monologue-dialogue format.
- AlignedMonologueDialogue_CODA_RELEASE_1.0.txt contains mapping for all data in this release in plain text format.
BY_SECTION_ALL_VIEWS/
The data set is split into sections which were annotated separately. The release contains a subdirectory NAME/ for each annotated section. Each subdirectory contains:
- an annotated dialogue file: NAME/NAME.Dialogue.xml
- a monologue translation: NAME/NAME.Monologue.xml
- a RST parsed monologue: NAME/NAME.RSTParsedMonologue.rs3
- A mapping between dialogue sequence and RST monologue structure (derived automatically from the above files): NAME/NAME.AlignedMonologueDialogue.xml
(all formats are described in the AnnotationManual.pdf)
Detailed content
Mark Twain "What is Man": Total number of turns 520
- Twain-part1_1 (51 turns)
- Twain-part1_2 (57)
- Twain-part2-SecuringAproval-1 (47)
- Twain-part2-SecuringAproval-2 (50)
- Twain-part4-admonition (28)
- Twain-part4-admonition2 (39)
- Twain-part5-more-about-machine (74)
- Twain-part6-DifficultQuestion (38)
- Twain-part6-FreeWill (51)
- Twain-part6-InstinctAndThought (85)
Berkeley "Three Dialogues between Hylas and Philonous, in Opposition to Sceptics and Atheists" : Total number of turns 172
- Berkeley_Phil_Hyl_dialog1_part1 (41 turns)
- Berkeley_Phil_Hyl_dialog1_part2 (37)
- Berkeley_Phil_Hyl_dialog1_part3 (49)
- Berkeley_Phil_Hyl_dialog1_part4 (45)