Moreover, even at a given level there may be different labeling schemes or even disagreement amongst annotators, such that we want to represent multiple versions.A second property of TIMIT is its balance across multiple dimensions of variation, for coverage of dialect regions and diphones.The same holds true of text corpora, in the sense that the original text usually has an external source, and is considered to be an immutable artifact.Any transformations of that artifact which involve human judgment — even something as simple as tokenization — are subject to later revision, thus it is important to retain the source material in a form that is as close to the original as possible.Finally, notice that even though TIMIT is a speech corpus, its transcriptions and associated data are just text, and can be processed using programs just like any other text corpus.Therefore, many of the computational methods described in this book are applicable.Moreover, notice that all of the data types included in the TIMIT corpus fall into the two basic categories of lexicon and text, which we will discuss below.Even the speaker demographics data is just another instance of the lexicon data type.

As we saw in 2., most lexical resources can be represented using a record structure, i.e. A lexical resource could be a conventional dictionary or comparative wordlist, as illustrated.

First, the corpus contains two layers of annotation, at the phonetic and orthographic levels.

In general, a text or speech corpus may be annotated at many different linguistic levels, including morphological, syntactic, and discourse levels.

Structured collections of annotated linguistic data are essential in most areas of NLP, however, we still face many obstacles in using them.

The goal of this chapter is to answer the following questions: Along the way, we will study the design of existing corpora, the typical workflow for creating a corpus, and the lifecycle of corpus.

The inclusion of speaker demographics brings in many more independent variables, that may help to account for variation in the data, and which facilitate later uses of the corpus for purposes that were not envisaged when the corpus was created, such as sociolinguistics.

