Thursday, January 22, 2009

The Corpus as Performance Art

Keywords, notes and quotes from:

Corpus Linguistics
Tony McEnery and Andrew Wilson
Edinburgh University Press
1996
206 pages

2 - Corpus linguistics is a methodology, rather than an aspect of language
2 - Corpus linguistics applied to syntax, semantics and pragmatics
3 - longitudinal studies
4 - "basic corpus methodology was widespread in linguistics for a long period of time."
4 - Chomsky (1957 & 1965) - "invalidated the corpus as a source of evidence in linguistic inquiry." {... and apparently single-handedly lead a migration away from empiricism towards rationalism in linguistics.}
5 - Chomsky - language competence vs. performance (I & E languages)
5 - performance a poor model of competence {corpus as performance art?}
6 - observation vs. introspection
6 - explicandum
8 - language is non-enumerable
8 - corpora are skewed
10 - early limitation - data processing
12 - frequency
15 - concordance program
18 - corpus and intuition complementary
19 - footnote ten - "One is irresistibly drawn to remembering Galton's words:
"until the phenomena of any branch of knowledge have been submitted to measurement and number, it cannot assume the dignity of a science'" {bold lentil's italics}
19 - Leech (92 and 91) references
21 - Corpus, any body of text with 1. sampling and representativeness, 2. finite size, 3. machine readable form, and 4. a standard reference.
22 - accurate picture of variety
22 - Monitor corpus
22 - Synchronic snapshot
24 - annotation
26 - CoCoA references
27 - Text Encoding Initiative - TEI
33 - Orthography, character encoding, 35
36 - parts-of-speech annotation, grammatical tagging, morphosyntactic annotation
39 - programs to tag parts of speech - 95% success rates (such as CLAWS)
40 - Ditto tags - used for more than one word in a phraseological unit...
42 - lemmatization - reducing words to their respective lexemes or head word form which would be looked in a dictionary
42 - lemma
43 - parsing
43 - treebanks
46 - constraint gammer
46 - context-free phrase structure grammar
49 - generally auto-parsing software has lower success rates than automatic parts-of-speech tagging
49 - semantics
50 - content analysis
51 - 31241100 - colour - semantic tag? Wilson - forthcoming? from Schmidt (1993)
52 - discourse tags
52 - anaphoric annotation - marking of pronoun reference
cohesion - 'the vehicle by which elements in texts are interconnected through use of pronouns, repetition, substitution, etc {as in this post lacks cohesion}
53 - bootstrapping problem
54 - prosodic annotation - suprasegmental features, stress, intonations, pauses - significant differences in transcribers
57 - multilingual corpora
54 - parallel corpora
58 - aligned corpora
61 - corpus - maximally representative finite sample
64 - representativeness - sample, population and sampling frame
65 - demographic sampling, stratificational sampling
66 - "rarer features on the other hand show more variations in their distributions and consequently require larger samples..." de Haan (1992)
67 - Language and Computers and Statistics for Corpus Linguistics
67 - tokens and types - frequency counts
68 - proportions - useful for comparing frequencies across corpora
69 - statistical significance testing - chi-squared test - widely used in corpus linguistics: + more sensitive that t-test, + does not assume normal data, + in 2 x 2 tables is easy to compute - but is unreliable with small frequencies
71 - collocations - characteristic co-occurence patterns of words
71 - mutual information and z-score - the more strongly connected two items are, the higher the mutual information
73 - multi-variate analysis: factor analysis, principle component analysis, correspondence analysis, multi-dimensional scaling, and cluster analysis
74 - Cross tabulation, intercorrelation matrix, factor analysis - "summarize" the similarities in variables with factors.
75 - correspondence analysis - summarize similarities by a smaller number of "best fit" axes
76 - Cluster analysis: single vs. average linkage, hierarchical - dendrogram, and non-hierarchical items can be in more than one cluster
79 - Hayashi's quantification Method Type III
82 - log-linear analysis
83 - Variable rule analysis - VARBRUL
83 - Probabilistic language modeling
87 - "Although many researchers will refer to their data as a corpus, frequently these data do not fit the definition of a corpus in the sense that we have tended to use that term in this book - as many other corpus linguists have - for a body of text which is carefully sampled to be maximally representative of a language or language variety."
89 - Intonation group boundaries
90 - Lexicography - study of words
94 - parser
95 - Probabilistically ordered choices...
95 - Register
97 - Fuzzy categories and gradience
98 - Pragmatics - "meaning in context"
99 - Elicited vs. naturalistic data
99 - Dialectology
107 - Diachronic - something happening over time, as opposed to synchronic - which is a "snapshot" at a specific time.
108 - Mystery of vanishing reliability
109 - "Common core" vs. 'unique Englishes'
112 - Studies of linguistic impairment
117 - "new, and alternative goals [for computational linguistics] have been added; such as the creation of working systems based on empirical data which sacrifice some or all cognitive plausibility for improved performance."
118 - Cognitively plausible vs. cognitively implausible
118 - 'Quantitative approaches to the solution of problems in artificial intelligence have long had a difficult hurdle to leap, so clearly expressed by McCarthy:' "Where do all the numbers come from?" 'The answer for NLP for now at least is clear - a corpus.' {not to pick nits but shouldn't it be - corpora?}
118 - "but where cognitive plausibility is sacrificed to brute force mathematical modeling, corpora are the sine qua non of such an approach. Corpora provide the necessary raw data for approaches to computational linguistics based upon abstract numerical modeling."
118 - Knowledge base
119 - Disambiguation
119 - Part-of-speech annotation
120 - "so the corpus, unlike the chicken or the egg, definitely comes first!"
119 - "any sacrifice of cognitive plausibility is most often one of degree, and rarely an absolute."
120 - Part-of-speech taggers
120 - Lexicon - corpus can be used to build, typically hundreds of thousands of words.
122 - morphological analysis
122 - Syntactic idiom
123 - disambiguation based on a matrix of probabilities
123 - 'human grammar is essentially a probabilistic grammar' Halliday
124 - "disambiguation remains the predominant use of quantitative data in NLP."
128- "postulate semantic affinities between words by measuring the frequency with which words co-occur in close proximity to one another."
125 - tagset - part-of-speech distinctions made by a machine learning system
128 - Association ratio - how often do words co-occur? Also note 12, relative to the frequency
129 - Parsing - 130 - has 30-40% accuracy
131 - 'Knowledge acquisition bottle-neck'
132 - Atheoretic account of language, radical statistical grammars
133 - Independence from hand-crafted linguistic rules
133 - Magerman (1994) Stanford thesis.
137 - "are models of cognition relevant to human beings the best models on which to base computer models?"
139 - "without parallel aligned corpora, there would be no example-based machine translation and no statistical machine translation."
149 - "In short, a corpus should be an an exceptionally good tool for identifying and describing a sublanguage, because they both have one thing in common - a finite nature."
137 - Statistical translation - eschews cognitive plausibility
138 - trigram - three words together in a sequence
138 - example-based machine translation
144 & 206 - Zernik - exploiting online resources to build a lexicon
147 - using corpora to examine a linguistic hypothesis
147 - Sublanguages - constrained variety of a language. may be naturally occurring. key feature - lacks typical productivity of a language.
147 - Chapter six as a case study
148 - closure - feature of a language variety in which the language is tending towards being finite.
150 - IBM manuals as a a possible sublanguage or at least a higher degree of closure than the other two corpora under consideration.
154 - corpus size - and the frequency of the feature to be observed
154 - open-class part-of-speech
165 - "Each new sentence means a new sentence type." {Likewise, each new incomplete sentence a new incomplete sentence type