| PoS Tagged Corpus |
The C7 tagset consists of 150 tags, 14 of which are punctuation tags, the remaining 136 being part of speech tags.
| Tag | Definition |
|---|---|
| * | punctuation tag - asterix |
| ! | punctuation tag - exclamation mark |
| " | punctuation tag - quotation marks |
| $ | germanic genitive marker - (' or 's) |
| ( | punctuation tag - left bracket |
| ) | punctuation tag - right bracket |
| , | punctuation tag - comma |
| - | punctuation tag - dash |
| ----- | new sentence marker |
| . | punctuation tag - full-stop |
| ... | punctuation tag - ellipsis |
| : | punctuation tag - colon |
| ; | punctuation tag - semi-colon |
| ? | punctuation tag - question-mark |
| APPGE | possessive pronoun, pre-nominal (my, your, our etc.) |
| AT | article (the, no) |
| AT1 | singular article (a, an, every) |
| BCL | before-clause marker (e.g. in order (that)) |
| CC | coordinating conjunction (and, or) |
| CCB | coordinating conjunction (but) |
| CS | subordinating conjunction (if, because, unless) |
| CSA | 'as' as a conjunction |
| CSN | than' as a conjunction |
| CST | that' as a conjunction |
| CSW | 'whether' as a conjunction |
| DA | after-determiner (capable of pronominal function)(such, former, same) |
| DA1 | singular after-determiner (little, much) |
| DA2 | plural after-determiner (few, several, many) |
| DAR | comparative after-determiner (more, less) |
| DAT | superlative after-determiner (most, least) |
| DB | before-determiner (capable of pronominal function) (all, half) |
| DB2 | plural before-determiner (capable of pronominal function) (eg. both) |
| DD | determiner (capable of pronominal function) (any, some) |
| DD1 | singular determiner (this, that, another) |
| DD2 | plural determiner (these, those) |
| DDQ | wh-determiner (which, what) |
| DDQGE | wh-determiner, genitive (whose) |
| DDQV | wh-ever determiner (whichever, whatever) |
| EX | existential 'there' |
| FO | formula |
| FU | unclassified word |
| FW | foreign word |
| GE | germanic genitive marker - (' or 's) |
| IF | 'for' as a preposition |
| II | preposition |
| IO | 'of' as a preposition |
| IW | 'with'; 'without' as preposition |
| JJ | general adjective |
| JJR | general comparative adjective (older, better, bigger) |
| JJT | general superlative adjective (oldest, best, biggest) |
| JK | adjective catenative ('able' in 'be able to';'willing' in 'be willing to') |
| LE | leading co-ordinator ('both' in 'both...and...';'either' in 'either... or...') |
| MC | cardinal number neutral for number (two, three...) |
| MCGE | genitive cardinal number, neutral for number (10's) |
| MC-MC | hyphenated number 40-50, 1770-1827) |
| MC1 | singular cardinal number (one) |
| MC2 | plural cardinal number (tens, twenties) |
| MD | ordinal number (first, 2nd, next last)) |
| MF | fraction, neutral for number (quarters, two-thirds) |
| NC2 | plural cited word ('ifs', in 'two ifs and a but') |
| ND1 | singular noun of direction (north, southeast) |
| NN | common noun, neutral for number (sheep, cod) |
| NN1 | singular common noun (book, girl) |
| NN2 | plural common noun (books, girls) |
| NNL1 | singular locative noun (street, Bay) |
| NNL2 | plural locative noun (islands, roads) |
| NNO | numeral noun, neutral for number (dozen, thousand) |
| NNO2 | plural numeral noun (hundreds, thousands) |
| NNA | following noun of style or title, abbreviatory (M.A.) |
| NNB | preceding sing. noun of style or title, abbr. (Prof.) |
| NNT1 | singular temporal noun (day, week, year) |
| NNT2 | plural temporal noun (days, weeks, years) |
| NNU | unit of measurement, neutral for number (in., cc.) |
| NNU1 | singular unit of measurement (inch, centimetre) |
| NNU2 | plural unit of measurement (inches, centimetres) |
| NP | proper noun, neutral for number (Indies, Andes) |
| NP1 | singular proper noun (London, Jane, Frederick) |
| NP2 | plural proper noun (Browns, Reagans, Koreas) |
| NPD1 | singular weekday noun (Sunday) |
| NPD2 | plural weekday noun (Sundays) |
| NPM1 | singular month noun (October) |
| NPM2 | plural month noun (Octobers) |
| PN | indefinite pronoun, neutral for number ("none") |
| PN1 | singular indefinite pronoun (one, everything, nobody) |
| PNQO | whom |
| PNQS | who |
| PNQV | whoever |
| PNX1 | reflexive indefinite pronoun (oneself) |
| PPGE | nominal possessive personal pronoun (mine, yours) |
| PPH1 | it |
| PPHO1 | him, her |
| PPHO2 | them |
| PPHS1 | he, she |
| PPHS2 | they |
| PPIO1 | me |
| PPIO2 | us |
| PPIS1 | I |
| PPIS2 | we |
| PPX1 | singular reflexive personal pronoun (yourself, itself) |
| PPX2 | plural reflexive personal pronoun (yourselves, ourselves) |
| PPY | you |
| RA | adverb, after nominal head (else, galore) |
| REX | adverb introducing appositional constructions (namely, viz, eg.) |
| RG | degree adverb (very, so, too) |
| RGQ | wh- degree adverb (how) |
| RGQV | wh-ever degree adverb (however) |
| RGR | comparative degree adverb (more, less) |
| RGT | superlative degree adverb (most, least) |
| RL | locative adverb (alongside, forward) |
| RP | prep. adverb; particle (in, up, about) |
| RPK | prep. adv., catenative ('about' in 'be about to') |
| RR | general adverb |
| RRQ | wh- general adverb (where, when, why, how) |
| RRQV | wh-ever general adverb (wherever, whenever) |
| RRR | comparative general adverb (better, longer) |
| RRT | superlative general adverb (best, longest) |
| RT | nominal adverb of time (now, tommorow) |
| TO | infinitive marker (to) |
| UH | interjection (oh, yes, um) |
| VB0 | be |
| VBDR | were |
| VBDZ | was |
| VBG | being |
| VBM | am |
| VBN | been |
| VBR | are |
| VBZ | is |
| VD0 | do |
| VDD | did |
| VDG | doing |
| VDN | done |
| VDZ | does |
| VH0 | have |
| VHD | had (past tense) |
| VHG | having |
| VHN | had (past participle) |
| VHZ | has |
| VM | modal auxiliary (can, will, would etc.) |
| VMK | modal catenative (ought, used) |
| VV0 | base form of lexical verb (give, work etc.) |
| VVD | past tense form of lexical verb (gave, worked etc.) |
| VVG | -ing form of lexical verb (giving, working etc.) |
| VVN | past participle form of lexical verb (given, worked etc.) |
| VVZ | -s form of lexical verb (gives, works etc.) |
| VVGK | -ing form in a catenative verb ('going' in 'be going to') |
| VVNK | past part. in a catenative verb ('bound' in 'be bound to') |
| VVI | infinitive (e.g. to give) |
| XX | not, n't |
| ZZ1 | singular letter of the alphabet:'A', 'a', 'B', etc. |
| ZZ2 | plural letter of the alphabet: 'As', b's, etc. |
Any of the tags listed above may in theory be modified by the addition of a pair of numbers to it: eg. DD21, DD22. This signifies that the tag occurs as part of a sequence of similar tags, representing a sequence of words which for grammatical purposes are treated as a single unit. For example the expression in terms of is treated as a single preposition, receiving the tags:
The first of the two digits indicates the number of words/tags in the sequence, and the second digit the position of each word within that sequence.
Such ditto tags are not included in the lexicon, but are assigned automatically by a program called IDIOMTAG which looks for a range of multi-word sequences included in the idiomlist. The following sample entries from the idiomlist show that syntactic ambiguity is taken into account, and also that, depending on the context, ditto-tags may or may not be required for a particular word sequence:
CLAWS, the part of speech tag, when last tested had an accuracy of 96-97% depending on the type of text. In any large sample of text it is extremely difficult to acheive 100% accuracy, even after humans have post-edited the text, because there can never be enough guidelines to cover every case in the text - the guidelines can deal with most cases of ambiguity - but there will always be new cases.
In ambiguous cases - cases where I was unsure whether to assign NN1 or NP1 for example, I would try to maintain consistency in my decision - thus once I had decided that a word should be given a tag, I would endeavour to stick with this tag throughout the corpus.
Each word is given one part of speech tag, numbers are usually given the tag MC. In English, large numbers tend to be written as 1000000 or 1,000,000. In French large numbers tend to be written as 100 000 and CLAWS would normally tag this as two words. Wherever possible I have joined examples like this in order to make a single number. This is not the case in tables of numbers (which are represented as long lists of numbers by CLAWS.) as it is not easy to be certain as to which numbers belong together and which should be left alone. However, this occurs rarely in the text.
APOSTROPHES
Words with apostrophes are split, receiving two tags. These include possessives such as
<TOK><ORTH>the</ORTH><CTAG>AT</CTAG></TOK>
<TOK><ORTH>dog</ORTH><CTAG>NN1</CTAG></TOK>
<TOK><ORTH>'s </ORTH><CTAG>GE</CTAG></TOK>
<TOK><ORTH>ball</ORTH><CTAG>NN1</CTAG></TOK>
and truncated words:
<TOK><ORTH>I</ORTH><CTAG>PPIS1</CTAG></TOK>
<TOK><ORTH>can</ORTH><CTAG>VM</CTAG></TOK>
<TOK><ORTH>'t</ORTH><CTAG>XX</CTAG></TOK>
<TOK><ORTH>go</ORTH><CTAG>VVI</CTAG></TOK>
FOREIGN WORDS.
If a sentence is in French I will usually give all the words in it the tag "FW" (foreign word) unless any of those words are recognisable proper nouns - in which case the proper noun tag "NP1" takes precedence. For names of foreign newspapers/journals/magazines, for the most part I did not recognise these as NP1 so they were given the "FW" tag.
FORMULAIC WORDS.
The "FO" tag is used for letter-number combinations and chemical formulae. EEC directive numbers are tagged "FO."
THE "FU" TAG.
I have used this for words that I am unsure about, or words that I think might be spelling errors and I am unable to decide what they correct word should be. (See part 5)
JJT
This has been used for combination words such as "least-favoured" and "largest-ever" as well as for words such as "biggest" and "smallest".
MCMC
Used for dates such as 1986-7, 1986/7 and 1986—87
DAR-RRR
more and less can be assigned either of these tags.
The difference between them is that DAR is for noun-phrase-like (and determiner) uses of the word in question, whereas RRR is for adverbial uses. The two can be difficult to distinguish, particularly after a verb: eg:
RRR: You should relax more
DAR: You should spend more
II-RP II-RL
Compare:
(a) "He ran down<CAT>II</CTAG> the hill"
and
(b) "He ran down<CTAG>RP</CTAG> his friends"
In (a), down is a preposition because:
1. You could insert an adverb before it:
"He ran quickly down the hill" But not: "He ran viciously down his friends"
2. You can move it to the front of a relative clause or question:
"This is the hill down which he ran" "Down which hills do you like running?"
In (b), down is an adverbial particle because:
1. You can place it before or after the noun phrase:
"He ran his friends down<CTAG>RP</CTAG>"
But not:
"*He ran the hill down"
2. If you replace the noun phrase with a pronoun, you HAVE TO place the pronoun in front of
the particle:
"He ran them down"
But not:
"*He ran down them"
However, tagging errors may occur with stranded prepositions which are denuded of their noun phrase because it has been fronted or ellipted (eg. in relative clauses, passives, questions etc.):
This is the hill (which) she ran down<CTAG>II</CTAG>
(ie. This is the hill down<CTAG>II</CTAG> which she ran)
On Shrove Tuesday, this hill will be run down<CTAG>II</CTAG> by housewives"
(ie. Housewives will run down<CTAG>II</CTAG> it)
Which car did you arrive in<CTAG>II</CTAG>?
(ie. In<CTAG>II</CTAG> which car did you arrive?)
The same tests apply to words which are tagged either as prepositions or as locative adverbs RL eg. across, past, behind etc.
JJ/NN1
Words ending in -ing, when they premodify a noun, may be tagged either NN1 or JJ, eg:
New<CTAG>JJ</CTAG> spending<CTAG>NN1</CTAG> reductions<CTAG>NN2</CTAG>
her<CTAG>APPGE</CTAG> acting<CTAG>NN1</CTAG> ability<CTAG>NN1</CTAG>
a<CTAG>AT</CTAG>1</CTAG> working<CTAG>JJ</CTAG> mother<CTAG>NN1</CTAG>
JJ/VVN
The tagging of words like "surprised" in "John was surprised", or "lasting" in "the effect was lasting" can be a problem. In both cases, the word can be a JJ. One test is to see whether you can insert an adverb like "very" in front of the word. eg. in "John was very surprised", "surprised" is a JJ.
Another test, having the opposite effect, is to see whether there is an agent "by"-phrase following an "ed/en" word. If so, it is a VVN. eg. in "John was surprised by the pirates", "surprised " is a VVN. Even where it is not present, the possibility of adding a "by"-phrase, without changing the meaning of the word, is evidence in favour of a VVN. (However, this criterion can clash with the preceding one - since it occasionally happens that an "ed"- word is preceded by an adverb like "very" AND followed by a "by"-phrase: eg. "John was very offended by her remarks". Fortunately, such cases are rare. When they do occur, however, give preference to JJ).
A third test is negative: to see whether the word in question can be placed before a noun. eg:
The effect is lasting: a lasting effect
This shows that "lasting" can be (but need not be) a JJ. If the word could not be placed (with the same meaning) before the noun, this would be evidence that the word is not a JJ, but a VVG or a VVN.
Even though an "-ing" word is normally a VVG after the verb "be" it is generally treated as a JJ before a noun:
The man was dying<CTAG>VVG</CTAG>
But:
The dying<CTAG>JJ</CTAG> man
When the -ing or -en/ed word forms part of a phrase premodifying the noun, as in the following examples, the VVG/VVN tag is preferred:
interest<CTAG>NN1</CTAG> earning<CTAG>VVG</CTAG> account<CTAG>NN1</CTAG> a hypothesis<CTAG>NN1</CTAG> driven<CTAG>VVN</CTAG> approach<CTAG>NN1</CTAG>
In these examples, the NN1/ VVG sequence is similar in function to a compound pre-modifying adjective. In hyphenated form they would be given a JJ tag. The same applies when the phrase is a noun-like compound. eg:
a [ carol<CTAG>NN1</CTAG> singing<CTAG>VVG</CTAG> ] contest<CTAG>NN1</CTAG>
If the verb be can be replaced by another verb such as seem or become, without changing the meaning of the following JJ/VVN word, this is a strong indication that the construction is not properly a passive, and that the word is a JJ. eg:
The building was infested<CTAG>JJ</CTAG> with cockroaches
(The building became/seemed infested...)
I could see he was favourably disposed<CTAG>JJ</CTAG> to the idea
(He seemed favourably disposed...)
A further distinction which can be used as a test with 'event' verbs is that the JJ refers to a 'resultant state', whereas the VVN refers to a an event. eg:
Bill was married<CTAG>JJ</CTAG> (as opposed to single)
Bill was married<CTAG>VVN</CTAG> to Sarah on May 14th (the actual event)
Some further examples:
Three people were injured<CTAG>VVN</CTAG> in the accident
I could see he was (seemed) injured<CTAG>JJ</CTAG>
He lay injured<CTAG>JJ</CTAG> on the road
We have three injured<CTAG>JJ</CTAG> players in the side
Our players are not worried<CTAG>JJ</CTAG>
She is not worried<CTAG>VVN</CTAG> by that sort of threat
RG/RR
RG is restricted to adverbs of degree (also called intensifiers, etc.) which precede the word or expression they modify. Clear cases of RG are very, and so and as in comparatives (see section on as below).
Adverbs which have a range of functions, including adverb of degree, are not normally tagged RG, but are given the more general RR tag instead.
She<CTAG>PPHS1 was<CTAG>VBDZ</CTAG scantily<CTAG>RR</CTAG> clad<CTAG>JJ</CTAG>
Here 'scantily' is an RR rather than an RG because it could also occur after a verb:
She<CTAG>PPHS1 dressed<CTAG>VVD</CTAG> scantily<CTAG>RR</CTAG>
This is another case of the general principle of avoiding general-specific ambiguities within a word class. RG is usually only for words which do not have a more general range of adverbial uses.
There are exceptions to this, however. (See Section 2: Adverbs. See also Section 4: so). The words which may be tagged RG or RR are:
Examples:
She is so<CTAG>RG</CTAG> attractive
I would think so<CTAG>RR</CTAG>
This is too<CTAG>RG</CTAG> heavy
Can I come too<CTAG>RR</CTAG>?
That's rather<CTAG>RG</CTAG> nice
I would rather<CTAG>RR</CTAG> go out
He's quite<CTAG>RG</CTAG> talkative
Quite<CTAG>RR</CTAG>, I agree
Note that about may be an RP or an RG. However, this does not violate the principle mentioned above, since both RP and RG are sub-categories of RR:
as can be tagged RG, II or CSA.
<>It is an RG when it occurs before an adjective, adverb or determiner (and sometimes other words) in phrases such as:
I don't think that one is as<CTAG>RG</CTAG> good
I go there as<CTAG>RG</CTAG> often (as...)
There are not as<CTAG>RG</CTAG> many (as...)
In the 2nd and 3rd examples above, the second as is always a CSA because it introduces a comparative construction (an equal comparison, as contrasted with an unequal comparison introduced by than). Thus, in the following, the second as is tagged CSA:
She's not as<CTAG>RG</CTAG> (or so<CTAG>RG</CTAG>) pretty as<CTAG>CSA</CTAG> I thought
An ostrich can run as<CTAG>RG</CTAG> quickly as<CTAG>CSA</CTAG> a zebra
He has as<CTAG>RG</CTAG> many as<CTAG>CSA</CTAG> six children
Notice that as in this comparative use is tagged CSA whether or not it introduces a clause, as normally understood. In the second case above, as precedes a noun phrase. In the following, it precedes an adjective:
Please come as<CTAG>RG</CTAG> quickly as<CTAG>CSA</CTAG> possible
CSA is also the tag used when as introduces other clauses (eg. clauses of time or clauses of reason). eg:
As<CTAG>CSA</CTAG> I arrived, he was leaving
I'll lend you the money, as<CTAG>CSA</CTAG> you're my friend
II is the tag for as as an undoubted preposition - it usually has an equative meaning, as in:
They regard him as<CTAG>II</CTAG> a friend
As<CTAG>II</CTAG> governor of the province, I have to take action
The guideline restricts II to cases of as followed by a noun-phrase-type structure - which may be a pronoun. If as is followed by an adjective, a past participle etc., it is tagged CSA, even though it has the same equative type of meaning as as<CTAG>II</CTAG>. eg:
The novel as<CTAG>CSA</CTAG> originally written
Many people regard his paintings as<CTAG>CSA</CTAG> hideous
NAMES OF PROJECTS/PROGRAMMES/FUNDS/TREATIES
In most cases these will be tagged NP1, even when the tag NN1 would have been valid (e.g. Force). Acronyms that represent company names, or names of groups are usually given the "NP1" tag too. An example of an acronym that could receive "NN1" is "SOS".
CONCERNED
This will almost always be "JJ" unless used in a sentence such as "I concerned myself with the information."
ONE
MC1 where one precedes a noun or noun phrase, as in:
one<CTAG>MC1 book
one<CTAG>MC1 bag of spuds
and where it is the head of a noun phrase with a dependent prepostional phrase:
one<CTAG>MC1 of the books
and when referring to 'one' as a number entity:
this is the number one<CTAG>MC1
one<CTAG>MC1 is an integer
type a one<CTAG>MC1 at the prompt
PN1 where it is a personal pronoun such as:
one<CTAG>PN1 ought to be careful
one<CTAG>PN1 doesn't like to make a fuss
and when functioning as a substitute form:
the prettiest one<CTAG>PN1 is called Flo
the one<CTAG>PN1 you are holding is a bomb
his idea is not one<CTAG>PN1 that holds much water
SINCE
When used to mean "because" or "because of" since is tagged as CS. When used in phrases such as "ten years since" it is tagged RR, and when used in phrases such as "Since September..." it is tagged II.
SO
The CS tag is used when so is equivalent to the expression so that. It has a purposive function:
We hid it so<CTAG>CS no one would notice
He only said it so<CTAG>CS he could impress us
It is an RR when it occurs, usually after a punctuation mark or at the beginning of a sentence, with a meaning approximating to therefore:
It is raining, so<CTAG>RR</CTAG> I am staying at home
So<CTAG>RR</CTAG> we gave up the struggle, you see
He swore at me, so<CTAG>RR</CTAG> I hit him
It is likewise an RR if preceded by a conjunction in examples like those directly above:
He swore at me, and<CTAG>CC so<CTAG>RR</CTAG> I hit him
In expressions where so is used as a substitute form, and in cases where its use is clearly adverbial (= like that), it is tagged RR:
substitute:
so<CTAG>RR</CTAG> I believe
I might feel that, but I would never say so<CTAG>RR</CTAG>
So<CTAG>RR</CTAG> did John
I'm afraid so<CTAG>RR</CTAG>
adverbial clause:
Don't take on so<CTAG>RR</CTAG>!
It is tagged RG when used in positions where very could occur:
She is so<CTAG>RG</CTAG> friendly
I have never been so<CTAG>RG</CTAG> angry
Thank you so<CTAG>RG</CTAG> much
and when it corresponds to the first as in 'as...as...' comparisons:
They're not doing so<CTAG>RG</CTAG> well<CTAG>RR</CTAG> as<CTAG>CSA</CTAG> before
UNTIL
Until is tagged II when it is used to mean "in" and CS when it is used to mean "when".
WHEN
When may be tagged RRQ or CS. When can introduce three types of clause:
When it introduces an adverbial clause or a non-restrictive relative clause, it is a CS. When it introduces either a noun clause or a restrictive relative clause, it is an RRQ. Examples:
- adverbial clause:
When<CTAG>CS I arrived, John left
John left when<CTAG>CS I arrived (at the time at which)
I smoke when<CTAG>CS I'm tense (whenever)
- noun clause:
I cannot remember when<CTAG>RRQ</CTAG> I was christened
I don't know when<CTAG>RRQ</CTAG> the next bus is due
(the date/point in time at which)
- relative clause:
In the year when<CTAG>RRQ</CTAG> I was born (in which)
The moment when<CTAG>RRQ</CTAG> he arrived (at which)
Note that when can often be omitted in a relative clause.
There are also non-restrictive relative clauses introduced by when, which are now to be tagged as CS. Previously they were tagged RRQ. It is no longer necessary to distinguish these from adverbial clauses introduced by when. Here are some examples of non-restrictive relative clauses:
In 1968, when<CTAG>CS the students were revolting in Paris...
Here, when could best be paraphrased as at the time when.
Another example:
School finished at 4 o'clock precisely, when<CTAG>CS a loud bell sounded
Non-restrictive relative clauses do not define or restrict the meaning of the antecedent. If the antecedent is a precise temporal expression (such as "4 o'clock", "1990", "yesterday"), when is usually a non-restrictive relative. These are different from restrictive relatives, such as:
In the year when<CTAG>RRQ</CTAG> I was born
Here the year is defined by the relative clause. Typically restrictive relatives are not preceded by a comma, and the when can normally be omitted. Another use of when<CTAG>RRQ</CTAG> is in direct questions:
When<CTAG>RRQ</CTAG> did you find out?
In abbreviated adverbial clauses, where when is followed by an adjective, a preposition phrase, a non-finite clause etc., when is a CS:
when<CTAG>CS ready
when<CTAG>CS in doubt
when<CTAG>CS arriving late
but before an infinitive, when is an RRQ:
I don't know when<CTAG>RRQ</CTAG> to apply
Note that the infinitive clause may be implied:
Tell me when<CTAG>RRQ</CTAG> (to start)
and that a noun clause may be abbreviated simply to the word when:
It was Guy Fawkes, but I can't remember when<CTAG>RRQ</CTAG>
WHERE
The tagging of where is consistent with when.
File 006
What legal proceedings will now by initiated against the offenders
To what extent can UNICE be said do be representing the employers'
As the Honorable Member States, the synthetic fibers industry has experienced... (is this a verb? should it be lowercase?)
seminars on community polices of particular interest (policies?)
Experience over the last 25 year with this... (years?)
25 MW, tghe output
the provision of these service is the responsibility (services?)
to the answer which is gave to Written Question...
N provision is made (No?)
of a lega system
and that used for calculating pension for women
N other states
status of black rhino's
In this context they Stated (remove captialisation)
greenhouse gase emissions
Question N 2513/91
had Stated its readiness
In EPC aware that the...
oral question N H-0544/92
has itself Stated that it..
Unectef v. <S> Heylens
File 016
which contradicts this assessments
Question N 2620/88
a unilateral decisions
consulting with workers representatives (add apostrophe?)
elected workers representatives
offices of the Holy See outside that State
of the Holy See
Rgion wallone
bevore President Najibullah's
The proceed were used to
provide en absolute guarantee of
File 032
Dr </S> **15;1324;S
fatal accidcents
Commission v. </S>
are the figure for the participation
The conflicts with at study
througout the area
N preliminary
Vol. <S/> advisory </S> </PAR> **52;46071;PAR **17;46124;S committees
the eoncouragement
views correct represented
Bleis v. </S>
which where the first to
as regard the
File 040
Werner v. </S>
Cocentrate intake
'Tokayj' wines
Italy v. </S>
During the 1980's
authorities, commitment to (should be apostrophe)
File 047
Latein American
Develpment
cf. </S>
have effectd
one of thom have
groupes of experts
chlidren's right (children's rights?)
N information
N details
at international closely level
Neverthelesss, special
Co. </S>
Co. </S>
N specific
N specific
and elsehwere
leakage or radioactive wastes (of?)
N community
N new arrangements
in wich
File 051
VRA's have been possible
N extension of such VRA's
N such understanding
macroeconmic
customers information (apostrophe needed?)
export of cosmetic containing (plural?)
e.g. </S>
N project
meting within
question N 672/92
done so do ratify (to?)
File 058
transport bypipeline
N one would contest
been earmarked or this (for?)
N follow-up
Prof. </S>
Prof. </S>
Prof. </S>
Prof. </S>
instal systems
question N
The blody events
File 061
An agrrement
N solution
plasma derivates
Doc. </S>
N longer applicable
but would provided no information
As regard the duration (plural)
the extend of (extent?)
NGO's
File 065
N project
section of 'Europe 2000 (where is the closing quotation?)
regarding lthe problems
Written Qeestions
N posts
increasing awarenes
establishments producting
a privat consulting
may favour large corporation
how will they by linked
rights issued involved
File 081
Rgime
that rgime
less than three month
Doc. </S>
Where an erroneous tag was found, the correct tag was inserted before it, separated by a dash (-), e.g.:
1.2.1.1.65.2.3.1.3.1\1 TOK La DETRFS 1.2.1.1.65.2.3.1.3.1\4 TOK Commission SUBSFS 1.2.1.1.65.2.3.1.3.1\14 PUNCT , YAAA 1.2.1.1.65.2.3.1.3.1\16 TOK qui PRELFS-PRELMS 1.2.1.1.65.2.3.1.3.1\20 TOK avait AUXA3-VERB3 1.2.1.1.65.2.3.1.3.1\26 TOK deja ADVE 1.2.1.1.65.2.3.1.3.1\31 TOK annoncé PPASMS 1.2.1.1.65.2.3.1.3.1\39 TOK cela PDETMS
Here the tag PRELMS was automatically assigned for "qui", and this was corrected to PRELFS. Likewise AUXA3 was inserted for "avait". For this example diacritic characters and the candidate tags have been removed but in the output files none of the text has been removed. The only changes made to the files are the addition of the correct tags, and the insertion of question marks on tags where the text is clearly wrong (usually a typographical error), or has been incorrectly segmented by the tagging program.
| Tag | Definition |
|---|---|
| ADJEFP | adjective feminine plural |
| ADJEFS | adjective feminine singular |
| ADJEMP | adjective masculine plural |
| ADJEMS | adjective masculine singular |
| ADJIFP | indefinite adjective feminine plural |
| ADJIFS | indefinite adjective feminine singular |
| ADJIMP | indefinite adjective masculine plural |
| ADJIMS | indefinite adjective masculine singular |
| AUXA | auxiliary "avoir" infinitive |
| AUXA1 | auxiliary "avoir" 1st person singular |
| AUXA2 | auxiliary "avoir" 2nd person singular |
| AUXA3 | auxiliary "avoir" 3rd person singular |
| AUXA4 | auxiliary "avoir" 1st person plural |
| AUXA5 | auxiliary "avoir" 2nd person plural |
| AUXA6 | auxiliary "avoir" 3rd person plural |
| AUXE | auxiliary "être" infinitive |
| AUXE1 | auxiliary "être" 1st person singular |
| AUXE2 | auxiliary "être" 2nd person singular |
| AUXE3 | auxiliary "être" 3rd person singular |
| AUXE4 | auxiliary "être" 1st person plural |
| AUXE5 | auxiliary "être" 2nd person plural |
| AUXE6 | auxiliary "être" 3rd person plural |
| VERB1 | main verb 1st person singular |
| VERB2 | main verb 2nd person singular |
| VERB3 | main verb 3rd person singular |
| VERB4 | main verb 1st person plural |
| VERB5 | main verb 2nd person plural |
| VERB6 | main verb 3rd person plural |
| VINF | main verb infinitive |
| PPASFP | past participle feminine plural |
| PPASFS | past participle feminine singular |
| PPASMP | past participle masculine plural |
| PPASMS | past participle masculine singular |
| PPRE | present participle |
| CCOO | coordinative conjunction |
| CSUB | subordinative conjunction |
| CHIF | numerals |
| DETRFP | determiner feminine plural |
| DETRFS | determiner feminine singular |
| DETRMP | determiner masculine plural |
| DETRMS | determiner masculine singular |
| DINTFP | indefinite determiner feminine plural |
| DINTFS | indefinite determiner feminine singular |
| DINTMP | indefinite determiner masculine plural |
| DINTMS | indefinite determiner masculine singular |
| ADVE | adverb |
| NE | negative adverb : particle "ne" |
| PAS | negative adverb : particle "pas,jamais,point" |
| PREP | adposition |
| PAU | adposition "au" |
| PAUX | adposition "aux" |
| PDEA | adposition "de,à" |
| PDES | adposition "des" |
| PREPMS | adposition "de" |
| PDETFP | demonstrative pronoun |
| PDETFS | demonstrative pronoun |
| PDETMP | demonstrative pronoun |
| PDETMS | demonstrative pronoun |
| PINDFP | indefinite pronoun |
| PINDFS | indefinite pronoun |
| PINDMP | indefinite pronoun |
| PINDMS | indefinite pronoun |
| PINTFP | interrogative pronoun |
| PINTFS | interrogative pronoun |
| PINTMP | interrogative pronoun |
| PINTMS | interrogative pronoun |
| PPER1 | personal pronoun 1st person singular |
| PPER2 | personal pronoun 2nd person singular |
| PPER3F | personal pronoun 3rd person feminine singular |
| PPER3M | personal pronoun 3rd person masculine singular |
| PPER4 | personal pronoun 1st person plural |
| PPER5 | personal pronoun 2nd person plural |
| PPER6F | personal pronoun 3rd person feminine plural |
| PPER6M | personal pronoun 3rd person masculine plural |
| PPOBFP | personal pronoun feminine plural (object) |
| PPOBFS | personal pronoun feminine singular (object) |
| PPOBMP | personal pronoun masculine plural (object) |
| PPOBMS | personal pronoun masculine singular (object) |
| PSFP | possessive pronoun feminine plural |
| PSFS | possessive pronoun feminine singular |
| PSMP | possessive pronoun masculine plural |
| PSMS | possessive pronoun masculine singular |
| PREFFP | reflexive pronoun feminine plural |
| PREFFS | reflexive pronoun feminine singular |
| PREFMP | reflexive pronoun masculine plural |
| PREFMS | reflexive pronoun masculine singular |
| PRELFP | relative pronoun feminine plural |
| PRELFS | relative pronoun feminine singular |
| PRELMP | relative pronoun masculine plural |
| PRELMS | relative pronoun masculine singular |
| NPRO | Proper nouns |
| SUBSFP | substantive feminine plural |
| SUBSFS | substantive feminine singular |
| SUBSMP | substantive masculine plural |
| SUBSMS | substantive masculine singular |
| X | unique membership class |
| AAAA | strong punctuation |
| YAAA | weak punctuation |
The tag set consists of 160 tags, 2 of which are punctuation tags, other are part of speech tags.
| Tag | Definition |
|---|---|
| A | Adjective, predic. |
| AAAA | weak Punctuation mark |
| AFPD | Adjective fem. plur. dative |
| AFPG | Adjective fem. plur. dative |
| AFSA | Adjective fem. sing. accusative |
| AFSD | Adjective fem. sing. dative |
| AFSG | Adjective fem. sing. genitive |
| AFSN | Adjective fem. sing. nominative |
| AMPA | Adjective masc. plur. accusative |
| AMPF | Adjective masc. plur. dative |
| AMPG | Adjective masc. plur. genitive |
| AMPN | Adjective masc. plur. nominative |
| AMSD | Adjective masc. sing. dative |
| AMSN | Adjective masc. sing. nominative |
| CA | Conjunction Part I |
| CC | Cooordinative cunjunction |
| CHIF | Numbers |
| CI | Subord. conjunctions introd. an infinit. clause |
| CS | Subordinative cunjunction |
| CV | Comparative Conjunction |
| CZ | Conjunction Part II |
| DD | Dem. determiner |
| DFSA | Dem. determiner, fem. sing. accusative |
| DFSD | Dem. determiner, fem. sing. dative |
| DFSG | Dem. determiner, fem. sing. genitive |
| DFSN | Dem. determiner, fem. sing. nominative |
| DI | Indef. determiner |
| DMSA | Dem. determiner, masc. sing. accusative |
| DMSD | Dem. determiner, masc. sing. dative |
| DMSG | Dem. determiner, masc. sing. genitive |
| DMSN | Dem. determiner, masc., sing., nominative |
| DNSA | Dem. determiner, neut. sing. accusative |
| DNSD | Dem. determiner, neut. sing. dative |
| DNSG | Dem. determiner, neut. sing. genitive |
| DNSN | Dem. determiner, neut. sing. nominative |
| DT | Interrog. determiner |
| I | Interjection |
| NCFPA | Common noun, fem. plur., accusative |
| NCFPD | Common noun, fem. plur., dative |
| NCFPG | Common noun, fem. plur., genitive |
| NCFPN | Common noun, fem. plur., nominative |
| NCFSA | Common noun, fem. sing., accusative |
| NCFSD | Common noun, fem. sing., dative |
| NCFSG | Common noun, fem. sing., genitive |
| NCFSN | Common noun, fem. sing., nominative |
| NCMPA | Common noun, masc. plur., accusative |
| NCMPD | Common noun, masc. plur., dative |
| NCMPG | Common noun, masc. plur., genitive |
| NCMPN | Common noun, masc. plur., nominative |
| NCMSA | Common noun, masc. sing., accusative |
| NCMSD | Common noun, masc. sing., dative |
| NCMSG | Common noun, masc. sing., genitive |
| NCMSN | Common noun, masc. sing., nominative |
| NCNPA | Common noun, neut. plur., accusative |
| NCNPD | Common noun, neut. plur., dative |
| NCNPG | Common noun, neut. plur., genitive |
| NCNPN | Common noun, neut. plur., nominative |
| NCNSA | Common noun, neut. sing., accusative |
| NCNSD | Common noun, neut. sing., dative |
| NCNSG | Common noun, neut. sing., genitive |
| NCNSN | Common noun, neut. sing., nominative |
| NPRO | Proper noun |
| PD | Dem. pronoun |
| PI | Indef. pronoun |
| PR | Rel. pronoun |
| PT | Interrogative pronoun |
| PX | Refl. pronoun |
| QI | infinitive particle |
| QS | superlative particle |
| QV | verbal prefix |
| RG | General adverb |
| RI | interrogative adverb |
| RP | pronominal adverb |
| SPC | pre-position, clitic |
| SPS | pre-position, simple |
| STS | post-position, simple |
| VAII1P | Aux. verb, 1st pers. pl. ind. imp. |
| VAII1S | Aux. verb, 1st pers. sg. ind. imp. |
| VAII2P | Aux. verb, 2nd pers. pl. ind. imp. |
| VAII2S | Aux. verb, 2nd pers. sg. ind. imp. |
| VAII3P | Aux. verb, 3nd pers. pl. ind. imp. |
| VAII3S | Aux. verb, 3rd pers. sg. ind. imp. |
| VAIP1P | Aux. verb, 1st pers. pl. ind. pres. |
| VAIP1S | Aux. verb, 1st pers. sg. ind. pres. |
| VAIP1S | Aux. verb, 1st pers. sg. ind. pres. |
| VAIP2P | Aux. verb, 2nd pers. pl. ind. pres. |
| VAIP2S | Aux. verb, 2nd pers. sg. ind. pres. |
| VAIP3P | Aux. verb, 3rd pers. pl. ind. pres. |
| VAM2P | Aux. verb, 2nd pers. pl. imperative |
| VAPP | Aux. verb, pres. participle |
| VAPS | Aux. verb, past part. |
| VASI1P | Aux. verb, 1st pers. pl. subj. imp. |
| VASI1S | Aux. verb, 1st pers. sg. subj. imp. |
| VASI2P | Aux. verb, 2nd pers. pl. subj. imp. |
| VASI2S | Aux. verb, 2nd pers. sg. subj. imp. |
| VASI3P | Aux. verb, 3rd pers. pl. subj. imp. |
| VASI3S | Aux. verb, 3rd pers. sg. subj. imp. |
| VASP1P | Aux. verb, 1st pers. pl. subj. pres. |
| VASP1S | Aux. verb, 1st pers. sg. subj. pres. |
| VASP2P | Aux. verb, 2nd pers. pl. subj. pres. |
| VASP2S | Aux. verb, 2nd pers. sg. subj. pres. |
| VASP3P | Aux. verb, 3rd pers. pl. subj. pres. |
| VASP3S | Aux. verb, 3rd pers. sg. subj. pres. |
| VMII1P | Main verb, 1st pers. pl. ind. imp. |
| VMII1S | Main verb, 1st pers. sg. ind. imp. |
| VMII2P | Main verb, 2nd pers. pl. ind. imp. |
| VMII2S | Main verb, 2nd pers. sg. ind. imp. |
| VMII3P | Main verb, 3rd pers. pl. ind. imp. |
| VMII3S | Main verb, 3rd pers. sg. ind. imp. |
| VMIP1P | Main verb, 1st pers. pl. ind. pres. |
| VMIP1S | Main verb, 1st pers. sg. ind. pres. |
| VMIP2P | Main verb, 2nd pers. pl. ind. pres. |
| VMIP2S | Main verb, 2nd pers. sg. ind. pres. |
| VMIP3P | Main verb, 3rd pers. pl. ind. pres. |
| VMIP3S | Main verb, 3rd pers. sg. ind. pres. |
| VMM2 | Main verb, 2nd pers. pl. imperative |
| VMM2P | Main verb, 2nd pers. pl. imperative |
| VMPP | Main verb, pres. participle |
| VMPS | Main verb, past part. |
| VMSI1P | Main verb, 1st pers. pl. subj. imp. |
| VMSI1S | Main verb, 1st pers. sg. subj. imp. |
| VMSI2P | Main verb, 2nd pers. pl. subj. imp. |
| VMSI2S | Main verb, 2nd pers. sg. subj. imp. |
| VMSI3P | Main verb, 3rd pers. pl. subj. imp. |
| VMSI3S | Main verb, 3rd pers. sg. subj. imp. |
| VMSP1P | Main verb, 1st pers. pl. subj. pres. |
| VMSP1S | Main verb, 1st pers. sg. subj. pres. |
| VMSP2P | Main verb, 2nd pers. pl. subj. pres. |
| VMSP2S | Main verb, 2nd pers. sg. subj. pres. |
| VMSP3P | Main verb, 3rd pers. pl. subj. pres. |
| VMSP3P | Main verb, 3rd pers. sg. subj. pres. |
| VMU | Main verb, infinitive with incorp. particle |
| VOII1P | Mod. verb, 1st pers. pl. ind. imp. |
| VOII1S | Mod. verb, 1st pers. sg. ind. imp. |
| VOII2P | Mod. verb, 2nd pers. pl. ind. imp. |
| VOII2S | Mod. verb, 2nd pers. sg. ind. imp. |
| VOII3P | Mod. verb, 3rd pers. pl. ind. imp. |
| VOII3S | Mod. verb, 3rd pers. sg. ind. imp. |
| VOIP1P | Mod. verb, 1st pers. pl. ind. pres. |
| VOIP1S | Mod. verb, 1st pers. sg. ind. pres. |
| VOIP2P | Mod. verb, 2nd pers. pl. ind. pres. |
| VOIP2S | Mod. verb, 2nd pers. sg. ind. pres. |
| VOIP3P | Mod. verb, 3rd pers. pl. ind. pres. |
| VOIP3S | Mod. verb, 3rd pers. sg. ind. pres. |
| VOPP | Mod. verb, pres. participle |
| VOPS | Mod. verb, past part. |
| VOSI1P | Mod. verb, 1st pers. pl. subj. imp. |
| VOSI1S | Mod. verb, 1st pers. sg. subj. imp. |
| VOSI2P | Mod. verb, 2nd pers. pl. subj. imp. |
| VOSI2S | Mod. verb, 2nd pers. sg. subj. imp. |
| VOSI3P | Mod. verb, 3rd pers. pl. subj. imp. |
| VOSI3S | Mod. verb, 3rd pers. sg. subj. imp. |
| VOSP1P | Mod. verb, 1st pers. pl. subj. pres. |
| VOSP1S | Mod. verb, 1st pers. sg. subj. pres |
| VOSP2P | Mod. verb, 2nd pers. pl. subj. pres |
| VOSP2S | Mod. verb, 2nd pers. sg. subj. pres |
| VOSP3P | Mod. verb, 3rd pers. pl. subj. pres |
| VOSP3S | Mod. verb, 3rd pers. sg. subj. pres |
| X | Formulae (5x + 3y), Symbols (%, ' etc.) |
| YAAA | strong Punctuation mark |
For the training of the tagger, we have used the MULTEXT corpus based on 'joc' files. Some statistics on these files given following results :
JOC006.DE
Words 21419 (88.2 %)
Punctuations 2858 (11.8 %)
Sentences 1019
Tags 3900
Unknown 1686 (7.9 %)
Ambiguous 14665 (68.5 %)
JOC016.DE
Words 17560 (87.9 %)
Punctuations 2417 (12.1 %)
Sentences 824
Tags 3056
Unknown 1431 (8.1 %)
Ambiguous 11642 (66.3 %)
JOC032.DE
Words 25166 (88.6 %)
Punctuations 3248 (11.4 %)
Sentences 1078
Tags 4056
Unknown 1950 (7.7 %)
Ambiguous 17113 (68.0 %)
JOC040.DE
Words 26493 (88.3 %)
Punctuations 3509 (11.7 %)
Sentences 1231
Tags 4566
Unknown 2188 (8.3 %)
Ambiguous 17725 (66.9 %)
JOC047.DE
Words 17468 (88.5 %)
Punctuations 2276 (11.5 %)
Sentences 808
Tags 3076
Unknown 1360 (7.8 %)
Ambiguous 12108 (69.3 %)
JOC051.DE
Words 16101 (87.8 %)
Punctuations 2239 (12.2 %)
Sentences 751
Tags 2816
Unknown 1370 (8.5 %)
Ambiguous 10709 (66.5 %)
JOC058.DE
Words 21003 (88.0 %)
Punctuations 2857 (12.0 %)
Sentences 921
Tags 3470
Unknown 1835 (8.7 %)
Ambiguous 14078 (67.0 %)
JOC061.DE
Words 20538 (88.7 %)
Punctuations 2628 (11.3 %)
Sentences 963
Tags 3700
Unknown 1625 (7.9 %)
Ambiguous 14005 (68.2 %)
JOC065.DE
Words 22443 (87.9 %)
Punctuations 3078 (12.1 %)
Sentences 1046
Tags 3940
Unknown 1833 (8.2 %)
Ambiguous 15213 (67.8 %)
JOC081.DE
Words 13352 (89.5 %)
Punctuations 1572 (10.5 %)
Sentences 549
Tags 2124
Unknown 1060 (7.9 %)
Ambiguous 9279 (69.5 %)
Where:
We have created an equiprobable matrix (also named 'zero matrix'), then the Baum-Welch re-estimation parameters was used on the texts after the lexical look up. Parameters were set to 10 iterations on each file.
RESULTS:
| Corpus | Words | Errors | Rate | |
|---|---|---|---|---|
| joc06+16 | 38879 | 15458 | 39.76% |
An initial matrix was created using a hand-tagging corpus:
Words 13114 (88.0 %)
Punctuations 1788 (12.0 %)
Sentences 651
Tags 2490
Then the Baum-Welch re-estimation parameters was used on the texts after the lexical look up. Parameters were set to 10 iterations on each file.
RESULTS:
| Corpus | Words | Errors | Rate |
|---|---|---|---|
| joc06+16 | 38879 | 10672 | 27.38% |
An initial matrix was created using a hand-tagging corpus:
Words 13114 (88.0 %)
Punctuations 1788 (12.0 %)
Sentences 651
Tags 2490
Then the Baum-Welch re-estimation parameters was used on the texts after the lexical look up and with some corrections described in section 3.3.a. Parameters were set to 10 iterations on each file.
We have numbered some errors in the lexicon:
RESULTS:
| Corpus | Words | Errors | Rate |
|---|---|---|---|
| joc06+16 | 38879 | 2791 | 7.18% |
As the real output format does not fit on single lines, we have split lines into :
FIRST LINE
first column reference (1.2.1.1.2.2.3.1.1.1\1)
second column segmenter Classes (CHUNK, TOK etc.)
third column textual element
fourth column tag (bold characters)
SECOND LINE
ambiguity classes separated by pipes (|) between [] :
[betreffen\Vaip3s\VAIP3S|betreffen\Vmip3s\VMIP3S|betreffen\Voip3s\VOIP3S]
Note
This format is the raw output from the tagger, before its conversion into CESANA conformant format.
[CHUNK <DIV_Q FROM='1.2.1.1.2.2.3.1'>
(PAR <HEAD FROM='1.2.1.1.2.2.3.1.1'>
(SENT <S>
1.2.1.1.2.2.3.1.1.1\1 TOK Betrifft VMIP3S
[betreffen\Vaip3s\VAIP3S|betreffen\Vmip3s\VMIP3S|betreffen\Voip3s\VOIP3S]
1.2.1.1.2.2.3.1.1.1\9 PUNCT : YAAA
[:\YAAA\YAAA]
1.2.1.1.2.2.3.1.1.1\11 TOK Personalsituation NPRO
[Personalsituation\Np----\NPRO]
1.2.1.1.2.2.3.1.1.1\29 TOK in P
in\Pov\P]
1.2.1.1.2.2.3.1.1.1\32 TOK der DFSD
[die\D.fpg\DFPG|die\D.fsd\DFSD|die\D.fsg\DFSG|der\D.mpg\DMPG|das\D.mpg\DMPG|der\D.msn\DMSN|die\Ncfpg\NCFPG|die\Ncfsd\NCFSD|die\Ncfsg\NCFSG|das\Ncmpg\NCMPG]
1.2.1.1.2.2.3.1.1.1\36 TOK Kommission NCFSD
[Kommission\Ncfsa\NCFSA|Kommission\Ncfsd\NCFSD|Kommission\Ncfsg\NCFSG|Kommission\Ncfsn\NCFSN]
)SENT </S>
)PAR </HEAD>
(PAR <P FROM='1.2.1.1.2.2.3.1.2'>
(SENT <S>
1.2.1.1.2.2.3.1.2.1\1 TOK Kann VOIP3S
[können\Vaip1s\VAIP1S|können\Vaip3s\VAIP3S|können\Vmip3s\VMIP3S|können\Voip3s\VOIP3S]
1.2.1.1.2.2.3.1.2.1\6 TOK die DFPA
[die\D.fpa\DFPA|die\D.fpn\DFPN|die\D.fsa\DFSA|die\D.fsn\DFSN|der\D.mpd\DMPD|der\D.mpn\DMPN|das\D.npa\DNPA|das\D.npd\DNPD|das\D.npn\DNPN|die\Ncfpa\NCFPA|die\Ncfpn\NCFPN|die\Ncfsa\NCFSA|die\Ncfsn\NCFSN|der\Ncmpd\NCMPD|der\Ncmpn\NCMPN|das\Ncnpa\NCNPA|das\Ncnpd\NCNPD|das\Ncnpn\NCNPN]
1.2.1.1.2.2.3.1.2.1\10 TOK Kommission NCFSN
[Kommission\Ncfsa\NCFSA|Kommission\Ncfsd\NCFSD|Kommission\Ncfsg\NCFSG|Kommission\Ncfsn\NCFSN]
1.2.1.1.2.2.3.1.2.1\21 TOK folgendes ANSA
[folgend\A..nsa\ANSA|folgend\A..nsn\ANSN]
1.2.1.1.2.2.3.1.2.1\31 TOK mitteilen VAIP1P
[mitteilen\Vaip1p\VAIP1P|mitteilen\Vaip3p\VAIP3P|mitteilen\Vasp1p\VASP1P|mitteilen\Vasp3p\VASP3P|mitteilen\Vmip1p\VMIP1P|mitteilen\Vmip3p\VMIP3P|mitteilen\Vmsp1p\VMSP1P|mitteilen\Vmsp3p\VMSP3P|mitteilen\Voip1p\VOIP1P|mitteilen\Voip3p\VOIP3P|mitteilen\Vosp1p\VOSP1P|mitteilen\Vosp3p\VOSP3P]
1.2.1.1.2.2.3.1.2.1\40 PUNCT : YAAA
[:\YAAA\YAAA
)SENT </S>
)PAR </P>
(PAR <P FROM='1.2.1.1.2.2.3.1.3'>
(SENT <S>
1.2.1.1.2.2.3.1.3.1\1 ENUM 1. CHIF
[1.\M----\CHIF]
1.2.1.1.2.2.3.1.3.1\5 TOK die DFSN
[die\D.fpa\DFPA|die\D.fpn\DFPN|die\D.fsa\DFSA|die\D.fsn\DFSN|der\D.mpd\DMPD|der\D.mpn\DMPN|das\D.npa\DNPA|das\D.npd\DNPD|das\D.npn\DNPN|die\Ncfpa\NCFPA|die\Ncfpn\NCFPN|die\Ncfsa\NCFSA|die\Ncfsn\NCFSN|der\Ncmpd\NCMPD|der\Ncmpn\NCMPN|das\Ncnpa\NCNPA|das\Ncnpd\NCNPD|das\Ncnpn\NCNPN]
1.2.1.1.2.2.3.1.3.1\9 TOK Zahl NCFSN
[Zahl\Ncfsa\NCFSA|Zahl\Ncfsd\NCFSD|Zahl\Ncfsg\NCFSG|Zahl\Ncfsn\NCFSN]
1.2.1.1.2.2.3.1.3.1\14 TOK der DFPG
[die\D.fpg\DFPG|die\D.fsd\DFSD|die\D.fsg\DFSG|der\D.mpg\DMPG|das\D.mpg\DMPG|der\D.msn\DMSN|die\Ncfpg\NCFPG|die\Ncfsd\NCFSD|die\Ncfsg\NCFSG|das\Ncmpg\NCMPG]
1.2.1.1.2.2.3.1.3.1\18 TOK bei SPS
[bei\Qv\QV|bei\Sps\SPS]
1.2.1.1.2.2.3.1.3.1\22 TOK der DFSD
[die\D.fpg\DFPG|die\D.fsd\DFSD|die\D.fsg\DFSG|der\D.mpg\DMPG|das\D.mpg\DMPG|der\D.msn\DMSN|die\Ncfpg\NCFPG|die\Ncfsd\NCFSD|die\Ncfsg\NCFSG|das\Ncmpg\NCMPG]
1.2.1.1.2.2.3.1.3.1\26 TOK Kommission NCFSD
[Kommission\Ncfsa\NCFSA|Kommission\Ncfsd\NCFSD|Kommission\Ncfsg\NCFSG|Kommission\Ncfsn\NCFSN]
1.2.1.1.2.2.3.1.3.1\37 TOK tätigen AG
[??\??\??]
1.2.1.1.2.2.3.1.3.1\45 TOK Bediensteten NCNPG
[Bedienstete\Ncfpg\NCFPG|Bedienstete\Ncmpg\NCMPG|Bedienstete\Ncmsn\NCMSN|Bedienstete\Ncnpg\NCNPG]
1.2.1.1.2.2.3.1.3.1\58 TOK auf SPS
[auf\Qv\QV|auf\Rg.\RG|auf\Sas\SA|auf\Sps\SPS]
1.2.1.1.2.2.3.1.3.1\62 TOK Zeit NCFSA
[Zeit\Ncfsa\NCFSA|Zeit\Ncfsd\NCFSD|Zeit\Ncfsg\NCFSG|Zeit\Ncfsn\NCFSN]
1.2.1.1.2.2.3.1.3.1\66 PUNCT ; AAAA
[;\AAAA\AAAA]
)SENT </S>
)PAR </P>
)SENT </S>
)PAR </P>
]CHUNK </DIV_Q>
This part adresses the work done for adapting the Multext Tools for Part-of-Speech Tagging (developed by ISSCO, Armstrong et al., 1995) to Spanish in order to tag the Spanish corpus from the Multext/MLCC corpus. A brief introduction of the model implemented by the HMM Multext tagger is given as well as the experimentation with the different parametrizations prior to the final version of the tool. Initial decisions, as the tagset, the lexicon and the training procedure are also discussed. Finally, results are presented and justified with respect its benefits in comparison with other documented attempts.
Multext defined in its Technical Annex the role of a POS desambiguator as a tool based on Markov model to select the most plausible analysis on the basis of the local context using statistical generalizations. It has been acknowledged in the literature that the best way to obtain these statistical generalizations is by deriving it automatically from a previously tagged training corpus. Multext was indeed forced to use a more complex method as it was recognised that tagged corpora was not available for most of the Multext languages. This was the case for Spanish. The Technical Annex stated as the goal of this task to build a tool based in this more complex method which had yield good results for English, and to experiment with languages of different morphological characteristics. The quality of results will determine whether to use only ambigous corpus can be considered sufficient or corpus had to be hand corrected and used as a bootstrapping training corpus. Larger annotated corpora and better disambiguators had then to be created cyclically for tagging, manual correction and retraining.
From this starting point the goals of the experiment can be summed up as to check the performance and differences between the two kinds of training. In this respect, conclusions after testing are that training with disambiguated corpus is the best strategy. However, when no hand tagged corpora is available, the tools proposed by MULTEXT and the methodology suggested is a good way to get tagged corpus. Results of the experimentation carried out show that good results can be achieved, results which can be comparable to other tools for Spanish.
The tagger for Part-of-Speech tagging for Multext is a program that takes as input a sequence of words annotated with one or more tags and returns the most likely tag for each word in the text. It is based on a Hidden Markov Model and the process is performed in two steps: a training phase to estimate the parameters of the model and a tagging phase to select the most probable tags according to this model employing the Viterbi algorithm.
Multext PoS desambiguator is based on a Hidden Markov Model. This model is largely documented in the literature (Rabiner, 1989 and Cutting-et-al., 1992). We will try to sum up the aspects which have to do with the application of this methodology for adaptating the tool to a given language, in this case Spanish.
Formally, a HMM is a 5-tuple <S, C, A, B, Pi> where
For the PoS case, the "set of states'' S are the Tags/Labels referring to grammatical category assigned to words. "Observational Symbols'' C are the Ambiguity Classes: combinations of different plausible tags which a given word can be assigned lexically (according to the set of tags). The A matrix is the Language Model, a probabilistic model for the tag sequences, and the B matrix is the Communication Model(which in the case of Multext implementation is a simplification of the initial HMM) and states the probability for a given tag of generating each of the ambiguity classes. The Pi vector records the probability for a given tag to occur in the beginning of a sentence.
Multext training module takes as its input sequences of ambiguity classes (that is, it can use ambigous text) and uses the Baum-Welch algorithm for parameter reestimation to produce a training Hidden Markov Model. The tool foresees the possibility of performing iterations in order to optimize the model. For readjusting the model, during the training phase, the tool provides a facility for taking into account user defined "transition biases". These specify values to increase or decrease the probability of a transition between to tags.
The tool also allows for an initial model to be provided. This model can be estimated from some amount of hand tagged corpora using relative frequency analysis tools. It has to be noted, though, that Multext tool only foresees the A matrix of the model to be estimated, while it lacks means either for estimating or for readjusting the B matrix by the user. That means that no preferences can be given for the probability of a tag to display a particular class rather than another.
All these user defined estimations are part of the tuning process to improve the model, as we will explain later.
Methodology during experimentation was mainly driven by the general approach of the project. As already foreseen, we had no hand tagged corpus available for starting training. Thus different rounds of parameter estimation and hand validation of the results, which were used again for training, toke place. During this process all the means offered by the tool to readjust and improve the model were used (but precision flags, which we could not run properly). The set of tags was also taken into account in order to test whether the number and quality ofthe used labels had impact in the results.
The project also included the development of a lexicon containing morphosyntactic features based in EAGLES (Expert Advisory Group on Language Engineering Standards) model and a correspondent tagset to be used by PoS tagger. The lexicon is then a fullform word lists made up of about 15,000 lemmas. The tagset to be used by each language had to be defined along the project basing decisions in a step-wise refinement.
Spanish initial tagset was a simplification of the lexical information based on EAGLES standards used to describe lexical items. Major morphosyntactic categories were taken into account as well as detailed information carried by lexical items (i.e. tense distinctions, type of determiners or pronouns, etc.). This first version was reviewed later on the light of the idea that the tagset was to be used only for disambiguation purposes. The possibility to recover lexical information after desambiguation process made it advisable to reduce the number of tags. Thus, a first set was used consisting of 259 tags which was reduced in a second round of experimentation to 107 tags.
Reduction of tags becomes also part of tunning the language model as far as the Baum-Welch algorithm caculates the language model matrix (that is A matrix) on the basis of non-ambigous bigrams found in the text. When training with ambigous texts, and due to the large number of ambiguities Spanish words enter in, there is very little information and very few non-ambigous tags can be taken into account. Reducing ambiguity classes in order to allow the occurrence of ambiguous words to be added to the occurrence of non-ambigous ones can influence the calculation of the matrix.
On the other hand reduction of tags also influences B matrix as this tool relies on the hipothesis that communication model is based on the same probabilites for each ambiguity classes of being displayed by a given tag. That is, if our understanding is correct, that when a given tag can generate different ambiguity classes, the maximum likelihood is taken for granted as point of departure. As final calculation takes into account the product of the probabilities given in A and B matrix, some times it was difficult to inspect results as it does not correspond to what we would call a "language model".
LEXICON
Inspecting the dictionary we can find the following ambiguity cases which will be considered when evaluating the results.
Ambiguities:
Special items:
Some of these ambiguities are also referred to by others. The literature shows that the most problematic are: article-pronoun, the word que, noun-verb, adjective-participle (cf. citeNfernando; citeNmarquez; citeNmoreno). Besides, given the fact that with Multext tools no preferences on lexical probability could be made (that is, no symbol biasing allowed), these special items, which display ambiguity but that can be said to be much more frequent like one of the categories than the others, were considered to be a crucial point in the experimentation. We are referring to the fact, for instance, that the word para is far more frequent as preposition than as a verb. Just to illustrate results on these special items we will present the following figures:
Test corpus: 040
Number of words: 31327
ya: 23 occurrences; 6 errors
para: 207 occurrences; 0 errors
uno: 10 occurrences; 3 errors
TRAINING AND BIASING
Due to the general approach of the project, in the first experiments, when only ambiguous corpus was available, a-priori biasing was a must. Some of the initial decisions were based on pure linguistic intuitions, such as co-occurrence restrictions (a preposition will never appear before a tensed verb) or grammatical information (such as agreement between determiners adjectives and nouns). This information turn out to be useful also on statistical basis. The first quotation is one of the bias used. The second is an extract of the A matrix for the same cases from the experiment with ambiguous corpus + bias. The third is the A matrix from the experiment with non-ambigous corpus. Note that the same kind of assumptions are made. Transitions to Nouns or Adjectives in agreement with the article are reinforced (this fact lead us to keep different tags stating gender and number in the reduced tagset).
TIFS NCFS +8
NCF +8 NCS +8 NPFS +8 NPS +8 A +5 AFS +5 AS +5 !OTHER -8
%% ambiguous text + biasing
%% more probable transitions are marked
State: TIFS
A 8.4e-02* AFP 3.4e-04 AFS 8.4e-02 * AMP 3.4e-04 AMS 3.4e-04 AP 3.4e-04 AS 8.4e-02 * CC 3.4e-04 CS 3.4e-04 DDFP 3.4e-04 DDFS 3.4e-04 ../.. M 3.4e-04 MFP 3.4e-04 MFS 3.4e-04 MMP 3.4e-04 MMS 3.4e-04 MP 3.4e-04 NCF 1.3e-01 * NCFP 3.4e-04 NCFS 1.3e-01 * NCM 3.4e-04 NCMP 3.4e-04 NCMS 3.4e-04 NCP 3.4e-04 NCS 1.3e-01 * NPFP 3.4e-04 NPFS 1.3e-01 * NPM 3.4e-04 ../..
%% Unambiguous corpus
%% more probable transitions are marked
State: TIFS
AFS 8.4e-02 * AS 3.7e-02 * CC 4.9e-03 DIFS 7.4e-03 MFS 9.8e-03 NCF 7.4e-03 NCFP 2.5e-03 NCFS 8.0e-01 * NCMS 2.5e-03 NCS 4.9e-03 NPFS 4.9e-03 SP 2.5e-02 VMIP3S 4.9e-03 WPUNCT( 2.5e-03
Biasing turned out to be a complex exercise mainly because of two facts. The first was that even with our reduced tagset, the large number of ambiguity classes and transitions made difficult to inspect the data. Second, it took as a while to understand how matrix B influenced transition biasing.
For testing purposes we used the hand validated corpus to train a new matrix. It turned out rather soon that results were better than those obtained via subsequent rounds of supervised training.
Experimentation was done with the following corpus:
The next table glosses used corpus statistics
Set corpus Tokens Sents Tags Unknowns Ambi.tokens Tags/Words
Train 006+misc 41900 1576 53450 72 (1.13%) 10215 (24.37%) 1.28
Test 016 18019 621 23110 216 (1.20%) 4530 (25.13%) 1.28
Test 081 14840 420 19157 98 (0.66%) 3789 (25.36%) 1.28
Test misc2 1483 44 1889 77 (5.19%) 362 (24.42%) 1.27
Experimentation carried out was based on the error rate shown in the different setups. The general idea was to prove with different initial models and using biasing to complement them in a step wise manner. When no better results than in the previous round occurred no more trials in this scenario were pursued. It has also to be noted that experimentation was done on a reduced corpus, this was mainly due to the methodology followed in the project. As already said, Spanish was one of the cases where no tagged corpora was available at the beginning. Incremental methodology has been used to increase the size of the resources used. Neverthless, as you can see at the tables, last round of tagging which was done with a model derived of the larger hand validated corpus, does not show a remarkable improvement.
In section 5 we present the tables of results of the experimentation. They reflect the different rounds with different setups. Each was tested with the 107 tagset and the 259 tagset. The devised bias were applied separately in order to be able to compare its impact. The labels of the "Train" set up correspond to the following scenarios.
In order to have a reference starting point, a equiprobable matrix was used as initial model.
No initial matrices were supplied but Baum-Welch reestimation of parameter was used. It is interesting to see how iteration affect negatively the behaviour of the tagger. Up to three iterations were done in the first experiment. As results did not show any impact, for later experimentation only one was done in order to show comparable results. As we can see in the tables, even complemented with biases, results can be even worst than with the equiprobable matrix. Hence it seems not advisable to use only this parameter estimation when working with ambigous text.
It is also noticeable the fact that a shorter tagset seems to work better in these conditions. This is mainly due to the increase of unambiguous bigrams when verbosity is reduced. However this difference is reduced when biasing is used.
This scenario was the setup considered for the project. In the lack of unambigous corpus, text after lexical lookup is used for creating an initial A matrix. Baum-Welch is then used, through several iterations, to reestimate parameters. Initial biasing for matrix A is also recommended in order to tune the model.
As in the previous setup, the larger tagset creates some difficulties. However these problems are minimized when using bias to tune the matrix. Then, comparable results are yield in both cases. As for the parameter reestimation, the larger tagset seems to improve with the different iterations, while the shorter one just loosses predictive power. Anyway, the best results, in comparable absolute figures, are achieved when using transitions adjustments.
In this scenario, an initial model was created using all available corpus. Ambigous used in the previous experiments, and the same texts having been corrected by hand. The point was to see the impact of unambigous transitions.
As one can see in the table, results are rather clear. The rate of succes is increased considerably and the impact of the size of the tagset and of the transition bias is reduced.
After the results of the previous scenario, a new one was prepared using non ambigous corpus and adding the facility for frequency estimation. This turned out to be the best strategy, as already documented in the literature of PoS tagging with HMM. Again no real impact of the size of the tagset, nor of the use of a priori biasing showed up. Iterations also show to be counterproductive.
Testing corpus: 081
Training corpus words: 41,900
Testing scenario: initial model: ambiguos corpus + biasing
Error rate: 2,93 %
Testing scenario: initial model: disambiguated corpus
Error rate: 1,13%
Testing corpus: 081
Training corpus words: 41,900
Testing scenario: initial model: ambiguos corpus + biasing + 259 tags
Error rate: 2,67 %
Testing scenario: initial model: ambiguos corpus + biasing + 107 tags
Error rate: 2,93 %
Testing scenario: initial model: disambiguated corpus + 259 tags
Error rate: 1,69%
Testing scenario: initial model: disambiguated corpus + 107 tags
Error rate: 1,54%
The second most common error has to do with disambiguating between Noun and Adjective. This can be considered a special feature of Spanish where almost all adjectives can be nominalised with an article, and, furthermore, many words have two different readings (adjectival and nominal). This is the case for público ('public': audience and status), útil ('tool' vs. 'useful'), etc. In the tested file (051), this error was about a 17.85% of the errors detected. Comparison with other PoS taggers for Spanish is not possible because no concrete figures are given but in one of the three papers we have taken into account. citeNmarquez report a 89.67% of success in que disambiguation. It has to be taken into account that they worked with a decision tree formalism with a broader context into consideration. citeNfernando report a 96.8% of general success, but they avoided to disambiguate the different possibilities for the word que creating a special tag for it. citeNmoreno reports a failure in trying to disambiguate this case, but no concrete figures are supplied.
Another rather frequent error involves the item ya, which can be either a Conjonction or an Adverb. Both occur in texts in simmilar contexts, thus disambiguation could only be affected if symbol biasing was possible because its occurrence as an adverb is much more frequent than as a conjunction.
Following contract duties, all the files delivered (200K words) have been manually corrected and validated.
Armstrong, S., Bouillon, P., and Robert, G. (1992). Tools for Part-of-Speech Tagging. Draft Version - Work in Progress, ISSCO, Geneva. MULTEXT PROJECT
Church, K.W. and Mercer, R.L. (1993). Introduction to the Special issue on Computional Linguistics Using Large Corpora. Computational Linguistics, 19(1):1-24
Cutting, D., Kupiec, J., Pedersen, J. and Sibun, P. (1992). A practical Part-of-Speech Tager. In Third Conference on Applied Natural Language Processing.
Elworthy, D. (1994). Does Baum-Wech Re-estimation Help Taggers. In Proceedings of the 4th Conference on Applied Natural Language Processing (ACL 1994), pages 53-58, Stuttgart.
Gale, W. and Church, K. (1994). What is wrong with adding one ? In Corpus-Based Research into Language. N. Oostdijk and P. de Haan, Rodopi, Amsterdam.
Magerman, D. (1995). Everything You Always Wanted to Know about Probability Theory, but Were Afraid to Ask. Corpus-Based Models of Language Processing, R. Bod and R. Scha, ESSLLI'95 Reader.
Marquèz, L. and Rodriguez, H. (1995). Towards Learning a Constraint Grammar from Annotated Corpora Using Decision Trees. Technical report, Universitat Politèctica de Catalunya, Spain.
Moreno-Torres, I. (1994). A Morphological Disambiguation Tool (MSD) : Application to Spanish. Technical Report, Depto de Lenguages y CC, Facultat de Informatica, Universidad de Malaga, Spain. ACQUILEX II, Working Paper 24.
Rabiner, L.R. (1989).A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In Proceedings of the IEEE, volume 77(2), pages 257-286.
Sanchez, F. (1994). Spanish tagset for the CRATER project. Technical report. CRATER, Internal Document.
Sanchez Leon, F. and Nieto, A.F. (1995). Dsarollo de un etiquetador morfosintactico para el espanol. Procesamiento del Lenguaje Natural, 17.
No documentation available yet.
You are invited to send comments and feedback to multext@lpl.univ-aix.fr.