Multext - Document MUL4. Corpora. Version 0.1. Last modified 20 December 1996.




logo

PoS Tagged Corpus



Content


1. Part of Speech Tagging for English

1.1. The tagset

The C7 tagset consists of 150 tags, 14 of which are punctuation tags, the remaining 136 being part of speech tags.

TagDefinition
*punctuation tag - asterix
!punctuation tag - exclamation mark
"punctuation tag - quotation marks
$germanic genitive marker - (' or 's)
(punctuation tag - left bracket
)punctuation tag - right bracket
,punctuation tag - comma
-punctuation tag - dash
-----new sentence marker
.punctuation tag - full-stop
...punctuation tag - ellipsis
:punctuation tag - colon
;punctuation tag - semi-colon
?punctuation tag - question-mark
APPGEpossessive pronoun, pre-nominal (my, your, our etc.)
ATarticle (the, no)
AT1singular article (a, an, every)
BCLbefore-clause marker (e.g. in order (that))
CCcoordinating conjunction (and, or)
CCBcoordinating conjunction (but)
CSsubordinating conjunction (if, because, unless)
CSA'as' as a conjunction
CSNthan' as a conjunction
CSTthat' as a conjunction
CSW'whether' as a conjunction
DAafter-determiner (capable of pronominal function)(such, former, same)
DA1singular after-determiner (little, much)
DA2plural after-determiner (few, several, many)
DARcomparative after-determiner (more, less)
DATsuperlative after-determiner (most, least)
DBbefore-determiner (capable of pronominal function) (all, half)
DB2plural before-determiner (capable of pronominal function) (eg. both)
DDdeterminer (capable of pronominal function) (any, some)
DD1singular determiner (this, that, another)
DD2plural determiner (these, those)
DDQwh-determiner (which, what)
DDQGEwh-determiner, genitive (whose)
DDQVwh-ever determiner (whichever, whatever)
EXexistential 'there'
FOformula
FUunclassified word
FWforeign word
GEgermanic genitive marker - (' or 's)
IF'for' as a preposition
IIpreposition
IO'of' as a preposition
IW'with'; 'without' as preposition
JJgeneral adjective
JJRgeneral comparative adjective (older, better, bigger)
JJTgeneral superlative adjective (oldest, best, biggest)
JKadjective catenative ('able' in 'be able to';'willing' in 'be willing to')
LEleading co-ordinator ('both' in 'both...and...';'either' in 'either... or...')
MCcardinal number neutral for number (two, three...)
MCGEgenitive cardinal number, neutral for number (10's)
MC-MChyphenated number 40-50, 1770-1827)
MC1singular cardinal number (one)
MC2plural cardinal number (tens, twenties)
MDordinal number (first, 2nd, next last))
MFfraction, neutral for number (quarters, two-thirds)
NC2plural cited word ('ifs', in 'two ifs and a but')
ND1singular noun of direction (north, southeast)
NNcommon noun, neutral for number (sheep, cod)
NN1singular common noun (book, girl)
NN2 plural common noun (books, girls)
NNL1singular locative noun (street, Bay)
NNL2plural locative noun (islands, roads)
NNOnumeral noun, neutral for number (dozen, thousand)
NNO2plural numeral noun (hundreds, thousands)
NNAfollowing noun of style or title, abbreviatory (M.A.)
NNBpreceding sing. noun of style or title, abbr. (Prof.)
NNT1singular temporal noun (day, week, year)
NNT2plural temporal noun (days, weeks, years)
NNUunit of measurement, neutral for number (in., cc.)
NNU1singular unit of measurement (inch, centimetre)
NNU2plural unit of measurement (inches, centimetres)
NPproper noun, neutral for number (Indies, Andes)
NP1singular proper noun (London, Jane, Frederick)
NP2plural proper noun (Browns, Reagans, Koreas)
NPD1singular weekday noun (Sunday)
NPD2plural weekday noun (Sundays)
NPM1singular month noun (October)
NPM2plural month noun (Octobers)
PNindefinite pronoun, neutral for number ("none")
PN1singular indefinite pronoun (one, everything, nobody)
PNQOwhom
PNQSwho
PNQVwhoever
PNX1reflexive indefinite pronoun (oneself)
PPGEnominal possessive personal pronoun (mine, yours)
PPH1it
PPHO1him, her
PPHO2them
PPHS1he, she
PPHS2they
PPIO1me
PPIO2us
PPIS1I
PPIS2we
PPX1singular reflexive personal pronoun (yourself, itself)
PPX2plural reflexive personal pronoun (yourselves, ourselves)
PPYyou
RAadverb, after nominal head (else, galore)
REXadverb introducing appositional constructions (namely, viz, eg.)
RGdegree adverb (very, so, too)
RGQwh- degree adverb (how)
RGQVwh-ever degree adverb (however)
RGRcomparative degree adverb (more, less)
RGTsuperlative degree adverb (most, least)
RLlocative adverb (alongside, forward)
RPprep. adverb; particle (in, up, about)
RPKprep. adv., catenative ('about' in 'be about to')
RRgeneral adverb
RRQwh- general adverb (where, when, why, how)
RRQVwh-ever general adverb (wherever, whenever)
RRRcomparative general adverb (better, longer)
RRTsuperlative general adverb (best, longest)
RTnominal adverb of time (now, tommorow)
TOinfinitive marker (to)
UHinterjection (oh, yes, um)
VB0be
VBDRwere
VBDZwas
VBGbeing
VBMam
VBNbeen
VBRare
VBZis
VD0do
VDDdid
VDGdoing
VDN done
VDZdoes
VH0have
VHDhad (past tense)
VHGhaving
VHNhad (past participle)
VHZhas
VMmodal auxiliary (can, will, would etc.)
VMKmodal catenative (ought, used)
VV0base form of lexical verb (give, work etc.)
VVDpast tense form of lexical verb (gave, worked etc.)
VVG-ing form of lexical verb (giving, working etc.)
VVNpast participle form of lexical verb (given, worked etc.)
VVZ-s form of lexical verb (gives, works etc.)
VVGK-ing form in a catenative verb ('going' in 'be going to')
VVNKpast part. in a catenative verb ('bound' in 'be bound to')
VVIinfinitive (e.g. to give)
XXnot, n't
ZZ1singular letter of the alphabet:'A', 'a', 'B', etc.
ZZ2plural letter of the alphabet: 'As', b's, etc.

1.2. Ditto tags

Any of the tags listed above may in theory be modified by the addition of a pair of numbers to it: eg. DD21, DD22. This signifies that the tag occurs as part of a sequence of similar tags, representing a sequence of words which for grammatical purposes are treated as a single unit. For example the expression in terms of is treated as a single preposition, receiving the tags:

in<CTAG>II31 terms<CTAG>II32 of<CTAG>II33

The first of the two digits indicates the number of words/tags in the sequence, and the second digit the position of each word within that sequence.

Such ditto tags are not included in the lexicon, but are assigned automatically by a program called IDIOMTAG which looks for a range of multi-word sequences included in the idiomlist. The following sample entries from the idiomlist show that syntactic ambiguity is taken into account, and also that, depending on the context, ditto-tags may or may not be required for a particular word sequence:

at<CTAG>RR21 length<CTAG>RR22
a<CTAG>DD21/RR21 lot<CTAG>DD22/RR22
in<CTAG>CS21/II that<CTAG>CS22/DD1

1.3. Discrimination by tag

CLAWS, the part of speech tag, when last tested had an accuracy of 96-97% depending on the type of text. In any large sample of text it is extremely difficult to acheive 100% accuracy, even after humans have post-edited the text, because there can never be enough guidelines to cover every case in the text - the guidelines can deal with most cases of ambiguity - but there will always be new cases.

In ambiguous cases - cases where I was unsure whether to assign NN1 or NP1 for example, I would try to maintain consistency in my decision - thus once I had decided that a word should be given a tag, I would endeavour to stick with this tag throughout the corpus.

Each word is given one part of speech tag, numbers are usually given the tag MC. In English, large numbers tend to be written as 1000000 or 1,000,000. In French large numbers tend to be written as 100 000 and CLAWS would normally tag this as two words. Wherever possible I have joined examples like this in order to make a single number. This is not the case in tables of numbers (which are represented as long lists of numbers by CLAWS.) as it is not easy to be certain as to which numbers belong together and which should be left alone. However, this occurs rarely in the text.

APOSTROPHES

Words with apostrophes are split, receiving two tags. These include possessives such as

<TOK><ORTH>the</ORTH><CTAG>AT</CTAG></TOK>
<TOK><ORTH>dog</ORTH><CTAG>NN1</CTAG></TOK>
<TOK><ORTH>'s </ORTH><CTAG>GE</CTAG></TOK>
<TOK><ORTH>ball</ORTH><CTAG>NN1</CTAG></TOK>

and truncated words:


<TOK><ORTH>I</ORTH><CTAG>PPIS1</CTAG></TOK>
<TOK><ORTH>can</ORTH><CTAG>VM</CTAG></TOK>
<TOK><ORTH>'t</ORTH><CTAG>XX</CTAG></TOK>
<TOK><ORTH>go</ORTH><CTAG>VVI</CTAG></TOK>

FOREIGN WORDS.

If a sentence is in French I will usually give all the words in it the tag "FW" (foreign word) unless any of those words are recognisable proper nouns - in which case the proper noun tag "NP1" takes precedence. For names of foreign newspapers/journals/magazines, for the most part I did not recognise these as NP1 so they were given the "FW" tag.

FORMULAIC WORDS.

The "FO" tag is used for letter-number combinations and chemical formulae. EEC directive numbers are tagged "FO."

THE "FU" TAG.

I have used this for words that I am unsure about, or words that I think might be spelling errors and I am unable to decide what they correct word should be. (See part 5)

JJT

This has been used for combination words such as "least-favoured" and "largest-ever" as well as for words such as "biggest" and "smallest".

MCMC

Used for dates such as 1986-7, 1986/7 and 1986—87

DAR-RRR

more and less can be assigned either of these tags.

The difference between them is that DAR is for noun-phrase-like (and determiner) uses of the word in question, whereas RRR is for adverbial uses. The two can be difficult to distinguish, particularly after a verb: eg:

RRR: You should relax more
DAR: You should spend more

II-RP II-RL

Compare:

(a) "He ran down<CAT>II</CTAG> the hill"
and
(b) "He ran down<CTAG>RP</CTAG> his friends"

In (a), down is a preposition because:

1. You could insert an adverb before it:
"He ran quickly down the hill" But not: "He ran viciously down his friends"

2. You can move it to the front of a relative clause or question:
"This is the hill down which he ran" "Down which hills do you like running?"

In (b), down is an adverbial particle because:

1. You can place it before or after the noun phrase:
"He ran his friends down<CTAG>RP</CTAG>"

But not:
"*He ran the hill down"

2. If you replace the noun phrase with a pronoun, you HAVE TO place the pronoun in front of the particle:
"He ran them down"

But not:
"*He ran down them"

However, tagging errors may occur with stranded prepositions which are denuded of their noun phrase because it has been fronted or ellipted (eg. in relative clauses, passives, questions etc.):

This is the hill (which) she ran down<CTAG>II</CTAG>
(ie. This is the hill down<CTAG>II</CTAG> which she ran)

On Shrove Tuesday, this hill will be run down<CTAG>II</CTAG> by housewives"
(ie. Housewives will run down<CTAG>II</CTAG> it)

Which car did you arrive in<CTAG>II</CTAG>?
(ie. In<CTAG>II</CTAG> which car did you arrive?)

The same tests apply to words which are tagged either as prepositions or as locative adverbs RL eg. across, past, behind etc.

JJ/NN1

Words ending in -ing, when they premodify a noun, may be tagged either NN1 or JJ, eg:

New<CTAG>JJ</CTAG> spending<CTAG>NN1</CTAG> reductions<CTAG>NN2</CTAG>
her<CTAG>APPGE</CTAG> acting<CTAG>NN1</CTAG> ability<CTAG>NN1</CTAG>
a<CTAG>AT</CTAG>1</CTAG> working<CTAG>JJ</CTAG> mother<CTAG>NN1</CTAG>

JJ/VVN

The tagging of words like "surprised" in "John was surprised", or "lasting" in "the effect was lasting" can be a problem. In both cases, the word can be a JJ. One test is to see whether you can insert an adverb like "very" in front of the word. eg. in "John was very surprised", "surprised" is a JJ.

Another test, having the opposite effect, is to see whether there is an agent "by"-phrase following an "ed/en" word. If so, it is a VVN. eg. in "John was surprised by the pirates", "surprised " is a VVN. Even where it is not present, the possibility of adding a "by"-phrase, without changing the meaning of the word, is evidence in favour of a VVN. (However, this criterion can clash with the preceding one - since it occasionally happens that an "ed"- word is preceded by an adverb like "very" AND followed by a "by"-phrase: eg. "John was very offended by her remarks". Fortunately, such cases are rare. When they do occur, however, give preference to JJ).

A third test is negative: to see whether the word in question can be placed before a noun. eg:

The effect is lasting: a lasting effect

This shows that "lasting" can be (but need not be) a JJ. If the word could not be placed (with the same meaning) before the noun, this would be evidence that the word is not a JJ, but a VVG or a VVN.

Even though an "-ing" word is normally a VVG after the verb "be" it is generally treated as a JJ before a noun:

The man was dying<CTAG>VVG</CTAG>

But:

The dying<CTAG>JJ</CTAG> man

When the -ing or -en/ed word forms part of a phrase premodifying the noun, as in the following examples, the VVG/VVN tag is preferred:

interest<CTAG>NN1</CTAG> earning<CTAG>VVG</CTAG> account<CTAG>NN1</CTAG> a hypothesis<CTAG>NN1</CTAG> driven<CTAG>VVN</CTAG> approach<CTAG>NN1</CTAG>

In these examples, the NN1/ VVG sequence is similar in function to a compound pre-modifying adjective. In hyphenated form they would be given a JJ tag. The same applies when the phrase is a noun-like compound. eg:

a [ carol<CTAG>NN1</CTAG> singing<CTAG>VVG</CTAG> ] contest<CTAG>NN1</CTAG>

If the verb be can be replaced by another verb such as seem or become, without changing the meaning of the following JJ/VVN word, this is a strong indication that the construction is not properly a passive, and that the word is a JJ. eg:

The building was infested<CTAG>JJ</CTAG> with cockroaches
(The building became/seemed infested...)

I could see he was favourably disposed<CTAG>JJ</CTAG> to the idea
(He seemed favourably disposed...)

A further distinction which can be used as a test with 'event' verbs is that the JJ refers to a 'resultant state', whereas the VVN refers to a an event. eg:

Bill was married<CTAG>JJ</CTAG> (as opposed to single)

Bill was married<CTAG>VVN</CTAG> to Sarah on May 14th (the actual event)

Some further examples:

Three people were injured<CTAG>VVN</CTAG> in the accident
I could see he was (seemed) injured<CTAG>JJ</CTAG>
He lay injured<CTAG>JJ</CTAG> on the road
We have three injured<CTAG>JJ</CTAG> players in the side
Our players are not worried<CTAG>JJ</CTAG>
She is not worried<CTAG>VVN</CTAG> by that sort of threat

RG/RR

RG is restricted to adverbs of degree (also called intensifiers, etc.) which precede the word or expression they modify. Clear cases of RG are very, and so and as in comparatives (see section on as below).

Adverbs which have a range of functions, including adverb of degree, are not normally tagged RG, but are given the more general RR tag instead.

She<CTAG>PPHS1 was<CTAG>VBDZ</CTAG scantily<CTAG>RR</CTAG> clad<CTAG>JJ</CTAG>

Here 'scantily' is an RR rather than an RG because it could also occur after a verb:

She<CTAG>PPHS1 dressed<CTAG>VVD</CTAG> scantily<CTAG>RR</CTAG>

This is another case of the general principle of avoiding general-specific ambiguities within a word class. RG is usually only for words which do not have a more general range of adverbial uses.

There are exceptions to this, however. (See Section 2: Adverbs. See also Section 4: so). The words which may be tagged RG or RR are:

Examples:

She is so<CTAG>RG</CTAG> attractive
I would think so<CTAG>RR</CTAG>
This is too<CTAG>RG</CTAG> heavy
Can I come too<CTAG>RR</CTAG>?
That's rather<CTAG>RG</CTAG> nice
I would rather<CTAG>RR</CTAG> go out
He's quite<CTAG>RG</CTAG> talkative
Quite<CTAG>RR</CTAG>, I agree

Note that about may be an RP or an RG. However, this does not violate the principle mentioned above, since both RP and RG are sub-categories of RR:
as can be tagged RG, II or CSA. <>It is an RG when it occurs before an adjective, adverb or determiner (and sometimes other words) in phrases such as:

I don't think that one is as<CTAG>RG</CTAG> good
I go there as<CTAG>RG</CTAG> often (as...)
There are not as<CTAG>RG</CTAG> many (as...)

In the 2nd and 3rd examples above, the second as is always a CSA because it introduces a comparative construction (an equal comparison, as contrasted with an unequal comparison introduced by than). Thus, in the following, the second as is tagged CSA:

She's not as<CTAG>RG</CTAG> (or so<CTAG>RG</CTAG>) pretty as<CTAG>CSA</CTAG> I thought
An ostrich can run as<CTAG>RG</CTAG> quickly as<CTAG>CSA</CTAG> a zebra
He has as<CTAG>RG</CTAG> many as<CTAG>CSA</CTAG> six children

Notice that as in this comparative use is tagged CSA whether or not it introduces a clause, as normally understood. In the second case above, as precedes a noun phrase. In the following, it precedes an adjective:

Please come as<CTAG>RG</CTAG> quickly as<CTAG>CSA</CTAG> possible

CSA is also the tag used when as introduces other clauses (eg. clauses of time or clauses of reason). eg:

As<CTAG>CSA</CTAG> I arrived, he was leaving
I'll lend you the money, as<CTAG>CSA</CTAG> you're my friend

II is the tag for as as an undoubted preposition - it usually has an equative meaning, as in:

They regard him as<CTAG>II</CTAG> a friend
As<CTAG>II</CTAG> governor of the province, I have to take action

The guideline restricts II to cases of as followed by a noun-phrase-type structure - which may be a pronoun. If as is followed by an adjective, a past participle etc., it is tagged CSA, even though it has the same equative type of meaning as as<CTAG>II</CTAG>. eg:

The novel as<CTAG>CSA</CTAG> originally written
Many people regard his paintings as<CTAG>CSA</CTAG> hideous




1.4. Discrimination by Word

NAMES OF PROJECTS/PROGRAMMES/FUNDS/TREATIES

In most cases these will be tagged NP1, even when the tag NN1 would have been valid (e.g. Force). Acronyms that represent company names, or names of groups are usually given the "NP1" tag too. An example of an acronym that could receive "NN1" is "SOS".

CONCERNED

This will almost always be "JJ" unless used in a sentence such as "I concerned myself with the information."

ONE

MC1 where one precedes a noun or noun phrase, as in:

one<CTAG>MC1 book
one<CTAG>MC1 bag of spuds

and where it is the head of a noun phrase with a dependent prepostional phrase:

one<CTAG>MC1 of the books

and when referring to 'one' as a number entity:

this is the number one<CTAG>MC1
one<CTAG>MC1 is an integer
type a one<CTAG>MC1 at the prompt

PN1 where it is a personal pronoun such as:

one<CTAG>PN1 ought to be careful
one<CTAG>PN1 doesn't like to make a fuss

and when functioning as a substitute form:

the prettiest one<CTAG>PN1 is called Flo
the one<CTAG>PN1 you are holding is a bomb
his idea is not one<CTAG>PN1 that holds much water

SINCE

When used to mean "because" or "because of" since is tagged as CS. When used in phrases such as "ten years since" it is tagged RR, and when used in phrases such as "Since September..." it is tagged II.

SO

The CS tag is used when so is equivalent to the expression so that. It has a purposive function:

We hid it so<CTAG>CS no one would notice
He only said it so<CTAG>CS he could impress us

It is an RR when it occurs, usually after a punctuation mark or at the beginning of a sentence, with a meaning approximating to therefore:

It is raining, so<CTAG>RR</CTAG> I am staying at home
So<CTAG>RR</CTAG> we gave up the struggle, you see
He swore at me, so<CTAG>RR</CTAG> I hit him

It is likewise an RR if preceded by a conjunction in examples like those directly above:

He swore at me, and<CTAG>CC so<CTAG>RR</CTAG> I hit him

In expressions where so is used as a substitute form, and in cases where its use is clearly adverbial (= like that), it is tagged RR:

substitute:

so<CTAG>RR</CTAG> I believe
I might feel that, but I would never say so<CTAG>RR</CTAG>
So<CTAG>RR</CTAG> did John
I'm afraid so<CTAG>RR</CTAG>

adverbial clause:

Don't take on so<CTAG>RR</CTAG>!

It is tagged RG when used in positions where very could occur:

She is so<CTAG>RG</CTAG> friendly
I have never been so<CTAG>RG</CTAG> angry
Thank you so<CTAG>RG</CTAG> much

and when it corresponds to the first as in 'as...as...' comparisons:

They're not doing so<CTAG>RG</CTAG> well<CTAG>RR</CTAG> as<CTAG>CSA</CTAG> before

UNTIL

Until is tagged II when it is used to mean "in" and CS when it is used to mean "when".

WHEN

When may be tagged RRQ or CS. When can introduce three types of clause:

When it introduces an adverbial clause or a non-restrictive relative clause, it is a CS. When it introduces either a noun clause or a restrictive relative clause, it is an RRQ. Examples:

- adverbial clause:

When<CTAG>CS I arrived, John left
John left when<CTAG>CS I arrived
(at the time at which)
I smoke when<CTAG>CS I'm tense (whenever)

- noun clause:

I cannot remember when<CTAG>RRQ</CTAG> I was christened
I don't know when<CTAG>RRQ</CTAG> the next bus is due (the date/point in time at which)

- relative clause:

In the year when<CTAG>RRQ</CTAG> I was born (in which)
The moment when<CTAG>RRQ</CTAG> he arrived (at which)

Note that when can often be omitted in a relative clause.

There are also non-restrictive relative clauses introduced by when, which are now to be tagged as CS. Previously they were tagged RRQ. It is no longer necessary to distinguish these from adverbial clauses introduced by when. Here are some examples of non-restrictive relative clauses:

In 1968, when<CTAG>CS the students were revolting in Paris...

Here, when could best be paraphrased as at the time when.

Another example:
School finished at 4 o'clock precisely, when<CTAG>CS a loud bell sounded

Non-restrictive relative clauses do not define or restrict the meaning of the antecedent. If the antecedent is a precise temporal expression (such as "4 o'clock", "1990", "yesterday"), when is usually a non-restrictive relative. These are different from restrictive relatives, such as:

In the year when<CTAG>RRQ</CTAG> I was born

Here the year is defined by the relative clause. Typically restrictive relatives are not preceded by a comma, and the when can normally be omitted. Another use of when<CTAG>RRQ</CTAG> is in direct questions:

When<CTAG>RRQ</CTAG> did you find out?

In abbreviated adverbial clauses, where when is followed by an adjective, a preposition phrase, a non-finite clause etc., when is a CS:

when<CTAG>CS ready
when<CTAG>CS in doubt
when<CTAG>CS arriving late

but before an infinitive, when is an RRQ:

I don't know when<CTAG>RRQ</CTAG> to apply

Note that the infinitive clause may be implied:

Tell me when<CTAG>RRQ</CTAG> (to start)

and that a noun clause may be abbreviated simply to the word when:

It was Guy Fawkes, but I can't remember when<CTAG>RRQ</CTAG>

WHERE

The tagging of where is consistent with when.

1.5. Possible errors in the text

File 006
What legal proceedings will now by initiated against the offenders
To what extent can UNICE be said do be representing the employers'
As the Honorable Member States, the synthetic fibers industry has experienced... (is this a verb? should it be lowercase?)
seminars on community polices of particular interest (policies?)
Experience over the last 25 year with this... (years?)
25 MW, tghe output
the provision of these service is the responsibility (services?)
to the answer which is gave to Written Question...
N provision is made (No?)
of a lega system
and that used for calculating pension for women
N other states
status of black rhino's
In this context they Stated (remove captialisation)
greenhouse gase emissions
Question N 2513/91
had Stated its readiness
In EPC aware that the...
oral question N H-0544/92
has itself Stated that it..
Unectef v. <S> Heylens

File 016
which contradicts this assessments
Question N 2620/88
a unilateral decisions
consulting with workers representatives (add apostrophe?)
elected workers representatives
offices of the Holy See outside that State
of the Holy See
Rgion wallone
bevore President Najibullah's
The proceed were used to
provide en absolute guarantee of

File 032
Dr </S> **15;1324;S
fatal accidcents
Commission v. </S>
are the figure for the participation
The conflicts with at study
througout the area
N preliminary
Vol. <S/> advisory </S> </PAR> **52;46071;PAR **17;46124;S committees
the eoncouragement
views correct represented
Bleis v. </S>
which where the first to
as regard the

File 040
Werner v. </S>
Cocentrate intake
'Tokayj' wines
Italy v. </S>
During the 1980's
authorities, commitment to (should be apostrophe)

File 047
Latein American
Develpment
cf. </S>
have effectd
one of thom have
groupes of experts
chlidren's right (children's rights?)
N information
N details
at international closely level
Neverthelesss, special
Co. </S>
Co. </S>
N specific
N specific
and elsehwere
leakage or radioactive wastes (of?)
N community
N new arrangements
in wich

File 051
VRA's have been possible
N extension of such VRA's
N such understanding
macroeconmic
customers information (apostrophe needed?)
export of cosmetic containing (plural?)
e.g. </S>
N project
meting within
question N 672/92
done so do ratify (to?)

File 058
transport bypipeline
N one would contest
been earmarked or this (for?)
N follow-up
Prof. </S>
Prof. </S>
Prof. </S>
Prof. </S>
instal systems
question N
The blody events

File 061
An agrrement
N solution
plasma derivates
Doc. </S>
N longer applicable
but would provided no information
As regard the duration (plural)
the extend of (extent?)
NGO's

File 065
N project
section of 'Europe 2000 (where is the closing quotation?)
regarding lthe problems
Written Qeestions
N posts
increasing awarenes
establishments producting
a privat consulting
may favour large corporation
how will they by linked
rights issued involved

File 081
Rgime
that rgime
less than three month
Doc. </S>


2. Part of Speech Tagging for French

2.1. Procedure of validation

Where an erroneous tag was found, the correct tag was inserted before it, separated by a dash (-), e.g.:

1.2.1.1.65.2.3.1.3.1\1        TOK       La            DETRFS
1.2.1.1.65.2.3.1.3.1\4        TOK       Commission    SUBSFS
1.2.1.1.65.2.3.1.3.1\14       PUNCT     ,             YAAA
1.2.1.1.65.2.3.1.3.1\16       TOK       qui           PRELFS-PRELMS
1.2.1.1.65.2.3.1.3.1\20       TOK       avait         AUXA3-VERB3
1.2.1.1.65.2.3.1.3.1\26       TOK       deja          ADVE
1.2.1.1.65.2.3.1.3.1\31       TOK       annoncé       PPASMS
1.2.1.1.65.2.3.1.3.1\39       TOK       cela          PDETMS

Here the tag PRELMS was automatically assigned for "qui", and this was corrected to PRELFS. Likewise AUXA3 was inserted for "avait". For this example diacritic characters and the candidate tags have been removed but in the output files none of the text has been removed. The only changes made to the files are the addition of the correct tags, and the insertion of question marks on tags where the text is clearly wrong (usually a typographical error), or has been incorrectly segmented by the tagging program.

2.2. The tagset

TagDefinition
ADJEFPadjective feminine plural
ADJEFSadjective feminine singular
ADJEMPadjective masculine plural
ADJEMSadjective masculine singular
ADJIFPindefinite adjective feminine plural
ADJIFSindefinite adjective feminine singular
ADJIMPindefinite adjective masculine plural
ADJIMSindefinite adjective masculine singular
AUXAauxiliary "avoir" infinitive
AUXA1auxiliary "avoir" 1st person singular
AUXA2auxiliary "avoir" 2nd person singular
AUXA3auxiliary "avoir" 3rd person singular
AUXA4auxiliary "avoir" 1st person plural
AUXA5auxiliary "avoir" 2nd person plural
AUXA6auxiliary "avoir" 3rd person plural
AUXEauxiliary "être" infinitive
AUXE1auxiliary "être" 1st person singular
AUXE2auxiliary "être" 2nd person singular
AUXE3auxiliary "être" 3rd person singular
AUXE4auxiliary "être" 1st person plural
AUXE5auxiliary "être" 2nd person plural
AUXE6auxiliary "être" 3rd person plural
VERB1main verb 1st person singular
VERB2main verb 2nd person singular
VERB3main verb 3rd person singular
VERB4main verb 1st person plural
VERB5main verb 2nd person plural
VERB6main verb 3rd person plural
VINFmain verb infinitive
PPASFPpast participle feminine plural
PPASFSpast participle feminine singular
PPASMPpast participle masculine plural
PPASMSpast participle masculine singular
PPREpresent participle
CCOOcoordinative conjunction
CSUBsubordinative conjunction
CHIFnumerals
DETRFPdeterminer feminine plural
DETRFSdeterminer feminine singular
DETRMPdeterminer masculine plural
DETRMSdeterminer masculine singular
DINTFPindefinite determiner feminine plural
DINTFSindefinite determiner feminine singular
DINTMPindefinite determiner masculine plural
DINTMSindefinite determiner masculine singular
ADVEadverb
NEnegative adverb : particle "ne"
PASnegative adverb : particle "pas,jamais,point"
PREPadposition
PAUadposition "au"
PAUXadposition "aux"
PDEAadposition "de,à"
PDESadposition "des"
PREPMSadposition "de"
PDETFPdemonstrative pronoun
PDETFSdemonstrative pronoun
PDETMPdemonstrative pronoun
PDETMSdemonstrative pronoun
PINDFPindefinite pronoun
PINDFSindefinite pronoun
PINDMPindefinite pronoun
PINDMSindefinite pronoun
PINTFPinterrogative pronoun
PINTFSinterrogative pronoun
PINTMPinterrogative pronoun
PINTMSinterrogative pronoun
PPER1personal pronoun 1st person singular
PPER2personal pronoun 2nd person singular
PPER3Fpersonal pronoun 3rd person feminine singular
PPER3Mpersonal pronoun 3rd person masculine singular
PPER4personal pronoun 1st person plural
PPER5personal pronoun 2nd person plural
PPER6Fpersonal pronoun 3rd person feminine plural
PPER6Mpersonal pronoun 3rd person masculine plural
PPOBFPpersonal pronoun feminine plural (object)
PPOBFSpersonal pronoun feminine singular (object)
PPOBMPpersonal pronoun masculine plural (object)
PPOBMSpersonal pronoun masculine singular (object)
PSFPpossessive pronoun feminine plural
PSFSpossessive pronoun feminine singular
PSMPpossessive pronoun masculine plural
PSMSpossessive pronoun masculine singular
PREFFPreflexive pronoun feminine plural
PREFFSreflexive pronoun feminine singular
PREFMPreflexive pronoun masculine plural
PREFMSreflexive pronoun masculine singular
PRELFPrelative pronoun feminine plural
PRELFSrelative pronoun feminine singular
PRELMPrelative pronoun masculine plural
PRELMSrelative pronoun masculine singular
NPROProper nouns
SUBSFPsubstantive feminine plural
SUBSFSsubstantive feminine singular
SUBSMPsubstantive masculine plural
SUBSMSsubstantive masculine singular
Xunique membership class
AAAAstrong punctuation
YAAAweak punctuation

3. Part of Speech Tagging for German

3.1. The Tag set

The tag set consists of 160 tags, 2 of which are punctuation tags, other are part of speech tags.

TagDefinition
AAdjective, predic.
AAAAweak Punctuation mark
AFPDAdjective fem. plur. dative
AFPGAdjective fem. plur. dative
AFSAAdjective fem. sing. accusative
AFSDAdjective fem. sing. dative
AFSGAdjective fem. sing. genitive
AFSNAdjective fem. sing. nominative
AMPAAdjective masc. plur. accusative
AMPFAdjective masc. plur. dative
AMPGAdjective masc. plur. genitive
AMPNAdjective masc. plur. nominative
AMSDAdjective masc. sing. dative
AMSNAdjective masc. sing. nominative
CAConjunction Part I
CCCooordinative cunjunction
CHIFNumbers
CISubord. conjunctions introd. an infinit. clause
CSSubordinative cunjunction
CVComparative Conjunction
CZConjunction Part II
DDDem. determiner
DFSADem. determiner, fem. sing. accusative
DFSDDem. determiner, fem. sing. dative
DFSGDem. determiner, fem. sing. genitive
DFSNDem. determiner, fem. sing. nominative
DIIndef. determiner
DMSADem. determiner, masc. sing. accusative
DMSDDem. determiner, masc. sing. dative
DMSGDem. determiner, masc. sing. genitive
DMSNDem. determiner, masc., sing., nominative
DNSADem. determiner, neut. sing. accusative
DNSDDem. determiner, neut. sing. dative
DNSGDem. determiner, neut. sing. genitive
DNSNDem. determiner, neut. sing. nominative
DTInterrog. determiner
IInterjection
NCFPACommon noun, fem. plur., accusative
NCFPDCommon noun, fem. plur., dative
NCFPGCommon noun, fem. plur., genitive
NCFPNCommon noun, fem. plur., nominative
NCFSACommon noun, fem. sing., accusative
NCFSDCommon noun, fem. sing., dative
NCFSGCommon noun, fem. sing., genitive
NCFSNCommon noun, fem. sing., nominative
NCMPACommon noun, masc. plur., accusative
NCMPDCommon noun, masc. plur., dative
NCMPGCommon noun, masc. plur., genitive
NCMPNCommon noun, masc. plur., nominative
NCMSACommon noun, masc. sing., accusative
NCMSDCommon noun, masc. sing., dative
NCMSGCommon noun, masc. sing., genitive
NCMSNCommon noun, masc. sing., nominative
NCNPACommon noun, neut. plur., accusative
NCNPDCommon noun, neut. plur., dative
NCNPGCommon noun, neut. plur., genitive
NCNPNCommon noun, neut. plur., nominative
NCNSACommon noun, neut. sing., accusative
NCNSDCommon noun, neut. sing., dative
NCNSGCommon noun, neut. sing., genitive
NCNSNCommon noun, neut. sing., nominative
NPROProper noun
PDDem. pronoun
PIIndef. pronoun
PRRel. pronoun
PTInterrogative pronoun
PXRefl. pronoun
QIinfinitive particle
QSsuperlative particle
QVverbal prefix
RGGeneral adverb
RIinterrogative adverb
RPpronominal adverb
SPCpre-position, clitic
SPSpre-position, simple
STSpost-position, simple
VAII1PAux. verb, 1st pers. pl. ind. imp.
VAII1SAux. verb, 1st pers. sg. ind. imp.
VAII2PAux. verb, 2nd pers. pl. ind. imp.
VAII2SAux. verb, 2nd pers. sg. ind. imp.
VAII3PAux. verb, 3nd pers. pl. ind. imp.
VAII3SAux. verb, 3rd pers. sg. ind. imp.
VAIP1PAux. verb, 1st pers. pl. ind. pres.
VAIP1SAux. verb, 1st pers. sg. ind. pres.
VAIP1SAux. verb, 1st pers. sg. ind. pres.
VAIP2PAux. verb, 2nd pers. pl. ind. pres.
VAIP2SAux. verb, 2nd pers. sg. ind. pres.
VAIP3PAux. verb, 3rd pers. pl. ind. pres.
VAM2PAux. verb, 2nd pers. pl. imperative
VAPPAux. verb, pres. participle
VAPSAux. verb, past part.
VASI1PAux. verb, 1st pers. pl. subj. imp.
VASI1SAux. verb, 1st pers. sg. subj. imp.
VASI2PAux. verb, 2nd pers. pl. subj. imp.
VASI2SAux. verb, 2nd pers. sg. subj. imp.
VASI3PAux. verb, 3rd pers. pl. subj. imp.
VASI3SAux. verb, 3rd pers. sg. subj. imp.
VASP1PAux. verb, 1st pers. pl. subj. pres.
VASP1SAux. verb, 1st pers. sg. subj. pres.
VASP2PAux. verb, 2nd pers. pl. subj. pres.
VASP2SAux. verb, 2nd pers. sg. subj. pres.
VASP3PAux. verb, 3rd pers. pl. subj. pres.
VASP3SAux. verb, 3rd pers. sg. subj. pres.
VMII1PMain verb, 1st pers. pl. ind. imp.
VMII1SMain verb, 1st pers. sg. ind. imp.
VMII2PMain verb, 2nd pers. pl. ind. imp.
VMII2SMain verb, 2nd pers. sg. ind. imp.
VMII3PMain verb, 3rd pers. pl. ind. imp.
VMII3SMain verb, 3rd pers. sg. ind. imp.
VMIP1PMain verb, 1st pers. pl. ind. pres.
VMIP1SMain verb, 1st pers. sg. ind. pres.
VMIP2PMain verb, 2nd pers. pl. ind. pres.
VMIP2SMain verb, 2nd pers. sg. ind. pres.
VMIP3PMain verb, 3rd pers. pl. ind. pres.
VMIP3SMain verb, 3rd pers. sg. ind. pres.
VMM2Main verb, 2nd pers. pl. imperative
VMM2PMain verb, 2nd pers. pl. imperative
VMPPMain verb, pres. participle
VMPSMain verb, past part.
VMSI1PMain verb, 1st pers. pl. subj. imp.
VMSI1SMain verb, 1st pers. sg. subj. imp.
VMSI2PMain verb, 2nd pers. pl. subj. imp.
VMSI2SMain verb, 2nd pers. sg. subj. imp.
VMSI3PMain verb, 3rd pers. pl. subj. imp.
VMSI3SMain verb, 3rd pers. sg. subj. imp.
VMSP1PMain verb, 1st pers. pl. subj. pres.
VMSP1SMain verb, 1st pers. sg. subj. pres.
VMSP2PMain verb, 2nd pers. pl. subj. pres.
VMSP2SMain verb, 2nd pers. sg. subj. pres.
VMSP3PMain verb, 3rd pers. pl. subj. pres.
VMSP3PMain verb, 3rd pers. sg. subj. pres.
VMUMain verb, infinitive with incorp. particle
VOII1PMod. verb, 1st pers. pl. ind. imp.
VOII1SMod. verb, 1st pers. sg. ind. imp.
VOII2PMod. verb, 2nd pers. pl. ind. imp.
VOII2SMod. verb, 2nd pers. sg. ind. imp.
VOII3PMod. verb, 3rd pers. pl. ind. imp.
VOII3SMod. verb, 3rd pers. sg. ind. imp.
VOIP1PMod. verb, 1st pers. pl. ind. pres.
VOIP1SMod. verb, 1st pers. sg. ind. pres.
VOIP2PMod. verb, 2nd pers. pl. ind. pres.
VOIP2SMod. verb, 2nd pers. sg. ind. pres.
VOIP3PMod. verb, 3rd pers. pl. ind. pres.
VOIP3SMod. verb, 3rd pers. sg. ind. pres.
VOPPMod. verb, pres. participle
VOPSMod. verb, past part.
VOSI1PMod. verb, 1st pers. pl. subj. imp.
VOSI1SMod. verb, 1st pers. sg. subj. imp.
VOSI2PMod. verb, 2nd pers. pl. subj. imp.
VOSI2SMod. verb, 2nd pers. sg. subj. imp.
VOSI3PMod. verb, 3rd pers. pl. subj. imp.
VOSI3SMod. verb, 3rd pers. sg. subj. imp.
VOSP1PMod. verb, 1st pers. pl. subj. pres.
VOSP1SMod. verb, 1st pers. sg. subj. pres
VOSP2PMod. verb, 2nd pers. pl. subj. pres
VOSP2SMod. verb, 2nd pers. sg. subj. pres
VOSP3PMod. verb, 3rd pers. pl. subj. pres
VOSP3SMod. verb, 3rd pers. sg. subj. pres
XFormulae (5x + 3y), Symbols (%, ' etc.)
YAAAstrong Punctuation mark

3.2. Training Corpus

For the training of the tagger, we have used the MULTEXT corpus based on 'joc' files. Some statistics on these files given following results :

JOC006.DE


Words 21419 (88.2 %)
Punctuations 2858 (11.8 %)
Sentences 1019
Tags 3900
Unknown 1686 (7.9 %)
Ambiguous 14665 (68.5 %)

JOC016.DE


Words 17560 (87.9 %)
Punctuations 2417 (12.1 %)
Sentences 824
Tags 3056
Unknown 1431 (8.1 %)
Ambiguous 11642 (66.3 %)

JOC032.DE


Words 25166 (88.6 %)
Punctuations 3248 (11.4 %)
Sentences 1078
Tags 4056
Unknown 1950 (7.7 %)
Ambiguous 17113 (68.0 %)

JOC040.DE


Words 26493 (88.3 %)
Punctuations 3509 (11.7 %)
Sentences 1231
Tags 4566
Unknown 2188 (8.3 %)
Ambiguous 17725 (66.9 %)

JOC047.DE


Words 17468 (88.5 %)
Punctuations 2276 (11.5 %)
Sentences 808
Tags 3076
Unknown 1360 (7.8 %)
Ambiguous 12108 (69.3 %)

JOC051.DE


Words 16101 (87.8 %)
Punctuations 2239 (12.2 %)
Sentences 751
Tags 2816
Unknown 1370 (8.5 %)
Ambiguous 10709 (66.5 %)

JOC058.DE


Words 21003 (88.0 %)
Punctuations 2857 (12.0 %)
Sentences 921
Tags 3470
Unknown 1835 (8.7 %)
Ambiguous 14078 (67.0 %)

JOC061.DE


Words 20538 (88.7 %)
Punctuations 2628 (11.3 %)
Sentences 963
Tags 3700
Unknown 1625 (7.9 %)
Ambiguous 14005 (68.2 %)

JOC065.DE


Words 22443 (87.9 %)
Punctuations 3078 (12.1 %)
Sentences 1046
Tags 3940
Unknown 1833 (8.2 %)
Ambiguous 15213 (67.8 %)

JOC081.DE


Words 13352 (89.5 %)
Punctuations 1572 (10.5 %)
Sentences 549
Tags 2124
Unknown 1060 (7.9 %)
Ambiguous 9279 (69.5 %)

Where:

3.3. Training process

3.3.1. Training a 'zero matrix' with ambiguous corpus.

We have created an equiprobable matrix (also named 'zero matrix'), then the Baum-Welch re-estimation parameters was used on the texts after the lexical look up. Parameters were set to 10 iterations on each file.

RESULTS:

CorpusWordsErrorsRate
joc06+16388791545839.76%

3.3.2. Training an 'Initial matrix' with ambiguous corpus.

An initial matrix was created using a hand-tagging corpus:


Words 13114 (88.0 %)
Punctuations 1788 (12.0 %)
Sentences 651
Tags 2490

Then the Baum-Welch re-estimation parameters was used on the texts after the lexical look up. Parameters were set to 10 iterations on each file.

RESULTS:

CorpusWordsErrorsRate
joc06+16388791067227.38%

3.3.3. Training an 'Initial matrix' with ambiguous corpus and lexical look up revision.

An initial matrix was created using a hand-tagging corpus:


Words 13114 (88.0 %)
Punctuations 1788 (12.0 %)
Sentences 651
Tags 2490

Then the Baum-Welch re-estimation parameters was used on the texts after the lexical look up and with some corrections described in section 3.3.a. Parameters were set to 10 iterations on each file.

3.3.4 Lexical look up revision

We have numbered some errors in the lexicon:

RESULTS:

CorpusWordsErrorsRate
joc06+163887927917.18%

3.3.5 Output and result example

As the real output format does not fit on single lines, we have split lines into :

FIRST LINE


first column reference (1.2.1.1.2.2.3.1.1.1\1)
second column segmenter Classes (CHUNK, TOK etc.)
third column textual element
fourth column tag (bold characters)

SECOND LINE
ambiguity classes separated by pipes (|) between [] :
[betreffen\Vaip3s\VAIP3S|betreffen\Vmip3s\VMIP3S|betreffen\Voip3s\VOIP3S]

Note
This format is the raw output from the tagger, before its conversion into CESANA conformant format.




                         [CHUNK	<DIV_Q FROM='1.2.1.1.2.2.3.1'>
                         (PAR	<HEAD FROM='1.2.1.1.2.2.3.1.1'>
                         (SENT	<S>	
1.2.1.1.2.2.3.1.1.1\1	TOK	Betrifft	VMIP3S
[betreffen\Vaip3s\VAIP3S|betreffen\Vmip3s\VMIP3S|betreffen\Voip3s\VOIP3S]
1.2.1.1.2.2.3.1.1.1\9	PUNCT	:	YAAA	
[:\YAAA\YAAA]
1.2.1.1.2.2.3.1.1.1\11	TOK	Personalsituation	NPRO
[Personalsituation\Np----\NPRO]
1.2.1.1.2.2.3.1.1.1\29	TOK	in	P
in\Pov\P]
1.2.1.1.2.2.3.1.1.1\32	TOK	der	DFSD
[die\D.fpg\DFPG|die\D.fsd\DFSD|die\D.fsg\DFSG|der\D.mpg\DMPG|das\D.mpg\DMPG|der\D.msn\DMSN|die\Ncfpg\NCFPG|die\Ncfsd\NCFSD|die\Ncfsg\NCFSG|das\Ncmpg\NCMPG]
1.2.1.1.2.2.3.1.1.1\36	TOK	Kommission	NCFSD
[Kommission\Ncfsa\NCFSA|Kommission\Ncfsd\NCFSD|Kommission\Ncfsg\NCFSG|Kommission\Ncfsn\NCFSN]
                         )SENT	</S>
                         )PAR	</HEAD>
                         (PAR	<P FROM='1.2.1.1.2.2.3.1.2'>	
                         (SENT	<S>	
1.2.1.1.2.2.3.1.2.1\1	TOK	Kann	VOIP3S
[können\Vaip1s\VAIP1S|können\Vaip3s\VAIP3S|können\Vmip3s\VMIP3S|können\Voip3s\VOIP3S]
1.2.1.1.2.2.3.1.2.1\6	TOK	die	DFPA
[die\D.fpa\DFPA|die\D.fpn\DFPN|die\D.fsa\DFSA|die\D.fsn\DFSN|der\D.mpd\DMPD|der\D.mpn\DMPN|das\D.npa\DNPA|das\D.npd\DNPD|das\D.npn\DNPN|die\Ncfpa\NCFPA|die\Ncfpn\NCFPN|die\Ncfsa\NCFSA|die\Ncfsn\NCFSN|der\Ncmpd\NCMPD|der\Ncmpn\NCMPN|das\Ncnpa\NCNPA|das\Ncnpd\NCNPD|das\Ncnpn\NCNPN]
1.2.1.1.2.2.3.1.2.1\10	TOK	Kommission	NCFSN
[Kommission\Ncfsa\NCFSA|Kommission\Ncfsd\NCFSD|Kommission\Ncfsg\NCFSG|Kommission\Ncfsn\NCFSN]
1.2.1.1.2.2.3.1.2.1\21	TOK	folgendes	ANSA
[folgend\A..nsa\ANSA|folgend\A..nsn\ANSN]
1.2.1.1.2.2.3.1.2.1\31	TOK	mitteilen	VAIP1P
[mitteilen\Vaip1p\VAIP1P|mitteilen\Vaip3p\VAIP3P|mitteilen\Vasp1p\VASP1P|mitteilen\Vasp3p\VASP3P|mitteilen\Vmip1p\VMIP1P|mitteilen\Vmip3p\VMIP3P|mitteilen\Vmsp1p\VMSP1P|mitteilen\Vmsp3p\VMSP3P|mitteilen\Voip1p\VOIP1P|mitteilen\Voip3p\VOIP3P|mitteilen\Vosp1p\VOSP1P|mitteilen\Vosp3p\VOSP3P]
1.2.1.1.2.2.3.1.2.1\40	PUNCT	:	YAAA
[:\YAAA\YAAA
                         )SENT	</S>	
                         )PAR	</P>	
                         (PAR	<P FROM='1.2.1.1.2.2.3.1.3'>	
                         (SENT	<S>	
1.2.1.1.2.2.3.1.3.1\1	ENUM	1.	CHIF
[1.\M----\CHIF]
1.2.1.1.2.2.3.1.3.1\5	TOK	die	DFSN
[die\D.fpa\DFPA|die\D.fpn\DFPN|die\D.fsa\DFSA|die\D.fsn\DFSN|der\D.mpd\DMPD|der\D.mpn\DMPN|das\D.npa\DNPA|das\D.npd\DNPD|das\D.npn\DNPN|die\Ncfpa\NCFPA|die\Ncfpn\NCFPN|die\Ncfsa\NCFSA|die\Ncfsn\NCFSN|der\Ncmpd\NCMPD|der\Ncmpn\NCMPN|das\Ncnpa\NCNPA|das\Ncnpd\NCNPD|das\Ncnpn\NCNPN]
1.2.1.1.2.2.3.1.3.1\9	TOK	Zahl	NCFSN
[Zahl\Ncfsa\NCFSA|Zahl\Ncfsd\NCFSD|Zahl\Ncfsg\NCFSG|Zahl\Ncfsn\NCFSN]
1.2.1.1.2.2.3.1.3.1\14	TOK	der	DFPG
[die\D.fpg\DFPG|die\D.fsd\DFSD|die\D.fsg\DFSG|der\D.mpg\DMPG|das\D.mpg\DMPG|der\D.msn\DMSN|die\Ncfpg\NCFPG|die\Ncfsd\NCFSD|die\Ncfsg\NCFSG|das\Ncmpg\NCMPG]
1.2.1.1.2.2.3.1.3.1\18	TOK	bei	SPS
[bei\Qv\QV|bei\Sps\SPS]
1.2.1.1.2.2.3.1.3.1\22	TOK	der	DFSD
[die\D.fpg\DFPG|die\D.fsd\DFSD|die\D.fsg\DFSG|der\D.mpg\DMPG|das\D.mpg\DMPG|der\D.msn\DMSN|die\Ncfpg\NCFPG|die\Ncfsd\NCFSD|die\Ncfsg\NCFSG|das\Ncmpg\NCMPG]
1.2.1.1.2.2.3.1.3.1\26	TOK	Kommission	NCFSD
[Kommission\Ncfsa\NCFSA|Kommission\Ncfsd\NCFSD|Kommission\Ncfsg\NCFSG|Kommission\Ncfsn\NCFSN]
1.2.1.1.2.2.3.1.3.1\37	TOK	tätigen	AG
[??\??\??]
1.2.1.1.2.2.3.1.3.1\45	TOK	Bediensteten	NCNPG
[Bedienstete\Ncfpg\NCFPG|Bedienstete\Ncmpg\NCMPG|Bedienstete\Ncmsn\NCMSN|Bedienstete\Ncnpg\NCNPG]
1.2.1.1.2.2.3.1.3.1\58	TOK	auf	SPS
[auf\Qv\QV|auf\Rg.\RG|auf\Sas\SA|auf\Sps\SPS]
1.2.1.1.2.2.3.1.3.1\62	TOK	Zeit	NCFSA
[Zeit\Ncfsa\NCFSA|Zeit\Ncfsd\NCFSD|Zeit\Ncfsg\NCFSG|Zeit\Ncfsn\NCFSN]
1.2.1.1.2.2.3.1.3.1\66	PUNCT	;	AAAA
[;\AAAA\AAAA]
                         )SENT	</S>	
                     	)PAR	</P>	
                         )SENT	</S>	
                         )PAR	</P>	
                         ]CHUNK	</DIV_Q>	

4. Part of Speech Tagging for Spanish

4.1. Introduction

This part adresses the work done for adapting the Multext Tools for Part-of-Speech Tagging (developed by ISSCO, Armstrong et al., 1995) to Spanish in order to tag the Spanish corpus from the Multext/MLCC corpus. A brief introduction of the model implemented by the HMM Multext tagger is given as well as the experimentation with the different parametrizations prior to the final version of the tool. Initial decisions, as the tagset, the lexicon and the training procedure are also discussed. Finally, results are presented and justified with respect its benefits in comparison with other documented attempts.

Multext defined in its Technical Annex the role of a POS desambiguator as a tool based on Markov model to select the most plausible analysis on the basis of the local context using statistical generalizations. It has been acknowledged in the literature that the best way to obtain these statistical generalizations is by deriving it automatically from a previously tagged training corpus. Multext was indeed forced to use a more complex method as it was recognised that tagged corpora was not available for most of the Multext languages. This was the case for Spanish. The Technical Annex stated as the goal of this task to build a tool based in this more complex method which had yield good results for English, and to experiment with languages of different morphological characteristics. The quality of results will determine whether to use only ambigous corpus can be considered sufficient or corpus had to be hand corrected and used as a bootstrapping training corpus. Larger annotated corpora and better disambiguators had then to be created cyclically for tagging, manual correction and retraining.

From this starting point the goals of the experiment can be summed up as to check the performance and differences between the two kinds of training. In this respect, conclusions after testing are that training with disambiguated corpus is the best strategy. However, when no hand tagged corpora is available, the tools proposed by MULTEXT and the methodology suggested is a good way to get tagged corpus. Results of the experimentation carried out show that good results can be achieved, results which can be comparable to other tools for Spanish.

4.2. Tools for Part-of-Speech Tagging in Multext

The tagger for Part-of-Speech tagging for Multext is a program that takes as input a sequence of words annotated with one or more tags and returns the most likely tag for each word in the text. It is based on a Hidden Markov Model and the process is performed in two steps: a training phase to estimate the parameters of the model and a tagging phase to select the most probable tags according to this model employing the Viterbi algorithm.

Multext PoS desambiguator is based on a Hidden Markov Model. This model is largely documented in the literature (Rabiner, 1989 and Cutting-et-al., 1992). We will try to sum up the aspects which have to do with the application of this methodology for adaptating the tool to a given language, in this case Spanish.

Formally, a HMM is a 5-tuple <S, C, A, B, Pi> where

For the PoS case, the "set of states'' S are the Tags/Labels referring to grammatical category assigned to words. "Observational Symbols'' C are the Ambiguity Classes: combinations of different plausible tags which a given word can be assigned lexically (according to the set of tags). The A matrix is the Language Model, a probabilistic model for the tag sequences, and the B matrix is the Communication Model(which in the case of Multext implementation is a simplification of the initial HMM) and states the probability for a given tag of generating each of the ambiguity classes. The Pi vector records the probability for a given tag to occur in the beginning of a sentence.

4.3. Training

Multext training module takes as its input sequences of ambiguity classes (that is, it can use ambigous text) and uses the Baum-Welch algorithm for parameter reestimation to produce a training Hidden Markov Model. The tool foresees the possibility of performing iterations in order to optimize the model. For readjusting the model, during the training phase, the tool provides a facility for taking into account user defined "transition biases". These specify values to increase or decrease the probability of a transition between to tags.

The tool also allows for an initial model to be provided. This model can be estimated from some amount of hand tagged corpora using relative frequency analysis tools. It has to be noted, though, that Multext tool only foresees the A matrix of the model to be estimated, while it lacks means either for estimating or for readjusting the B matrix by the user. That means that no preferences can be given for the probability of a tag to display a particular class rather than another.

All these user defined estimations are part of the tuning process to improve the model, as we will explain later.

4.3.1. Tuning the model

Methodology during experimentation was mainly driven by the general approach of the project. As already foreseen, we had no hand tagged corpus available for starting training. Thus different rounds of parameter estimation and hand validation of the results, which were used again for training, toke place. During this process all the means offered by the tool to readjust and improve the model were used (but precision flags, which we could not run properly). The set of tags was also taken into account in order to test whether the number and quality ofthe used labels had impact in the results.

The project also included the development of a lexicon containing morphosyntactic features based in EAGLES (Expert Advisory Group on Language Engineering Standards) model and a correspondent tagset to be used by PoS tagger. The lexicon is then a fullform word lists made up of about 15,000 lemmas. The tagset to be used by each language had to be defined along the project basing decisions in a step-wise refinement.

Spanish initial tagset was a simplification of the lexical information based on EAGLES standards used to describe lexical items. Major morphosyntactic categories were taken into account as well as detailed information carried by lexical items (i.e. tense distinctions, type of determiners or pronouns, etc.). This first version was reviewed later on the light of the idea that the tagset was to be used only for disambiguation purposes. The possibility to recover lexical information after desambiguation process made it advisable to reduce the number of tags. Thus, a first set was used consisting of 259 tags which was reduced in a second round of experimentation to 107 tags.

Reduction of tags becomes also part of tunning the language model as far as the Baum-Welch algorithm caculates the language model matrix (that is A matrix) on the basis of non-ambigous bigrams found in the text. When training with ambigous texts, and due to the large number of ambiguities Spanish words enter in, there is very little information and very few non-ambigous tags can be taken into account. Reducing ambiguity classes in order to allow the occurrence of ambiguous words to be added to the occurrence of non-ambigous ones can influence the calculation of the matrix.

On the other hand reduction of tags also influences B matrix as this tool relies on the hipothesis that communication model is based on the same probabilites for each ambiguity classes of being displayed by a given tag. That is, if our understanding is correct, that when a given tag can generate different ambiguity classes, the maximum likelihood is taken for granted as point of departure. As final calculation takes into account the product of the probabilities given in A and B matrix, some times it was difficult to inspect results as it does not correspond to what we would call a "language model".

LEXICON

Inspecting the dictionary we can find the following ambiguity cases which will be considered when evaluating the results.

Ambiguities:

Special items:

Some of these ambiguities are also referred to by others. The literature shows that the most problematic are: article-pronoun, the word que, noun-verb, adjective-participle (cf. citeNfernando; citeNmarquez; citeNmoreno). Besides, given the fact that with Multext tools no preferences on lexical probability could be made (that is, no symbol biasing allowed), these special items, which display ambiguity but that can be said to be much more frequent like one of the categories than the others, were considered to be a crucial point in the experimentation. We are referring to the fact, for instance, that the word para is far more frequent as preposition than as a verb. Just to illustrate results on these special items we will present the following figures:


Test corpus: 040
Number of words: 31327
ya: 23 occurrences; 6 errors
para: 207 occurrences; 0 errors
uno: 10 occurrences; 3 errors

TRAINING AND BIASING

Due to the general approach of the project, in the first experiments, when only ambiguous corpus was available, a-priori biasing was a must. Some of the initial decisions were based on pure linguistic intuitions, such as co-occurrence restrictions (a preposition will never appear before a tensed verb) or grammatical information (such as agreement between determiners adjectives and nouns). This information turn out to be useful also on statistical basis. The first quotation is one of the bias used. The second is an extract of the A matrix for the same cases from the experiment with ambiguous corpus + bias. The third is the A matrix from the experiment with non-ambigous corpus. Note that the same kind of assumptions are made. Transitions to Nouns or Adjectives in agreement with the article are reinforced (this fact lead us to keep different tags stating gender and number in the reduced tagset).

TIFS NCFS +8

NCF       +8
NCS       +8
NPFS      +8
NPS       +8
A         +5
AFS       +5
AS        +5
!OTHER    -8

%% ambiguous text + biasing
%% more probable transitions are marked

State: TIFS

A          8.4e-02*
AFP        3.4e-04
AFS        8.4e-02 *  
AMP        3.4e-04  
AMS        3.4e-04  
AP         3.4e-04  
AS         8.4e-02 *   
CC         3.4e-04  
CS         3.4e-04  
DDFP       3.4e-04  
DDFS       3.4e-04  
../..
M          3.4e-04  
MFP        3.4e-04  
MFS        3.4e-04  
MMP        3.4e-04  
MMS        3.4e-04  
MP         3.4e-04  
NCF        1.3e-01 *   
NCFP       3.4e-04  
NCFS       1.3e-01 *  
NCM        3.4e-04  
NCMP       3.4e-04  
NCMS       3.4e-04  
NCP        3.4e-04  
NCS        1.3e-01 *  
NPFP       3.4e-04  
NPFS       1.3e-01 *  
NPM        3.4e-04  
../..

%% Unambiguous corpus
%% more probable transitions are marked

State: TIFS

AFS        8.4e-02 *  
AS         3.7e-02 *  
CC         4.9e-03  
DIFS       7.4e-03  
MFS        9.8e-03  
NCF        7.4e-03  
NCFP       2.5e-03  
NCFS       8.0e-01 *  
NCMS       2.5e-03  
NCS        4.9e-03  
NPFS       4.9e-03  
SP         2.5e-02  
VMIP3S     4.9e-03  
WPUNCT(    2.5e-03  

Biasing turned out to be a complex exercise mainly because of two facts. The first was that even with our reduced tagset, the large number of ambiguity classes and transitions made difficult to inspect the data. Second, it took as a while to understand how matrix B influenced transition biasing.

4.3.2. Training with non ambigous texts

For testing purposes we used the hand validated corpus to train a new matrix. It turned out rather soon that results were better than those obtained via subsequent rounds of supervised training.

4.4. Experimentation

Experimentation was done with the following corpus:

  • Extracts from the Journal of the European Commission, Written Questions (1993). This data is provided for the use of MULTEXT project and is COPYRIGHT to the Office of Publications of the European Community (OPOCE). These fragments are named in experimentation tables as 006, 016, 081.
  • The issues of the Spanish newspaper "El Sur" (Malaga, Spain), corresponding to one week (from April and September 1991). This corpus has been distributed by the Multilingual Corpora for Cooperation Centre in Edinburgh from the European Corpus Initiaive (ECI) corpus. In the following tables is named as misc2
  • Several issues of the Sunday magazines edited by the Spanish newspapers "La Vanguardia", "El Pais" and "ABC" during 1994. This corpus has been provided by UAB and is named in the tables bellow as misc1.

    The next table glosses used corpus statistics

    Set
    corpus
    Tokens
    Sents
    Tags
    Unknowns
    Ambi.tokens
    Tags/Words
    
    Train
    006+misc
    41900
    1576
    53450
    72 (1.13%)
    10215 (24.37%)
    1.28
    
    Test
    016
    18019
    621
    23110
    216 (1.20%)
    4530 (25.13%)
    1.28
    
    Test
    081
    14840
    420
    19157
    98 (0.66%)
    3789 (25.36%)
    1.28
    
    Test
    misc2
    1483
    44
    1889
    77 (5.19%)
    362 (24.42%)
    1.27
    

    Experimentation carried out was based on the error rate shown in the different setups. The general idea was to prove with different initial models and using biasing to complement them in a step wise manner. When no better results than in the previous round occurred no more trials in this scenario were pursued. It has also to be noted that experimentation was done on a reduced corpus, this was mainly due to the methodology followed in the project. As already said, Spanish was one of the cases where no tagged corpora was available at the beginning. Incremental methodology has been used to increase the size of the resources used. Neverthless, as you can see at the tables, last round of tagging which was done with a model derived of the larger hand validated corpus, does not show a remarkable improvement.

    In section 5 we present the tables of results of the experimentation. They reflect the different rounds with different setups. Each was tested with the 107 tagset and the 259 tagset. The devised bias were applied separately in order to be able to compare its impact. The labels of the "Train" set up correspond to the following scenarios.

    4.4.1. Scenario eq: Equiprobable matrix

    In order to have a reference starting point, a equiprobable matrix was used as initial model.

    4.4.2. Scenario bw: No initial model

    No initial matrices were supplied but Baum-Welch reestimation of parameter was used. It is interesting to see how iteration affect negatively the behaviour of the tagger. Up to three iterations were done in the first experiment. As results did not show any impact, for later experimentation only one was done in order to show comparable results. As we can see in the tables, even complemented with biases, results can be even worst than with the equiprobable matrix. Hence it seems not advisable to use only this parameter estimation when working with ambigous text.

    It is also noticeable the fact that a shorter tagset seems to work better in these conditions. This is mainly due to the increase of unambiguous bigrams when verbosity is reduced. However this difference is reduced when biasing is used.

    4.4.3. Scenario ini: Training with ambigous input

    This scenario was the setup considered for the project. In the lack of unambigous corpus, text after lexical lookup is used for creating an initial A matrix. Baum-Welch is then used, through several iterations, to reestimate parameters. Initial biasing for matrix A is also recommended in order to tune the model.

    As in the previous setup, the larger tagset creates some difficulties. However these problems are minimized when using bias to tune the matrix. Then, comparable results are yield in both cases. As for the parameter reestimation, the larger tagset seems to improve with the different iterations, while the shorter one just loosses predictive power. Anyway, the best results, in comparable absolute figures, are achieved when using transitions adjustments.

    4.4.4. Scenario initag: Training with ambigous and unambigous input

    In this scenario, an initial model was created using all available corpus. Ambigous used in the previous experiments, and the same texts having been corrected by hand. The point was to see the impact of unambigous transitions.

    As one can see in the table, results are rather clear. The rate of succes is increased considerably and the impact of the size of the tagset and of the transition bias is reduced.

    4.4.5. Scenario frq: Training with unambigous input and frequency estimation

    After the results of the previous scenario, a new one was prepared using non ambigous corpus and adding the facility for frequency estimation. This turned out to be the best strategy, as already documented in the literature of PoS tagging with HMM. Again no real impact of the size of the tagset, nor of the use of a priori biasing showed up. Iterations also show to be counterproductive.

    4.5. Conclusions

  • Training with disambiguated corpus is the best strategy. Results can be summarized with the following figures:

    Testing corpus: 081
    Training corpus words: 41,900
    Testing scenario: initial model: ambiguos corpus + biasing
    Error rate: 2,93 %
    Testing scenario: initial model: disambiguated corpus
    Error rate: 1,13%

  • The size of the tagset does not seem to be a crucial factor as the following figures show:

    Testing corpus: 081
    Training corpus words: 41,900
    Testing scenario: initial model: ambiguos corpus + biasing + 259 tags
    Error rate: 2,67 %
    Testing scenario: initial model: ambiguos corpus + biasing + 107 tags
    Error rate: 2,93 %
    Testing scenario: initial model: disambiguated corpus + 259 tags
    Error rate: 1,69%
    Testing scenario: initial model: disambiguated corpus + 107 tags
    Error rate: 1,54%

  • Error rate can vary substantially depending on the text tested. The best results achieved with text never seen before by the tool, although belonging to the same text type, is a 0,77% of error (text 051, word number: 10865, error number: 84).
  • The analysis of most frequent mistakes in this best scored text is the following: The word 'que', with two possible tags: relative pronoun and subordinate conjonction is the most frequent error (37.5% of the errors detected; 30 cases of 200 occurrences, that is a 15% of error in disambiguating 'que'). The most common context is after an indirect object of a completive verb. It seems that the word 'que' occurring after a noun leads the mistake. Hence no solution envisaged within this framework as a larger context or some information on the subcategorization frame will be needed.

    The second most common error has to do with disambiguating between Noun and Adjective. This can be considered a special feature of Spanish where almost all adjectives can be nominalised with an article, and, furthermore, many words have two different readings (adjectival and nominal). This is the case for público ('public': audience and status), útil ('tool' vs. 'useful'), etc. In the tested file (051), this error was about a 17.85% of the errors detected. Comparison with other PoS taggers for Spanish is not possible because no concrete figures are given but in one of the three papers we have taken into account. citeNmarquez report a 89.67% of success in que disambiguation. It has to be taken into account that they worked with a decision tree formalism with a broader context into consideration. citeNfernando report a 96.8% of general success, but they avoided to disambiguate the different possibilities for the word que creating a special tag for it. citeNmoreno reports a failure in trying to disambiguate this case, but no concrete figures are supplied.

    Another rather frequent error involves the item ya, which can be either a Conjonction or an Adverb. Both occur in texts in simmilar contexts, thus disambiguation could only be affected if symbol biasing was possible because its occurrence as an adverb is much more frequent than as a conjunction.

    Following contract duties, all the files delivered (200K words) have been manually corrected and validated.

    4.6. References

    Armstrong, S., Bouillon, P., and Robert, G. (1992). Tools for Part-of-Speech Tagging. Draft Version - Work in Progress, ISSCO, Geneva. MULTEXT PROJECT

    Church, K.W. and Mercer, R.L. (1993). Introduction to the Special issue on Computional Linguistics Using Large Corpora. Computational Linguistics, 19(1):1-24

    Cutting, D., Kupiec, J., Pedersen, J. and Sibun, P. (1992). A practical Part-of-Speech Tager. In Third Conference on Applied Natural Language Processing.

    Elworthy, D. (1994). Does Baum-Wech Re-estimation Help Taggers. In Proceedings of the 4th Conference on Applied Natural Language Processing (ACL 1994), pages 53-58, Stuttgart.

    Gale, W. and Church, K. (1994). What is wrong with adding one ? In Corpus-Based Research into Language. N. Oostdijk and P. de Haan, Rodopi, Amsterdam.

    Magerman, D. (1995). Everything You Always Wanted to Know about Probability Theory, but Were Afraid to Ask. Corpus-Based Models of Language Processing, R. Bod and R. Scha, ESSLLI'95 Reader.

    Marquèz, L. and Rodriguez, H. (1995). Towards Learning a Constraint Grammar from Annotated Corpora Using Decision Trees. Technical report, Universitat Politèctica de Catalunya, Spain.

    Moreno-Torres, I. (1994). A Morphological Disambiguation Tool (MSD) : Application to Spanish. Technical Report, Depto de Lenguages y CC, Facultat de Informatica, Universidad de Malaga, Spain. ACQUILEX II, Working Paper 24.

    Rabiner, L.R. (1989).A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In Proceedings of the IEEE, volume 77(2), pages 257-286.

    Sanchez, F. (1994). Spanish tagset for the CRATER project. Technical report. CRATER, Internal Document.

    Sanchez Leon, F. and Nieto, A.F. (1995). Dsarollo de un etiquetador morfosintactico para el espanol. Procesamiento del Lenguaje Natural, 17.


    5. Part of Speech Tagging for Italian

    No documentation available yet.


    You are invited to send comments and feedback to multext@lpl.univ-aix.fr.


    | Top | Multext Corpus : General Content | Multext home page | LPL/CNRS
    Copyright © Centre National de la Recherche Scientifique, 1996.
    This page will undergo frequent modification. Therefore, please do not mirror this page.
    HTML 3.2 Checked!