Département de modélisation des langues
Directeur : Joseph Olive
Lucent Technologies - Bell-Labs innovations
 

Extracteurs de syntagmes en vue d'un étiquetage

pour extraction de sens


Corpus after |parts |np : wsj9_00*

Corpus after tagger : wsj_000*.mrg

Tags and frequency
suffix

Semantic

Rules pattern

programs are in /u/s_gael/program
the corpus in /u/s_gael/corpus
the results in /u/s_gael/result
 
 

Corpus after Parts-of-Speech and noun phrase extraction

Before |parts |np, the corpus need to be cleaned up from all tags : perl clean.pl
The result gives a set of articles  as text files > |parts |np > article ready for extraction
Programs of NP extraction :
perl simplenp.pl : returns noun phrases cleaned up from adjectives and/or articles :
- Entire NP
- Without articles
- Without adjectives
- Neither articles nor adjectives articles
- Nouns only
 

perl complexnp.pl : returns complex noun phrases from combination of simple noun phrases related by : preposition, conjunction, verbs, attribution, subordination
- preposition
- conjunction
- preposition and conjunction
- verbs
- be
- subordination
- all of previous

perl occ.pl : returns statistics on of those results, for example on Neither articles nor adjectives articles the occurrences for the all database for each noun : total by itself, total included in a composed noun, and the articles and sentences where they are :
occurrences

perl occ2.pl : returns only the total number of occurrences in the entire database, it`s faster and it can help finding relevant semantic tags from words; not only from syntactic tags. result in : occ2
 
 

Corpus after tagger

The corpus : is already tagged with complex tagged as NP-SBJ noun phrase subject with syntactic informations
to see the text use parsetotext.pl

headtag.pl returns the first 2 lines of the item ,it gives a simple view of of 1- or 2- item matches and the # of occurrences separated or close in the text : headtag

searchtag.pl : returns from wsj_000*.mrg the structure and content of complex tags from the first x sentences of each article.
NP-SBJ of the first sentence of each article gives generally information about the source of the article, the person(s) or institution(s) involved.
VP of the first sentence of each article generally gives information about the event developed further in the article.

searchtag2.pl : returns from the results of searchtag.pl the the intern elements we can use to build rules about the content of this information , for example who is concerned, searching the NNP : searchtag2

Search some tags cooccurring in the same area of text
autour.pl > autour
autour2.pl > autour2

For some statistics about an item, occurrences in the whole database, per article and per sentence : perl stat.pl
result in stat

To quickly view the syntactic shape of a text use parsetotag.pl
returns in parsetotag the the structure cleaned of words.

The program searchstring.pl returns in searchstring the lines containing the string the occurrences as a word, prefix, suffix, or affix.
 
 

Tags frequency



To use a tag in a rule an average frequency is generally more relevant. The frequency of combination of those will be a criteria to make the rules.
TAGS : and frequency on 100 articles :
Tag count:     33032
Tag one occurrence:   627
4647    NP
Syntagme nominal
2970    VP
Syntagme verbal
2530    NN
Nom Commun singulier ou masse
1937    IN
Préposition subordonnant
1743    DT
Déterminant
1734    S
Phrase
1542    NP-SBJ
Syntagme nominal sujet
1447    NNP
Nom propre singulier
1248    NNS
Nom commun pluriel
1116    JJ
Adjectif
1057    PP
Syntagme prépositionnel
614     VB
Verbe base
551     RB
Adverbe
477     TO
To
460     CC
Conjonction de coordination
444     VBD
Verbe au passé -ed
430     VBZ
Verbe au présent 3ème pers./sing. -s
428     CD
Nombre cardinal
428     VBN
Participe passé
413     SBAR
Proposition subordonnée
406     NONE
Ellipse
378     PRP
Pronom personnel
297     VBG
Participe présent, gérondif
281     VBP
Verbe présent autres personnes
226     ADVP
Syntagme adverbial
217     PP-CLR
Complément circonstanciel
212     MD
Modal (tous vbs sans -s à la 3ème)
206     PP-LOC
Complément de lieu
193     NONE-*-1
 
172     NP-SBJ-1
Sujet différent de la principale
164     PRP$
Pronom possessif
162     NONE-*T*-1
 
141     PP-TMP
Complément de temps
135     ADJP-PRD
Syntagme adjectival attribut du sujet
126     QP
Syntagme ou complément de quantité
121     ADVP-TMP
Syntagme adverbial de temps
119     ADJP
Syntagme adjectival
117     POSS
 
110     NP-PRD
Syntagme nominal attribut du sujet
94      SBAR-ADV
Prop sub. adverbiale
92      JJR
Adjectif comparatif -er, more, less
92      WDT
Pronom relatif : which, that, wh- determiner
88      NONE-*U
 
86      NN%
 
81      $$
 
78      S-NOM
SN à base verbal inclus dans un S prépositionnel
65      NNPS
Nom propre pluriel
65      NONE-*T*-2
 
64      WP
Pronom relatif : who, what, whom : wh- pronom
63      WHNP-1
 
62      NONE-*-2
 
61      RBT
 
60      PRN
 
59      ADVP-MNR
 
53      NP-SBJ-2
 
53      PP-DIR
 
46      PRT
 
45      S-TPC-1
 
44      NP-LGS
 
40      SINV
 
39      WHNP-2
 
37      NP-TMP
 
37      RP
Particule verbale
37      S-ADV
 
35      WHADVP-1
 
34      SBAR-TMP
 
34      WRB
Adverbe interrogatif : how, where, why, wh- adv
28      JJS
Adjectif superlatif
26      NONE-*-3
 
26      RRB--RRB
 
25      NP-SBJ-3
 
24      NONE-*T*-3
 
23      LRB--LRB
 
22      PP-MNR
 
21      RBR
Adverbe comparatif -er : more, less, later + ADJ
20      &&
 
20      ADVP-DIR
S adv de direction
20      NX
 
19      NP-ADV
 
19      PP-PRD
 
18      SBAR-PRP
 
17      S-1
 
16      ADVP-LOC
 
16      CD000
 
16      EX
"Il y a", existentiel
16      S-PRP
 
16      SBAR-NOM
 
16      SQ
 
16      VBZS
 
15      NP-1
 
14      NONE-*ICH*-1
 
14      PP-PRP
 
12      NP-LOC
 
12      S-CLR
Proposition circonstancielle
12      WHNP
 
12      WHNP-3
 
11      NNP&P
 
11      S-PRD
 
11      SBARQ
 
10      S-TPC-2
 
10      WHADVP-2
 
9       NP-2
 
9       POS
Possédé : 's POS
8       CC&
 
8       CD\/2
 
8       NONE-*ICH*-2
 
8       NONE-*RNR*-1
 
8       PP-LOC-CLR
 
8       PP-PUT
 
8       RBS
Adverbe superlatif : most
8       S-2
 
8       S-HLN
 
8       SBAR-PRD
 
8       VBPRE
 
7       CD\/4
 
7       NNP-PACIFIC
 
7       NONE-*T*-4
 
7       NP-TMP-CLR
 
7       PP-DTV
 
6       PDT
Prédéterminant : the all world
6       S-TPC-3
 
6       UCP
 
6       VBPVE
 
5       CD3
 
5       CD\/8
 
5       FRAG
 
5       INTJ
 
5       NNPD
 
5       NONE-*EXP*-1
 
5       S-NOM-SBJ
 
5       WHPP-1
 
4       ADVP-PRD
 
4       NNPSA
 
4       NP-EXT
 
4       PP-TMP-CLR
 
4       SBAR-1
 
4       SBAR-CLR
 
3       ADVP-CLR
 
3       ADVP-PRP
 
3       CD55
 
3       LS
Numérotation de liste
3       LST
Liste à numéro
3       NAC
 
3       NNPC
 
3       NNPK
 
3       NONE-*-52
 
3       NP-SBJ-4
 
3       PP-1
 
3       VBPM
 
3       WHADVP-3
 
3       WHADVP-4
 
3       WHNP-4
 
3       WP$
Pronom possessif : wh-
2       ADJP-ADV
 
2       CD07
 
2       CD25
 
2       CD4
 
2       CD64
 
2       CD95
 
2       CONJP
 
2       FW
Mot étranger
2       LRB--LCB
 
2       NAC-LOC
 
2       NNP-AMERICAN
 
2       NNP-MELLON
 
2       NNPBRIEN
 
2       NP-CLR
 
2       NP-HLN
 
2       NP-VOC
 
2       PP-DIR-2
 
2       PP-DIR-CLR
 
2       PP-EXT
 
2       RRB--RCB
 
2       S-3
 
2       S-MNR
 
2       S-PRP-CLR
 
2       SBAR-NOM-PRD
 
2       SBAR-NOM-SBJ
 
2       SBARQ-NOM
 
2       UCP-PRD
 
2       UH
Interjection exclamative
2       VP-1
 
2       WHPP
 
0       SYM
Symboles
1       ADJP-2
 
1       ADJP-CLR
 
1       ADJP-TPC-1
 
1       ADVP-LOC-CLR
 
1       ADVP-PUT
 
1       ADVP|PRT
 
1       FRAG-ADV
 
1       FRAG-TTL-SBJ-1
 
1       JJ-BUSH
 
1       JJ-SPEAKER
 
1       JJ000
 
1       MDD
 
1       MDLL
 
1       NAC-TMP
 
1       NNP-BACHE
 
1       NNP-BUICK
 
1       NNP-CONTRA
 
1       NNP-DEFICIENCY
 
1       NNP-SCOTT-RODINO
 
1       NNP-SENATE
 
1       NNP-TOTE
 
1       NNP-TRACK
 
1       NNPA
 
1       NNPI
 
1       NNPJ
 
1       NNPY
 
1       NNP\/DEL
 
1       NNP\/FAWCETT
 
1       NNSDS
 
1       NP-3
 
1       NP-MNR
 
1       NP-SBJ-9
 
1       NP-TMP-HLN
 
1       NP-TTL
 
1       PP-2
 
1       PP-BNF
 
1       PP-DIR=2
 
1       PP-LOC-1
 
1       PP-LOC-CLR-TPC-1
 
1       PP-LOC-PRD
 
1       PP-LOC=1
 
1       PP-TMP-PRD
 
1       PP-TPC-1
 
1       RRC
 
1       S-NOM-PRD
 
1       S-SBJ
 
1       S-TPC-4
 
1       SBAR-2
 
1       SBAR-4
 
1       SBAR-ADV-3
 
1       SBAR-LOC
 
1       SBAR-MNR
 
1       SBAR-NOM-1
 
1       SINV-2
 
1       SINV-TPC-1
 
1       UCP-MNR
 
1       UCP-PRP
 
1       VP-TPC-1
 
1       WHADVP-5
 
1       WHPP-3
 
1       X-HLN
 

SUFFIX
 
-PRD Attribut du sujet
- SBJ Sujet
-TMP Temporel
-CLR Circonstanciel
-ADV Adverbial
-LOC De lieu
-NOM Nominal
-DIR Directionnel
-MNR Manière
-2 Occurrence d'un tag dans un phrase ou type de tag
-HLN Head line
   
   
   
   
   
   
   
   
   
   


2141 : NP-suffixed

7914 : NP suffixed et non suffixed

2327 : VB suffixed or not

549 : VB non-suffixed

1778 : VB suffixed

31623 tags

4165 signes de ponctuation
 

TAGS :

thema
event
statement
description
attribut
object
result
action
detail

sub-tagg :
org-pers : organization or person (usually : NNP) or (DT NNP (or JJ + capital letter) +...)
 

   (NP (DT the) (NNP Dutch) (VBG publishing) (NN group) )))))

   (NP (DT this) (JJ British) (JJ industrial) (NN conglomerate) ))))))
    (. .) ))
 

fact
cause
field
consequence
comments
nuance
affirmation
aim
definition
comparison
conclusion
exposure
event
frequency
quality
characteristic
actor
sources of information [info source]
declaration
 

Index of tagged item that can be used to find some reccursive semantis structures :
 

A

Adjectives
Adverbs
  manner
time : already à past part. : + [consequence] || [result] / [fact]
place

Frequency, quantity : a few, some : [comments] [interpretation] [info source] [frequency]
degree

Interrogative adverbs

Any (anybody, anywhere, anything)

Articles

AS + Comparative + AS : + [comparison]
To show no difference: AS + MUCH + AS , AS + MANY + AS : + [comparison]
Auxiliary Verbs

B

Be : As an ordinary verb [definition] [description] [characteristic]
Be: As an auxiliary : + [characteristic]

C

Can
Could
Classes of adverbs
Comparative + than
Comparison of Adjectives
Comparing adverbs
Comparison of quantity (adjectives)
Compound Nouns
Countable and Uncountable nouns

D

Definite article: the : new [thema ]

démonstratifs : [cause] | [comments] / [thema]
Determiners
Distributives : either, or, neither, nor, each, every
Do : as an ordinary verb
Do : As an auxiliary

E

Either
Each
Every
Enough + noun
Enough -(adverb section)
Exclamatives: such and what

F

Form of adverbs

G

H

Have, Have got, have got to : As ordinary verbs [characteristic] [quality]
Have : As an auxiliary [consequence] [result] [fact]

I

Indefinite articles: an, a : more informations about? [description], à definite article
Indefinite Pronouns
Interrogative adverbs
Interrogative and Negative of Ordinary Verbs

J

K

L

M

May
Might
Must
Much, many
Modal auxiliary verbs
MORE, LESS, FEWER + THAN : To show difference

N

Nationalities
Need
Nouns
Not as...as
Numbers

O

One / Ones : Pronouns
Ought to

P

Personal and Possessive Pronouns
Personal Pronouns
Plural of Nouns
Possessives : my, your, his, her, its, our, their
Possessive Pronouns
Possessive with 's and '

Preposition : VPB + (PP (TO to : [object/thema]
Pronouns
Proper Nouns : [info source] [thema] [object]

Q

Quantifiers: a few, a little, much, many, a lot of, most, any, some, enough, etc.

R

Reflexive Pronouns
Relative Pronouns : Who, Whom, That, Which

S

Shall
Should
Some
Still as an adverb of time
Such

T

That : Relative Pronoun

the + Superlative
This, that, these, those

U

used to

V

VERBS

Intransitive Transitive General Notes    Verb Forms
     Regular verbs in the Simple Present
     The Interrogative and Negative of Ordinary Verbs
     To do : as an ordinary verb
     To Have, Have got, have got to : As ordinary verbs
     To Be : As an ordinary verb
     The Modal auxiliary verbs:
     will, shall, may, might, can, could, must, ought to, should, would, used to, need.

W

What (as exclamative)
Will
Who, Whom, Which : Relative Pronouns
would

X

Y

Yet as an adverb of time

Z
 
 

RULES :

The major rules should need to be build on a succession of lexical tags that indicates the main ideas, their relationship and an order of importance.

Some salient tags can indicate relationship between some main parts as noun phrase subject or attribut.
 

Example of rules observed from the first sentences : NP-SBJ extraction VP extraction
thema            
n*NNP [ quality ] [event] || [description]        
n*NNP [ quality ] (CC and)  n*NNP [ quality ] [event] || [description]  
event            
[org-pers]1 [decision to do] [have effect on] [org-pers]2 [illustration]    
MD will  [ object ]          
illustration            
[datas and amount]            
description            
[ thema ] VBZ is [ quality ]        
, ADJP          
object            
[ thema ] [ event ] NP        
quality            
ADJ            
NP            
, ADJP          
org-pers            
n*NNP [ quality ]          

 tagging keys :

scale of priority :
event+++ or statement++ or descrption+

links between tags : X about Y = X/Y = Xà Y
attribut of a thema : attribut/thema

thema with no common words :
thema1, thema2
with common noun or adjective :
thema1.1, thema 1.2

scheme after tagging :
n* : one item or more of the same
[ ] : optional