2013-04-11

NLTK 2.3 - Working with Wordnet

I'm a little bit behind my schedule of implementing NLTK examples in Lisp with no posts on topic in March. It doesn't mean that work on CL-NLP has stopped - I've just had an unexpected vacation and also worked on parts, related to writing programs for the excellent Natural Language Processing by Michael Collins Coursera course.
Today we'll start looking at Chapter 2, but we'll do it from the end, first exploring the topic of Wordnet.

Wordnet and Lisp

Wordnet is a database of word meanings (senses) and word semantic relations (synonymy, hyponymy and various other ymy's). It is a really important freely available resource, so having easy access to it is very valuable. At the same time, getting Wordnet to work is somewhat involved technically. That's why I've decided to implement it's support in cl-nlp-contrib system, so that the basic cl-nlp functionality can be loaded without the additional hassle of having to ensure Wordnet (and other similar stuff) is loading.
The implementation of Wordnet interface in cl-nlp is incomplete in the sense, that it doesn't provide matching entities for all Wordnet tables and so doesn't support all the possible interactions with it out-of-the-box. Yet, it is sufficient to run all NLTK's examples and it is trivial to implement the rest as the need arises (I plan to do it later this year).
There are several other Wordnet interfaces for CL developed in the previous years, starting from as long ago as early nineties:
Non of them use my desired storage format (sqlite) and their overall support ranges from non-existing to little if any. Besides, one of the principles behind cl-nlp is to prefer uniform interface and ease of extensibility over performance and efficiency with the premise that a specialized efficient version can be easily implemented by restricting the more general one, while the opposite is often much harder. And adapting the existing packages to the desired interface didn't seem like a straightforward task, so I decided to implement Wordnet support from scratch.

Wordnet installation

Wordnet isn't a corpus, although NLTK categorizes it as such placing it in nltk.corpus module. It is a word database distributed in a custom format, as well as in binary format of several databases. I have picked sqlite3-based distribution as the easiest to work with — it's a single file of roughly 64 MB in size which can be downloaded from WNSQL project page. The default place where cl-nlp will search for Wordnet is the data/ directory. If you call (download :wordnet), it will fetch a copy and place it there.
There are several options to work with sqlite databases in Lisp, and I have chosen CLSQL as the most mature and comprehensive system — it's a goto library for doing simple SQL stuff in CL. Besides the low-level interface it provides the basic ORM functionality and some other goodies.
Also, to work with sqlite from Lisp a C client library is needed. It can be obtained in various ways depending on your OS and distribution (on most Linuxes with apt-get or similar package managers). Yet there's another issue of making Lisp find the installed library. So for Linux I decided to ship the appropriate binaries in the project's lib/ dir. If you are not on Linux and CLSQL can't find sqlite native library, you need to manually call the following command before you load cl-nlp-contrib with quicklisp or ASDF:
(clsql:push-library-path "path to sqlite3 or libsqlite3 directory")

Supporting Wordnet interaction

Using CLSQL we can define classes for Wordnet entities, such as: Synset, Sense or Lexlink.
(clsql:def-view-class synset ()
  ((synsetid :initarg :synsetid :reader synset-id
             :type integer :db-kind :key)
   (pos :initarg :pos :reader synset-pos
        :type char)
   (lexdomainid :initarg :lexdomain
                :type smallint :db-constraints (:not-null))
   (definition :initarg :definition :reader synset-def
               :type string)
   (word :initarg :word :db-kind :virtual)
   (sensenum :initarg :sensenum :db-kind :virtual))
  (:base-table "synsets"))
Notice the presence of virtual slots word and sensenum which are used to cache data from other tables in synset to display it like it is done in NLTK. They are populated lazily with slot-unbound method just like we've seen in Chapter 1.
All the DB entities are cached as well in the global cache, so that same requests don't produce different objects and Wordnet objects could be compared with eql.
There are some inconsistencies in Wordnet lexicon which have also migrated to NLTK interface (and then NLTK introduced some more). I've done a cleanup on that so some classes and slot accessors don't fully mimic the names of their DB counterparts or the NLTK's interface. For instance, in our dictionary word refers to a raw string and lemma — to its representation as a DB entity.
CLSQL also has special syntactic support for writing SQL queries in Lisp:
(clsql:select [item_id] [as [count [*]] 'num]
              :from [table]
              :where [is [null [sale_date]]]
              :group-by [item_id]
              :order-by '((num :desc))
              :limit 1)
Yet I've chosen to use another very simple custom solution for that, which I've kept in the back of my mind for a long time since I'd started working with CLSQL literal SQL syntax. Here's how we get all lemma object for a specific synset — they are connected through senses:
(query (select 'lemma
               `(:where wordid
                 :in ,(select '(wordid :from senses)
                              `(:where synsetid := ,(synset-id synset))))))
For the sake of efficiency we don't open new DB connection on every such request, but use a combination of the following:
  • there's a CLSQL special variable *default-database* that points to the current DB connection; it is used by all Wordnet interfacing functions
  • the connection can be established with connect-wordnet which is given an instance of sql-wordnet3 class (it has the usual default instance <wordnet>, but you can use any other instance which may connect to some othe Wordnet SQL DB like MySQL if needed)
  • ensure-db function checks, if the connection is present and opens it otherwise
  • it is assumed that all functions will perform Wordnet interaction inside with-wordnet macro that implements a standard Lisp resource-management call-with-* pattern for the Wordnet connection

Running NLTK's examples

Now, let's run the examples from the book to see how they work in our interface.
First, connect to Wordnet:
WORDNET> (connect-wordnet <wordnet>)
#<CLSQL-SQLITE3:SQLITE3-DATABASE /cl-nlp/data/wordnet30.sqlite OPEN {1011D1FAF3}>
Now we can run the functions with the existing connection passed implicitly through the special variable. wn is a special convenience package which implements the logic of implicitly relying on the default <wordnet>. There are functions with the same names in wordnet package that take a wordnet instance as first argument, similar to how other parts of cl-nlp are organized.
WORDNET> (wn:synsets "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
WORDNET> (synsets <wordnet> "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
The printing of synsets and other Wordnet objects is performed with custom print-object methods. Here's a print-object for synset that uses the built-in print-unreadable-object macro:
(defmethod print-object ((sample sample) stream)
  (print-unreadable-object (sample stream :type t :identity t)
    (with-slots (sample sampleid) sample
      (format stream "~A ~A" sample sampleid))))
Let's proceed further.
WORDNET> (wn:words (wn:synset "car.n.1"))
("auto" "automobile" "car" "machine" "motorcar")
This snippet is equivalent to the following NLTK code:
wn.synset('car.n.01').lemma_names
Definitions and examples:
WORDNET> (synset-def (wn:synset "car.n.1"))
"a motor vehicle with four wheels; usually propelled by an internal combustion engine"
WORDNET> (wn:examples (wn:synset "car.n.1"))
("he needs a car to get to work")
Next are lemmas.
WORDNET> (wn:lemmas (wn:synset "car.n.1"))
(#<LEMMA auto 9953 {100386BCC3}> #<LEMMA automobile 10063 {100386BCF3}>
 #<LEMMA car 20722 {100386BC03}> #<LEMMA machine 80312 {100386BD23}>
 #<LEMMA motorcar 86898 {1003250543}>)
As I've written earlier, there's some confusion in Wordnet between words and lemmas, and, IMHO, it is propagated in NLTK. The term lemma only appears once in Wordnet DB as a column in words table. And synsets are not directly related to words — they are linked through senses table. NLTK calls a synset to word pairing a lemma, which only adds to the confusion. I decided to call entities of words table lemmas. Now, how do you implement the equivalent of
wn.lemma('car.n.01.automobile')
We can do it like this:
WORDNET> (remove-if-not #`(string= "automobile" (lemma-word %))
                        (wn:lemmas (wn:synset "car.n.1")))
(#<LEMMA automobile 10063 {100386BCF3}>)
But, actually, we need not a raw lemma object, but a sense object, because sense is that mapping of word to its meaning (defined by a synset), and it's a proper Wordnet entity for this:
WORDNET> (wn:sense "car~automobile.n.1")
#<SENSE car~auto.n.1 28261 {1011F226C3}>
You can also get at the raw lemma by its DB id:
WORDNET> (wn:synset (wn:lemma 10063))
#<SYNSET auto.n.1 102958343 {100386BB33}>
Notice, that synset here is named auto. This is different from NLTK's 'car', and the reason for this is that it's unclear from the Wordnet DB, what is the "primary" lemma for a synset, so I just use the word which appears first in the DB. Probably, NLTK also uses it, but has a different ordering — compare the next output with the book's one:
WORDNET> (dolist (synset (wn:synsets "car"))
           (print (wn:words synset)))
("cable car" "car")
("auto" "automobile" "car" "machine" "motorcar")
("car" "railcar" "railroad car" "railway car")
("car" "elevator car")
("car" "gondola")
Also I've chosen a different naming scheme for senses — word~synset — as it better reflects the semantic meaning of this concept.

Wordnet relations

There're two types of Wordnet relations: semantic ones between synsets and lexical ones between senses. All of the relations or links can be found in *link-types* parameter. There are 28 of them in Wordnet 3.0.
To get at any relation there's a generic function related:
WORDNET> (defvar *motorcar* (wn:synset "car.n.1"))
WORDNET> (defvar *types-of-motorcar* (wn:related *motorcar* :hyponym))
WORDNET> (nth 0 *types-of-motorcar*)
#<SYNSET ambulance.n.1 102701002 {1011E35063}>
In our case "ambulance" is the first motorcar type.
WORDNET> (sort (mapcan #'wn:words *types-of-motorcar*) 'string<)
("ambulance" "beach waggon" "beach wagon" "bus" "cab" "compact" "compact car"
 "convertible" "coupe" "cruiser" "electric" "electric automobile"
 "electric car" "estate car" "gas guzzler" "hack" "hardtop" "hatchback" "heap"
 "horseless carriage" "hot rod" "hot-rod" "jalopy" "jeep" "landrover" "limo"
 "limousine" "loaner" "minicar" "minivan" "model t" "pace car" "patrol car"
 "phaeton" "police car" "police cruiser" "prowl car" "race car" "racer"
 "racing car" "roadster" "runabout" "s.u.v." "saloon" "secondhand car" "sedan"
 "sport car" "sport utility" "sport utility vehicle" "sports car" "squad car"
 "stanley steamer" "station waggon" "station wagon" "stock car" "subcompact"
 "subcompact car" "suv" "taxi" "taxicab" "tourer" "touring car" "two-seater"
 "used-car" "waggon" "wagon")
Hypernyms are more general entities in synset hierarchy:
WORDNET> (wn:related *motorcar* :hypernym)
(#<SYNSET automotive vehicle.n.1 103791235 {10125702C3}>)
Let's trace them up to root entity:
WORDNET> (defvar *paths* (wn:hypernym-paths *motorcar*))
WORDNET> (length *paths*)
2
WORDNET> (mapcar #'synset-name (nth 0 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
 "instrumentality.n.3" "conveyance.n.3" "vehicle.n.1" "wheeled vehicle.n.1"
 "self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
WORDNET> (mapcar #'synset-name (nth 1 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
 "instrumentality.n.3" "container.n.1" "wheeled vehicle.n.1"
 "self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
And here are just the root hypernyms:
WORDNET> (remove-duplicates (mapcar #'car *paths*))
(#<SYNSET entity.n.1 100001740 {101D016453}>)
Now, if we look at part-meronym, substance-meronym and member-holonym relations and try to get them with related, we'll get an empty set.
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym)
NIL
That's because the relation is actually one-way: a burl is part of a tree, but not vice versa. For this case there's a :reverse key to related:
WORDNET> (wn:related (wn:synset "tree.n.1") :part-meronym :reverse t)
(#<SYNSET stump.n.1 113111504 {10114220B3}>
 #<SYNSET crown.n.7 113128003 {1011423B73}>
 #<SYNSET limb.n.2 113163803 {1011425633}>
 #<SYNSET bole.n.2 113165815 {10114270F3}>
 #<SYNSET burl.n.2 113166044 {1011428BE3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym :reverse t)
(#<SYNSET sapwood.n.1 113097536 {10113E8BE3}>
 #<SYNSET duramen.n.1 113097752 {10113EA6A3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :member-holonym :reverse t)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
While the tree is member-meronym of forest (i.e. NLTK has it slightly the opposite way):
WORDNET> (wn:related (wn:synset "tree.n.1") :member-meronym)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
And here's the mint example:
WORDNET> (dolist (s (wn:synsets "mint" :pos #\n))
           (format t "~A: ~A~%" (synset-name s) (synset-def s)))
mint.n.6: a plant where money is coined by authority of the government
mint.n.5: a candy that is flavored with a mint oil
mint.n.4: the leaves of a mint plant used fresh or candied
mint.n.3: any member of the mint family of plants
mint.n.2: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
batch.n.2: (often followed by `of') a large number or amount or extent
WORDNET> (wn:related (wn:synset "mint.n.4") :part-holonym :reverse t)
(#<SYNSET mint.n.2 112855042 {1011CF3CF3}>)
WORDNET> (wn:related (wn:synset "mint.n.4") :substance-holonym :reverse t)
(#<SYNSET mint.n.5 107606278 {1011CEECA3}>)
Verbs:
WORDNET> (wn:related (wn:synset "walk.v.1") :entail)
(#<SYNSET step.v.1 201928838 {1004139FF3}>)
WORDNET> (wn:related (wn:synset "eat.v.1") :entail)
(#<SYNSET chew.v.1 201201089 {10041BDC33}>
 #<SYNSET get down.v.4 201201856 {10041BF6F3}>)
WORDNET> (wn:related (wn:synset "tease.v.3") :entail)
(#<SYNSET arouse.v.7 201762283 {10042495E3}>
 #<SYNSET disappoint.v.1 201798936 {100424B0A3}>)
And, finally, here's antonomy, which is a lexical, not semantic, relationship, and it takes place between senses:
WORDNET> (wn:related (wn:sense "supply~supply.n.2") :antonym)
(#<SENSE demand~demand.n.2 48880 {1003637C03}>)
WORDNET> (wn:related (wn:sense "rush~rush.v.1") :antonym)
(#<SENSE linger~dawdle.v.4 107873 {1010780DE3}>)
WORDNET> (wn:related (wn:sense "horizontal~horizontal.a.1") :antonym)
(#<SENSE inclined~inclined.a.2 94496 {1010A5E653}>)
WORDNET> (wn:related (wn:sense "staccato~staccato.r.1") :antonym)
(#<SENSE legato~legato.r.1 105844 {1010C18AF3}>)

Similarity measures

Let's now use the lowest-common-hypernyms function abbreviated to lch to see the semantic grouping of marine animals:
WORDNET> (defvar *right* (wn:synset "right whale.n.1"))
WORDNET> (defvar *orca* (wn:synset "orca.n.1"))
WORDNET> (defvar *minke* (wn:synset "minke whale.n.1"))
WORDNET> (defvar *tortoise* (wn:synset "tortoise.n.1"))
WORDNET> (defvar *novel* (wn:synset "novel.n.1"))
WORDNET> (wn:lch *right* *minke*)
(#<SYNSET baleen whale.n.1 102063224 {102A2E0003}>)
2
1
WORDNET> (wn:lch *right* *orca*)
(#<SYNSET whale.n.2 102062744 {102A373323}>)
3
2
WORDNET> (wn:lch *right* *tortoise*)
(#<SYNSET craniate.n.1 101471682 {102A3734A3}>)
7
5
WORDNET> (wn:lch *right* *novel*)
(#<SYNSET entity.n.1 100001740 {102A373653}>)
15
7
The second and third return values of lch come handy here, as they show the depth of the paths to the common ancestor and give some immediate data for estimating semantic relatedness, that we're going to explore more now.
WORDNET> (wn:min-depth (wn:synset "baleen whale.n.1"))
14
WORDNET> (wn:min-depth (wn:synset "whale.n.2"))
13
WORDNET> (wn:min-depth (wn:synset "vertebrate.n.1"))
8
WORDNET> (wn:min-depth (wn:synset "entity.n.1"))
0
By the way, there's another whale, that is much closer to the root:
WORDNET> (wn:min-depth (wn:synset "whale.n.1"))
5
Guess what it is?
WORDNET> (synset-def (wn:synset "whale.n.1"))
"a very large person; impressive in size or qualities"
Now, let's calculate different similarity measures:
WORDNET> (wn:path-similarity *right* *minke*)
1/4
WORDNET> (wn:path-similarity *right* *orca*)
1/6
WORDNET> (wn:path-similarity *right* *tortoise*)
1/13
WORDNET> (wn:path-similarity *right* *novel*)
1/23
The algorithm here is very simple — it just reuses the secondary return values of lch:
(defmethod path-similarity ((wordnet sql-wordnet3)
                            (synset1 synset) (synset2 synset))
  (mv-bind (_ l1 l2) (lowest-common-hypernyms wordnet synset1 synset2)
    (declare (ignore _))
    (when l1
      (/ 1 (+ l1 l2 1)))))
There are many more Wordnet similarity measures in NLTK, and they are also implemented in cl-nlp. For example, lch-similarity the name of which luckily coincides with lch function:
WORDNET> (wn:lch-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
Calculating taxonomy depth for #<SYNSET entity.n.1 100001740 {102A9F7A83}>
2.0281482
This measure depends on performing expensive computation of taxonomy depth which is done in a lazy manner in the max-depth function:
(defmethod max-depth ((wordnet sql-wordnet3) (synset synset))
  (reduce #'max (mapcar #`(or (get# % *taxonomy-depths*)
                              (progn
                                (format *debug-io*
                                        "Calculating taxonomy depth for ~A~%" %)
                                (set# % *taxonomy-depths*
                                      (calc-max-depth wordnet %))))
                        (mapcar #'car (hypernym-paths wordnet synset)))))
Other similarity measures include wup-similarity, res-similarity, lin-similarity and others. You can read about them in the NLTK Wordnet manual. Most of them depend on the words' information content database which is calculated for different corpora (e.g. BNC and Penn Treebank) and is not part of Wordnet. There's a WordNet::Similarity project that distributes pre-calculated databases for some popular corpora.
Lin similarity proposes a similar formula to LCH similarity for calculating the score, but only uses information contents instead of taxonomy depths as the arguments to it:
WORDNET> (wn:lin-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
0.87035906
To calculate it we first need to fetch the information content corpus with (download :wordnet-ic) and load the data for SemCor:
WORDNET> (setf *lc* (load-ic :filename-pattern "semcor"))
#<HASH-TABLE :TEST EQUAL :COUNT 213482 {100D7B80D3}>
The default :filename-pattern will load the combined information content scores for the following 5 corpora: BNC, Brown, SemCor and its raw version, all of Shakespeare's works and Penn Treebank. We'll talk more about various corpora in the next part of the series...
Well, this is all as far as Wordnet concerned for now. There's another useful Wordnet functionality — word morpher, but we'll return to it when we'll be talking about word forming, stemming and lemmatization. Meanwhile, here's a good schema showing the full potential of Wordnet:
Wordnet 3.0 UML diagram

No comments: