Thesauri Building

THESAURUS CONSTRUCTION

Padmini Srinivasan
University of Iowa

Abstract

Thesauri are valuable structures for Information Retrieval systems. A thesaurus provides a precise and controlled vocabulary which serves to coordinate document indexing and document retrieval. In both indexing and retrieval, a thesaurus may be used to select the most appropriate terms. Additionally, the thesaurus can assist the searcher in reformulating search strategies if required. This chapter first examines the important features of thesauri. This should allow the reader to differentiate between thesauri. Next, a brief overview of the manual thesaurus construction process is given. Two major approaches for automatic thesaurus construction have been selected for detailed examination. The first is on thesaurus construction from collections of documents, and the second, on thesaurus construction by merging existing thesauri. These two methods were selected since they rely on statistical techniques alone and are also significantly different from each other. Programs written in C language accompany the discussion of these approaches.

9.1 INTRODUCTION
Roget’s thesaurus is very different from the average thesaurus associated with Information Retrieval (IR) systems accessing machine readable document databases. The former is designed to assist the writer in creatively selecting vocabulary. In IR systems, used to retrieve potentially relevant documents from large collections, the thesaurus serves to coordinate the basic processes of indexing and document retrieval. (The term document is used here generically and may refer to books, articles, magazines, letters, memoranda, and also software.) In indexing, a succinct representation of the document is derived, while retrieval refers to the search process by which relevant items are identified. The IR thesaurus typically contains a list of terms, where a term is either a single word or a phrase, along with the relationships between them. It provides a common, precise, and controlled vocabulary which assists in coordinating indexing and retrieval. Given this objective, it is clear that thesauri are designed for specific subject areas and are therefore domain dependent.

Figure 9.1 displays a short extract from an alphabetical listing of thesaurus terms and their relationships in the INSPEC thesaurus. This thesaurus is designed for the INSPEC domain, which covers physics, electrical engineering, electronics, as well as computers and control. The thesaurus is logically organized as a set of hierarchies. In addition to this hierarchical arrangement, the printed INSPEC thesaurus also includes an alphabetical listing of thesaural terms. Each hierarchy is built from a root term representing a high-level concept in the domain. In Figure 9.1, “computer-aided instruction” is a part of the hierarchy whose root note or top term (TT) is “computer applications.” The “see also” link leads to cross-referenced thesaural terms. NT suggests a more specific thesaural term; similarly, BT provides a more general thesaural term. RT signifies a related term and this relationship includes a variety of links such as part-whole and object-property. UF is utilized to indicate the chosen form from a set of alternatives. In the above example, “computer-aided instruction” is used for the alternative “teaching machines.” The converse of UF is USE, which takes the user from a form that is not part of the vocabulary to a valid alternative. The example shows that “caesium” is the preferred form over “cesium.” CC and FC are from the INSPEC classification scheme and indicate the subject area in which the term is used.

cesium

USE caesium

computer-aided instruction

see also education

UF teaching machinse

BT educational computing

TT computer applications

RT education

teaching

CC C7810C

FC 7810Cf

Figure 9.1: A short extract from the 1979 INSPEC thesaurus
A carefully designed thesaurus can be of great value. The indexer is typically instructed to select the most appropriate thesaural entries for representing the document. In searching, the user can employ the thesaurus to design the most appropriate search strategy. If the search does not retrieve enough documents, the thesaurus can be used to expand the query by following the various links between terms. Similarly, if the search retrieves too many items, the thesaurus can suggest more specific search vocabulary. In this way the thesaurus can be valuable for reformulating search strategies. It is quite common to provide on-line thesauri, which simplifies the query reformulation process. In addition, such query modifications can also be more system initiated than user initiated. However, this obviously requires rather complex algorithms since the system has to know not only how to reformulate the query but also when to do so.

There exists a vast literature on the principles, methodologies, and problems involved in thesaurus construction. However, only a small portion is devoted to the “automatic” construction of thesauri. This mirrors the current state of art, marked by an abundance of manually generated thesauri. In fact, there is even much skepticism over the possibility of fully automating this procedure. This is because manual thesauri are highly complex structures which exhibit a variety of relationships including hierarchical, nonhierarchical, equivalent, and associative ones. The automatic detection of such relationships continues to be challenging. But, the reader can be assured that some automatic methodologies have emerged over the last few decades. As in most other subfields of IR, these methods are strongly motivated by statistics. In contrast, manual thesaurus construction is a highly conceptual and knowledge-intensive task and therefore also extremely labor intensive. Consequently, the search for alternative automatic methods is definitely of value.

The rest of this chapter is organized into six sections. The next, section 9.2, describes features of thesauri. Section 9.3 introduces the different approaches for automatic thesaurus construction following a brief description of manual thesaurus construction. Sections 9.4 and 9.5 focus on two major approaches for automatic thesaurus construction. The three C programs included with this chapter are described briefly in section 9.6. The last section presents the conclusions.

9.2 FEATURES OF THESAURI
Some important features of thesauri will be highlighted here. For a more detailed discussion, please consult Aitchison and Gilchrist (1972). The objective is for the reader to be able to compare thesauri. In general, the discussion applies to both manually and automatically generated thesauri. However, differences between the two are also identified where appropriate.

9.2.1 Coordination Level
Coordination refers to the construction of phrases from individual terms. Two distinct coordination options are recognized in thesauri: precoordination and post-coordination. A precoordinated thesaurus is one that can contain phrases. Consequently, phrases are available for indexing and retrieval. A postcoordinated thesaurus does not allow phrases. Instead, phrases are constructed while searching. The choice between the two options is difficult. The advantage in precoordination is that the vocabulary is very precise, thus reducing ambiguity in indexing and in searching. Also, commonly accepted phrases become part of the vocabulary. However, the disadvantage is that the searcher has to be aware of the phrase construction rules employed. Thesauri can adopt an intermediate level of coordination by allowing both phrases and single words. This is typical of manually constructed thesauri. However, even within this group there is significant variability in terms of coordination level. Some thesauri may emphasize two or three word phrases, while others may emphasize even larger sized phrases. Therefore, it is insufficient to state that two thesauri are similar simply because they follow precoordination. The level of coordination is important as well. It should be recognized that the higher the level of coordination, the greater the precision of the vocabulary but the larger the vocabulary size. It also implies an increase in the number of relationships to be encoded. Therefore, the thesaurus becomes more complex. The advantage in postcoordination is that the user need not worry about the exact ordering of the words in a phrase. Phrase combinations can be created as and when appropriate during searching. The disadvantage is that search precision may fall, as illustrated by the following well known example, from Salton and McGill (1983): the distinction between phrases such as “Venetian blind” and “blind Venetian” may be lost. A more likely example is “library school” and “school library.” The problem is that unless search strategies are designed carefully, irrelevant items may also be retrieved. Precoordination is more common in manually constructed thesauri. Automatic phrase construction is still quite difficult and therefore automatic thesaurus construction usually implies post-coordination. Section 9.4 includes a procedure for automatic phrase construction.

9.2.2 Term Relationships
Term relationships are the most important aspect of thesauri since the vocabulary connections they provide are most valuable for retrieval. Many kinds of relationships are expressed in a manual thesaurus. These are semantic in nature and reflect the underlying conceptual interactions between terms. We do not provide an exhaustive discussion or listing of term relationships here. Instead, we only try to illustrate the variety of relationships that exist. Aitchison and Gilchrist (1972) specify three categories of term relationships: (1) equivalence relationships, (2) hierarchical relationships, and (3) nonhierarchical relationships. The example in Figure 9.1 illustrates all three categories. Equivalence relations include both synonymy and quasi-synonymy. Synonyms can arise because of trade names, popular and local usage, superseded terms, and the like. Quasi-synonyms are terms which for the purpose of retrieval can be regarded as synonymous, for example, “genetics” and “heredity,” which have significant overlap in meaning. Also, the terms “harshness” and “tenderness,” which represent different viewpoints of the same property continuum. A typical example of a hierarchical relation is genus-species, such as “dog” and “german shepherd.” Nonhierarchical relationships also identify conceptually related terms. There are many examples including: thing–part such as “bus” and “seat”; thing–attribute such as “rose” and “fragance.”

Wang, Vandendorpe, and Evens (1985) provide an alternative classification of term relationships consisting of: (1) parts–wholes, (2) collocation relations, (3) paradigmatic relations, (4) taxonomy and synonymy, and (5) antonymy relations. Parts and wholes include examples such as set–element; count–mass. Collocation relates words that frequently co-occur in the same phrase or sentence. Paradigmatic relations relate words that have the same semantic core like “moon” and “lunar” and are somewhat similar to Aitchison and Gilchrist’s quasi-synonymy relationship. Taxonomy and synonymy are self-explanatory and refer to the classical relations between terms. It should be noted that the relative value of these relationships for retrieval is not clear. However, some work has been done in this direction as in Fox (1981) and Fox et al. (1988). Moreover, at least some of these semantic relationships are commonly included in manual thesauri. Identifying these relationships requires knowledge of the domain for which the thesaurus is being designed. Most if not all of these semantic relationships are difficult to identify by automatic methods, especially by algorithms that exploit only the statistical relationships between terms as exhibited in document collections. However, it should be clear that these statistical associations are only as good as their ability to reflect the more interesting and important semantic associations between terms.

9.2.3 Number of Entries for Each Term
It is in general preferable to have a single entry for each thesaurus term. However, this is seldom achieved due to the presence of homographs–words with multiple meanings. Also, the semantics of each instance of a homograph can only be contextually deciphered. Therefore, it is more realistic to have a unique representation or entry for each meaning of a homograph. This also allows each homograph entry to be associated with its own set of relations. The problem is that multiple term entries add a degree of complexity in using the thesaurus–especially if it is to be used automatically. Typically the user has to select between alternative meanings. In a manually constructed thesaurus such as INSPEC, this problem is resolved by the use of parenthetical qualifiers, as in the pair of homographs, bonds (chemical) and bonds (adhesive). However, this is hard to achieve automatically.

9.2.4 Specificity of Vocabulary
Specificity of the thesaurus vocabulary is a function of the precision associated with the component terms. A highly specific vocabulary is able to express the subject in great depth and detail. This promotes precision in retrieval. The concomitant disadvantage is that the size of the vocabulary grows since a large number of terms are required to cover the concepts in the domain. Also, specific terms tend to change (i.e., evolve) more rapidly than general terms. Therefore, such vocabularies tend to require more regular maintenance. Further, as discussed previously, high specificity implies a high coordination level which in turn implies that the user has to be more concerned with the rules for phrase construction.

9.2.5 Control on Term Frequency of Class Members
This has relevance mainly for statistical thesaurus construction methods which work by partitioning the vocabulary into a set of classes where each class contains a collection of equivalent terms. Salton and McGill (1983, 77-78) have stated that in order to maintain a good match between documents and queries, it is necessary to ensure that terms included in the same thesaurus class have roughly equal frequencies. Further, the total frequency in each class should also be roughly similar. (Appropriate frequency counts for this include: the number of term occurrences in the document collection; the number of documents in the collection in which the term appears at least once). These constraints are imposed to ensure that the probability of a match between a query and a document is the same across classes. In other words, terms within the same class should be equally specific, and the specificity across classes should also be the same.

9.2.6 Normalization of Vocabulary
Normalization of vocabulary terms is given considerable emphasis in manual thesauri. There are extensive rules which guide the form of the thesaural entries. A simple rule is that terms should be in noun form. A second rule is that noun phrases should avoid prepositions unless they are commonly known. Also, a limited number of adjectives should be used. There are other rules to direct issues such as the singularity of terms, the ordering of terms within phrases, spelling, capitalization, transliteration, abbreviations, initials, acronyms, and punctuation. In other words, manual thesauri are designed using a significant set of constraints and rules regarding the structure of individual terms. The advantage in normalizing the vocabulary is that variant forms are mapped into base expressions, thereby bringing consistency to the vocabulary. As a result, the user does not have to worry about variant forms of a term. The obvious disadvantage is that, in order to be used effectively, the user has to be well aware of the normalization rules used. This is certainly nontrivial and often viewed as a major hurdle during searching (Frost 1987). In contrast, normalization rules in automatic thesaurus construction are simpler, seldom involving more than stoplist filters and stemming. (These are discussed in more detail later.) However, this feature can be regarded both as a weakness and a strength.

These different features of thesauri have been presented so that the reader is aware of some of the current differences between manual and automatic thesaurus construction methods. All features are not equally important and they should be weighed according to the application for which the thesaurus is being designed. This section also gives an idea of where further research is required. Given the growing abundance of large-sized document databases, it is indeed important to be challenged by the gaps between manual and automatic thesauri.

9.3 THESAURUS CONSTRUCTION

9.3.1 Manual Thesaurus Construction
The process of manually constructing a thesaurus is both an art and a science. We present here only a brief overview of this complex process. First, one has to define the boundaries of the subject area. (In automatic construction, this step is simple, since the boundaries are taken to be those defined by the area covered by the document database.) Boundary definition includes identifying central subject areas and peripheral ones since it is unlikely that all topics included are of equal importance. Once this is completed, the domain is generally partitioned into divisions or subareas. Once the domain, with its subareas, has been sufficiently defined, the desired characteristics of the thesaurus have to be identified. Since manual thesauri are more complex structurally than automatic ones, as the previous section has shown, there are more decisions to be made.

Now, the collection of terms for each subarea may begin. A variety of sources may be used for this including indexes, encyclopedias, handbooks, textbooks, journal titles and abstracts, catalogues, as well as any existing and relevant thesauri or vocabulary systems. Subject experts and potential users of the thesaurus should also be included in this step. After the initial vocabulary has been identified, each term is analyzed for its related vocabulary including synonyms, broader and narrower terms, and sometimes also definitions and scope notes. These terms and their relationships are then organized into structures such as hierarchies, possibly within each subarea. The process of organizing the vocabulary may reveal gaps which can lead to the addition of terms; identify the need for new levels in the hierarchies; bring together synonyms that were not previously recognized; suggest new relationships between terms; and reduce the vocabulary size. Once the initial organization has been completed, the entire thesaurus will have to be reviewed (and refined) to check for consistency such as in phrase form and word form. Special problems arise in incorporating terms from existing thesauri which may for instance have different formats and construction rules. At this stage the hierarchically structured thesaurus has to be “inverted” to produce an alphabetical arrangement of entries–a more effective arrangement for use. Typically both the alphabetical and hierarchical arrangements are provided in a thesaurus. Following this, the manually generated thesaurus is ready to be tested by subject experts and edited to incorporate their suggestions.

The above informal description is very sketchy. It is a long process that involves a group of individuals and a variety of resources. Once the thesaurus has been designed and implemented for use within a retrieval system, the next problem is that it needs to be maintained in order to ensure its continued viability and effectiveness. That is, the thesaurus should reflect any changes in the terminology of the area. The problem is that since the older documents are still within the system, the updated thesaurus must also retain the older information. Updates are typically slow and again involve several individuals who regularly review and suggest new and modified vocabulary terms as well as relationships. Therefore, typically a thesaurus evolves with time and slowly responds to changes in the terminology of the subject.

9.3.2 Automatic Thesaurus Construction
In selecting automatic thesaurus construction approaches for discussion here, the criteria used are that they should be quite different from each other in addition to being interesting. Also, they should use purely statistical techniques. (The alternative is to use linguistic methods.) Consequently, the two major approaches selected here have not necessarily received equal attention in the literature. The first approach, on designing thesauri from document collections, is a standard one. The second, on merging existing thesauri, is better known using manual methods. Programs included with this chapter are based on these two major approaches. We also discuss a third automatic approach which is quite novel and interesting, although it is based on tools from expert systems and does not use statistical methods. In this approach, thesauri are built using information obtained from users. We first present a brief overview of each approach and then provide detailed descriptions.

9.3.3 From a Collection of Document Items
Here the idea is to use a collection of documents as the source for thesaurus construction. This assumes that a representative body of text is available. The idea is to apply statistical procedures to identify important terms as well as their significant relationships. It is reiterated here that the central thesis in applying statistical methods is to use computationally simpler methods to identify the more important semantic knowledge for thesauri. It is semantic knowledge that is used by both indexer and searcher. Until more direct methods are known, statistical methods will continue to be used. Work by Soergel (1974) is relevant to this point since it includes an interesting discussion on the various semantic interpretations of significant statistical associations between words. The first two programs, select.c and hierarchy.c, included with this chapter are based on this approach of designing a thesaurus from a collection of documents.

9.3.4 By Merging Existing Thesauri
This second approach is appropriate when two or more thesauri for a given subject exist that need to be merged into a single unit. If a new database can indeed be served by merging two or more existing thesauri, then a merger perhaps is likely to be more efficient than producing the thesaurus from scratch. This approach has been discussed at some length in Forsyth and Rada (1986). The challenge is that the merger should not violate the integrity of any component thesaurus. Rada has experimented with augmenting the MeSH thesaurus with selected terms from SNOMED (Forsyth and Rada 1986, 216). MeSH stands for Medical Subject Headings and is the thesaurus used in MEDLINE, a medical document retrieval system, constructed and maintained by the National Library of Medicine. It provides a sophisticated controlled vocabulary for indexing and accessing medical documents. SNOMED, which stands for Systematized Nomenclature of Medicine, is a detailed thesaurus developed by the College of American Pathologists for use in hospital records. MeSH terms are used to describe documents, while SNOMED terms are for describing patients. Ideally, a patient can be completely described by choosing one or more terms from each of several categories in SNOMED. Both MeSH and SNOMED follow a heirarchical structure. Rada’s focus in his experiments has been on developing suitable algorithms for merging related but separate thesauri such as MeSH and SNOMED and also in evaluating the end products. The third program (merge.c) included here implements two different merging algorithms adapted from Rada’s work.

9.3.5 User Generated Thesaurus
In this third alternative, the idea is that users of IR systems are aware of and use many term relationships in their search strategies long before these find their way into thesauri. The objective is to capture this knowledge from the user’s search. This is the basis of TEGEN–the thesaurus generating system designed by Guntzer et al. (1988). They propose TEGEN as a viable alternative technique for automatic thesaurus construction. The procedure involves examining the types of Boolean operators used between search terms, the type of query modification performed by the user, and so on. User feedback is included to resolve any ambiguities and uncertainties. Feedback is also required to select the specific type of relationship between two terms once it has been decided that the terms are indeed related. Therefore, their approach requires considerable interaction with the user population. TEGEN is designed using production rules which perform a detailed analysis of the user’s search pattern. Given that this approach utilizes mainly expert system methodologies, no representative program is included here.

9.4 THESAURUS CONSTRUCTION FROM TEXTS
The overall process may be divided into three stages: (1) Construction of vocabulary: This involves normalization and selection of terms. It also includes phrase construction depending on the coordination level desired. (2) Similarity computations between terms: This step identifies the significant statistical associations between terms. (3) Organization of vocabulary: Here the selected vocabulary is organized, generally into a hierarchy, on the basis of the associations computed in step 2. The first program select.c implements procedures for stages 1 and 2. Program hierarchy.c implements one method for organizing the vocabulary in the third stage.

9.4.1 Construction of Vocabulary
The objective here is to identify the most informative terms (words and phrases) for the thesaurus vocabulary from document collections. The first step is to identify an appropriate document collection. The only loosely stated criteria is that the collection should be sizable and representative of the subject area. The next step is to determine the required specificity for the thesaurus. If high specificity is needed, then the emphasis will be on identifying precise phrases; otherwise, more general terms can be sought. Terms can be selected from titles, abstracts, or even the full text of the documents if available. This initial set of vocabulary terms is now ready for normalization. The simplest and most common normalization procedure is to eliminate very trivial words such as prepositions and conjunctions. For this, an appropriate stoplist of trivial words needs to be constructed. The article by Fox (1989-1990) on the construction of stoplists may be useful here. The next standard normalization procedure is to stem the vocabulary. Stemming reduces each word into its root form. For example, the terms “information,” “informing,” and “informed” could all be stemmed into the same root “inform.” The chapter in this book on stemming algorithms may be consulted for this. The resulting pool of stems is now ready to be analyzed statistically with two objectives: first, to select the most interesting stems as discussed in the following section, and second, to create interesting phrases for a higher coordination level as discussed in the section on phrase construction.

Stem evaluation and selection
There are a number of methods for statistically evaluating the worth of a term. The ones we discuss here are: (1) selection of terms based on frequency of occurrence, (2) selection of terms based on Discrimination Value, (3) selection of terms based on the Poisson model. Program select.c, includes routines for all three methods, which are described briefly below.

Selection by Frequency of Occurence: This is one of the oldest methods and is based on Luhn’s work, which has been extensively discussed in the literature; see for example, Salton and McGill (1983). The basic idea is that each term may be placed in one of three different frequency categories with respect to a collection of documents: high, medium, and low frequency. Terms in the mid-frequency range are the best for indexing and searching. Terms in the low-frequency range have minimal impact on retrieval, while high-frequency terms are too general and negatively impact search precision. Salton recommends creating term classes for the low-frequency terms; however, it is not evident how to do this automatically. High-frequency terms are generally coordinated into phrases to make them more specific (see the later section on phrase construction for this). Threshold frequencies are generally not fixed and therefore user specified. Program select.c includes a routine for this selection method. Threshold frequencies are specified by the user via the global variables LOW_THRESHOLD and HIGH_THRESHOLD.

Selection by Discrimination Value (DV): DV measures the degree to which a term is able to discriminate or distinguish between the documents of the collection as described by Salton and Yang (1973). The more discriminating a term, the higher its value as an index term. The overall procedure is to compute the average interdocument similarity in the collection, using some appropriate similarity function. Next, the term k being evaluated is removed from the indexing vocabulary and the same average similarity is recomputed. The discrimination value (DV) for the term is then computed as:

DV(k) = (Average similarity without k) – (Average similarity with k)

Good discriminators are those that decrease the average similarity by their presence, that is, those for which DV is positive. Poor discriminators have negative DV, while neutral discriminators have no effect on average similarity. Terms that are positive discriminators can be included in the vocabulary and the rest rejected.

Program select.c includes a routine called dv-all which computes DV for all terms with respect to a collection. In the program, average similarity with all terms intact is referred to as the baseline. The algorithm used is a straightforward one using the method of centroids. In this method the average interdocument similarity is computed by first calculating the average document vector or centroid vector for the collection. This is performed by a routine called centroid. Next, the similarity between every document and this centroid is calculated. This generates the total similarity in the collection, which can then be used to calculate the average similarity. For large collections of documents, a more efficient algorithm such as that based on the cover coefficient concept may be tried as suggested by Can and Ozkarahan (1987). However, this is not done here.

Selection by the Poisson Method: This is based on the work by Bookstein and Swanson (1974), Harter (1975), and Srinivasan (1990) on the family of Poisson models. The Poisson distribution is a discrete random distribution that can be used to model a variety of random phenomena including the number of typographical errors in a page of writing and the number of red cars on a highway per hour. In all the research that has been performed on the family of Poisson models, the one significant result is that trivial words have a single Poisson distribution, while the distribution of nontrivial words deviates significantly from that of a Poisson distribution. This result is used here to select nontrivial words as thesaurus terms.

The get-Poisson-dist routine in the select.c program prints out the distributions in a collection for all terms in the inverted file. Two distributions are produced for each term: the actual distribution and the one expected under the assumption of a single Poisson model. These distributions will have to be compared using the chi-square test to identify any significant differences. The reader is referred to Harter (1975) for information on these chi-square comparisons. Terms whose distributions deviate significantly are selected to be in the vocabulary. The rest may be discarded.

Phrase construction
This step may be used to build phrases if desired. As mentioned before, this decision is influenced by the coordination level selected. Also, phrase construction can be performed to decrease the frequency of high-frequency terms and thereby increase their value for retrieval. Two methods are described below. Program select.c includes routines for the first method. Given insufficient details for the second method, it is not implemented here. However, since it is an interesting approach, we include a sketchy algorithm.

Salton and McGill Procedure: This is the standard one proposed in Salton and McGill (1983, 133-34) and adapted by Rada in Forsyth and Rada (1986, 200). Their procedure is a statistical alternative to syntactic and/or semantic methods for identifying and constructing phrases. Basically, a couple of general criteria are used. First, the component words of a phrase should occur frequently in a common context, such as the same sentence. More stringent contextual criteria are possible, such as the words should also appear within some specified distance. The second general requirement is that the component words should represent broad concepts, and their frequency of occurrence should be sufficiently high. These criteria motivate their algorithm, which is described below.

1. Compute pairwise co-occurrence for high-frequency words. (Any suitable contextual constraint such as the ones above may be applied in selecting pairs of terms.)

2. If this co-occurrence is lower than a threshold, then do not consider the pair any further.

3. For pairs that qualify, compute the cohesion value. Two formulas for computing cohesion are given below. Both ti and tj represent terms, and size-factor is related to the size of the thesaurus vocabulary.

COHESION (ti, tj) =

co-occurrence-frequency/sqrt(frequency(ti) * frequency(tj))

(Rada, page 200)

COHESION (ti, tj) =

size-factor * (co-occurrence-frequency/(total-frequency(ti) *

total-frequency(tj))) (Salton and McGill, 85)
4. If cohesion is above a second threshold, retain the phrase as a valid vocabulary phrase.

We do not include a program for this complete algorithm. However, the routine cohesion in select.c is an implementation of the Rada formula.

Choueka Procedure: The second phrase construction method is based on the work by Choueka (1988). He proposes a rather interesting and novel approach for identifying collocational expressions by which he refers to phrases whose meaning cannot be derived in a simple way from that of the component words, for example, “artificial intelligence.” The algorithm proposed is statistical and combinatorial and requires a large collection (at least a million items) of documents to be effective. The author has been quite successful in identifying meaningful phrases and is apparently extending the algorithm. Unfortunately, some critical details are missing from their paper. Therefore, we do not include an implementation of their approach. However, a sketchy procedure is included since the overall process looks rather interesting.

1. Select the range of length allowed for each collocational expression. Example: two to six words.

2. Build a list of all potential expressions from the collection with the prescribed length that have a minimum frequency (again, a preset value).

3. Delete sequences that begin or end with a trivial word. The trivial words include prepositions, pronouns, articles, conjunctions, and so on.

4. Delete expressions that contain high-frequency nontrivial words.

5. Given an expression such as a b c d, evaluate any potential subexpressions such as a b c and b c d for relevance. Discard any that are not sufficiently relevant. (It is not clear from the paper how relevance is decided, but perhaps it is also based on frequency.)

6. Try to merge smaller expressions into larger and more meaningful ones. For example, a b c d and b c d may merge to form a b c d. (Again, the exact criteria for allowing a merger are not given.)

The main difference between this procedure and the previous one is that this one considers phrases that have more than two words. It is of course possible to extend the previous procedure to include phrases with more than two words. However, computing cohesion is likely to be more challenging than simply applying the formula recursively over larger sized phrases. Choueka’s procedure also allows phrases to be substituted by longer and more general phrases as in step 6. However, the criteria used for allowing this should be carefully formulated.

9.4.2 Similarity Computation
Once the appropriate thesaurus vocabulary has been identified, and phrases have been designed if necessary, the next step is to determine the statistical similarity between pairs of terms. There are a number of similarity measures available in the literature. An extensive study has been done on the comparison of different similarity measures (McGill et al. 1979). Select.c includes two similarity routines:

1. Cosine: which computes the number of documents associated with both terms divided by the square root of the product of the number of documents associated with the first term and the number of documents associated with the second.

2. Dice: which computes the number of documents associated with both terms divided by the sum of the number of documents associated with one term and the number associated with the other.

Either measure can be used to assess the similarity or association between terms in a document collection. The algorithms implemented in the program select.c can be made more accurate by including only those instances in each numerator wherein the two terms co-occur in the same sentence and within some specified distance (that is, in a unit smaller than the entire document), as suggested in Salton and McGill (1983) and Soergel (1974). This makes the criteria for similarity more stringent.

9.4.3 Vocabulary Organization
Once the statistical term similarities have been computed, the last step is to impose some structure on the vocabulary which usually means a hierarchical arrangement of term classes. For this, any appropriate clustering program can be used. A standard clustering algorithm generally accepts all pairwise similarity values corresponding to a collection of objects and uses these similarity values to partition the objects into clusters or classes such that objects within a cluster are more similar than objects in different clusters. Some clustering algorithms can also generate hierarchies. It should be noted there are major differences between available clustering algorithms, and the selection should be made after carefully studying their characteristics.

Given the chapter in this book on clustering, such programs are not included here. Instead, we include the implementation of an alternative simple procedure to organize the vocabulary which is described in Forsyth and Rada (1986, 200-01). This algorithm implemented in hierarchy.c is quite different from the standard clustering algorithms and is based on the following assumptions: (1) high-frequency words have broad meaning, while low-frequency words have narrow meaning; and (2) if the density functions of two terms, p and q (of varying frequencies) have the same shape, then the two words have similar meaning. As a consequence of these two assumptions, if p is the term with the higher frequency, then q becomes a child of p. These two assumptions motivate their entire procedure as outlined below. Besides illustrating the procedure, hierarchy.c also includes appropriate data structures for storing thesauri organized as hierarchies.

1. Identify a set of frequency ranges.

2. Group the vocabulary terms into different classes based on their frequencies and the ranges selected in step 1. There will be one term class for each frequency range.

3. The highest frequency class is assigned level 0, the next, level 1 and so on.

4. Parent-child links are determined between adjacent levels as follows. For each term t in level i, compute similarity between t and every term in level i-1. Term t becomes the child of the most similar term in level i-1 . If more than one term in level i-1 qualifies for this, then each becomes a parent of t. In other words, a term is allowed to have multiple parents.

5. After all terms in level i have been linked to level i-1 terms, check level i-1 terms and identify those that have no children. Propagate such terms to level i by creating an identical “dummy” term as its child.

6. Perform steps 4 and 5 for each level starting with level 1.

9.5 MERGING EXISTING THESAURI
The third program, merge.c, is designed to merge different hierarchies. This is useful when different thesauri (perhaps with different perspectives) are available for the same subject, or when a new subject is being synthesized from existing ones. The algorithm for the program has been adapted from Chapter 14 of Forsyth and Rada (1986) in which experiments in augmenting MeSH and SNOMED have been described. Two different merging algorithms have been implemented. The first, called simple-merge, links hierarchies wherever they have terms in common. The second, called complex-merge, adopts a more interesting criteria. It links terms from different hierarchies if they are similar enough. Here, similarity is computed as a function of the number of parent and child terms in common. Also, sufficiency is decided based on a preset user specified threshold.

9.6 BRIEF DESCRIPTION OF C PROGRAMS INCLUDED
Three programs are included at the end of this chapter. The first can be used to select terms and to construct phrases; the second generates (or reads) and stores hierarchies. The third program merges different hierarchies.

9.6.1 Program Select.c
This program contains a variety of routines for the various selection criteria used in designing the thesaurus vocabulary (see Appendix 9.A). It requires two input files: a direct file and an inverted file. Both contain the same information, but arranged differently. The direct file is a listing of document numbers corresponding to the database of interest. Each document number is associated with a set of term and term-weight pairs. A document may be associated with all its component terms or perhaps only a select few. The term-weight represents the strength of association between the document and the term. This term-weight may be assigned manually, or for example be some function of the term’s frequency of occurrence in the document. The file is arranged such that the rows pertaining to the same document are grouped together. The inverted file is a listing of terms. Here each term is linked to its associated document numbers and the term-weights. The interpretation of term-weights should be the same in both input files. In fact, the two files should contain identical information. The inverted index is arranged such that rows corresponding to a term are grouped together. In both files, document numbers are represented by integers; the terms are character strings and the weights are decimal numbers. One or more spaces may be used to distinguish between the three. Figure 9.2 below shows a brief extract of both files.

2 math 2.0 mellitus 1 1.0

2 logic 1.0 logic 2 1.0

1 diabetes 2.0 diabetes 1 2.0

1 mellitus 1.0 math 2 2.0

3 math 1.0 math 3 1.0

Direct file extract Inverted file extract

Figure 9.2: Short extracts from both input files
Besides these two files, an output file must also be specified. The user will have to specify four global variables. The first two are MAXWORD: specifying the maximum number of characters in a term; and MAXWDF: specifying the expected maximum frequency of occurrence for a term within a document. The other two parameters are: LOW-THRESHOLD AND HIGH-THRESHOLD, which are used when partitioning the terms by frequency of occurrence in the collection into HIGH, MEDIUM, and LOW frequency classes.

9.6.2 Program Hierarchy.c
This program can perform two major and separate functions (see Appendix 9.B). First, if given the hierarchical relationships between a set of terms, it records these relationships in its internal inverted file structure. This can then be used for other purposes. Second, it is also capable of generating the hierarchical structure automatically using Rada’s algorithm. For this, the input required is the inverted file, which has the same structure as in Figure 9.2. The second input file is a link file, which is a sequence of rows representing link information. A row consists of a parent term followed by any number of spaces and then a child term. This file is used if the link information is simply provided to the program. For this the user will have to set two parameters: MAXWORD, which specifies the maximum size for a term, and NUMBER-OF-LEVELS, which constrains the size of the generated hierarchy.

9.6.3 Program merge.c
This program contains routines to perform the two types of merging functions described (see Appendix 9.C). Four input files are required here: an inverted file and a link file for each thesaurus hierarchy. Their formats are as described before. Two parameters will have to be set: MAXWORD described before and THRESHOLD which specifies the minimum similarity for use in the complex merge routine.

9.7 CONCLUSION
This chapter began with an introduction to thesauri and a general description of thesaural features. Two major automatic thesaurus construction methods have been detailed. A few related issues pertinent to thesauri have not been considered here: evaluation of thesauri, maintenance of thesauri, and how to automate the usage of thesauri. The focus has been on the central issue, which is the construction of thesauri. However, these secondary issues will certainly be important in any realistic situation.

REFERENCES
AITCHISON, J., and A. GILCHRIST. 1972. Thesaurus Construction — A Practical Manual. London: ASLIB.

BOOKSTEIN, A., and D. R. SWANSON. 1974. “Probabilistic Models for Automatic Indexing.” J. American Society for Information Science, 25(5), 312-18.

CAN, F., and E. OZKARAHAN. 1985. Concepts of the Cover-Coefficient-Based Clustering Methodology. Paper presented at the Eighth International Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 204-11.

CHOUEKA, Y. 1988. Looking for Needles in a Haystack OR Locating Interesting Collocational Expressions in Large Textual Databases. Paper presented at the Conference on User-Oriented Content-Based Text and Image Handling, MIT, Cambridge, Mass. 609-23.

FORSYTH, R., and R. RADA. 1986. Machine Learning — Applications in Expert Systems and Information Retrieval. West Sussex, England: Ellis Horwood Series in Artificial Intelligence.

FOX, E. A. 1981. “Lexical Relations: Enhancing Effectiveness of Information Retrieval Systems.” SIGIR Newsletter, 15(3).

FOX, E. A., J. T. NUTTER, T. AHLSWERE, M. EVENS, and J. MARKOWITZ. 1988. Building A Large Thesaurus for Information Retrieval. Paper presented at the Second Conference on Applied Natural Language Processing. Association for Computational Linguistics, 101-08.

FOX, C. FALL 1989/Winter 1990. “A Stop List for General Text.” SIGIR Forum, 21(1-2), 19-35.

FROST, C. O. 1987. “Subject Searching in an Online Catalog.” Information Technology and Libraries, 6, 60-63.

GUNTZER, U., G. JUTTNER, G. SEEGMULLER, and F. SARRE. 1988. Automatic Thesaurus Construction by Machine Learning from Retrieval Sessions. Paper presented at the Conference on User-Oriented Content-Based Text and Image Handling, MIT, Cambridge, Mass., 588-96.

HARTER, S. P. 1975. “A Probabilistic Approach to Automatic Keyword Indexing. Parts I and II.” J. American Society for Information Science, 26, 197-206 and 280-89.

MCGILL, M. et al. 1979. An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. Project report. Syracuse, New York: Syracuse University School of Information Studies.

SALTON, G., and M. MCGILL. 1983. Introduction to Modern Information Retrieval. NewYork: McGraw-Hill.

SALTON, G., and C. S. YANG. 1973. “On the Specification of Term Values in Automatic Indexing.” Journal of Documentation, 29(4), 351-72.

SOERGEL, D. 1974. “Automatic and Semi-Automatic Methods as an Aid in the Construction of Indexing Languages and Thesauri.” Intern. Classif. 1(1), 34-39.

SRINIVASAN, P. 1990. “A Comparison of Two-Poisson, Inverse Document Frequency and Discrimination Value Models of Document Representation.” Information Processing and Management, 26(2) 269-78.

WANG, Y-C., J. VANDENDORPE, and M. EVENS. 1985. “Relationship Thesauri in Information Retrieval.” J. American Society of Information Science, 15-27.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *