While there are many articles which propose extraction methods, few are related to the context of software engineering. As such, the first part of this section present recent methods for the extraction of concepts or keywords from text with no specific usage in mind. The second part review the keyword extraction process which target the software domain. the type of document used as the extraction source, the scope of usage of the keywords, the published performance of the method and the language of application. Research targeting generic or non-software concept extraction will be detailed in the next section. The second section details work which centers around concept extraction for or from the software engineering domain.
Generic keywords extraction methods
Relevant term detection usually uses external resources to give hints to the different tools used in the extraction process about what constitutes an expression or a term worth retaining. Applications working in domain traditionally used in information extraction (IE) like news, finance, health and biology may use some of the numerous datasets, ontologies and lexical and semantic resources available for these specific domains. Lexico-semantic repository like Wordnet (Fellbaum, 1998) and Eurowordnet (Vossen, 1998), which contain polysemous terms with their respective contextual examples, with may be used to detect, disambiguate and link textual manifestation of relevant concepts while higher and mid level ontologies like OpenCyc, Sumo and BFO (basic formal ontology) can either be used for term categorization, semantic linking and context identification.
This opportunity is not a given in more specialized multi-domain organizations for which no adapted resources are available or adaptable. They have to rely on linguistic and statistical methods to be able to differentiate between high value knowledge and “filling”. One such application is Textractor (Ittoo et al., 2010) which uses a pipeline of algorithms and techniques for language detection, noise reduction, candidate concept identification and extraction. Their source documents were irregular text description of customer complaints from help desks and repair action of service engineers. They evaluate their resulting concept list using domain experts and achieve a 91.7% precision score.
Other systems like the International Enterprise Intelligence prototype application (Maynard et al., 2007) developed for the European Musing project rely on the repetition of predicted link manifestation in text as their base for robust and high-confidence results. They use pattern-based extractors which target specific knowledge items like company contact information, managerial structures, activity domain, imported and exported products, shareholders and so on. This type of extraction method expects a precise type of knowledge and structure, so that any piece of information which doesn’t fall into the predefined pattern are not recovered. They use domain ontologies to mine business documents for numerical data from various sources as well as table extraction process to obtain these information from semi-structured documents. This is done to support business intelligence which need to assess financial risk, operational risk factors, follow trends or perform credit risk management. The extraction pipeline achieves a f-measure of 84% on twelve types of specific knowledge units. The precision range from 50.0% to 100.0% while the recall score fluctuates between 66.7% and 100.0% for various types.
The single document extraction RAKE system (Rose et al., 2010) is described as an unsupervised, domain-independent, and language-independent method for extracting keywords from individual documents which makes use of stop words to delimit either single-word or multiwords concepts. They use stop lists based on term frequency and keyword adjacency to boost the extraction power of their process. Applied to a set of abstracts from scientific articles, they achieve domain keyword extraction precision of 33.7% with a recall of 41.5% which amounts to 37.2% for the f-measure. Albeit the claim of language independence, languages like French, Spanish or Italian, makes extensive use of stop words in multiwords concepts which limits the keyword extraction potential of this system to single word concepts and truncated multiwords concepts. Some parallel research has been made precisely on multiword expression extraction in French because of the inherent difficulty to detect them (Green et al., 2010).
Software engineering centered extraction systems
While these previous systems and methods are relevant for knowledge extraction from unstructured and semi-structured document sources, few research is made with these methods applied on the software engineering side. Nonetheless, some systems target natural language documents from software projects to extract domain knowledge and help generate software artefacts or models. The document source are, for most research, software requirements documents which have been created previously by a human agent. These documents, compared to internal business document, feature limited and less ambiguous domain terminology, hand picked domain concept directly relevant to the software project and usually a more structured and hierarchical logical flow throughout the documents.
One of these systems have been designed to take in software requirements written in natural language to generate UML diagrams (Deeptimahanti and Sanyal, 2011) to cover the semiautomatic generation of class diagram, collaboration diagram and use-case diagram. The goal of their contribution is to limit the need of expert interaction and knowledge in the modeling process. Wordnet, anaphora resolution and sentence rewriting were applied to simplify the sentences in software requirements document to ease the downstream parsing. They then use sentence structures patterns to extract actors, classes, attributes, methods and associations; a sentence subjects are considered sender objects for collaboration diagrams, objects are the receivers, etc. While the system seems complete from the diagram generation point-of-view, no evaluation of the output potential performance was done on a gold standard.
Another approach, the Dowser tool of Popescu et al. (2008), applies knowledge extraction methods to reduce ambiguities in requirement specifications in order to help the review process of software requirement specification documents. It is designed to help the software expert detect inconsistencies and contradictions by automatically producing an object-oriented model from the natural language specifications. The system extract the domain specific terms by noun phrase chunking obtained after the execution of a part-of-speech tagging step. They consider that having a human agent in the loop for the model evaluation removes the need to obtain a high recall and precision from the extraction process. Nonetheless, they evaluate their approach on the intro man page of the Cygwin environment for domain specific term extraction and achieved a recall score of 78.8% on the exact document and a 88.46% score after rewriting some of the sentences with a precision of 86.79% which translate into a combined f-measure score of 87.61%. Two other use cases were used to test their system but no evaluation score were published.
Finally in Kof (2010), different term extraction heuristics are compared as part of the process of creating modeling message sequence charts in UML. They explore the influence of named entity recognition in isolation and with sentence structure extraction on two case studies detailing specifications of an instrument cluster from a car dashboard and a steam boiler controller. A precision and recall score of 95% was attained on the extraction of 20 concepts on the steam boiler specifications using named entity recognition, providing one wrong concepts and missing only one.
These combined research efforts presents the following limitations:
• Limitation 1: The methods are applied on specialized documents
• Limitation 2: Make use of many advanced linguistic tools, resources or human intervention
• Limitation 3: There is no annotated corpus publicly available to be used as a gold standard, so methods cannot be compared on the basis of their performance when applied on the same resource.
Our research differs from these previous systems on many aspects. First, the concept extraction process is applied on business documents which were written as part of the software project which increase considerably the extraction task. The current process is also evaluated using a gold standard targeted specially for knowledge extraction for software engineering. The evaluation is also done using the combined views of multiple experts instead of a single expert approach.
|
Table des matières
INTRODUCTION
CHAPTER 1 LITERATURE REVIEW
1.1 Concept extraction systems and methods
1.1.1 Generic keywords extraction methods
1.1.2 Software engineering centered extraction systems
1.2 Multiword expression extraction
1.3 Acronym detection for candidate
CHAPTER 2 ARTICLE I: CONCEPT EXTRACTION FROM BUSINESS DOCUMENTS FOR SOFTWARE ENGINEERING PROJECTS
2.1 Introduction
2.1.1 Motivation
2.1.2 Context of application
2.2 Automated concept extraction
2.3 Proposed approach
2.3.1 Gold standard definition
2.3.2 Evaluation metrics
2.4 Detailed extraction process
2.4.1 Implementation overview
2.4.2 Candidate extraction
2.4.3 Inconsistency filter
2.4.4 Stop list exclusion
2.4.5 Acronym resolution
2.4.6 Complex multiword expressions detection
2.4.7 Dictionary validation
2.4.8 Web validation
2.4.9 Structure detection and analysis
2.4.10 Relevance ordering algorithm
2.5 Evaluation
2.5.1 Experimental setup
2.5.2 Baseline comparison
2.6 Results
2.6.1 Candidate extraction
2.6.2 Candidates ordering
2.7 Conclusion
CHAPTER 3 ARTICLE II: HYBRID EXTRACTION METHOD FOR FRENCH COMPLEX NOMINAL MULTIWORD EXPRESSIONS
3.1 Introduction
3.2 Overview
3.2.1 Context
3.2.2 Examples
3.2.3 Challenges and issues
3.2.4 Uses and proposed method
3.3 Details of the proposed method
3.3.1 Acquisition of ngrams
3.3.2 Semilattice construction
3.3.3 Reduction of semilattices
3.4 Evaluation
3.4.1 Corpora
3.4.2 Baseline
3.4.3 Results
3.5 Conclusion
CHAPTER 4 ARTICLE III: CLASSIFIER-BASED ACRONYM EXTRACTION FOR BUSINESS DOCUMENTS
4.1 Introduction
4.2 Background
4.2.1 Definition
4.2.2 Morphology
4.2.3 Context of use
4.3 Specific properties of business documents
4.3.1 Main differences
4.3.2 Extraction challenges
4.3.3 Special cases
4.4 Methodology
4.4.1 Overview
4.4.2 Preparation step
4.4.3 Generic acronym repository
4.4.4 Short-form identification
4.4.5 SF-LF candidate extraction
4.5 Candidate SF-LF evaluation
4.5.1 Structural features
4.5.2 Similarity features
4.5.3 Cleanup step
4.6 Experiment
4.6.1 Parameter and score template definition
4.6.2 Corpora
4.6.3 Classifier training
4.7 Results
4.8 Future work
4.9 Conclusion
CHAPTER 5 GENERAL DISCUSSION
5.1 Concept extraction from business documents for software engineering projects
5.1.1 Performance and usage
5.1.2 Extraction issues
5.1.3 Evaluation method
5.1.4 Gold standard
5.2 Hybrid extraction method for French complex nominal multiword expressions
5.2.1 Performance
5.2.2 Sources of errors
5.3 Classifier-based acronym extraction for business documents
5.3.1 Performance and features
5.3.2 Limitations
GENERAL CONCLUSION
Télécharger le rapport complet