Approaches to Automatic Text Structuring

by Nicolai Erbs
Abstract:
Structured text helps readers to better understand the content of documents. In classic newspaper texts or books, some structure already exists. In the Web 2.0, the amount of textual data, especially user-generated data, has increased dramatically. As a result, there exists a large amount of textual data which lacks structure, thus making it more difficult to understand. In this thesis, we will explore techniques for automatic text structuring to help readers to fulfill their information needs. Useful techniques for automatic text structuring are keyphrase identification, table-of-contents generation, and link identification. We improve state of the art results for approaches to text structuring on several benchmark datasets. In addition, we present new representative datasets for users’ everyday tasks. We evaluate the quality of text structuring approaches with regard to these scenarios and discover that the quality of approaches highly depends on the dataset on which they are applied. In the first chapter of this thesis, we establish the theoretical foundations regarding text structuring. We describe our findings from a user survey regarding web usage from which we derive three typical scenarios of Internet users. We then proceed to the three main contributions of this thesis. We evaluate approaches to keyphrase identification both by extracting and assigning keyphrases for English and German datasets. We find that unsupervised keyphrase extraction yields stable results, but for datasets with predefined keyphrases, additional filtering of keyphrases and assignment approaches yields even higher results. We present a de- compounding extension, which further improves results for datasets with shorter texts. We construct hierarchical table-of-contents of documents for three English datasets and discover that the results for hierarchy identification are sufficient for an automatic system, but for segment title generation, user interaction based on suggestions is required. We investigate approaches to link identification, including the subtasks of identifying the mention (anchor) of the link and linking the mention to an entity (target). Approaches that make use of the Wikipedia link structure perform best, as long as there is sufficient training data available. For identifying links to sense inventories other than Wikipedia, approaches that do not make use of the link structure outperform the approaches using existing links. We further analyze the effect of senses on computing similarities. In contrast to entity linking, where most entities can be discriminated by their name, we consider cases where multiple entities with the same name exist. We discover that similarity de- pends on the selected sense inventory. To foster future evaluation of natural language processing components for text structuring, we present two prototypes of text structuring systems, which integrate techniques for automatic text structuring in a wiki setting and in an e-learning setting with eBooks.
Reference:
Approaches to Automatic Text Structuring (Nicolai Erbs), PhD thesis, Technische Universität Darmstadt, 2015.
Bibtex Entry:
@phdthesis{TUD-CS-2015228,
	Abstract = {Structured text helps readers to better understand the content of
documents. In classic newspaper texts or books, some structure already
exists. In the Web 2.0, the amount of textual data, especially
user-generated data, has increased dramatically. As a result, there exists
a large amount of textual data which lacks structure, thus making it more
difficult to understand. In this thesis, we will explore techniques for
automatic text structuring to help readers to fulfill their information
needs. Useful techniques for automatic text structuring are keyphrase
identification, table-of-contents generation, and link identification. We
improve state of the art results for approaches to text structuring on
several benchmark datasets. In addition, we present new representative
datasets for users' everyday tasks. We evaluate the quality of text
structuring approaches with regard to these scenarios and discover that the
quality of approaches highly depends on the dataset on which they are
applied.
 In the first chapter of this thesis, we establish the theoretical
foundations regarding text structuring. We describe our findings from a
user survey regarding web usage from which we derive three typical
scenarios of Internet users. We then proceed to the three main
contributions of this thesis.
 We evaluate approaches to keyphrase identification both by extracting and
assigning keyphrases for English and German datasets. We find that
unsupervised keyphrase extraction yields stable results, but for datasets
with predefined keyphrases, additional filtering of keyphrases and
assignment approaches yields even higher results. We present a de-
compounding extension, which further improves results for datasets with
shorter texts.
 We construct hierarchical table-of-contents of documents for three English
datasets and discover that the results for hierarchy identification are
sufficient for an automatic system, but for segment title generation, user
interaction based on suggestions is required.
 We investigate approaches to link identification, including the subtasks
of identifying the mention (anchor) of the link and linking the mention to
an entity (target). Approaches that make use of the Wikipedia link
structure perform best, as long as there is sufficient training data
available. For identifying links to sense inventories other than Wikipedia,
approaches that do not make use of the link structure outperform the
approaches using existing links. We further analyze the effect of senses on
computing similarities. In contrast to entity linking, where most entities
can be discriminated by their name, we consider cases where multiple
entities with the same name exist. We discover that similarity de- pends on
the selected sense inventory.
 To foster future evaluation of natural language processing components for
text structuring, we present two prototypes of text structuring systems,
which integrate techniques for automatic text structuring in a wiki setting
and in an e-learning setting with eBooks.},
	Address = {Darmstadt},
	Author = {Nicolai Erbs},
	Date-Added = {2016-07-12 14:57:09 +0000},
	Date-Modified = {2016-07-12 15:03:24 +0000},
	Month = sep,
	Pubkey = {TUD-CS-2015-1228},
	Research_Area = {Ubiquitous Knowledge Processing},
	Research_Sub_Area = {UKP_s_JWPL, UKP_s_DKPro_Similarity, UKP_s_DKPro_Core, UKP_p_WIKULU, UKP_p_WIWEB, UKP_p_openwindow, UKP_p_DKPro, UKP_a_NLP4Wikis, UKP_a_ENLP},
	School = {Technische Universit{"a}t Darmstadt},
	Title = {Approaches to Automatic Text Structuring},
	Type = {Dissertation},
	Website = {http://tuprints.ulb.tu-darmstadt.de/4959/},
	Year = {2015}}

This entry was posted in . Bookmark the permalink.