Model-driven Web Page Segmentation for Non Visual Access
Résumé
Web page segmentation aims to break a large page into smaller blocks, in which contents with coherent semantics are kept together. Within this context, a great deal of approaches have been proposed without any specific end task in mind. In this paper, we study different segmentation strategies for the task of non visual skimming. For that purpose, we propose to segment web pages into visually coherent zones so that each zone can be represented by a set of relevant keywords that can be further synthesized into concurrent speech. As a consequence, we consider web page segmentation as a clustering problem of visual elements, where (1) a fixed number of clusters must be discovered, (2) the elements of a cluster should be visually connected and (3) all visual elements must be clustered. Therefore , we study variations of three existing algorithms, that comply to these constraints: K-means, F-K-means, and Guided Expansion. In particular, we evaluate different reading strategies for the positioning of the initial K seeds as well as a pre-clustering methodology for the Guided Expansion algorithm, which goal is to (1) fasten the clustering process and (2) reduce unbalance between clusters. The performed evaluation shows that the Guided Expansion algorithm evidences statistically increased results over the two other algorithms with the variations of the reading strategies. Nevertheless, improvements still need to be proposed to increase separateness.
Fichier principal
PACLING_2019_Model-Driven-Web-Page-Segmentation.pdf (385 Ko)
Télécharger le fichier
Origine | Fichiers produits par l'(les) auteur(s) |
---|
Loading...