Skip to content

ispras/news-page-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataset For Information Extraction From News Web Pages

Multilingual dataset of labeled news web pages for information extraction task

Dataset Description

Dataset contains websites in 6 languages: Russian, English, German, Chinese, Korean, Arabic. We labeled news pages with attributes from these sets:

  • For Russian: title, subtitle, publication date, modification date, text, authors, sources, categories, tags
  • For other languages: title, publication date, text, authors, tags
Title Text Date Author Tag
ru Sites / Pages 112 / 722
Sites with attribute
Pages with attribute
Nodes with attribute
110
712
714
112
716
5918
110
708
724
54
262
272
49
332
1190
en Sites / Pages 10 / 500
Sites with attribute
Pages with attribute
Nodes with attribute
10
500
500
10
499
22200
10
499
499
4
147
147
2
98
258
de Sites / Pages 9 / 450
Sites with attribute
Pages with attribute
Nodes with attribute
9
450
454
9
449
6847
9
450
600
9
270
308
2
100
336
zh Sites / Pages 10 / 500
Sites with attribute
Pages with attribute
Nodes with attribute
10
500
501
10
500
5872
10
500
500
6
227
277
0
0
0
ko Sites / Pages 10 / 500
Sites with attribute
Pages with attribute
Nodes with attribute
10
500
500
10
500
6898
10
500
550
8
358
409
1
41
155
ar Sites / Pages 10 / 500
Sites with attribute
Pages with attribute
Nodes with attribute
10
500
500
10
500
5752
10
500
550
10
180
274
4
184
648

Data Collection

Creating the Russian-language part of the dataset is described in our paper. The annotators marked up web pages using Label Studio according to the guideline.

For other languages, we marked up nodes on pages using sitemaps created in the Web Scraper.

Dataset Format

For Russian-language part we have JSON file with the following structure (Label Studio JSON MIN format):

[
  {
    'id':
    'url':
    'html':
    'html_en':
    'agency':
    'site':
    'title':
    'annotator':
    'annotation_id':
    'created_at':
    'updated_at':
    'lead_time':
    'labels': [
      {
        'text':
        'hypertextlabels':
        'start':
        'end':
        'endOffset':
        'startOffset':
        'globalOffsets':
      },
      ...]
  },
...]

We additionally added html_en with translated HTML into English.

JSONs structure for other languages:

{'site': [
  {
    'uuid':
    'url':
    'html':
    'annotations': [
      {
        'xpath':
        'text':
        'label':
      },
      ...]
  },
  ...],
...}

Download

Citation

More details about the Russian-language part of the dataset are available in our paper. Please cite us if you use or discuss this dataset in your work:

@INPROCEEDINGS{10076872,
  author={Varlamov, Maksim and Galanin, Denis and Bedrin, Pavel and Duda, Sergey and Lazarev, Vladimir and Yatskov, Alexander},
  booktitle={2022 Ivannikov Ispras Open Conference (ISPRAS)}, 
  title={A Dataset for Information Extraction from News Web Pages}, 
  year={2022},
  volume={},
  number={},
  pages={100-106},
  keywords={Annotations;Neural networks;Web pages;Data aggregation;Information retrieval;Data mining;Electronic commerce;web data extraction;information extraction;news;webpage dataset;neural networks},
  doi={10.1109/ISPRAS57371.2022.10076872}}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published