Skip to content

Commit

Permalink
refactoring & renaming
Browse files Browse the repository at this point in the history
  • Loading branch information
jakopako committed Mar 5, 2022
1 parent 1603006 commit 5214234
Show file tree
Hide file tree
Showing 7 changed files with 623 additions and 596 deletions.
34 changes: 17 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Go crawler
# goskyr

[![Release](https://img.shields.io/github/release/jakopako/go-crawler.svg?style=flat-square)](https://github.com/jakopako/go-crawler/releases)
[![Go Report Card](https://goreportcard.com/badge/github.com/jakopako/event-crawler)](https://goreportcard.com/report/github.com/jakopako/event-crawler)
[![Release](https://img.shields.io/github/release/jakopako/goskyr.svg?style=flat-square)](https://github.com/jakopako/goskyr/releases)
[![Go Report Card](https://goreportcard.com/badge/github.com/jakopako/goskyr)](https://goreportcard.com/report/github.com/jakopako/goskyr)

This project's goal is to make it easier to extract structured data from web pages. Initially, the main use case was to extract event data from
This project's goal is to make it easier to scrape structured data from web pages. Initially, the main use case was to extract event data from
different venue websites. However, the code has been rewritten to handle a more general use case of extracting a list of items from any website.
This could be a list of books from an online book store, a list of plays in a public theater, a list of newspaper articles, etc. Currently, information can only be extracted from static websites.

Expand All @@ -18,10 +18,10 @@ Similar projects:

## Installation

Download the go-crawler by running
Download goskyr by running

```bash
go install github.com/jakopako/go-crawler@latest
go install github.com/jakopako/goskyr@latest
```

Or clone the repository and then run with `go run main.go ...` or build it yourself.
Expand All @@ -31,7 +31,7 @@ Or clone the repository and then run with `go run main.go ...` or build it yours
A very simple configuration would look something like this:

```yml
crawlers:
scrapers:
- name: LifeQuotes # The name is only for logging and does not appear in the json output.
url: "https://www.goodreads.com/quotes/tag/life"
item: ".quote"
Expand All @@ -45,7 +45,7 @@ crawlers:
selector: ".authorOrTitle"
```
Save this to a file, e.g. `quotes-config.yml` and run `go-crawler -config quotes-config.yml` (or `go run main.go -config quotes-config.yml`) to retreive the scraped quotes as json string. The result should look something like this:
Save this to a file, e.g. `quotes-config.yml` and run `goskyr -config quotes-config.yml` (or `go run main.go -config quotes-config.yml`) to retreive the scraped quotes as json string. The result should look something like this:

```json
[
Expand All @@ -64,7 +64,7 @@ Save this to a file, e.g. `quotes-config.yml` and run `go-crawler -config quotes
A more complex configuration might look like this:

```yml
crawlers:
scrapers:
- name: Kaufleuten
url: "https://kaufleuten.ch/events/kultur/konzerte/"
item: ".event"
Expand Down Expand Up @@ -138,11 +138,11 @@ The result should look something like this:
]
```

Basically, a config file contains a list of crawlers that each may have static and / or dynamic fields. Additionally, items can be filtered based on regular expressions and pagination is also supported. The resulting array of items is return to stdout as json string. TODO: support writing other outputs, e.g. mongodb.
Basically, a config file contains a list of scrapers that each may have static and / or dynamic fields. Additionally, items can be filtered based on regular expressions and pagination is also supported. The resulting array of items is return to stdout as json string. TODO: support writing other outputs, e.g. mongodb.

### Static fields

Each crawler can define a number of static fields. Those fields are the same over all returned items. For the event crawling use case this might be the location name as shown in the example above. For a static field only a name and a value need to be defined:
Each scraper can define a number of static fields. Those fields are the same over all returned items. For the event crawling use case this might be the location name as shown in the example above. For a static field only a name and a value need to be defined:

```yml
fields:
Expand All @@ -153,7 +153,7 @@ Each crawler can define a number of static fields. Those fields are the same ove

### Dynamic fields

Dynamic fields are a little more complex as their values are extracted from the webpage and can have different types. In the most trivial case it suffices to define a field name and a selector so the crawler knows where to look for the corresponding value. The quotes crawler is a good example for that:
Dynamic fields are a little more complex as their values are extracted from the webpage and can have different types. In the most trivial case it suffices to define a field name and a selector so the scraper knows where to look for the corresponding value. The quotes scraper is a good example for that:

```yml
fields:
Expand All @@ -165,7 +165,7 @@ Dynamic fields are a little more complex as their values are extracted from the

**Key: `location`**

However, it might be a bit more complex to extract the desired information. Take for instance the concert crawler configuration shown above, more specifically the config snippet for the `title` field.
However, it might be a bit more complex to extract the desired information. Take for instance the concert scraper configuration shown above, more specifically the config snippet for the `title` field.

```yml
fields:
Expand All @@ -178,7 +178,7 @@ However, it might be a bit more complex to extract the desired information. Take
index: 0
```

This field is implicitly of type `text`. Other types, such as `url` or `date` would have to be configured with the keyword `type`. The `location` tells the crawler where to look for the field value and how to extract it. In this case the selector on its own would not be enough to extract the desired value as we would get something like this: `Bastian Baker • Konzert`. That's why there is an extra option to define a regular expression to extract a substring. Note that in this example our extracted string would still contain a trainling space which is automatically removed by the crawler. Let's have a look at two more examples to have a better understanding of the location configuration. Let's say we want to extract "Tonhalle-Orchester Zürich" from the following html snippet.
This field is implicitly of type `text`. Other types, such as `url` or `date` would have to be configured with the keyword `type`. The `location` tells the scraper where to look for the field value and how to extract it. In this case the selector on its own would not be enough to extract the desired value as we would get something like this: `Bastian Baker • Konzert`. That's why there is an extra option to define a regular expression to extract a substring. Note that in this example our extracted string would still contain a trainling space which is automatically removed by the scraper. Let's have a look at two more examples to have a better understanding of the location configuration. Let's say we want to extract "Tonhalle-Orchester Zürich" from the following html snippet.

```html
<div class="member">
Expand Down Expand Up @@ -236,7 +236,7 @@ location:

Here, the selector is not enough to extract the desired string and we can't go further down the tree by using different selectors. With the `child_index` we can point to the exact string we want. A `child_index` of 0 would point to the first `<strong>` node, a `child_index` of 1 would point to the string containing "19h00", a `child_index` of 2 would point to the second `<strong>` node and finally a `child_index` of 3 points to the correct string. If `child_index` is set to -1 the first child that results in a regex match will be used. This can be usefull if the `child_index` varies across different items. In the current example however, the `child_index` is always the same but the string still contains more stuff than we need which is why we use a regular expression to extract the desired substring.

To get an even better feeling for the location configuration check out the numerous examples in the `concerts-crawler.yml` file.
To get an even better feeling for the location configuration check out the numerous examples in the `concerts-config.yml` file.

**Key: `can_be_empty`**

Expand Down Expand Up @@ -265,7 +265,7 @@ A dynamic field has a field type that can either be `text`, `url` or `date`. The

* `url`

A url has one additional boolean option: `relative`. This option determines whether this crawler's base url will be prepended to the string that has been extracted with the given `location`
A url has one additional boolean option: `relative`. This option determines whether this scraper's base url will be prepended to the string that has been extracted with the given `location`
* `date`

A date field is different from a text field in that the result is a complete, valid date. Internally, this is a `time.Time` object but in the json output it is represented by a string. In order to be able to handle a lot of different cases where date information might be spread across different locations, might be formatted in different ways using different languages a date field has a list of components where each component looks like this:
Expand All @@ -283,4 +283,4 @@ A dynamic field has a field type that can either be `text`, `url` or `date`. The
layout: "<layout>"
```

As can be seen, a component has to define which part of the date it covers (at least one part has to be covered). Next, the location of this component has to be defined. This is done the same way as we defined the location for a text field string. Finally, we need to define the layout which is done the 'go-way' as this crawler is written in go. For more details check out [this](https://yourbasic.org/golang/format-parse-string-time-date-example/) link or have a look at the numerous examples in the `concerts-config.yml` file.
As can be seen, a component has to define which part of the date it covers (at least one part has to be covered). Next, the location of this component has to be defined. This is done the same way as we defined the location for a text field string. Finally, we need to define the layout which is done the 'go-way' as this scraper is written in go. For more details check out [this](https://yourbasic.org/golang/format-parse-string-time-date-example/) link or have a look at the numerous examples in the `concerts-config.yml` file.
4 changes: 2 additions & 2 deletions concerts-config.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Crawler configs are grouped by city and cities are sorted alphabetically.
# Scraper configs are grouped by city and cities are sorted alphabetically.

crawlers:
scrapers:
##########
# Bologna
##########
Expand Down
Loading

0 comments on commit 5214234

Please sign in to comment.