Skip to content

Commit

Permalink
improved readme
Browse files Browse the repository at this point in the history
  • Loading branch information
jakopako committed Mar 4, 2022
1 parent 1b9dc13 commit 9f5aead
Show file tree
Hide file tree
Showing 7 changed files with 251 additions and 152 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:

# Runs a single command using the runners shell
- name: Run the crawler
run: go run main.go -store
run: go run main.go -config concerts-config.yml -store
env:
API_USER: ${{ secrets.API_USER }}
API_PASSWORD: ${{ secrets.API_PASSWORD }}
Expand Down
238 changes: 234 additions & 4 deletions REAME.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,247 @@ Similar projects:
* [slotix/dataflowkit](https://github.com/slotix/dataflowkit)
* [andrewstuart/goq](https://github.com/andrewstuart/goq)

## Configuration & Usage

## Configuration
A very simple configuration would look something like this:

Checkout the `example-config.yml` for details about how to configure the crawler. Basically, an extracted item can have static fields that are the same for each item and dynamic fields whose values are extracted from the respective website based on the given configuration.
```json
crawlers:
- name: LifeQuotes
url: "https://www.goodreads.com/quotes/tag/life"
item: ".quote"
fields:
dynamic:
- name: "quote"
location:
selector: ".quoteText"
- name: "author"
location:
selector: ".authorOrTitle"
```

Save this to a file, e.g. `quotes-config.yml` and run `go run main.go -config quotes-config.yml` (or `./crawler -config quotes-config.yml`) to retreive the scraped quotes as json string. The result should look something like this:

```json
[
{
"author": "Marilyn Monroe",
"quote": "“I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.”"
},
{
"author": "William W. Purkey",
"quote": "“You've gotta dance like there's nobody watching,"
},
...
]
```

A more complex configuration might look like this:

```json
crawlers:
- name: Kaufleuten
url: "https://kaufleuten.ch/events/kultur/konzerte/"
item: ".event"
fields:
static:
- name: "location"
value: "Kaufleuten"
- name: "city"
value: "Zurich"
- name: "type"
value: "concert"
dynamic:
- name: "title"
location:
selector: "h3"
regex_extract:
exp: "[^•]*"
index: 0
- name: "comment"
can_be_empty: true
location:
selector: ".subtitle strong"
- name: "url"
type: "url"
location:
selector: ".event-link"
- name: "date"
type: "date"
on_subpage: "url"
components:
- covers:
day: true
month: true
year: true
time: true
location:
selector: ".event-meta time"
attr: "datetime"
layout: "2006-01-02T15:04:05-07:00"
date_location: "Europe/Berlin"
filters:
- field: "title"
regex_ignore: "Verschoben.*"
- field: "title"
regex_ignore: "Abgesagt.*"
```

The result should look something like this:

```json
[
{
"city": "Zurich",
"comment": "Der Schweizer Singer-Songwriter, mit Gitarre und bekannten sowie neuen Songs",
"date": "2022-03-09T19:00:00+01:00",
"location": "Kaufleuten",
"title": "Bastian Baker",
"type": "concert",
"url": "https://kaufleuten.ch/event/bastian-baker/"
},
{
"city": "Zurich",
"comment": "Der kanadische Elektro-Star meldet sich mit neuem Album zurück",
"date": "2022-03-13T19:00:00+01:00",
"location": "Kaufleuten",
"title": "Caribou",
"type": "concert",
"url": "https://kaufleuten.ch/event/caribou/"
},
...
]
```

Basically, a config file contains a list of crawlers that each may have static and / or dynamic fields. Additionally, items can be filtered based on regular expressions and pagination is also supported. The resulting array of items is return to stdout as json string. TODO: support writing other outputs, e.g. mongodb.

### Static fields

Each crawler can define a number of static fields. Those fields are the same over all returned items. For the event crawling use case this might be the location name as shown in the example above. For a static field only a name and a value need to be defined:

```json
fields:
static:
- name: "location"
value: "Kaufleuten"
```

### Dynamic fields

#### Field types
Dynamic fields are a little more complex as their values are extracted from the webpage and can have different types. In the most trivial case it suffices to define a field name and a selector so the crawler knows where to look for the corresponding value. The quotes crawler is a good example for that:

```json
fields:
dynamic:
- name: "quote"
location:
selector: ".quoteText"
```

**Key: `location`**

However, it might be a bit more complex to extract the desired information. Take for instance the concert crawler configuration shown above, more specifically the config snippet for the `title` field.

```json
fields:
dynamic:
- name: "title"
location:
selector: "h3"
regex_extract:
exp: "[^•]*"
index: 0
```

This field is implicitly of type `text`. Other types, such as `url` or `date` would have to be configured with the keyword `type`. The `location` tells the crawler where to look for the field value and how to extract it. In this case the selector on its own would not be enough to extract the desired value as we would get something like this: `Bastian Baker • Konzert`. That's why there is an extra option to define a regular expression to extract a substring. Note that in this example our extracted string would still contain a trainling space which is automatically removed by the crawler. Let's have a look at two more examples to have a better understanding of the location configuration. Let's say we want to extract "Tonhalle-Orchester Zürich" from the following html snippet.

```html
<div class="member">
<span class="member-name"></span>
<span class="member-name"> Tonhalle-Orchester Zürich</span><span class="member-function">, </span>
<span class="member-name"> Yi-Chen Lin</span><span class="member-function"> Leitung und Konzept,</span>
<span class="composer">
Der Feuervogel
</span>
<span class="veranstalter">
Organizer: Tonhalle-Gesellschaft Zürich AG
</span>
</div>
```

We can do this by configuring the location like this:

```json
location:
selector: ".member .member-name"
node_index: 1 # This indicates that we want the second node (indexing starts at 0)
```

Last but not least let's say we want to extract the time "20h00" from the following html snippet.

```html
<div class="col-sm-8 col-xs-12">
<h3>
Freitag, 25. Feb 2022
</h3>

<h2><a href="/events/924"><strong>Jacob Lee (AUS) - Verschoben</strong>
<!--(USA)-->
</a></h2>
<q>Singer & Songwriter</q>

<p><strong>+ Support</strong></p>
<i><strong>Doors</strong> : 19h00
/
<strong>Show</strong>
: 20h00
</i>
</div>
```

This can be achieved with the following configuration:

```json
location:
selector: ".col-sm-8 i"
child_index: 3
regex_extract:
exp: "[0-9]{2}h[0-9]{2}"
```

Here, the selector is not enough to extract the desired string and we can't go further down the tree by using different selectors. With the `child_index` we can point to the exact string we want. A `child_index` of 0 would point to the first `<strong>` node, a `child_index` of 1 would point to the string containing "19h00", a `child_index` of 2 would point to the second `<strong>` node and finally a `child_index` of 3 points to the correct string. However, the string still contains more stuff than we need which is why we use a regular expression to extract the desired substring.

To get an even better feeling for the location configuration check out the numerous examples in the `concerts-crawler.yml` file.

**Key: `can_be_empty`**

This key only applies to dynamic fields of type text. As the name suggests, if set to `true` there won't be an error message if the value is empty.

**Key: `on_subpage`**

This key indicates that the corresponding field value should be extracted from a subpage defined in another dynamic field of type `url`. In the following example the comment field will be extracted from the subpage who's url is the value of the dynamic field with the name "url".

```json
dynamic:
- name: "comment"
location:
selector: ".qt-the-content div"
can_be_empty: true
on_subpage: "url"
- name: "url"
type: "url"
location:
selector: ".qt-text-shadow"
```

**Key: `type`**

A dynamic field has a field type that can either be `text`, `url` or `date`. The default is `text`. In that case the string defined by the `location` is extracted and used as is as the value for the respective field. The other types are:

* `text`
* `url`

TODO

* `date`

TODO
Loading

0 comments on commit 9f5aead

Please sign in to comment.