Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the option to archive with a headless browser #14

Open
hellodword opened this issue Feb 24, 2022 · 16 comments
Open

Allow the option to archive with a headless browser #14

hellodword opened this issue Feb 24, 2022 · 16 comments
Labels
enhancement New feature or request

Comments

@hellodword
Copy link
Contributor

hellodword commented Feb 24, 2022

Just like archivebox, I think archivebox is very nice, but there're two issues:

  1. slow, not a big deal;
  2. custom automation for special pages (lazy loading for example), this issue is working on it.

And I found a great golang lib rod, how about adding a mode of using headless (or headful, it depends) chromium?

@waybackarchiver
Copy link
Collaborator

That is a fantastic idea. Given the original requirement, we implemented similar features in screenshot, but it is still not what you expected.

Perhaps we can take things further and develop a piecemeal approach here.

@hellodword
Copy link
Contributor Author

The biggest challenge for me is developing or choosing a script and its interpreter.

I have no experience of this before, but rod has a good api. 😄

I will try to implement this, I really prefer this mode, dealing with all elements is too hard.

@fmartingr
Copy link
Member

If this happens, I'd like the feature to be optional if it require a lot /complex of external dependencies. I'm trying to maintain shiori as simple as possible and we just get rid of CGO switching the SQLite driver, and since wasm is obsolete and I have to start replacing it with Obelisk, the more seamless the experience it can be, the better :)

@hellodword
Copy link
Contributor Author

If this happens, I'd like the feature to be optional if it require a lot /complex of external dependencies. I'm trying to maintain shiori as simple as possible and we just get rid of CGO switching the SQLite driver, and since wasm is obsolete and I have to start replacing it with Obelisk, the more seamless the experience it can be, the better :)

I'm also a big fan of CGo-free and fewer dependencies, chromedp and rod are based on Chrome DevTools Protocol, without CGo or tons of dependencies. 😃

@waybackarchiver
Copy link
Collaborator

@fmartingr Please don't be worried about complex external dependencies. Perhaps we can look forward to the given works.

Anyway, pr is wecome.

@hellodword
Copy link
Contributor Author

Hey, I created a simple demo.

https://github.com/hellodword/web-archiving-with-headless-chromium-demo

env rod=show,bin=/path/to/chrome go run .

It's very simple, but provides custom post.js for hooking and pre.js for scroll/click/...

And use singlefile for saving.

@waybackarchiver
Copy link
Collaborator

This implementation is creative and far superior to our current archiving solution. However, it is heavily dependant on SingleFile, and it appears that obelisk is no longer required. If we stick to this plan, a new project might be a better alternative.

@fmartingr What do you think?

@hellodword
Copy link
Contributor Author

hellodword commented Feb 25, 2022

it is heavily dependant on SingleFile

Right, and it's buggy in this demo. 😂

But it's optional, just like the archivebox, archivebox has multi saving modes, singlefile is only one of them.

The thing I want to show is ability of custom script, and, a highly recommend cdp library of golang, I think it's much better than chromedp.

@waybackarchiver
Copy link
Collaborator

Appreciate the time and effort. Personally, I prefer the option of trying to inject the script in headless over the one implemented in the screenshot project.

It appears that making it an option would be reasonable, so if SingleFile is added as a browser extension, I would prefer to put the gitmodule in the .github/thridparty directory.

An example of archiving results using screenshot:

image

@fmartingr
Copy link
Member

This implementation is creative and far superior to our current archiving solution. However, it is heavily dependant on SingleFile, and it appears that obelisk is no longer required. If we stick to this plan, a new project might be a better alternative.

@fmartingr What do you think?

I still haven't started migrating to obelisk just yet... it will be an interesting amount of work to perform and I do not have much time to spare this weeks (and most is invested in replying issues and PRs, yay FOSS! 😂).

My comment was regarding more the current state of shiori and some comments by our packages in regard of external dependecies or ecosystems. For me the ideal solution is to import obelisk without much trouble and don't lose the ability to cross compile or requiring external software for the archive to work. If you want to add that to obelisk, I'd say it to be optional for users (you can either build it with --tags XXXX or require anything else).

That said, I don't want my comments/vision to halt obelisk's progress! I'm just expressing my fears from an user perspective, not imposing anything. I haven't use any library like this in a while (and not in the Go world, anyway) so I just wanted to make sure I don't create future problems for shiori. You folks are the experts here :)

@hellodword
Copy link
Contributor Author

hellodword commented Feb 26, 2022

Right, it was a demo so I directly use singlefile as an embedded dependency, it could or should be act as a plugin.

I think nowadays archiving tool do not necessarily need a chromium, but need ability of scripting extension, one reason is there're too much anti-bot stuff (captcha, WAF, and so on) on the internet.

@waybackarchiver
Copy link
Collaborator

there're too much anti-bot stuff (captcha, WAF, and so on) on the internet.

I'm interested in this somehow, so let's do it.

Related to wabarc/wayback#92

@github-actions
Copy link

github-actions bot commented Jul 2, 2022

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

@Katarn
Copy link

Katarn commented Aug 26, 2022

It seems to me that the ideal solution would be the ability to prepare the page for saving not on the server, but on the client. And send it to the server.

Now a lot of sites use dynamic image loading, captcha checking, they load comments only if you scroll the page to them (and comments are sometimes more interesting than the article itself), they don’t load all comments (hide discussion threads until you force them to open). Lots of dynamics. Therefore, it is better to save the page after having previously examined it with your own eyes, that all that is needed is loaded and displayed. There is no universal solution here, so it is preferable to inspect the page yourself.

I just looked into my Pocket archive and it became very sad - many domains are already partitioned, there are no sites. And the pages themselves (at a premium tariff) are far from being completely saved, sometimes they don’t even have text. And now I'm looking for a solution to this problem. I have now started saving pages through SingleFile, but if you tie it to shiori, it will be just the perfect bookmark manager.

At the same time, I would like shiori not to save the text to its database (perhaps only for a quick search), but always retrieve it again from the saved page. Because text content recognition algorithms will always improve, and the content stored in the database may be incorrectly recognized and no longer relevant from the new version of the application.

@waybackarchiver
Copy link
Collaborator

@Katarn Thank you for your offer, it's a fantastic idea. As intended, obelisk should support both headless and non-headless mode for archiving webpage.

if you tie it to shiori, it will be just the perfect bookmark manager.

Makes shiori work with obelisk is related to go-shiori/shiori#353

@github-actions
Copy link

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the Stale label Dec 31, 2022
@github-actions github-actions bot closed this as completed Jan 7, 2023
@fmartingr fmartingr changed the title How do you think about archiving with a real browser? Allow the option to archive with a headless browser Jan 21, 2023
@fmartingr fmartingr added enhancement New feature or request and removed Stale labels Jan 21, 2023
@fmartingr fmartingr reopened this Jan 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants