Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training word2vec model fails on Fedora Linux #12

Open
steffen-stell opened this issue Feb 22, 2023 · 3 comments
Open

Training word2vec model fails on Fedora Linux #12

steffen-stell opened this issue Feb 22, 2023 · 3 comments

Comments

@steffen-stell
Copy link

Training any word2vec() model fails on Fedora 37 with the binary from the iucar/cran COPR repository. I first reported the problem there, but the maintainer makes clear that it is a bug in the word2vec package. He has posted some first insights in the issue.

@jwijffels
Copy link
Contributor

jwijffels commented Feb 23, 2023

Hello thanks for the report.

  • So these Fedora builds are checking for run-time bounds for C++ strings and containers. Which is interesting by itself.
  • I didn't even know that you could pass a quanteda corpus to the function.

Questions:
I don't have a machine with Fedora. Is this reproducible on rhub or how can I reproduce this - I only have Windows and Ubuntu.
Does this out of bounds happen when you do the standard data processing provided in the word2vec package (txt_clean_word2vec)

@steffen-stell
Copy link
Author

steffen-stell commented Feb 24, 2023

Thanks for the quick response.

I don't have a machine with Fedora. Is this reproducible on rhub or how can I reproduce this - I only have Windows and Ubuntu.

I have no experience with r-hub, so I can't tell you if you can reproduce it there. It should be fairly easy to reproduce in a docker container or a VM. R on Fedora is not difficult to set up. A docker file to build an image to reproduce this would look like this:

FROM fedora:37
RUN dnf update -y
RUN dnf install 'dnf-command(copr)' -y
RUN dnf copr enable iucar/cran -y
RUN dnf install R-CoprManager R-CRAN-quanteda R-CRAN-word2vec -y

I didn't even know that you could pass a quanteda corpus to the function.

The corpus class objects from quanteda are S3 objects of type character. So to any function that is not an S3 generic with a method for corpus objects, it is just a character vector with a bunch of attributes. I originally encountered this problem with another corpus. That was not a quanteda corpus object. I was just looking for a built-in dataset to make a quick reproducible example.

Does this out of bounds happen when you do the standard data processing provided in the word2vec package (txt_clean_word2vec)

I've tried to do this with txt_clean_word2vec(), but it still fails.

@Draic
Copy link

Draic commented Jul 29, 2024

Trying to use the word2vec package on Arch Linux also does not work. I don't know if for the same reasons as reported before
linux-crash

> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas.so.0.3;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] word2vec_0.4.0 udpipe_0.8.11 

loaded via a namespace (and not attached):
 [1] Matrix_1.7-0      miniUI_0.1.1.1    compiler_4.4.1    promises_1.3.0    Rcpp_1.0.13       stringr_1.5.1    
 [7] later_1.3.2       yaml_2.3.10       fastmap_1.2.0     lattice_0.22-6    mime_0.12         R6_2.5.1         
[13] knitr_1.48        htmlwidgets_1.6.4 profvis_0.3.8     shiny_1.8.1.1     rlang_1.1.4       cachem_1.1.0     
[19] stringi_1.8.4     httpuv_1.6.15     xfun_0.46         fs_1.6.4          pkgload_1.4.0     memoise_2.0.1    
[25] cli_3.6.3         magrittr_2.0.3    grid_4.4.1        digest_0.6.36     rstudioapi_0.16.0 xtable_1.8-4     
[31] remotes_2.5.0     devtools_2.4.5    lifecycle_1.0.4   vctrs_0.6.5       data.table_1.15.4 evaluate_0.24.0  
[37] glue_1.7.0        urlchecker_1.0.1  sessioninfo_1.2.2 pkgbuild_1.4.4    rmarkdown_2.27    purrr_1.0.2      
[43] tools_4.4.1       usethis_3.0.0     ellipsis_0.3.2    htmltools_0.5.8.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants