Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy the inflammation dataset #5

Merged
merged 24 commits into from
Jul 4, 2018
Merged

Tidy the inflammation dataset #5

merged 24 commits into from
Jul 4, 2018

Conversation

katrinleinweber
Copy link

@katrinleinweber katrinleinweber commented Jun 20, 2018

See #3. I feel between a rock and a hard place here, about which way to go for the 12. > afternoon "Community-standard data formats, tidy data, data packages":

  1. R-eco…dplyr or r-social…tidyr are mature & highlight the important aspects :-)

  2. The PANGAEA remix is close to our hearts, but touches only a few aspects. Community-standard variable names are already aligned, so only a few lines of gather() & use_data() could be appended there.

  3. The inflammation dataset can easily be tidied on the technical level, but how does one find community-standard variable names? patient_ID & inflammation_score seem logical, but are they in any ontology?

Ideas for this episode

  • convert reshape2::melt to tidyr::gather()
  • include raw data as files & result data as .rda
  • include all 12 files, but then how to label patients consecutively (1-60, 61-120, etc.)?

@katrinleinweber katrinleinweber added the enhancement New feature or request label Jun 20, 2018
@katrinleinweber katrinleinweber self-assigned this Jun 20, 2018
Copy link

@lwjohnst86 lwjohnst86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the "WIP" tag, I won't comment on the completeness. Seems good so far. Maybe I'm not understanding, but what exactly are you unsure about doing? I think covering wide to long etc conversion is a very powerful tool, so I agree with covering these.

teaching: 30
exercises: 10
questions:
- "Which different forms can one dataset have?"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"What are possible forms that a dataset can have?"

exercises: 10
questions:
- "Which different forms can one dataset have?"
- "Which advantages and disadvantages do these forms have?"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which -> what

- "Which features make a dataset more or less reusable?"
- "How can we add datasets to R packages?"
objectives:
- "Use `tidyr::gather()` to convert wide data to its long form."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This objective doesn't relate to item 4 of questions. Maybe add to it saying "and adding to package"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More objectives will be added. I have a task list in my staging area ;-)

objectives:
- "Use `tidyr::gather()` to convert wide data to its long form."
keypoints:
- "Excel & co. incentivise the wide data format, which may spread variables across columns."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Spreadsheets incentivise ..."


```{r inflammation-wide}
dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
colnames(dat); rownames(dat)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this ; been introduced? I tend to prefer one line per command for such short lines of code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I thought it was necessary to get both outputs into the knit-ed version. But it turns out it isn't, so we'll go with your preference :-)


```{r inflammation-label}
patient_ID <- paste("patient", sep = "_", seq(60))
dat_labelled <- cbind(dat, patient_ID)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we use dplyr/tidyr, why not also use bind_cols instead of cbind?

Copy link
Author

@katrinleinweber katrinleinweber Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about that, thanks for the hint! In this case, though: Error in cbind_all(x) : Argument 2 must have names. So, I think we can either stick to base::cbind or upgrade the example with names() or something like that? What do you think?

Please feel free to add it. I tried names(patient_ID) <- "patient_ID" and then dplyr::bind_cols(dat, patient_ID) but still get the error.

@@ -0,0 +1,73 @@
---
title: "Tidying & packaging datasets"
teaching: 30

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You think it will take 30 min to cover this? Seems a bit much for so little code. Maybe also include spread? And using the gather(..., -variable_name) minus ability to exclude a column from the gathering.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP ;-) I didn't adjust the number after copying from a template.

But yes, spread and more about gather is a good idea!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't include more about this after all, because we can re-use so many other resources if people would like to learn about this. For the context of this inflammation dataset, I mostly want to get the column naming and tidying across.

Also, I would include git commiting the steps in between or in the end, so I think 30min is realistic after all.

@katrinleinweber
Copy link
Author

Thanks for your comments above, @lwjohnst86 :-)

what exactly are you unsure about doing?

About which of the 3 options could be the most effective learning experience. 3. is nearly complete, but that doesn't mean we have to use.

@katrinleinweber katrinleinweber changed the title WIP: Tidy the inflammation dataset Tidy the inflammation dataset Jun 29, 2018
katrinleinweber pushed a commit that referenced this pull request Jul 1, 2018
Also move it to where the tidy data episode (see #5) will need it to display the same code as learners need for packaging the dataset.
Copy link

@lwjohnst86 lwjohnst86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!


Just as a good function documentation makes your code more accessible and reusable
for others and your future self, datasets benefit from a documentation as well.
Please read through [r-pkgs.had.co.nz/data.html#documenting-data](http://r-pkgs.had.co.nz/data.html#documenting-data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They'll do this on their own I assume?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 5min break for us.

> > ~~~
> > #' Inflammation In Patients During Study...
> > #'
> > #' @source Pre-registration: \url{http://wwww.alltrials.net/study...}.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use markdown syntax in the Roxygen documentation? Use can use it by running usethis::use_roxygen_md().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in responding! GitHub didn't notify me about your comments :-(

> > #' \item{patient_ID}{A factor prepresenting the different patients}
> > #' \item{day}{Number of days after start of the study}
> > #' \item{inflammatory_response}{Measured daily as described in the methods section of ...}
> > #' }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a \describe alternative in markdown, so this would stay if you did use markdown

Copy link
Author

@katrinleinweber katrinleinweber Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this, and the mention of "LaTeX-like syntax" in an earlier episode, I'd keep it like this first, but I added a note below this: 5b9deca.

>
> > ## Solutions
> >
> > 1. Run `roxygen2::roxygenise()`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or devtools::document()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's not mentioned anywere else in the lesson anymore, I'd refer learners to ??roxygenise in case of problems. There, they'll find a note about when to use devtools::document().

@katrinleinweber
Copy link
Author

I'm merging this as it is now, because we have to freeze the material at some point I'm going to do it now, and leave #6 to a second iteration of this workshop.

@katrinleinweber katrinleinweber merged commit be6b96b into gh-pages Jul 4, 2018
@katrinleinweber katrinleinweber deleted the 3-tidy-data branch July 4, 2018 14:31
katrinleinweber added a commit that referenced this pull request Jul 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants