Open and Reproducible Research Cloud Workflows: A Firsthand Experience for Librarians: All in One View

Content from Cloud Computing and Open/Reproducible Research

Last updated on 2025-05-08 | Edit this page

Overview

Questions

What is cloud computing?
What are the reasons researchers might need to use a cloud computer?
What cloud computing services might be available to researchers at an academic institution?

Objectives

Explain the concept of cloud computing
Describe three benefits of cloud computing in open and reproducible research
Identify cloud computing providers within and beyond your institution
List at least three parameters that one may need to choose when creating an instance of a cloud computer
Identify resources typically needed to utilize cloud computing
Identify advantages and disadvantages of cloud computing versus local computing when presented with various researcher scenarios

Introduction

The majority of our researchers, outside of a few specialized fields, usually have a laptop that they use for most of their work. When they need to work with a data set, they typically download it to their laptop and use some sort of data analysis software on their laptop to conduct their analysis.

Discussion

With the person next to you, discuss:

If a researcher came to you for advice on the following questions, what advice would you give?

“I need to analyze a data set that is too large to fit on my laptop. What should I do?”
“My analysis needs to run for many hours and I can’t keep my laptop on for that whole time.”
“My analysis would take many days or weeks to run on my laptop”
“The tools I need to use for my analysis need to run in a Linux environment, but I only have a Windows or Mac laptop.”

Show me the solution

Some common answers might be:

“I need to analyze a data set that is too large to fit on my laptop. What should I do?”
- Break the data up and analyze it in parts
- Use a cloud computing provider (e.g. AWS, Google Cloud, Azure)
“My analysis needs to run for many hours and I can’t keep my laptop on for that whole time.”
- Use a cloud computing provider (e.g. AWS, Google Cloud, Azure)
“My analysis would take many days or weeks to run on my laptop”
- Use a cloud computing provider (e.g. AWS, Google Cloud, Azure)
“The tools I need to use for my analysis need to run in a Linux environment, but I only have a Windows or Mac laptop.”
- Try using a virtualized environment on your laptop
- Use a cloud computing provider (e.g. AWS, Google Cloud, Azure)

What is cloud computing?

Cloud computing is a pivotal technology for researchers, offering unparalleled access to vast computing resources and data storage capabilities via the internet. For academic librarians assisting researchers, understanding cloud computing is crucial as it enables the efficient handling of large datasets, complex computations, and collaborative projects. Researchers can leverage cloud services to perform high-performance computing tasks, run sophisticated simulations, and analyze extensive data sets without the need for substantial local infrastructure. This technology also facilitates seamless collaboration across institutions and geographical boundaries, fostering a more integrated and dynamic research environment. By utilizing cloud computing, researchers can accelerate their work, reduce costs, and enhance the reproducibility and scalability of their studies. This makes cloud computing an indispensable tool in advancing scientific discovery and innovation.

Options for cloud computing

Institutional infrastructure (for example, institutional cluster computing)
Public/national research infrastructure (TODO: Examples)
Commercial services: Amazon Web Services, Microsoft Azure, Google Cloud
Discipline-specific commercial services (TODO: Examples)

Discussion

Does your institution have its own cloud and/or cluster computing infrastructure?
Does your institution provide researchers with access to commercial cloud computing resources?
How do specific research projects pay for cloud services?
As a librarian, is there a cloud/cluster computing environment that you can access?

Cloud computing benefits

How does cloud computing make scientific research easier or faster?

TODO

How can cloud computing reduce costs for research?

Cloud computing often allows you to specify the storage size, processing power, chip type, internet bandwidth etc. that you need.
If a researcher has a big computation to run, they can control costs by creating a cloud computer instance, running the analysis, and then shutting down the instance.

How can cloud computing lead to more reproducible research?

Review/define this concept here. Use Turing Way definitions.

Scalability?

TODO

Key Points

Cloud computing can make computational research easier, faster, less expensive, and/or more reproducible.
Researchers need to use cloud computing when they need more storage space and/or processing power; cloud computing can also enhance reproducibility.
Many institutions either have their own cloud infrastructure or have arrangements with commercial providers. An institutional cloud infrastructure costs money to maintain, and commercial services cost money to use.

Content from Interacting with a virtual computer

Last updated on 2025-05-08 | Edit this page

Overview

Questions

How does a virtual computer get created?
What are different ways researchers might interact with a virtual computer?
What are some basic skills researchers use on virtual computers?

Objectives

Log in to a remote computer using ssh
Execute basic shell navigation commands such as cd, ls, cp, rm.
Describe a command shell and its uses
Log in to RStudio server running on a remote computer, via web browser

How do I “make” a virtual computer?

First of all, a researcher would determine whether they need their own separate virtual computer, or whether they can share space on an existing virtual computer (for example, one shared by a research group).

Assuming they decide to create their own, they would might request one from their institution, or they might have an account with a commercial provider where they have full control to set up a virtual computer to suit their needs.

How would I log on to a virtual computer?

Although it is possible to set up cloud computers with different operating systems such as Windows or MacOS, most researchers use a Linux-type operating system (of which there are several subtypes, such as Ubuntu).

TODO: How to connect

Key Points

TODO

Content from Cloud computing data and analysis workflows

Last updated on 2025-05-08 | Edit this page

Overview

Questions

What steps would a researcher go through to obtain public data and “put” it on a cloud computer?
What might a researcher do to work with their data that is on a cloud computer?
What might a researcher do to export their results from their cloud computer?

Objectives

Download open data from a public URL using {R code or wget/curl - TBD}
Use scp to transfer a set of files from a local computer to a remote computer
Clone a public github repository onto a remote computer
Run R code to analyze data, within a remote RStudio server session

Moving data around

TODO

#::::::::::::::::::::::::::::::::::::: challenge

Downloading data

If you wanted to download data sets for 1999-2025 from X open data site, what are your options for doing so?

#:::::::::::::::::::::::: solution

Solutions

Manual download
API

#:::::::::::::::::::::::::::::::::

TODO

Key Points

TODO

Content from Growing further in open and reproducible research skills

Last updated on 2025-05-08 | Edit this page

Overview

Questions

FIXME
How do you write a lesson using Markdown and sandpaper?

Objectives

Describe essential skills for librarians to help researchers plan reproducible research
Identify opportunities for librarians to personally gain further experience with and employ skills for open and reproducible research

Introduction

Discussion

With the person next to you, discuss:

What are some of the skills that researchers need in order to use open and reproducible research workflows?

What skills would librarians need in order to advise researchers?

Show me the solution

Some common answers might be:

Blah

This is a lesson created via The Carpentries Workbench. It is written in Pandoc-flavored Markdown for static files and R Markdown for dynamic files that can render code into output. Please refer to the Introduction to The Carpentries Workbench for full documentation.

What you need to know is that there are three sections required for a valid Carpentries lesson:

questions are displayed at the beginning of the episode to prime the learner for the content.
objectives are the learning objectives for an episode displayed with the questions.
keypoints are displayed at the end of the episode to reinforce the objectives.

Challenge 1: Can you do it?

What is the output of this command?

R

paste("This", "new", "lesson", "looks", "good")

Output

OUTPUT

[1] "This new lesson looks good"

Challenge 2: how do you nest solutions within challenge blocks?

Show me the solution

You can add a line with at least three colons and a solution tag.

Figures

You can use standard markdown for static figures with the following syntax:

![optional caption that appears below the figure](figure url){alt='alt text for accessibility purposes'}

You belong in The Carpentries!

Callout

Callout sections can highlight information.

They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.

Math

One of our episodes contains $\LaTeX$ equations when describing how to create dynamic reports with {knitr}, so we now use mathjax to describe this:

$\alpha = \dfrac{1}{(1 - \beta)^2}$ becomes: $\alpha = \dfrac{1}{(1 - \beta)^2}$

Cool, right?

Key Points

Use .md files for episodes when you want static content
Use .Rmd files for episodes when you need to generate output
Run sandpaper::check_lesson() to identify any issues with your lesson
Run sandpaper::build_lesson() to preview your lesson locally