Loops
Anything Goes: Loops
Examples from our ‘anything goes’ loop session on 24th of Feb 2020
Python
Wytamma presented an algorithm that uses a for loop to calculate the hamming distance of two sequences.
def hamming_distance(s1, s2):
"""
Return the Hamming distance between equal-length sequences
https://en.wikipedia.org/wiki/Hamming_distance#Algorithm_example
"""
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(el1 != el2 for el1, el2 in zip(s1, s2))
The comprehension on the final line can be written as a multiline loop as follows:
l = []
for el1, el2 in zip(s1, s2):
l.append(el1 != el2)
sum(l)
Here’s a link to a gist with more information.
Bash
Chloë presented a loop where she uses trimmomatic to trim her read adapters from her multiple fasta files.
for i in *_R1.fastq.gz #any file that ends with
do
filename=`echo $i|cut -c 1-3` #for output use a substring to just cut first 3 characters
echo $filename
trimmomatic SE -phred33 -threads 10 \
$i \
${filename}_trimmed.fastq.gz \
ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 #settings from example TruSeq, should adapt for your specific run
done
echo "Trimming reads like a baller" #because
Bash uses ‘do’ and ‘done’ like R uses parentheses {} and python uses white space :) If you want to try a bash loop there’s a tutorial here.
R
Legana suggested Chapter 21: Iterations from R for Data Science book by Garrett Grolemund and Hadley Wickham. This open ebook contains very good information about the use of R from a tidyverse philosophy.
César presented some dummy examples using for loops
in R. The aim from this first dummy example was to demonstrate R
loops syntax. In this case, given a vector x
we wanted to multiply each element by 5, and replace each element with the result from this operation.
x <- seq(1, 10, 1)
x
## Multiply each elemnt in a vector
x*5
for (i in seq_along(x)){
x[i] <- x[i]*5
print(x[i])
}
We also discussed for loops
styles. Thus, options one and two from example below are preferred over option three. You can explore different outputs in each option when using a null vector (e.g . x <- NULL
)
x <- seq(1, 10, 1)
# Option one
for (i in x){
print(x[i]*5)
}
# Option two
for (i in seq_along(x)){
print(x[i]*5)
}
# Option three
for (i in 1:length(x)){
print(x[i]*5)
}
We created a simple control flow statement using if
, else if
, and else
statements inside a for loop
. Thus, this control flow statement tell us for each element in x
if its modulo is zero when divided by 2 or 3.
x <- seq(1, 10, 1)
for (i in seq_along(x)){
y <- x[i] * 5
if (y %% 2 == 0){
print(paste(y, " is an even number because modulo is zero when divided by 2"))
}
else if (y %% 3 == 0){
print(paste(y, " modulo is zero when divided by 3, "))
}
else {
print(paste(y, " is an odd number becasue modulo is not zero"))
}
}
We discussed removing dummy variables created by for loops
(e.g. i
, x
and y
below), and using the garbage collectors gc
to emptied memory after large calculations.
x <- seq(1, 10, 1)
x
for (i in seq_along(x)){
y <- x[i] * 5
z <- sqrt(y) + 1000/3
x[i] <- z
rm(list = c('y', 'z', 'i'))
gc()
}
x
César presented an example to list all files inside a directory, and open them in the R environment (step 1). Then, these files can be merged in a list of data frames (step 2). This list of data frames can be combined into one big data frame using for loop
or inbuilt R package functions (step 3). Thus, we discussed the use of for loops
and inbuilt vectorized operations in R packages.
# Step 1
library(here)
here::here()
## Gather all file names contained inside folder 'data_raw' in a list
file.names <- list.files('dat_raw', pattern = "*.csv", full.names = TRUE)
file.names
## Take each file name, read it and save it in the environment.
### gc = garbage collection takes place. Useful when reading large objects and returning memory to OS
for (i in 1:length(file.names)){
# Split and atomize 'file.names' list, returning only each file name with extension at each iteration
start.stripped.i <- unlist(strsplit(x = file.names[i], split = 'dat_raw/'))[2]
# Return only file name without file extension
obj.name.i <- unlist(strsplit(x = start.stripped.i, split = '\\.'))[1] # escape character before. so it's not treated as a wildcard
# Read each file into environment
X.i <- read.csv(file.names[i], skip=4, header = TRUE, colClasses = c("numeric","character", "numeric", 'character',
'numeric', 'numeric', 'numeric', 'numeric',
"character", "character", "character", "numeric",
"numeric", "numeric", "character"))
# Assign the new object in the environment to its file name
assign(x = obj.name.i, value = X.i)
rm(list = c('start.stripped.i', 'obj.name.i', 'X.i', 'i'))
gc()
}
# Step 2
## Create empty list and add all files contained in folder 'tracks'
my_list <- list()
for (i in 1:length(file.names)){
start.stripped.i <- unlist(strsplit(x = file.names[i], split = 'dat_raw/'))[2]
# print(start.stripped.i)
obj.name.i <- unlist(strsplit(x = start.stripped.i, split = '\\.'))[1]
print(obj.name.i)
my_list[[obj.name.i]] <- get(obj.name.i)
#obj.name.i[,2] <- gsub('[()]', '', obj.name.i[,2])
#obj.name.i <- separate(obj.name.i, Crab_Position, into = c('x_coord', 'y_coord'), sep = ',')
rm(list = c('start.stripped.i', 'obj.name.i', 'i', obj.name.i))
gc()
}
length(my_list)
# Step 3
library(plyr)
# Create a dataframe from list
tracks <- ldply(my_list, data.frame)
We also went through César example of using a for loop
to produce several plots using ggplot2
.
# Example: Plotting using for loops
library(here)
here::here()
load(file = here::here("dat_clean", "dat.insect_w_scores.Rdata"))
load(file = here::here("dat_clean", "dat.sp_insect_scores.Rdata"))
load(file = here::here("dat_clean", "insects.Rdata"))
library(ggplot2)
insects # a vector containing all insects names in other data sets
## Iterate by each element in insects and plot
for (i in insects){
gnmds <- ggplot() +
geom_point(data = data.withscores, aes_string(x="NMDS1", y="NMDS2", color = "Species", size=i)) +
# geom_text(data = data.withscores, aes(x=NMDS1, y=NMDS2, label = Seasons), hjust =0, nudge_x = 0.05, size=2.5) +
theme(legend.position = "right",
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.border = element_rect(size = 2, fill = NA),
panel.background = element_blank())
print(gnmds)
ggsave(here::here("results", paste(i, "_nmds_bubbles.jpeg")), width = 25, height = 12, units = 'cm', dpi = 300)
Sys.sleep(1)
}