This is a classic titanic post, because (as so often) I don't really know what I'm talking about. Through the lockdown I made a half-hearted attempt to learn R, the leading statistical programme. As usual, what caught my eye was an ancient Greek.
Specifically, Eratosthenes of Cyrene. Eratosthenes is one of the great figures of Hellenistic Alexandria. He was head of the library there, the highest intellectual position of the day. He's best remembered today for estimating the circumference of the earth to an astounding degree of accuracy, but he was also a polymath interested in literature, music, and, as we'll see, mathematics. He was called 'Beta' by his peers, not because he was on Reddit, but because he was the second best at everything.
Not too far into my studies in R, I came across Eratosthenes' sieve, a simple algorithm that's often used as an exercise for coding students. What follows is an attempt to unpack one way of coding Eratosthenes' sieve in R, with some help from a friend and some code I found online. I struggled with it a bit myself, which should hopefully put me in the ideal position to explain, since I can remember what confused me (everything), and nothing about it seems obvious to me.
I'll start with the sieve itself. The point of it is to find all the prime numbers up to a certain point (let's say, up to 100). Starting with 2, what you do is to move through the rest of the numbers crossing out all of the multiples of 2 (since if they're multiples of any other number higher than 1 they can't be prime). Then you do the same with 3, 4, and so on, until you can't do it anymore, since all the multiples of the number you started with have already been crossed out (as multiples of an earlier number). Then you can look at all the numbers that haven't been crossed out. Those are your primes.
I guess it makes sense as an exercise for beginner coders because it's a pretty easy to understand algorithm, a set of steps for a brain or computer to work through to get a set of outputs (here, prime numbers).
So - as previously advertized - below is one way of getting the R programme to become, for a thrilling and infinitely repeatable moment, the head librarian of Alexandria. After the code itself I'll break it all up into bits, accompanying each bit with some comments that hopefully a) are correct and b) help you understand what's going on. Code is bold, the commentary (a highly Alexandrian form) not. The commentary is talking to R and telling it what to do (imperative mood).
sieve <- function(n) {
if (n < 2) return(NULL)
a <- rep(T, n)
a[1] <- F
for(i in seq(n)) {
if (a[i]) {
j <- i * i
if (j > n) return(which(a))
a[seq(j, n, by=i)] <- F
}
}
}
sieve <- function(n) {
Make "sieve" a function that does the below to a number we'll call "n." (This just names our algorithm and makes it a function, a way of doing things to what's in the brackets that follow).
a <- rep(T, n)
If n is smaller than 2 then spit out NULL. (In other words, refuse to do this computation if it's on a number of numbers less than 2).
Make a vector called "a" and store in it n repetitions, all marked "true." (T stands for 'true.' This is like writing out the number in the grid above. Saying they're all true effectively means we're starting with the assumption that they're all prime.)
a[1] <- F
Store the first item as false in "a." (Because we want to get rid of 1 immediately?)
for(i in seq(n)) {
Start with i = 1, iterating the code between here and the closing }, incrementing i by one each time, until i = seq(n). (Note the opening curly bracket. The code within the for loop, contained between the curly brackets, will run a number of times equal to the number of elements in seq(n), starting with i = 1, and with the value of i increasing by one each time it runs.)
if (a[i]) {
If a[i] is true, the code within the ensuing curly brackets will execute. If a[i] is false, it won't. (So if an individual number is false, it won't run the code. At this point that's just 1, as specified above, so I think this just stops R running this algorithm with the number 1. That's actually important, because if it did it would cross off all the numbers and we wouldn't get any primes; another way of looking at this is that being a multiple of 1 doesn't mean a number isn't a prime, and the algorithm needs to recognize this.)
j <- i * I
This finds the square of i and stores it in variable j. Remember we're in the for loop, with i incrementing each time the loop repeats. The first time round i = 1 and j = 1; the second, i = 2 and j = 4; the third, i = 3 and j = 9; and so on.
if (j > n)
If (and only if) the square of i is greater than the number of elements in the entire sequence...(i.e. that number times itself is larger than e.g. 100...)
return(which(a))
Return the sequential positions of all those numbers in a which are true (T). (The idea here is that if we've reach a number whose square is greater then the number of numbers in the sequence, e.g. 100, then we've found all the primes already. I guess this is just a separate assumption that happens to be true?)
a[seq(j, n, by=i)] <- F
Mark as F (false, i.e. not prime - in other words, cross out) all the numbers in a between the jth and nth (the nth being the last) that are divisible by (that's what 'by' does) i (and thus aren't prime). This line is the heart and soul of the code, the steely essence of Beta's colander.
OK, I'm still not sure I understand all of that, so if you want to try to help out in the comments be my guest.
No comments:
Post a Comment