# How a RAM disk saved my life

## I don’t want to duel that much

In the last few weeks, I have been risking my life in the corridors at work. I have fit coworkers with a good sense of balance. Here is a picture of the horrible situations I’ve found myself into :

(props to xkcd. It’s the greatest webcomic in the universe. read it regularly)

Some of the code I’m dealing with, though, is not bound by the CPU speed or the number of cores I can throw at it. In particular, the ant distpack-opt task of the Scala compiler, compiling and packaging the documentation, is very much limited by storage transfer speeds. I have state-of-the-art SSDs on some of my machines, but yet found myself frustrated and waiting in front of the machine way too often.

To be fair, as far as I’m concerned, once is too often. I’m not so good when standing on a rolling chair.

## The RAM disk solution and its imperfections

The obvious solution, then was to use a RAM disk : insane transfer speeds would definitely help me, right ? Except RAM disks have two well-known problems:

• a RAM disk is, fundamentally, a piece of storage — a partition — carved out of your existing RAM. Assuming you have enough of it, it’s very fun until you have no more power. Every time you shut down your machine, you need to transfer the content of your RAM to your hard drive. Every time there is a sudden power outage, you basically lose the whole partition.

• moreover, a RAM partition is inherently limited in size. More interestingly, that’s a hard limit. What happens when you overflow the size of that partition is — at least on Linux — a nightmare: because the partition is treated like a fixed storage device, not a piece of memory (that could overflow into swap), it basically blows up in your face when full.

## The solutions on a modern Arch

Thankfully, the tools to manage those two problems have come to maturity. At the time of this writing, my machines are running a Linux 3.8rc7 kernel on ArchLinux transitioned to systemd, and running RAM disks in that kind of environment is a breeze.

• Ulatencyd (ArchLinux doc) is a daemon that uses cgroups, a recent-ish feature of the Linux kernel that lets you constrain the amount of resources allocated to a process. It watches over your processes, and dynamically gives a fair amount of ressources to them. Every second, it looks for memory pressure, and either relieves it if possible, or kills the guilty party if necessary. We have all had to face the swap of death, that frightening moment when a memory-leaking process overflows into the disk at a scary rate, slowing down — nay, taking down — your entire system. Here, Ulatencyd helps you separate processes on your system into cgroups, that are resized dynamically, and handle memory situations in a simple & graceful by default but amazingly configurable fashion.

The Archlinux install is amazingly simple. Install the AUR package, and activate it the systemd way (systemctl enable ulatencyd.service) after (optionally) editing the configuration (/etc/ulatencyd/ulatencyd.conf) : the default maximal amount of physical memory given to a process is 70%, and since I have 16GB of RAM on all my machines (with a RAM disk moving between 3 and 5 GB), this limit is pretty safe.

An added benefit ? In the middle of a long compilation, my machine is entirely responsive. I may be waiting for the compiler, but I can still surf the Internet in the meantime.

• Anything-sync-daemon (ArchLinux doc) is a small daemon — in fact, a small bash script — that is aimed at hiding the management of the day-to-day life of a RAM disk from the user. It uses rsync to synchronize the contents of your RAM disk to disk priodically, alleviating the risks coming from the need for your RAM to be powered at all times. It creates the partitions in your RAM and fills them up at boot time, and shelves them at shutdown.

Again, the ArchLinux install is insanely simple. There is an AUR package, the activation is trivial with systemd (systemctl enable asd.service) and the configuration is also easy. I simply edited /etc/asd.conf and put the following directories under RAMDISK :

WHATTOSYNC=('/home/huitseeker/Scala' '/home/huitseeker/.eclipse' '/home/huitseeker/.m2' '/usr/share/eclipse' '/home/huitseeker/workspace' '/home/huitseeker/junit-workspace' '/usr/lib/jvm' '/home/huitseeker/runtime-EclipseApplicationwithEquinoxWeaving' '/home/huitseeker/.sbt' '/home/huitseeker/.ivy2')

This is more than you need to simply compile Scala. As you probably noticed, this means :

## The Results

For the distpack-opt target of the Scala compiler, I have tried this method with :

• a slow, 2010-top-of-the-line Lynnfield core iMac (configuration) with a very well-used, slow hard drive.

• a reasonable X220 Lenovo Thinkpad (config) with an i5-2540M and an SSD.

Both configurations run on full-disk-encrypted storage using LUKS (but AVX-accelerated) (which explains the bad baseline you’re going to see for that disk-bound compilation).

• The slow iMac went from 124 minutes for compilation to 27 minutes.
• The Lenovo went from 24 minutes to 8 minutes 30 seconds

Yay ! no more risking my life fencing in corridors any more !

# ensime.org is down

But there’s a mirror on ibiblio.org ! Other wise said, if you encounter the following error:

[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::


and don’t have the time to fix it the correct way, just add the following resolver to your plugins/build.sbt:

resolvers += "iBlio Maven Repository" at "http://mirrors.ibiblio.org/maven2/"


# Every single freshman CS question in ˜80 lines of Scala

This post is about writing a single program that returns the stream representing any recursive sequence of order $k$, which $i$th term a[i] = f(a[i-1], a[i-2], ... a[i-k]), for all $k$, in Scala. It explains things very slowly: If the meaning of the previous sentence is already 100% clear to you, you can easily skip the next 2 paragraphs.

All the code in this post can be found as a gist here.

# Recursive sequences and their order

Most first-year computer science questions aim at teaching students recursion. Hence, they involve a flurry of questions about defining recursive sequence of varied orders.

What’s a recursive sequence ? It’s a sequence whose value at any rank $n$ is defined by a function of the values at rank $(n-1) .. (n-k)$, and the initial values at ranks $1 … k$. $k$ is the order. For example, the $n$th term $S_n$ of a Syracuse sequence - a recursive sequence of order 1 - is given by:

• $S_n = S_{n-1}/2$ if n even
• $S_n = 3 S_{n-1} + 1$ if n odd

Of course, you have to choose an initial term to define the sequence properly, but a famous conjecture is precisely that for any initial term, a Syracuse sequence - i.e. a member of the class of sequences that verify this recurrence relation - converges towards the cycle (1,4,2).

Another example is the Fibonacci sequence, a recursive sequence of order 2:

• $F_n = F_{n-1}+F_{n-2}$ if n ≥ 2
• $F_1 = 1$
• $F_0 = 0$

Naturally, basic examples about recursion start with sequences of order one, and then move on to sequences of order two, usually with Fibonacci. The student is then asked to notice that the naive way of defining the Fibonacci sequence leads to a exponential number of calls, which allows the teacher to introduce notions such as e.g. memoization, or dynamic programming.

# Generic recursive sequences, as streams

A short while later, the student realizes that for a fixed $k$, there is a generic way of computing the $n$th term of a recursive sequence of order $k$, in linear time. Here it is for $k = 2$ :

def recTwo[A](f : A ⇒ A ⇒ A) (x:A) (y:A) (n:Int) : A =
n match {
case 0 ⇒ x
case 1 ⇒ y
case _ ⇒ recTwo (f) (y) (f (x) (y)) (n-1)
}


We abstract over the function and initial elements here: all you need to do to define the Fibonacci sequence is invoke recTwo ((x:Int) ⇒ (y:Int) ⇒ x + y)(0)(1). The crucial thing that makes the computation of the $n$th term linear is that you are shifting the whole sequence by one rank at each call, or, equivalently, saying that the nth term of the sequence with initial terms $(x,y)$ is the $(n-1)$th term of the sequence with the same recurrence relation $f$ and initial terms $(y, f(x,y))$. In effect, it’s a witty memoization trick that reuses the initial terms arguments passed to recTwo.

So far, so good. Now, it turns out that it’s not too difficult to reuse the same reasoning to upgrade this function to one returning the Stream of the elements of the recursive sequence. It even makes the whole business simpler, since we don’t have to deal with $n$:

import Stream._
def recStreamTwo[A](f : A ⇒ A ⇒ A) (x:A) (y:A):Stream[A] =
x #:: recStreamTwo (f) (y) (f(x)(y))


Then to get the first 10 terms of the Fibonacci sequence:

  recStreamTwo ((x:Int) ⇒ (y:Int) ⇒ x + y)(0)(1) take 10 print


Easy, right ? In fact, you can even apply the pattern to a sequence of any order, and define:

def recStreamK [A](f : A ⇒ A ⇒ ... A)
(x1:A) ... (xk:A):Stream[A] =
x1 #:: recStreamK (f) (x2) (x3)  ...  (xk) (f(x1,x2,..., xk))


Good. Now, it turns out that this pattern solves most of the first-year CS questions about defining recursive sequences of a given order. But we still have to write a slightly different function, with a slightly different number of parameters, every time the order $k$ changes.

# Generalizing over the order

## The problem

My question is : can we generalize over the order ? That is, can we write a single function that, for all $k$, allows us to generate the stream of elements $U_n$, given $a_1, … a_k$ and a function $f$ such that $U_n = f(U_{n-1}, U_{n-2}, … U_{n-k})$ ?

The instinctive answer is no, at least within the bounds of the type system, since the type of the argument f in recStreamK is fixed by the order of the recursive sequence, which is also the arity of the corresponding recurrence relation defined by f. The common perception would have it that to generalize over the recursive order we need:

• a macro language
• initial arguments passed as a list, or a stream
• type casts, or an equivalent way to bypass the type system
• dependent types

All those solutions are interesting. In fact, some were proposed in a recent Stackoverflow.com question. I learnt a lot from that question, and upvoted accordingly.

But in Scala, we don’t necessarily need to resort to that. It turns out that with Scala’s overlapping implicits, we can define a single constructor that will automatically generate the appropriate stream for any order.

In a type-safe manner.

Here’s how.

## A small proviso

More exactly, “here’s how” with a temporary proviso on tuples (that we will remove at the end): in Scala, the usual fixed-sized tuple of width k is not an instance of a recursively-generated type built from nested pairs, it is a bespoke constant type Tuplek (for a fixed k between 2 and 22), with little structural relation to Tuplek-1 or Tuplek+1. In particular, there is no recursive relation that allows me to consider something structurally akin to a function of type Tuplek ⇒ Tuplek+1 for any given k. As we will see, having some sort of recursive structure recognizable within some fixed-width tuple type is crucial to what we are trying to do, so that we will temporarily forgo Scala’s native Tuples, and work with our very own tuples defined in the following manner:

• the singleton of type T is simply T.
• for any tuple type U of a given width k, its successor, the tuple of width k+1, is the type Pair [T,U](1)

Forget everything you know about Scala’s Tuple5 for a minute: when we think (1,2,3,4,5), what we are really going to write is (1,(2,(3,(4,5)))). The same goes for Tuple2 ...Tuple22. In a later, separate step, we will come back to how to apply this to regular arguments.

So, let’s recapitulate: we want a generic function that given two arguments:

• a function taking any k-tuple argument f : Pair[A,Pair[A, ... Pair[A,A]...]] ⇒ A,
• and a corresponding k-tuple of initial elements (a1,(a2,...(ak-1,ak)...)),

returns the Stream[A] consisting of the recursive sequence of order k defined by the recurrence relation given by f and the initial elements a1, ... ak.

We want that single program to work for any k, but we want to be type-safe : if we receive a function taking k arguments as above, but only k-1 or k+1 initial elements, things should “break” at type-checking.

## Overlapping implicits

As per Fighting Bit Rot With Types:

The foundations of Scala’s implicits are quite simple. A method’s last argument list may be marked as implicit. If such an implicit argument list is omitted at a call site, the compiler will, for each missing implicit argument, search for the implicit value with the most speciﬁc type that conforms to the type of that argument. For a value to be eligible, it must have been marked with the implicit keyword, and it must be in the implicit scope at the call site.

As it turns out, those implicit arguments form implicitly-passed dictionaries, and can carry transitive constraints: a method can define an implicit (with the implicit prefix keyword), and itself require implicit arguments - ensuring the propagation of constraints. These two mechanisms:

• automatic instantiation
• constraint propagation

make Scala’s implicits a flavor of type classes. The feature we are going to be mostly interested in is the oft-overlooked overlap resolution between implicits: when two implicits are in scope, Scala follows the overloading resolution rules (Scala ref §6.26.3), and considers the lowest subclass in the subtyping lattice. Which means that if we want to direct implicit instance resolution, we can simply define the alternatives to be tried first in subtypes, and the fallback cases in supertypes. There are examples in several papers.

## Recursive Implicits

Let’s consider the stream equation at rank $k$ as we have outlined it in the pattern above:

def recStreamK [A](f : A ⇒ A ⇒ ... A)
(x1:A) ... (xk:A):Stream[A] =
x1 #:: recStreamK (f) (x2) (x3)  ...  (xk) (f(x1)(x2) ... (xk))


If you consider $f(a1)$ as a higher-order function of arity k-1, the last argument of the recursive call can be made to depend only on the k-1 arguments a2, ..., ak. This is important, because it will let us define this as the return of some recursive function at order $(k-1)$.

Thus we want to define an intermediary function:

prod (f) (a1, ..., ak) = (a1, ..., ak, f(a1, ..., ak))


This lets us define the stream rstream f taking a tuple of width k recursively as a function of some prod (f(a1)) taking a tuple of rank k-1:

rstream (f) (a1, ..., ak) = a1 #:: rstream f (a2, ..., ak,
f(a1, ..., ak))
= a1 #:: rstream f (prod (f(a1))
(a2, ... ak))


The next stage involves defining an implicit trait that defines methods rstream and prod for every tuple, by recognizing the Pair-directed structure of the recursive function’s argument. Since we have a recursive, structural type of tuple, this just involves a relatively simple manipulation:

trait RecStream [S,Base] {
def prod : (S ⇒ Base) ⇒ S ⇒ Pair[Base, S]
def rstream : (S ⇒ Base) ⇒ S ⇒ Stream[Base]
}

class RecStreamDefault {
implicit def ZeroRS[S] = new RecStream[S,S] {
def prod = xs ⇒ ss ⇒ (ss, xs (ss))
def rstream = xs ⇒ ss ⇒ ss #:: rstream (xs) (xs(ss))
}

}

object RecStream extends RecStreamDefault {
def apply[S,B](f:S ⇒ B)
(implicit rs:RecStream[S,B]):S ⇒ Stream[B] =
rs.rstream(f)

implicit def SuccRS[R,B] (implicit rs:RecStream[R,B]) =
new RecStream[Pair[B,R],B] {
def prod = f ⇒ a ⇒
(a._1, rs.prod (y ⇒ f (a._1,y)) (a._2))
def rstream = f ⇒ a ⇒
a._1 #:: rstream (f) (rs.prod (y ⇒ f (a._1,y)) (a._2))
}

}


This is the core of the trick: we already have something that works for any k, and returns the appropriate recursive stream. Here are examples for orders 1 and 2:

def increaser : Int ⇒ Stream[Int] =
RecStream((x:Int) ⇒ x + 1)

def fibonacci : Pair[Int,Int] ⇒ Stream [Int] =
RecStream(x ⇒ x._1 + x._2)

fibonacci(0,1).take(10).print
increaser(0).take(10).print


We are type-safe: if we attempt to write fibonacci(0).take(10), we get:

recstream.scala:104: error: type mismatch;
found   : Int(0)
required: (Int, Int)
fibonacci(0).take(10)
^
one error found


The only problem is the fact that if we really want to extend this to some k ≥ 3, then we can’t use a function such as (x:Int) ⇒ (y:Int) ⇒ (z:Int) ⇒ x + y + z. We have to shape it according to our bespoke nested-pairs tuple type, and write the painful x ⇒ x._1 + x._2._1 + x._2._2. Can’t we solve that problem too, and use regular curried functions to solve our problem ?

## Currying and Uncurrying

Currying is a fancy name for transforming a function that takes a pair as an argument, of type $A × B → C$ into a higher-order function that takes a single argument, and returns the function that takes the second argument before returning a result computed with both. That latter function has of course the type $A → (B → C)$. Since arrows are right-associative, the parentheses in the latter expression are optional.

Naturally, uncurrying is the reverse transformation, from the higher-order function to the function taking a pair. To do this with first-order types, you just have to write the following couple of functions:

def curry [A,B,C](f:Pair[A,B] ⇒ C) =
(x:A) ⇒ (y:B) ⇒ f (x,y)
def uncurry [A,B,C](f: A ⇒ B ⇒ C) =
(x:Pair[A,B]) ⇒ f (x._1) (x._2)


Now, this is all well and good if you are dealing with ground types, but what if you are dealing, as we are, with nested pairs ? How do we apply the transformation recursively to our tuples ?

We can resort in fact to the same trick of using prioritized implicits! In fact, leaving to Scala to thread the construction of implicit curryfication functions through the unification process involved in typeinference. Here is the code for recursive curryfication using prioritized overlapping implicits:

trait Curried[S,R] {
type Fun
def curry : (S ⇒ R) ⇒ Fun
}

class CurriedDefault {
implicit def ZeroC[S,R] = new Curried[S,R]{
type Fun = S ⇒ R
def curry = f ⇒ f
}
}

object Curried extends CurriedDefault{
def apply[S,R] (f:S ⇒ R) (implicit c:Curried[S,R]) : c.Fun =
c.curry(f)

implicit def SuccC[T,S,R] (implicit c:Curried[S,R]) =
new Curried[Pair[T,S],R] {
type Fun = T ⇒ c.Fun
def curry = f ⇒ x ⇒ c.curry(y ⇒ f(x,y))
}
}


I leave the code for recursive uncurryfication to the reader as an exercise. In fact, after looking for it by herself if it suits her, said reader can find the solution as a gist here (along with all the code presented in this post).

Once this is defined, the definition of our generic recursive arbitrary-order streams is a breeze:

def sumthree (x:Int)(y:Int)(z:Int) = x+y+z
def sumthreestream = Curried(RecStream(Uncurried(sumthree _)))


The client function call is (using java-style syntax to emphasize that I am passing arguments one by one) :

sumthreestream(0)(0)(1).take(10).print


Not convinced about the type yet ? Compiling this with -Xprint:typer yields:

def sumthreestream: Int => Int => Int => Stream[Int]


Isn’t Scala wonderful ?

# Further work

I suspect several improvements are possible using prioritized implicits are still possible, including:

• avoiding our ugly kludge using nested pair types entirely
• streamlining the use of curryfication and uncurryfication using views

Don’t hesitate to suggest improvements in comments !

1. note we implicitly forget about heterogeneity, since we won’t need it for our particular application

# Timeline & stories with Facebook

I’ve recently had a look at Facebook’s new feature, Timeline, and published mine in the process. With it, you can organize your news and shares along a temporal line design, that has already been called bloglike in some places. I think it’s a fantastic feature. The tagline of it is: Tell your life story with a new kind of profile

Great. But there’s more stories to tell, and Facebook is getting oh so close to letting us tell them. I wish it would inch a little further.

But before I get into what’s missing, let’s look at why this matters. Malcom Gladwell has looked in depth at how trends go viral in a famous 2000 book called The Tipping Point. In there, he explains the profile of who exactly helps people adopt those trends. Indeed, those connectors follow some sort of Pareto Principle where 20 percent of people make up 80 percent of the information exchange work necessary for people to adopt a trend and make it go viral. This means that giving those few key people the means to express themselves matters, even if they only makeup for a fraction of the general population.

Basically, a trend extends its outreach through the actions of connectors, who embody a hub of contacts reaching across different social groups, salesmen, with the charisma to make us believe their story, and mavens, who are information specialists with a genuine desire to help others. Contrarily to Gladwell, I’m not sure people fit one or the other profile in all aspects fo their lives. But I recognize those roles around me, and I feel relatively familiar with the maven.

Mavens have stories to tell. Detailed, involved, engaging, fascinating stories. But since they’re connectors, they don’t necessarily like dry, formal lessons you could find in a book. They have human stories, that they personally want to take a bunch of their peers through. No wonder why I identify.

For instance, among other things, I’m a fan of a genre of music commonly called by such labels as post-rock, especially a branch of it bearing the even scarier moniker of math-rock. I try to know everything I can about it. What most people do know about it, though, is that at the local hipster coffee shop, there’s a dude with abstract art on his T-shirt and a suede jacket on that thinks and says anything else is crap. And it pains me personnally that it’s what most people will keep knowing of it.

Math-rock seems so hermetic because, like so much music at the fringes right about now, it is an exploratory front of a long history of mixing influences and perspectives. The way I’d go telling people about math rock it is essentially a connective experience : I’d start writing a couple of weekly paragraphs about how the musical story went, tracing links between things seemingly very distinct. I’d relate how progressive rock started laying some tracks in the 60s, tell you where Don Cab’s crazy time signatures fit into this, and how the indie scene reappropriated some of this technique with an emotional influence that permeates through some of the more airy music you have probably enjoyed in the score of a couple of Michael Mann’s latest movies. I would add links to a couple of tracks you could grab on iTunes at the end of each post and listen to on the car ride home. And by the way, a bunch of us are going to that concert I think you’d like next week, would you like to join ? Here’s the Facebook page for the event.

Does that approach means I’m a maven ? No idea. Is that sort of behavior maven-like ? You betcha. This kind of story is not reserved to a few select experts — otherwise I would be hard-pressed to explain why I would be entitled to raise my voice on the subject. We all have friends who introduced us, more or less informally, but always personally, to their interests and passions, musical or otherwise.

And it works, more or less. I mean, if the genre is far from your thing, putting it in context alone is not going to make it so. But my experience in ventures of that sort (k-lophonies) shows people enjoy the story, in the sense that some context makes the music “familiar” enough for them to enjoy. I myself have been on the receiving end of some similar get-togethers about classical music that I’ve greatly benefited from. And hopefully, next time people meet the obnoxious hipster type at the coffee shop, they’ll be able to steer him or her towards a less annoying approach to talking about music.

Now, Facebook has events, notes, and links. One thing I think is missing from the user experience — and I think one of the reasons people don’t blog on Facebook — is the option to unwind a string between those nuggets of context. Something that would let a friend who skipped the first few installments of my informal introduction to math rock catch up. Something that would, more importantly, form a connective experience through a story over the course of a few weeks.

And Facebook’s Timeline is getting very close to this: I do have a life story, but I have other stories as well. Facebook could help me tell them, and in so doing, connect better with people.

# Certifying RSA correct for breakfast

Reasoning on the correctness of the RSA cryptosystem is much easier with a little bit of algebra: this is a perfect example of a case where the “practical” applications of proof assistants can be helped with libraries dealing with abstract algebra, including SSReflect. The proof we are going to build in the present post is based on a stock SSReflect version 1.3pl1, and is available on github. It can be compared with the RSA contribution to Coq, made on the basis of the standard library (note this is not a code golf, but rather a tentative to demonstrate that a big mathematical library helps).

Let us look at the encryption and decryption primitives of RSA: given a message, $M$, represented as a number, the cypher text $C$ is given by $C \equiv M^e \pmod n$. Likewise the decryption is defined as $M = C^d \pmod n$. The public key is the pair $(e,n)$ and the private key is the pair $(d,n)$, with the message $M$ chunked and padded appropriately so that $M \leq n$. In SSReflect:

Definition encrypt (w : nat) := (w ^ e) %% n.

Definition decrypt (w : nat) := (w ^ d) %% n.


The correctness property verifies that the cypher text allows us to recover the original message, i.e.:

$$C^d \equiv M^{e d} \equiv M \pmod n$$

For that, it suffices to choose $e,d$ so that :

$$ed \equiv 1 \pmod{\Phi(n)}$$

where $\Phi$ designates Euler’s totient function : $\Phi(n)$ is the number of integers $1 \leq m < n$ such that $m$ and $n$ are coprime.

This is an easy process: for security reasons, $n$ is chosen to be the product of two randomly chosen large prime numbers $p$,$q$. Since for all $n_1,n_2$ coprime, $$\Phi(n_1n_2) = \Phi(n_1)\Phi(n_2)\label{eq:phi_mul}$$ the totient of $n$ is the easily computed to be $\Phi(pq)=(p − 1)(q − 1) = n - (p + q) + 1$. Naturally, having that last equation as a library lemma (phi_coprime) helps tremendously. Hence, to pick $e,d$ verifying the equation above, all that remains to do is to pick a random $e$ coprime to $\Phi(n)$ such that we can use Euclid’s algorithm to compute $d$, its multiplicative inverse modulo $\Phi(n)$.

Variable p q : nat.
Variable prime_p : prime p.
Variable prime_q : prime q.
Variable neq_pq: p != q.

Local Notation n := (p * q).

Hypothesis big_p: 2 < p.
Hypothesis big_q: 2 < q.

Lemma pq_coprime: coprime p q.
Proof. by rewrite prime_coprime // dvdn_prime2 //. Qed.

Lemma phi_prime : forall x, prime x -> phi x = x.-1.
Proof.
move=> x; move/phi_pfactor; move/(_ _ (ltn0Sn 0)).
by rewrite expn1 expn0 muln1.
Qed.

Lemma pq_phi: phi(n) = p.-1 * q.-1.
Proof.
rewrite phi_coprime; last by rewrite pq_coprime.
by rewrite !phi_prime //.
Qed.

(*  We pick an exponent e coprime to phi(n) *)

Variable e : nat.
Hypothesis e_phin_coprime: coprime (phi n) e.

Definition d := (fst (egcdn e (phi n))).

(* We check that ed is 1 modulo (phi n) *)

Lemma ed_1_mod_phin: e * d = 1 %[mod (phi n)].
Proof.
by  rewrite -(chinese_modl e_phin_coprime 1 0) /chinese
Qed.


In this introduction of basic lemmas, we noticed that the value of $\Phi$ for a prime wasn’t included in the library. However, the Search tool of SSReflect allowed us to quickly find a more general lemma on the value of $\Phi$ based on the prime decomposition of a number (phi_pfactor above).

And finally, the value of $ed \pmod{\Phi(n)}$ leads to correctness thanks to Euler’s theorem:

For all $n$ and $a$ coprime to $n$, $a^{\Phi(n)} \equiv 1 \pmod n$

Hence, $\exists k$ such that $ed = k \Phi(n)+1$, so that $$M^{ed} = M^{k \Phi(n)+1} = M^{k \Phi(n)} M \equiv M \pmod n$$ is a consequence of Euler’s theorem if $M$ and $n$ are coprime.

The bane of working with a proof assistant is that we can’t ignore an often-overlooked detail: if $M$ and $n$ are not coprime (and if $M \neq n$, otherwise this is trivially true), we can assume, without loss of generality that $p | M$ and $q$ and $M$ are coprime. Indeed:

Lemma notcoprime: forall x, 0 < x < n -> ~~ (coprime x n) ->
((p %| x) && coprime x q) || ((q %| x) && coprime x p).
Proof.
move=> x; case/andP=>[x_gt0 x_lt_n]; rewrite coprime_mulr; move/nandP.
move/orP; rewrite coprime_sym [coprime _ q]coprime_sym 2?prime_coprime //.
case Hdvdn: (p %| x) =>/=; rewrite ?negbK ?andbF ?andbT // orbF=>_.
have xdpp_x: (x %/p) * p = x by move: (Hdvdn); rewrite dvdn_eq; move/eqP.
rewrite -xdpp_x; apply/negP=> H; move:H; rewrite (euclid _ _ prime_q).
rewrite [ _ %| p]dvdn_prime2 // eq_sym; move/negbTE: neq_pq=>->.
rewrite orbF=>H; move: (dvdn_mul (dvdnn p) H).
rewrite [_ * (_ %/ _)]mulnC xdpp_x; move/(dvdn_leq x_gt0).
by rewrite leqNgt x_lt_n.
Qed.


In that case $M^{k \Phi(n)} M \equiv M \pmod q$ is a consequence of Euler’s theorem, and $M^{k \Phi(n)} M \equiv M \pmod p$ is trivially true, which allows us to conclude using a Chinese lemma. Most presentations of RSA cop out of treating this case, but it makes the encoding no less correct. We can now prove the theorem:

Theorem rsa_correct1 :
forall w : nat, w <= n -> (decrypt (encrypt w)) = w %[mod n].
Proof.
move=> w w_leq_n; rewrite modn_mod modn_exp -expn_mulr.


This simplifies the goal to w ^ (e * d) = w %[mod n]. Then we are going to want to treat the case where w = 0 first. Since this is a decidable property, we proceed by making a case on 0 < w. We simplify the left part:

case w_gt0: (0 < w); last first.
move/negbT: w_gt0; rewrite lt0n; move/negPn; move/eqP=>->.


We are brought to 0 ^ (e * d) = 0 %[mod n]. The left part forces us to check that 0 < e*d:

have ed_gt0: 0 < e * d.
have:= (divn_eq (e*d) (phi n)); rewrite ed_1_mod_phin => ->.


We are left with 0 < (e * d) %/ phi n * phi n + 1 %% phi n. We would like to simplify $1 \pmod{\Phi(n)}$, which requires $1 < \Phi(n)$:

have phi_gt1 : 1 < (phi n); first rewrite pq_phi -(muln1 1).
by apply: ltn_mul; rewrite -ltnS prednK ?big_p ?big_q ?prime_gt0.


We can then conclude on both 0 < e * d and the original case where w = 0:

    by rewrite [_ * phi n]mulnC (modn_small phi_gt1) addnS lt0n.
by rewrite exp0n ?ltn_addl // ?modn_small.


Back to $w > 0$ ! Simplifying our goal a little further:

have:= (divn_eq (e*d) (phi n)); rewrite ed_1_mod_phin modn_small // =>->.


We get : w ^ ((e * d) %/ phi n * phi n + 1) = w %[mod n], and we are ready to conclude for the case where message and modulus are coprime ! This is the core of the proof : a simple use of Euler’s theorem and a few helper lemma on moduli.

case cp_w_n : (coprime w n).
rewrite expn_add [_ * phi n]mulnC expn_mulr -modn_mul2m; move: cp_w_n.
by move/Euler=>E; rewrite -modn_exp E modn_exp exp1n modn_mul2m mul1n.


Now, in the case where modulus and message are not coprime, we are going to want to use the notcoprime lemma above, for which we also require them not to be equal, so we quickly eliminate that case.

case w_eq_n: (w == n).
by move/eqP: w_eq_n =>->; rewrite -modn_exp modnn exp0n ?mod0n ?addn1.


Finally, we apply the chinese lemma, to separate our requirements on $w \pmod{\Phi(n)}$ into requirements on $w \pmod{\Phi(p)}$ and $w \pmod{\Phi(q)}$:

move: w_leq_n; rewrite leq_eqVlt {}w_eq_n orFb; move=> w_lt_n.
apply/eqP; rewrite chinese_remainder; last first.
by rewrite prime_coprime // dvdn_prime2.


The goal at this stage is

 (w ^ ((e * d) %/ phi n * phi n + 1) == w %[mod p]) &&
(w ^ ((e * d) %/ phi n * phi n + 1) == w %[mod q])


We want to rewrite this a little further to make the prime $w$ is not divisible by more prominent.

rewrite mulnC {1 3}pq_phi {2}[_ * q.-1]mulnC -2!mulnA 2!expn_add expn_mulr.
rewrite [w ^ ( q.-1 * _ )]expn_mulr expn1 -modn_mul2m -(modn_mul2m _ _ q).
rewrite -modn_exp -(modn_exp _ q).


This gives us the goal:

 ((w ^ p.-1 %% p) ^ (q.-1 * ((e * d) %/ phi n)) %% p * (w %% p) == w %[mod p]) &&
((w ^ q.-1 %% q) ^ (p.-1 * ((e * d) %/ phi n)) %% q * (w %% q) == w %[mod q])


We can now use notcoprime above, apply Euler’s theorem in each case, and conclude !

move/andP: (conj (idP w_gt0) w_lt_n).
move/notcoprime; move/(_ (negbT cp_w_n)); case/orP; move/andP=> [Hdvdn Hncp].
move: Hdvdn; rewrite /dvdn; move/eqP=>->; rewrite muln0 mod0n /=.
move/Euler:Hncp; rewrite (phi_prime prime_q)=>->.
by rewrite modn_exp exp1n modn_mul2m mul1n.
move: Hdvdn; rewrite /dvdn; move/eqP=>->; rewrite muln0 mod0n /= andbT.
move/Euler:Hncp; rewrite (phi_prime prime_p)=>->.
by rewrite modn_exp exp1n modn_mul2m mul1n.
Qed.


This concludes our certification of RSA’s correctness, in itself 30 lines of proof script. Note we have never used any automated tactic database such as auto, which confers our proof script a particularly deterministic nature (but it’s not a requirement, of course). Here’s a run of coqwc, the coq line-count utility, on our final file:

   spec    proof comments
28       53       17         RSA.v


The final file is 126 lines.

# How do you make a recursive merge sort more efficient in Python ?

This is re-worked from an excerpt of an answer of mine (1) to an SO question that made me discover some interesting things about Python : turns out it’s not easy to make recursion faster by doing tail call elimination, even by hand. (2)

We mean here to look at methods of tail call elimination and their effect on performance, demonstrated within the limits of the language, and with the example of merge sort. Tail call elimination aims at turning tail recursive calls into iterative loops, providing the benefit of not having to manage a recursive call stack, and thus, hopefully, of making things a bit faster. It is also used to deal with functions who are mutually recursive, or, more generally, whose number of recursive calls make the call stack grow very fast.

Python is an interesting example to show tail call elimination within the language itself because it does not have tail call optimization built-in. Some languages (generally functional) have built-in tail call elimination, meaning every tail call you make gets compiled to an iterative loop. If you want an iterative merge sort in one of those languages, all you need is to re-write the classic version to make your recursive calls tail recursive, using a CPS translation, as we will show below.

Of course, if what you want is a fast sort in Python, the native sort, or numpy’s alternative, or even an in-place quicksort are better alternatives. Even more impressively, faster implementations of a sort which closely (3) mimics the recursive pattern of the merge sort have been proposed on reddit in response to this post. And finally, if you wish to break the limits of what the language intends “natively”, it should be interesting to turn to one of the famous tricks for achieving TCO in Python. Our goal here is to apply tail call elimination within language boundaries, and maybe see if we obtain a TCO’ed function faster that the function it was derived from.

The exhaustive benchmark with timeit of all implementations discussed in this answer (including bubblesort) is posted as a gist here. Just to give a starting point, I find the following results for the average duration of a single run :

• Python’s native (Tim)sort : 0.0144600081444
• Bubblesort : 26.9620819092

Those are seconds (but since only the relative speed matters, this isn’t very important). As can be expected, Python is much faster. And to think that Timsort is supposed to be a (C) merge sort !

• For starters, we can write a relatively efficient, non-tail-recursive merge sort in the following way:

def mergeSort(array):
n  = len(array)
if n <= 1:
return array
left = array[:n/2]
right = array[n/2:]
return merge(mergeSort(left),mergeSort(right))

def merge(array1,array2):
merged_array=[]
while array1 or array2:
if not array1:
merged_array.append(array2.pop())
elif (not array2) or array1[-1] > array2[-1]:
merged_array.append(array1.pop())
else:
merged_array.append(array2.pop())
merged_array.reverse()
return merged_array


It runs, as expected, quite faster than the Bubblesort. Average run time : 0.126505711079

Now, our example may not be the best test of the difficulty of managing a huge call stack depth. The number of recursive calls R(n) on a list of size n is (2n-2) : for $k \geq 2$, $R(k) = 2+2* R(k/2)$, and $R(1)=0$. But in depth, the stack blowup of the recursive merge sort algorithm is in fact modest : as soon as you get out of the leftmost path in the array-slicing recursion pattern, the algorithm starts returning (& removing frames). So for 10K-sized lists, you get a function stack that is at any given time at most of size $log_2(10 000) = 14$ … pretty modest. But we hope it might still give some glimpse of a hint as to what our TCO methods entail.

• The call to merge is not tail-recursive; it calls both (mergeSort left) and (mergeSort right) recursively while there is remaining work in the call (merge).

But we can make the merge tail-recursive by using CPS. Note that as-is, this would only be useful as-is if we were using a language that does tail call optimization, without it , it’s a bad idea. Indeed, this is only workable in Python if we do further work on that intermediary function, which, by itself, runs out of stack space for even modest lists:

def cps_merge_sort(array):
return cpsmergeSort(array,lambda x:x)

def cpsmergeSort(array,continuation):
n  = len(array)
if n <= 1:
return continuation(array)
left = array[:n/2]
right = array[n/2:]
return cpsmergeSort (left, lambda leftR:
cpsmergeSort(right, lambda rightR:
continuation(fastmerge(leftR,rightR))))


Once this is done, we can do TCO by hand to defer the call stack management done by recursion to the while loop of a normal function (trampolining, explained e.g. here or there, trick originally due to Guy Steele). Trampolining and CPS work great together.

We write a thunking function, that “records” and delays application: it takes a function and its arguments, and returns a function that returns (that original function applied to those arguments).

thunk = lambda name, *args: lambda: name(*args)


We then write a trampoline that manages calls to thunks: it applies a thunk until the thunk returns a result (as opposed to another thunk)

def trampoline(bouncer):
while callable(bouncer):
bouncer = bouncer()
return bouncer


Then all that’s left is to “freeze” (thunk) all your recursive calls from the original CPS function, to let the trampoline unwrap them in proper sequence. The function now returns a thunk, without recursion (and discarding its own frame), at every call:

def tco_cpsmergeSort(array,continuation):
n  = len(array)
if n <= 1:
return continuation(array)
left = array[:n/2]
right = array[n/2:]
return thunk (tco_cpsmergeSort, left, lambda leftR:
thunk (tco_cpsmergeSort, right, lambda rightR:
(continuation(fastmerge(leftR,rightR)))))

mycpomergesort = lambda l: trampoline(tco_cpsmergeSort(l,lambda x:x))


This gives us a recursive merge sort in which the call stack size is at most one at any given time : by returning the thunk, our originally recursive procedure gives the “address” of the next function to call. But this does not go that fast (recursive mergesort:0.126505711079, this trampolined version : 0.170638551712). Indeed, each reursive calls need to create a closure (the thunk) and the trampoline loop has to test for closure (callable) continuously, so the overall naive trampolining is pretty expensive - unless you reach the end of the call stack, in which case they may prove of some use.

• We can do a slightly more involved stack-based TCO elimination in the guise of this SO answer (rehashed here). The idea is to consider the recursion as a storage of work to be done and work accomplished until now, and to include pointers to differentiate between the two. Then, reproducing a recursive pattern just consists in managing the order in which work units (here, slices of our array) are treated. This gives the iterative function:

def leftcomb(l):
maxn,leftcomb = len(l),[]
n = maxn/2
while maxn > 1:
leftcomb.append((l[n:maxn],False))
maxn,n = n,n/2
return l[:maxn],leftcomb

def tcomergesort(l):
l,stack = leftcomb(l)
while stack: # l sorted, stack contains tagged slices
i,ordered = stack.pop()
if ordered:
l = fastmerge(l,i)
else:
stack.append((l,True)) # store return call
rsub,ssub = leftcomb(i)
stack.extend(ssub) #recurse
l = rsub
return l


This goes a tad faster than the inefficient trampolined version above (trampolined mergesort: 0.170638551712, this stack-based version:0.144994809628). This is not surprising. The interesting fact is that this final iterative version goes slower that the basic, non-tail recursive run-of-the mill merge sort you would have written as CS 101 homework. Now this is cool : CPython’s function stack manipulations are faster than my explicit array manipulations. This is more a statement on the size of the recursive call stack (mentioned above) and the efficiency of CPython than on the quality of my CS homework (indeed, an even faster recursive version has been suggested on reddit). But still. It’s surprising that it would take an iterative function that does not do the basic, systematic recursion-to-stack transformation above to beat the run-of-the-mill recursive version.

The final result summary ? on my machine (Ubuntu natty’s stock Python 2.7.1+), the average run timings (out of of 100 runs -except for Bubblesort-, list of size 10000, containing random integers of size 0-10000000) are:

• Python’s native (Tim)sort : 0.0144600081444
• Bubblesort : 26.9620819092
• non-tail-recursive Mergesort : 0.126505711079
• trampolined CPS non-recursive mergesort  : 0.170638551712
• stack-based non-recursive mergesort : 0.144994809628

1. Initially fraught with errors, and therefore duly ill-scored.

2. This has also benefited a lot of a discussion with Carey Underwood below and in a reddit thread on this post. Many thanks !

# Lightweight profiling for your Coq hacks

Let’s suppose we want to hack a few functions in the Coq compiler. The classic situation is that we have the base implementation of our function as it stands, and an optimized version for which we want to check it really improves things.

We all know Donald Knuth’s line on “premature optimization” …

Crucially, we want execution time information, not just call count information: there is no reason our “optimized” subroutine, deep in the entrails of the compiler, to be called a different number of times.

The easiest way to profile Coq is to compile it as native code, passing the -profile option at the execution of the configure script:

huitseeker@fuwaad:/opt/coq/trunk\$ ./configure --help| grep -A1 -e '-profile'
-profile
Add profiling information in the Coq executables


This will in fact modify the native compilation using ocamlopt, so that it generates profiling information by adding calls to the _mcount hook, at each ML definition. The details are in the OCaml manual

Once you have called the configure script with the appropriate options, and recompiled Coq, you can execute your toplevel as usual:

coqtop.opt  -q  -dont-load-proofs -I .    -compile mytheoryfile


This will generate a gmon.out file that you can then parse with gprof.

gprof /opt/coq/trunk/bin/coqtop.opt gmon.out > ~/gmonoutput.log


However, the problem then becomes that on a number of proof files, Coq spends most of its time within reduction primitives … as can be expected, I guess. This often dwarfs the mesurements for our specific function in reports from huge reduction loops that don’t concern it. Thankfully, a more lightweight option exists in the source of the compiler, hidden in library/profile.ml. It’s documented in comments in library/profile.mli:

To trace a function f you first need to get a key for it by using :

let fkey = declare_profile "f";;


(the string is used to print the profile infomation). Warning: this function does a side effect. Choose the ident you want instead “fkey”. Then if the function has ONE argument add the following just after the definition of f or just after the declare_profile if this one follows the definition of f.

let f = profile1 fkey f;;


If f has two arguments do the same with profile2, idem with 3, … For more arguments than provided in this module, make a new copy of function profile and adapt for the needed arity.

This is really the 5-minute way to profile your Coq hack (modulo recompilation time, of course) : there is an example of profiling of a 4-argument function in the comments of kernel/reductionops.ml, and Coq even has the niceness of already calling Profile.print_profile() for you at the end of the toplevel execution. You get a result such as :

Function name              Own time Total time  Own alloc Tot. alloc     Calls
f                            0.28      0.47        116        116      5      4
h                            0.19      0.19          0          0      4      0
g                            0.00      0.00          0          0      0      0
others                       0.00      0.47        392        508             9
Est. overhead/total          0.00      0.47       2752       3260

• Ocamlviz is another option, but we don’t care much about real-time profiling information. Running the profiler as server corresponds to another usecase : optimizing a proof script rather than the compiler itself. Besides, it requires linking with the libocamlviz library, which is somehow difficult to integrate to the monolithic (and somewhat labyrinthic) build process of Coq (especially when aiming to analyze just a few functions).

• You may also want to have a look at this issue of the ocaml newsletter, it details a few of the alternatives for profiling ocaml code, including oprofile, which I haven’t tested.

In my first year I was taught about the slide rule. They said, “The slide rule is important. Without it you can do nothing. The slide rule is the modern weapon of efficiency. With the slide rule you can get from here to the stars. Buy it, use it – your slide rule!” Within one year it was, “Burn the slide rule. The calculator can add up with none of this fucking sliding the shit around and working out where that bit in the middle goes. Smash it over your head.
Eddie Izzard, Biography, Chapter 5