Skip to content
Follow me on twitter!

Probabilities in Proofreading

Suppose you write a program and you send the source code to two of your friends, LaTeX Formula and LaTeX Formula. Your two friends read the code and when they finish, A errors are detected by LaTeX Formula, B errors are detected by LaTeX Formula, and C errors are detected by both. So, in total, A+B-C errors are detected and can now be eliminated. We wish to estimate the number of errors that remain unnoticed and uncorrected.

The original version of this problem concerns manuscripts and proofreaders, instead of source code and programmers. It was posed and solved by George Polya and published in 1976 on The American Mathematical Monthly under the name of Probabilities in Proofreading. Because the problem is interesting and Polya’s solution is short and elegant, I have decided to record and share it. Also, since code sharing and reading is a frequent activity in the software development world, estimating the desired value can be helpful for some readers of this blog.

Estimating the number of unnoticed errors

Let E be the number of all errors, noticed and unnoticed, in the source code. Our goal is to estimate the value of E-(A+B-C). Let p be the probability that friend LaTeX Formula notices any given error and q the analogous probability for friend LaTeX Formula. The expected number of errors that may be detected by LaTeX Formula is LaTeX Formula and by LaTeX Formula is LaTeX Formula. Assuming that these probabilities are independent, the expected number of errors that may be mutually detected by both friends is LaTeX Formula.

Because we are interested in an estimate, we can safely assume that the expected numbers are approximately equal to the number of errors detected, that is, LaTeX Formula, LaTeX Formula, and LaTeX Formula. (We use the notation LaTeX Formula to denote that two numbers are approximately equal.)

We now have all the ingredients to conclude the solution. Recall that our goal is to estimate the value of E-(A+B-C). We calculate:

LaTeX Formula

This is the desired estimate!

Related Articles:

One Comment

  1. June Kim wrote:

    Nice piece. This is a well known practice in ecology for estimating population size(and recently in sociology and policy-making for estimating, for example, homeless people). It’s called mark and recapture or capture-recapture.

    http://en.wikipedia.org/wiki/Mark_and_recapture

    There are even some(not many) papers applying this technique to defect estimation in software.

    Friday, December 11, 2009 at 6:56 am | Permalink

One Trackback/Pingback

  1. Math World | Probabilities in Proofreading : Joao Ferreira on Monday, September 14, 2009 at 4:39 pm

    [...] View post: Probabilities in Proofreading : Joao Ferreira [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*