## Tuesday, October 27, 2009

OK, a respite from my last two posts about the Research Excellence Framework. The following question occurred to me, I make no promise that it is interesting.

Suppose you repeatedly roll a fair die and write down the sequence of numbers you get. I'm going to delete numbers from the sequence according to the following rule. Let X be the first number in the sequence. I delete the first block of consecutive X's (which will usually just be the single X of course), then I delete the second block of consecutive X's, but retain everything in between. Then I do the same thing to the rest of the sequence of dice rolls - it will start with some Y not equal to X, so I get rid of the first two blocks of consecutive Y's and continue.

For example, I would delete the bold-face elements of the following:

2,3,6,1,1,2,2,4,2,3,3,5,4,5,5,4,6,2,3,2,4,1,5,2,3,2,...

And the question is, is the sequence of undeleted numbers indistinguishable from a completely random sequence? Seems to work for 2-sided dice (coins).

Alex F said...

Intuitively, it seems likely to be a bit different: the procedure looks like it makes "slightly homogeneous" stretches (consisting of all values except X) more probable.

Indeed, take a 3-sided die, and measure the probability that the first 3 digits of the resulting sequence are all different. This is 6/27 for a random-rolls sequence. In this sequence, I can upper-bound this probability by 7/36 via a bit of arithmetic.

(Just add up the probabilities of the head of the original sequence matching, to within some symmetries, the patterns:
^0..011
^0..0122
^0..0121
^0..0120..01..12
^0..010..02..21
^0..010..02..200
^0..010..02..201
^0..010..01..122
^0..010..01..100)

Curious: did this come from any particular application? It certainly seems that the distribution will be quite close to uniform under any common metric that comes to mind.

Paul Goldberg said...

That looks plausible w.r.t. the first three characters of the sequence; but I'm still not sure if we'd be able to identify such a sequence more and more reliably as it gets longer and longer.

And the application? You'll wish you hadn't asked... I was trying to figure out a strategy for winning at randomly generated instances of bubble breaker as the size of the game increases... (winning means getting rid of all the "bubbles" rather than maximizing your score) i.e. the probability that you win should converge to 1 as the size increases (of course, there is always a positive probability that the game is unwinnable).

Anonymous said...
Alex F said...

> but I'm still not sure if we'd
> be able to identify such a
> sequence more and more
> reliably as it gets longer and
> longer.

We can use the probability of contiguous stretches in the modified sequence, which is slightly higher than in the original:

Use R_i for the i'th "roll"/sequence element. P[R_{i+1}=R_i|R_1...R_i] is of course 1/n in the original sequence (for an n-sided die). In the modified sequence, once i symbols have been emitted, there's a 1/n chance that the next element in the underlying sequence will be another copy of R_i, but also a chance that the next element will be a copy of the current "X", initiating a deleted sequence, then followed by a 1 or more copies of some digit that's not R_i, followed by R_i, causing another copy of R_i to be "emitted", for a total probability Pr[R_{i+1}=R_i|R_1..R_i]=1/n + (1/n)(1/(n-1))(1/n).

So there should be a d (I think 5 might be enough, given that the probabilities differ by a (1+Theta(1/n^2)) factor? too lazy to check) such that by measuring average lengths of "contiguous runs of digits" in the sequence over the course of O(n^d) elements, you should be able to distinguish with high confidence the slightly-longer average run lengths that you'd get in the modified sequence from the original sequence.

Alex F said...

(That should be 1/n + (1/n)(1/(n-1))(1/(n-1)), fwiw)

Paul Goldberg said...

Excellent, you have convinced me! And now this reddit.com web site worth browsing some more.