Cheat Engine

thenicelordj

Semi-automated cracking of monoalphabetic and polyalphabetic substitution ciphers

Since a few of the 'crackme' challenges in this topic have been based upon substitution ciphers I thought I'd spend a little time to cover a cracking strategy that seems to elude most people. What I present here is a method of breaking substitution ciphers in any language or format without the usual human approach... that is, without guessing at the likely sequences which help determine the plaintext.

First though, some examples of the classical approach...

Example 1, Monoalphabet with spaces preserved

In a Monoalphabetic substitution cipher you have an alphabet used for the plaintext (For example A-Z). Thus all plaintext messages are expressed in this alphabet. In the example given (uppercase letters) the following is a valid message :

I AM GOING TO THE SHOPS

Now, we choose 26 symbols that will form the encrypted text. It is important to note that the symbols are not always chosen from the same alphabet. Usually though, they are just a rearrangement of the initial alphabet, like this...

CODE
Plain Message : [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
Crypt Message : [FQRPTLVDWYHZIJSKMXNOAUBGCE]

Thus the message [I AM GOING TO THE SHOPS]
becomes [W FI VSWJV OS ODT NDSKN]

Now the usual approach 'frequency distribution' won't work too well. Basically, the principle that E is the most common letter in english text breaks down because the sample is not representative. There is only one E. There are three O's though, so we're already starting badly by mistaking O's for E's The next most common letter in english test is 'T', but as you see here this is ambiguous as T, S and even G have 2 letters. Still, this is enough to guess that ODT might mean THE and thus make a guess at the true symbol for 'E'

But lets put frequency distribution approaches aside for a moment, just realise that frequency distribution will only get you a few characters at best and requires careful application to avoid the 'E/O' problem we encountered above. Ultimately, whatever the results of frequency analysis give us, we must eventually solve this as a logic problem.

Of course, we have a LOT of clues, firstly the spaces are preserved, so we immediately can guess that the lone 'W' represents either an I or an A. We also know that 'OS' represents an 'is, am, it, an, or, be, to, etc...' ... experience tells us that ODT is probably 'and' or 'the', but could also be 'but', 'our'.

What do we see in 'OS ODT' well, the initial letter is the same.

So, 'to the' and 'on our' seem very likely, 'be bad' is a less likely possibility... 'an ass' is obviously ruled out because of the double s... etc...

Continuing with this thinking, and testing various possibilities we eventually get to a point where we can solve the problem. Essentially we are playing a variation of the childrens game 'hangman'. Trying to guess at the gaps by seeing patterns in structure and a few available letters.

After a few minutes we will probably get to the solution.

Example 2, Monoalphabetic with spaces encrypted

Lets say that instead of using 26 symbols in our alphabetic substitution cipher we choose 27. The new one is the space character itself.

CODE
Plain Message : [ABCDEFGHIJKLMNOPQRSTUVWXYZ ]
Crypt Message : [FERPTLVDWYHZIJ SKMXNOAUBGCQ]

Thus the message [I AM GOING TO THE SHOPS]
becomes [WQFIQV WJVQN QNDTQXD SH]

This hides a lot of the clues. The initial approach would probably have the frequency distribution approach misinterpret Q (the spaces) for the letter E or T. In the logic phase they would probably assert that SH is 'it' because of its relationship to the larger preceeding word and the fact that it occurs at the end of the message (maeaning that 'is', 'as', 'am', etc... would be dangling, and thus gramatically unlikely)

Of course, after some time spent analysing this the cracker may realise the mistake and treat Q as a SPACE (The most common SYMBOL in english text) and suddenly everything starts falling into place. They then solve it as before.

The automated solution to all monoalphabetic ciphers

It is an incredibly complex problem to code a program which will emulate your thought processes as you try to crack a monoalpabetic cipher. Describing gramatical relationships alone is a difficult task. Now, you might think, easy, I'll just write a program which tries every possible combination... that is an incredibly wasteful and simplistic approach but you still have a problem, how does the program 'know' when it hits a valid solution ? Do you want it to prompt you whenever it gets a string with the word 'the' in it ? That is going to log countless possibles! Do you want to check everything against a wordlist? What if the majority of the message is names ?

HERBERT GRIMSHAW IS COOLTASTIC

what we need is a system for identifying english-looking structures. A good way to do this is with trigrams or quadrigrams. We will work with quadrigrams (4 letter sequences).

We create an array 27x27x27x27 (A four dimensional array) and then head over to project gutenberg (free online literature) for training. Find a selection of reasonably contemporary piece and download them as text. Strip off the license headers and feed these files through an algorithm.

We read the book from beginning to end using a sliding window. We ignore all punctuation and symbols and merge any whitespace into a single space. This gives us a stream of characters separated only by single spaces. We move through this stream one character at a time keeping a buffer of the last four symbols read.

CODE
Thus, for a book which begins 'It was a cold, dark, stormy night...' our buffer reads the following...

[IT W]
[T WA]
[ WAS]
[WAS ]
[AS A]
[S A ]
[ A C]
[A CO]
[ COL]
[COLD]
[OLD ]
[LD D]

Each buffer represents a quadrigram. Now, for each buffer we do the following. Turn each character into a symbol. A=0, B=1, C=2, D=3..to..Y=25, Z=26, (SPACE)=27. This is easy, just deduct 65 (The value of 'A') from each character in the buffer... if the result is negative then set it to 27

So, the first buffer 'IT(space)W' becomes '8, 19, 27, 23'

Remember our 4 dimensional array ? This counts the occurrences of each unique quadragram pattern. So, when we see [IT W] we increment the value at that position in the four dimensional array:

CODE
In C this looks like 'buffer[8][19][27][23]'
in VB it would look Like 'buffer(8,19,27,23)'

Keep doing this for a few books. If you want to build a quadrigram for breaking spanish messages you'd use modern spanish texts to train your quadrigrams, if russian you would use a different alphabet and russian texts to train on. By the time you're finished you have an array of quadrigrams and their likelyhood. Ones that have never occurred score badly. Ones that occur regularly score highly.

Thus, en english, Q followed by anything other than a U will score very badly indeed...

CODE
For example, the message O QID ITPAA'

[O QI] scores highly negative
[ QID] scores highly negative
[QID ] scores highly negative
[ID I] moderately positive
[D IT] moderately positive
[ ITP] scores highly negative
[ITPA] scores highly negative
[TPAA] scores highly negative
[TPAA] scores highly negative

The result, this is rejected as english text.

CODE
Other constructs score highly twice... 'Is the round ball under the chair'

[IS T] scores high
[S TH] scores high
[ THE] scores very high
[THE ] scores very high
[HE R] high
[E RO] high
[ ROU] moderate
[ROUN] moderate-high
[OUND] moderate-high
[UND ] moderate
[ND B] moderate-high

Such a sentence would very quickly pass the threshold for strong english constructions.

I hope you realize the simplicity and power of the quadragram approach for automated cracking of mono alphabetic substitution ciphers. It can be trained on any language or alphabet you wish, from English to Russian, from dwarvish to klingon. It can also be trained on protocols such as HTML or IRC and structured documents and file formats.

Whilst it will not 'know' when it has the correct answer you can league-table the best results for display, normally the answer is in the first 20 places and, with structured documents such as HTML, is almost always the top result.

Not only does it simplify monoalphabetic substitution but it helps considerably to speed up polyalphabetics and other non-substitution ciphers such as RC4. And, the best thing, you can build a discrimination library of languages, protocols, dialects, formal and informal speech. Anything.

Alphabets with upper AND lower case are even more discriminating. Alphabets with periods will discriminate against 'is' at the end of a sentence thus demonstrating a very simplistic awareness of gramatical structure. Endings with 'ing' 'er' 'ist' score higher too than the same constructs in the middle of a word. It really does take the work out of recognising solutions and discerning between possible languages.

So, stop breaking substitution ciphers by hand, and give this a go. See if you can write a program which breaks monoalphabetics in a semi-automated fashion. If you manage this, try the problem of breaking polyalphabetics (much more of a challenge)

The other thing you should take from this is that substitution ciphers are NOT valid encryption these days, at best they are logic puzzles. Fun to break if you're bored, but ultimately useless.

oib111 · Posted: Thu May 15, 2008 2:52 pm Post subject:

thenicelordj · Posted: Fri May 16, 2008 5:13 am Post subject:

Jake! · Posted: Tue May 20, 2008 8:51 pm Post subject:

Overload · Posted: Tue May 20, 2008 9:01 pm Post subject: