|
Cheat Engine The Official Site of Cheat Engine
|
View previous topic :: View next topic |
Author |
Message |
thenicelordj Cheater Reputation: 0
Joined: 24 Mar 2007 Posts: 48 Location: I'm in your VCR
|
Posted: Thu May 15, 2008 6:23 am Post subject: Solving Substitution Ciphers, Manual versus Semi-Automated |
|
|
Semi-automated cracking of monoalphabetic and polyalphabetic substitution ciphers
Since a few of the 'crackme' challenges in this topic have been based upon substitution ciphers I thought I'd spend a little time to cover a cracking strategy that seems to elude most people. What I present here is a method of breaking substitution ciphers in any language or format without the usual human approach... that is, without guessing at the likely sequences which help determine the plaintext.
First though, some examples of the classical approach...
Example 1, Monoalphabet with spaces preserved
In a Monoalphabetic substitution cipher you have an alphabet used for the plaintext (For example A-Z). Thus all plaintext messages are expressed in this alphabet. In the example given (uppercase letters) the following is a valid message :
I AM GOING TO THE SHOPS
Now, we choose 26 symbols that will form the encrypted text. It is important to note that the symbols are not always chosen from the same alphabet. Usually though, they are just a rearrangement of the initial alphabet, like this...
CODE
Plain Message : [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
Crypt Message : [FQRPTLVDWYHZIJSKMXNOAUBGCE]
Thus the message [I AM GOING TO THE SHOPS]
becomes [W FI VSWJV OS ODT NDSKN]
Now the usual approach 'frequency distribution' won't work too well. Basically, the principle that E is the most common letter in english text breaks down because the sample is not representative. There is only one E. There are three O's though, so we're already starting badly by mistaking O's for E's The next most common letter in english test is 'T', but as you see here this is ambiguous as T, S and even G have 2 letters. Still, this is enough to guess that ODT might mean THE and thus make a guess at the true symbol for 'E'
But lets put frequency distribution approaches aside for a moment, just realise that frequency distribution will only get you a few characters at best and requires careful application to avoid the 'E/O' problem we encountered above. Ultimately, whatever the results of frequency analysis give us, we must eventually solve this as a logic problem.
Of course, we have a LOT of clues, firstly the spaces are preserved, so we immediately can guess that the lone 'W' represents either an I or an A. We also know that 'OS' represents an 'is, am, it, an, or, be, to, etc...' ... experience tells us that ODT is probably 'and' or 'the', but could also be 'but', 'our'.
What do we see in 'OS ODT' well, the initial letter is the same.
So, 'to the' and 'on our' seem very likely, 'be bad' is a less likely possibility... 'an ass' is obviously ruled out because of the double s... etc...
Continuing with this thinking, and testing various possibilities we eventually get to a point where we can solve the problem. Essentially we are playing a variation of the childrens game 'hangman'. Trying to guess at the gaps by seeing patterns in structure and a few available letters.
After a few minutes we will probably get to the solution.
Example 2, Monoalphabetic with spaces encrypted
Lets say that instead of using 26 symbols in our alphabetic substitution cipher we choose 27. The new one is the space character itself.
CODE
Plain Message : [ABCDEFGHIJKLMNOPQRSTUVWXYZ ]
Crypt Message : [FERPTLVDWYHZIJ SKMXNOAUBGCQ]
Thus the message [I AM GOING TO THE SHOPS]
becomes [WQFIQV WJVQN QNDTQXD SH]
This hides a lot of the clues. The initial approach would probably have the frequency distribution approach misinterpret Q (the spaces) for the letter E or T. In the logic phase they would probably assert that SH is 'it' because of its relationship to the larger preceeding word and the fact that it occurs at the end of the message (maeaning that 'is', 'as', 'am', etc... would be dangling, and thus gramatically unlikely)
Of course, after some time spent analysing this the cracker may realise the mistake and treat Q as a SPACE (The most common SYMBOL in english text) and suddenly everything starts falling into place. They then solve it as before.
The automated solution to all monoalphabetic ciphers
It is an incredibly complex problem to code a program which will emulate your thought processes as you try to crack a monoalpabetic cipher. Describing gramatical relationships alone is a difficult task. Now, you might think, easy, I'll just write a program which tries every possible combination... that is an incredibly wasteful and simplistic approach but you still have a problem, how does the program 'know' when it hits a valid solution ? Do you want it to prompt you whenever it gets a string with the word 'the' in it ? That is going to log countless possibles! Do you want to check everything against a wordlist? What if the majority of the message is names ?
HERBERT GRIMSHAW IS COOLTASTIC
what we need is a system for identifying english-looking structures. A good way to do this is with trigrams or quadrigrams. We will work with quadrigrams (4 letter sequences).
We create an array 27x27x27x27 (A four dimensional array) and then head over to project gutenberg (free online literature) for training. Find a selection of reasonably contemporary piece and download them as text. Strip off the license headers and feed these files through an algorithm.
We read the book from beginning to end using a sliding window. We ignore all punctuation and symbols and merge any whitespace into a single space. This gives us a stream of characters separated only by single spaces. We move through this stream one character at a time keeping a buffer of the last four symbols read.
CODE
Thus, for a book which begins 'It was a cold, dark, stormy night...' our buffer reads the following...
[IT W]
[T WA]
[ WAS]
[WAS ]
[AS A]
[S A ]
[ A C]
[A CO]
[ COL]
[COLD]
[OLD ]
[LD D]
Each buffer represents a quadrigram. Now, for each buffer we do the following. Turn each character into a symbol. A=0, B=1, C=2, D=3..to..Y=25, Z=26, (SPACE)=27. This is easy, just deduct 65 (The value of 'A') from each character in the buffer... if the result is negative then set it to 27
So, the first buffer 'IT(space)W' becomes '8, 19, 27, 23'
Remember our 4 dimensional array ? This counts the occurrences of each unique quadragram pattern. So, when we see [IT W] we increment the value at that position in the four dimensional array:
CODE
In C this looks like 'buffer[8][19][27][23]'
in VB it would look Like 'buffer(8,19,27,23)'
Keep doing this for a few books. If you want to build a quadrigram for breaking spanish messages you'd use modern spanish texts to train your quadrigrams, if russian you would use a different alphabet and russian texts to train on. By the time you're finished you have an array of quadrigrams and their likelyhood. Ones that have never occurred score badly. Ones that occur regularly score highly.
Thus, en english, Q followed by anything other than a U will score very badly indeed...
CODE
For example, the message O QID ITPAA'
[O QI] scores highly negative
[ QID] scores highly negative
[QID ] scores highly negative
[ID I] moderately positive
[D IT] moderately positive
[ ITP] scores highly negative
[ITPA] scores highly negative
[TPAA] scores highly negative
[TPAA] scores highly negative
The result, this is rejected as english text.
CODE
Other constructs score highly twice... 'Is the round ball under the chair'
[IS T] scores high
[S TH] scores high
[ THE] scores very high
[THE ] scores very high
[HE R] high
[E RO] high
[ ROU] moderate
[ROUN] moderate-high
[OUND] moderate-high
[UND ] moderate
[ND B] moderate-high
Such a sentence would very quickly pass the threshold for strong english constructions.
I hope you realize the simplicity and power of the quadragram approach for automated cracking of mono alphabetic substitution ciphers. It can be trained on any language or alphabet you wish, from English to Russian, from dwarvish to klingon. It can also be trained on protocols such as HTML or IRC and structured documents and file formats.
Whilst it will not 'know' when it has the correct answer you can league-table the best results for display, normally the answer is in the first 20 places and, with structured documents such as HTML, is almost always the top result.
Not only does it simplify monoalphabetic substitution but it helps considerably to speed up polyalphabetics and other non-substitution ciphers such as RC4. And, the best thing, you can build a discrimination library of languages, protocols, dialects, formal and informal speech. Anything.
Alphabets with upper AND lower case are even more discriminating. Alphabets with periods will discriminate against 'is' at the end of a sentence thus demonstrating a very simplistic awareness of gramatical structure. Endings with 'ing' 'er' 'ist' score higher too than the same constructs in the middle of a word. It really does take the work out of recognising solutions and discerning between possible languages.
So, stop breaking substitution ciphers by hand, and give this a go. See if you can write a program which breaks monoalphabetics in a semi-automated fashion. If you manage this, try the problem of breaking polyalphabetics (much more of a challenge)
The other thing you should take from this is that substitution ciphers are NOT valid encryption these days, at best they are logic puzzles. Fun to break if you're bored, but ultimately useless.
_________________
I'm like a rubix cube, the more you play with me the harder I get. |
|
Back to top |
|
|
oib111 I post too much Reputation: 0
Joined: 02 Apr 2007 Posts: 2947 Location: you wanna know why?
|
Posted: Thu May 15, 2008 2:52 pm Post subject: |
|
|
I suspect a copy paste....
_________________
8D wrote: |
cigs dont make people high, which weed does, which causes them to do bad stuff. like killing |
|
|
Back to top |
|
|
thenicelordj Cheater Reputation: 0
Joined: 24 Mar 2007 Posts: 48 Location: I'm in your VCR
|
|
Back to top |
|
|
Jake! Grandmaster Cheater Supreme Reputation: 4
Joined: 03 Jan 2007 Posts: 1354
|
Posted: Tue May 20, 2008 8:51 pm Post subject: |
|
|
x0r wrote: | thenicelordj wrote: | Indeed I copied my earlier post from cricticalsecurity. I'm DToX. Sure I could've mentioned it but it's funnier when someone finds the original thread and then accuse me. |
Oh sure, now please make a post on criticalsecurity.net verifying this claim |
Bitch got denied.
|
|
Back to top |
|
|
Overload Master Cheater Reputation: 0
Joined: 08 Feb 2008 Posts: 293
|
Posted: Tue May 20, 2008 9:01 pm Post subject: |
|
|
Link wrote: | x0r wrote: | thenicelordj wrote: | Indeed I copied my earlier post from cricticalsecurity. I'm DToX. Sure I could've mentioned it but it's funnier when someone finds the original thread and then accuse me. |
Oh sure, now please make a post on criticalsecurity.net verifying this claim |
Bitch got denied. |
Bitch got owned.
|
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You cannot download files in this forum
|
|