Re: Verifying Toolchain Semantics

On 3 October 2014 22:56, Taylan Ulrich Bayirli/Kammer <address@hidden> wrote:

William ML Leslie <address@hidden> writes:

> Oh, interesting point. Maybe we should define PDF as an abstract
> semantics that we can convert into a wide range of equivalent document
> layout languages? If the attacker can't tell exactly what xsl-fo or
> dsssl the tool will output, or what software you're using to render
> the result, it will magically make it more difficult to attack!

Come on, that snark is undue. :-)

It won't make it "magically" more difficult, it will make it actually
more difficult.

It makes it harder in the same way ASLR makes attacks on a lack of memory safety harder: there is less you can rely on, but it's still basically the same game.

Say, for example, that I can guess you will end up viewing the document with one of five output programs - maybe, evince, chrome, tk, gecko translating xsl-fo to xul or firefox viewing html 4 directly, and say that I only have exploits for the first three.  Now I have a bit of a problem: I need to find a way to get your tool to emit the specific structure that will invoke the exploit.  Since it will show a document that looks correct, although underneath may have a different structure, I will have to account for variation in what it emits but there will be a limited class of algebraic manipulations that it can make without changing the semantics of the document.

Maybe I would not find a way to /reliably/ produce a document that will infect your machine.  Or maybe I would.  It's hypothetical.  But you know as well as I do that fuzzing a document generator to see what instructions produce exploitable output is easier than either writing a verifiable PDF viewer or writing a tool that bisimulates a PDF document with a class of layout languages.  The cherry on top is that all that extra code you introduce is surely ripe with additional vulnerabilities - so maybe I don't need to bother with producing those back-end exploits at all.

​​

Though I think your analogy is wrong. PDF is already

an abstract semantics, like any formal language spec. The question is

whether an implementation realizes precisely those semantics, or adds
"oh and insert a backdoor" to some pieces of code in that language, "as
an addition to the language semantics." This augmentation to the
semantics needn't be apparent in the source code of the implementation,
since the language it's written in might again be compiled by a compiler
that "augments" the semantics of the language it compiles, and that in
turn needn't be apparent in the compiler's source ... until you arrive
at the since long vanished initial source code that planted the virus.

​I wasn't suggesting that someone had a thompson hack in your pdf viewer (do you consider postscript a reasonable language for a bootstrap compiler? :-)​.  I was implying that infection via PDF is similar to a Thompson hack in an interesting way:  having many possibly pdf pipelines makes the problem only marginally harder, and in this, the problem is dual to that of the Thompson hack.

Where it would take a great fool to rely on syntactic properties - a /specific set of instructions/ - to recognise and infect a C compiler in practice, it would take similar foolishness to imagine that applying a class of machine-describable translations to a source would so reduce the ability to get exploitable code to come out at the other end.  It is *possible*, at least, it is possible to get to the point where you now only have to rely on the program consuming the output, be it x.org or a linux vt, by verfying that the output program is well typed; but then you could have just written the well-typed viewer and been done with it.

That's just reiterating what the Thompson hack is. It relies on us
using a single lineage of compilers/software. (Or a couple lineages
could be compromised separately; that would still be plausible if it's
plausible for one.) Because at every layer the semantics can only be
compromised enough to propagate the compromise through one concrete
code-base.

​No no no.  This is the point.  Anyone smart enough to have pulled of this attack - someone who has compromised the sha function, too - is not relying on syntactic features, they are doing decent static analysis of higher-order behaviour.  Saying, this looks like - semantically - like register allocation, that looks like it's producing an ELF header (very easy to do), that has appropriate allocation and loop structure to do SSA transformation.  Looking at the higher-order behaviour of code is not difficult, it's exactly what Ian is suggesting we /do/ to combat the problem.  I think it's probably easier to use this technique to propogate a Thompson hack than to ​combat it.

In your PDF analogy, the solution is to write a spurious
amount of PDF implementations.

​Except that in doing so, you've structured your vulnerabilities.  Please keep in mind that nobody sets out to write a PDF reader that has vulnerabilities (or, well, maybe.  hello Adobe!) - they are there by accident​.  Prodicing a wide range of reader implementations just means that the attacker needs to make some guesses.  You may have introduced more severe bugs by this process.

Or for C, to implement a spurious amount
of C compilers. That is impractical because C is complex. What might
be practical might be to write one new C compiler (guaranteed clean,
since it's new)

I hope by now you understand why I think this to be nonsense.  It's still a C compiler.  You know that, I know that.  Why wouldn't the machine be able to tell?​

, and verify GCC with that[*]. What's more useful in the
long term is to define a much simpler language, and base our everything
on that instead of on C. Then we can verify our toolchain more often,
since we needn't write a whole freaking C compiler every time. :-)

[*] Compile version X of GCC with the Clean C Compiler™, compile version
X of GCC with some GCC binary, then compile some GCC with both GCC-X
binaries, one of which is guaranteed clean, to see if they behave same
while compiling GCC; thus we get both a guaranteed clean GCC, and we get
to know whether the other was compromised in first place, or whether we
wasted a lot of time.

I think Ian is doing himself a disservice by using many fancy abstract
terms and, if I may be blunt, acting in some tinfoil hat-like ways, when
in fact the underlying ideas are relatively sane and simple. It's all
there if you read a little between the lines, and a little around the
parts written in all-caps, like in:

http://lists.gnu.org/archive/html/guile-devel/2014-09/msg00035.html

​This asssumption that I haven't been reading everything ​is unusual and I'm not sure where it's coming from.  I guess Ian didn't manage to infect me?  Maybe I wrote my own PDF viewer, maybe I use out of date software, or maybe I just read the postscript in my terminal.

The problem that I have with this discussion is that rather than address many* of the points in my email, he instead resorted to insisting I was a shill.  It's very difficult to have a conversation with someone who uses silencing tactics.  Or better - it's pointless.

If it were true, then you all lost a long time ago, I'm sorry.  So I perpetuated a Thompson hack on a large, well-known free software codebase while I was in primary school and without regular computer access, and nobody noticed?  And now he doesn't realise I'm editing his emails and PDFs in-flight to make him sound arrogant and clueless?

To be fair, I have seen similar levels of gaslighting before.  It's safe to always assume /something/ is up.  But throwing around those sort of allegations just because someone pokes holes in your arguments is an acceptable cause for me to dismiss you as a nut, imho.

I don't have an issue with fancy abstract terms, but I do see clearly how the proposition only solves a narrow subclass of the problems, and specifically a subclass that would be better addressed with static analysis and ideally types at the assembly level.  Show me a Thompson attack that this 'semantic fixpoint' can work around, and I'll show you a crack that can work around any similar 'semantic fixpoint'.

​​

​* well, he responded to the point about hardware.  If people want to know why he's wrong, I can elaborate.  But by that point, I hoped people could see for themselves that there's no reason to talk any further.

--
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered under copyright law. You absolutely MAY reproduce any part of it in accordance with the copyright law of the nation you are reading this in. Any attempt to DENY YOU THOSE RIGHTS would be illegal without prior contractual agreement.

From:	William ML Leslie
Subject:	Re: Verifying Toolchain Semantics
Date:	Mon, 6 Oct 2014 00:57:49 +1100