## What is the problem?

I think that the problem must now already be costing billions of dollars on a yearly basis. This is what tends to happen. Say that we have the following template statement in SQL:

select * from T where f1 = ‘{value1}’ and f2 = {value2}

Now we want to expand it with the following values en provenance from user input:

value1=hello

value2=5

We obtain the following SQL statement that we will submit to the database:

select * from T where f1=’hello’ and f2=5

The problem is that the attacker could construct the following user input:

value1=anything’ or 1=1 or f1=’whatever

value2=5

After expansion, the resulting statement looks like this:

select * from T where f1=’anything’ or 1=1 or f1=’whatever’ and f2=5

The attacker will effectively have managed to change the semantics of the template statement. This problem does not only occur just with SQL, but with any language that ordinarily gets templated, such as HTML and shell scripts.

## Description of a general solution

SQL is an arbitrary but consistent axiomatization in which the tokens and derivation rules act its axiomatic basis. The key word here is “arbitrary”. There are an infinite number of axiomatizations that are homeomorphic with SQL. Every valid statement in such arbitrary homeomorphism can be mapped to exactly one valid statement in SQL and vice versa. If the statement is invalid in such arbitrary homeomorphism it will also be invalid in SQL. It will not be possible for the attacker to produce an injection that satisfies the rules of any possible, arbitrary homeomorphism with SQL.

## Strategy 1: Expand the template statement in another grammar with different derivation rules

### Example 1: Prefix language

SQL uses an infix notation. The infix notation is homeomorphic with possible lisp-style prefix notations. Infix to prefix:

a OP1 b OP2 c <=> (OP1 a (OP2 b c))

With a, b, c identifiers or values, and OP1, OP2 operators or functions.

The example statement in a possible prefix notation:

(select * T (and (= f1 ‘{value1}’) (= f2 {value2})))

The statements are homeomorphic. They are semantically extentional. It is a non-issue to translate automatically SQL infix to this prefix notation and the other way around. However, the attacker’s injection is not valid in the prefix grammar:

(select * T (and (= f1 ‘anything’ or 1=1 or a=’whatever’) (= f2 5)))

This is a syntax error. What the attacker wanted to say, is:

(select * T (or (= f1 ‘anything’) (or (=1 1) (and (= a ‘whatever’) (= f2 5)))))

This is not the same. The attacker’s injection did not produce valid prefix language.

### Example 2: Euler notation

Another alternative would be full Euler notation. Infix to Euler:

a OP1 b OP2 c <=> OP1(a,OP2(b,c))

The example statement would look like:

select( *,T,and(=(f1,'{value1}’),=(f2,{value2})))

Again, the injection would be a syntax error:

select( *,T,and(=(f1,’anything’ or 1=1 or a=’whatever’),=(f2,5)))

What the attacker meant to write is:

select( *,T,or(=(f1,’anything’,or(=(1,1),and(=(a,’whatever’),=(f2,5))))))

What the attacker is doing, is wrong. His injection is not oblivious to the arbitrary notation chosen.

### Example 2: object notation

Yet another alternative would be full object notation. Infix to object:

a OP1 b OP2 c <=> a.OP1(b).OP2(c)

The example would look like:

T.where(f1.=(‘{value1}’).and(f2.=({value2})).select(*)

Again, the injection would again turn into a syntax error:

T.where(f1.=(‘anything’ or 1=1 or a=’whatever’).and(f2.=5)).select(*)

I leave it as an exercise to the reader to correct this into what the attacker actually meant to say.

## Strategy 2: pick other arbitrary tokens for SQL

Keywords are always arbitrary tokens in a language. What matters is their place in the derivation rules, not their arbitrary incarnation. You can always replace the existing keywords by other ones, and translate back and forth. Say that we map the following keywords in SQL to a language that we could call ‘arbitrary brainfuck’:

{“**select**“:”iph0ohKi”, “*****“:”ieZoh4xa”, “**from**“:”aeZi5uja”, “**where**“:”OoJ4aX4n”, “**=**“:”eeQu2Zad”, “**(**“:”eiD5aera”,”**)**“:”Soo2uach”, “**or**“:”Ocaig5Es”}

For the sake of the argument, we will also map identifiers to semi-arbitrarily constructed sequences:

**T** <=> @phai1Oa6@**T**@

**hello** <=> @phai1Oa6@**hello**@

With **phai1Oa6** an arbitrarily-picked sequence of characters. In that case, the example statement:

select * from T where f1 = ‘{value1}’ and f2 = {value2}

becomes:

iph0ohKi ieZoh4xa aeZi5uja @phai1Oa6@T@ OoJ4aX4n @phai1Oa6@f1@ eeQu2Zad ‘{value1}’ @phai1Oa6@and@ @phai1Oa6@f2@ eeQu2Zad {value2}

This is valid arbitrary brainfuck language. After injection, we obtain:

iph0ohKi ieZoh4xa aeZi5uja @phai1Oa6@T@ OoJ4aX4n @phai1Oa6@f1@ eeQu2Zad ‘anything‘ or 1=1 or a=’whatever’ @phai1Oa6@and@ @phai1Oa6@f2@ eeQu2Zad 5

As you can see, it contains the tokens ‘or’ and ‘=’ that are not valid in arbitrary brainfuck language. Our grammar says that you must use:

or <=> Ocaig5Es

= <=> eeQu2Zad

These tokens can also not be identifiers, because then they would have to look like:

**or** <=> @phai1Oa6@**or**@

**=** <=> @phai1Oa6@**=**@

In other words, after injection, the statement has simply become invalid and unusable.

## Strategy 3: verify invariants

When you count the tokens in the example template statement:

[1] select [2] * [3] from [4] T [5] where [6] f1 [7] = [8] ‘{value1}’ [9] and [10] f2 [11] = [12] {value2}

You can see 12 tokens. After expansion of the template, this tally must obviously still be 12, but when we look at the expansion caused by the attack:

[1] select [2] * [3] from [4] T [5] where [6] f1 [7] = [8] ‘anything’ [9] or [10] 1 [11] = [12] 1 [13] or [14] a [15] = [16] ‘whatever’ [17] and [18] f2 [19] = [20] 5

Now there are 20 tokens. A violation of this invariant gives away that there is something fishy in the state of Denmark. The same holds true for the representation of the same statement but then in arbitrary, brainfuck language. Template expansion must never lead to a change in the token count.

In fact, you can try to impose other invariants and verify them after expansion. The attacker will have to comply with all of them.

## Conclusion

Some people advocate that programmers should be more careful when expanding their SQL templates. Addressing the SQL injection problem would simply require more care in programming. That approach is obviously not a solution. People will still make errors in validating user input and still end up accepting malicious user input. In other words, the problem cannot be solved fundamentally by all of us just working harder.

The real solution consists in confronting the attacker with arbitrarization of the SQL language itself, and demand that all existing invariants still hold, in accordance with the rules of any arbitrary homeomorphism constructed. This can be done automatically without any intervention of the programmer.

The attacker will have to remain consistent with the rules of an unknown and arbitrary brainfuck grammar. Achieving compliance with a set of unknown rules is an intractable problem. Therefore, the attacker will generally not be able to achieve this.

## One thought on “Solving the SQL injection problem requires another approach”