Detect Trojan Source Attack

What is the Trojan Source Attack

The Trojan Source attack became famous on November 1st 2021 by the paper Trojan Source: Invisible Vulnerabilities published by Nicolas Boucher and Ross Anderson and the then following coverage in the security and IT news sites (e.g. 1, 2, 3, 4, 5). The vulnerability is also listed as CVE-2021-42574.

The core of the attack is to use Unicode control characters to reorder tokens in source code. These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens.

Compilers and interpreters adhere to the logical ordering of source code, not the visual order. The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.

The vulnerability is confirmed for a series of wide spread programming languages including C, C++, C#, JavaScript, Java, Rust, Go, and Python. But it is suspected, that the attack works in most modern languages.

Example

With the hidden Unicode control character in the right spot, what the user sees (visual order) may look like this:

package main

import "fmt"

func main() {
  var accessLevel = "user"
  if accessLevel != "user" { // Check if admin
    fmt.Println("You are an admin.")
  } else {
    fmt.Println("You are a user.")
  }
}

while at the same time, the compiler sees the following (logical order, hidden Unicode characters shown as [U+nnnn]):

package main

import "fmt"

func main() {
  var accessLevel = "user"
  if accessLevel != "user[U+202E][U+2066]// Check if admin[U+2069][U+2066]" {
    fmt.Println("You are an admin.")
  } else {
    fmt.Println("You are a user.")
  }
}

In this case, the visual order read as if someone is not a user, he is an admin. But the compiled program will behave differently an make everyone an admin, because no user will ever have the accessLevel user[U+202E][U+2066]// Check if admin[U+2069][U+2066].

What are Unicode control characters

Unicode encoded text allows for bidirectional text, that is text with sections that are read from left-to-right and other sections that are read from right-to-left (e.g. Arabic or Hebrew). The Bidirectional algorithm (short Bidi algorithm) translates the logical order (in memory, always from left-to-right) into the visual order.

Each Unicode character has a type describing its behavior in bidirectional text. The four types are: strong, weak, neutral and explicit formatting.

For the purpose of the attack, the characters of the last category, explicit formatting, are interesting, because these characters allow to direct the Bidi algorithm to modify its default behavior. This category is further divided into marks, embeddings, isolates and overrides.

The combination of overrides (override the direction) and the isolates (treat a section as isolated from its surroundings) allows to alter the visual order such that it differs significantly from the logical order, that is processed by the compiler.

Attack mitigation for Go

There are different approaches, on how attacks that use the Trojan Source attack vector can be detected:

In the compiler is the wrong place Russ Cox (Go tech lead) explained, why this kind of issues will not be fixed in the Go compiler it self. It boils down to the fact, that it is impossible for a compiler to tell, whether some code is good or bad.

The better place to detect these kinds of attacks is to perform proper review and to use tools, that support humans in spotting suspicious Unicode. A lot has been done since the public release of the paper about the Trojan Source Vulnerability. For example Github provides hints about source code files, that contain hidden Unicode characters. Also editor like Visual Studio Code have been improved to make these Unicode characters visible.

github.com warning about hidden unicode characters

Visual Studio Code shows hidden unicode characters

With all these improvements, there is still (at least) one option remaining and this is to have a linter, that can check for the hidden Unicode characters at development time or in the continous integration system (CI).

This is where bidichk comes in.

bidichk

bidichk is a small linter for Go source code, that warns about occurences of hidden Unicode characters. By default it considers the following Unicode characters, as dangerous:

bidichk is also integrated in golangci-lint and therefore the easiest way to take advantage of bidichk is to enable it in your golangci.yml file.

Bärner Go Meetup

This topic has been part of the talk Detect Trojan Source Attack at the Bärner Go Meetup on December 7th 2021.

The slides can be found here.