Tagging of dynamically added Fields in Elasticsearch

At work we use the ELK Stack (Elasticsearch, Logstash and Kibana) to process, store and visualize all kind of log data. To get the most out of the information stored in Elasticsearch, we maintain a handcrafted Elasticsearch mapping.

As we are in the process of continuously adding more and more log sources, now and then our Elasticsearch mapping is not complete. In this case the dynamic mapping feature of Elasticsearch adds these new fields by it’s own.

As we want to keep our Elasticsearch mapping up to date we have been looking for an easy way to identify all the new fields dynamically added.

Naive approach

The naive approach to find the differences between our handcrafted mapping and the actual mapping, used by Elasticsearch for the respective index (possibly with dynamically added fields), would be a simple diff of two text files.

The handcrafted mapping is stored as a simple JSON-file, which is used by Logstash to apply the mapping when a new index is created.

The active mapping is accessible via the Elasticsearch API, for example with curl:

curl -XGET 'http://localhost:9200/logstash-2015.10.28/_mapping'

The problem is, these two files are not directly comparable, for the following reasons:

So it is obvious a naive diff of the two files does not lead the goal, even if we ensure the same formating for both files by using a tool like jq. Even if we use JSON aware tools to find the differences, like json-delta, we do not get the right result as these tools are not able to determine, where field settings are included in the handcrafted mapping file, which are set to the default value.

Dynamic field template and custom field analyzer

So we looked for an other way to achieve the goal and we came up with the following solution.

Elasticsearch does provide two features, which combined provide a easy solution to identify all dynamically added fields:

What we did is defining a custom analyzer called unknown_field_analyzer.

{
  "template" : "logstash-*",
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "unknown_field_analyzer" : {
          "type" : "whitespace"
        }
      }
    }
  }
...
}

(snipped of analyzers section of the Elasticsearch mapping)

This analyzer we apply in the definition for the dynamic templates.

{
  "dynamic_templates" : [
    {
      "string_fields" : {
        "mapping" : {
          "index" : "analyzed",
          "omit_norms" : true,
          "analyzer" : "unknown_field_analyzer",
          "type" : "string"
        },
        "match_mapping_type" : "string",
        "match" : "*"
      }
    }
  ]
}

(snipped of dynamic_templates section of the Elasticsearch mapping)

This leads to the a mapping, where all dynamically added fields are doubtless identified by the setting analyzer.

Now it is easy to list all dynamically added fields by a small script, using the already mentioned tool jq (requires at least version 1.5):

curl -XGET 'http://localhost:9200/logstash-2015.10.28/_mapping' 2>/dev/null | jq '.[].mappings | to_entries | .[] | .key as $mapping | .value.properties | to_entries | .[] | .key as $property | if .value.analyzer? == "unknown_field_analyzer" then "mapping: " + $mapping + ", property: " + $property else empty end'

This script returns the following example output:

"mapping: mylog, property: newfield"

Based on the main idea of the above script a monitoring system could easily alert if new fields have been dynamically added.