Storing HTML Documents in Elasticsearch

A complete test case:

DELETE /test
PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "htmlStripAnalyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"],
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "html": {
          "type": "text",
          "analyzer": "htmlStripAnalyzer"
        }
      }
    }
  }
}

POST /test/test/1
{
  "html": "<td><tr>span<td></tr>"
}
POST /test/test/2
{
  "html": "<span>whatever</span>"
}
POST /test/test/3
{
  "html": "<html> <body> <h1 style=\"font-family: Arial\">Test</h1> <span>More test</span> </body> </html>"
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "span"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "body"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "more"
    }
  }
}

Update for Elasticsearch >=7 (removal of types)

DELETE /test
PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "htmlStripAnalyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"],
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "html": {
        "type": "text",
        "analyzer": "htmlStripAnalyzer"
      }
    }
  }
}

POST /test/_doc/1
{
  "html": "<td><tr>span<td></tr>"
}
POST /test/_doc/2
{
  "html": "<span>whatever</span>"
}
POST /test/_doc/3
{
  "html": "<html> <body> <h1 style=\"font-family: Arial\">Test</h1> <span>More test</span> </body> </html>"
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "span"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "body"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "more"
    }
  }
}

By default Elasticsearch will dynamically add new fields if it finds any during the indexing process (see this):

When Elasticsearch encounters a previously unknown field in a document, it uses dynamic mapping to determine the datatype for the field and automatically adds the new field to the type mapping.

To disable this behavior (see the doc for more details), the simplest is to put dynamic to false (prevents the automatic creation) or to strict (throws an exception and does not create a new document). In that case, you would need to explicitly write the mapping for the tags you wish to keep inside your _tags section, and pre parse the HTML document to only feed the tags you are interested in to Elasticsearch.

So let's say you have the following HTML document:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>A simple example</title>
</head>
<body>
  <div>
    <p><span class="ref">A sentence I want to reference from this HTML document</span></p>
    <p><span class="">Something less important</span></p>
</body>
</html>

The first thing you want to have is a static mapping inside Elasticsearch, I would do the following (assuming the ref is a string):

PUT html
{

"mappings": {
  "test":{
    "dynamic": "strict",
    "properties": {
      "ref":{
        "type": "string"
      }
    }
  }
}

Now if you try adding a document this way, it will succeed:

PUT html/test/1
{
  "ref": "A sentence I want to reference from this HTML document"
}

But this won't succeed:

PUT html/test/2
{
  "ref": "A sentence I want to reference from this HTML document",
  "some_field": "Some field"
}

Now the only thing remaining is to parse the HTML to retrieve the "ref" field, and create the above query (use whatever language you like, Java, Python...)

Edit: Actually to store the HTML without indexing it, in your mapping you simply need to set index to no (see here):

"_tags": {
          "type": "nested",
          "dynamic": true,
          "index": "no"
         }