How to design a database schema to support tagging with categories?

This is yet another variation on the Entity-Attribute-Value design.

A more recognizable EAV table looks like the following:

CREATE TABLE vehicleEAV (
  vid        INTEGER,
  attr_name  VARCHAR(20),
  attr_value VARCHAR(100),
  PRIMARY KEY (vid, attr_name),
  FOREIGN KEY (vid) REFERENCES vehicles (vid)
);

Some people force attr_name to reference a lookup table of predefined attribute names, to limit the chaos.

What you've done is simply spread an EAV table over three tables, but without improving the order of your metadata:

CREATE TABLE vehicleTag (
  vid         INTEGER,
  cid         INTEGER,
  tid         INTEGER,
  PRIMARY KEY (vid, cid),
  FOREIGN KEY (vid) REFERENCES vehicles(vid),
  FOREIGN KEY (cid) REFERENCES categories(cid),
  FOREIGN KEY (tid) REFERENCES tags(tid)
);

CREATE TABLE categories (
  cid        INTEGER PRIMARY KEY,
  category   VARCHAR(20) -- "attr_name"
);

CREATE TABLE tags (
  tid        INTEGER PRIMARY KEY,
  tag        VARCHAR(100) -- "attr_value"
);

If you're going to use the EAV design, you only need the vehicleTags and categories tables.

CREATE TABLE vehicleTag (
  vid         INTEGER,
  cid         INTEGER,     -- reference to "attr_name" lookup table
  tag         VARCHAR(100, -- "attr_value"
  PRIMARY KEY (vid, cid),
  FOREIGN KEY (vid) REFERENCES vehicles(vid),
  FOREIGN KEY (cid) REFERENCES categories(cid)
);

But keep in mind that you're mixing data with metadata. You lose the ability to apply certain constraints to your data model.

  • How can you make one of the categories mandatory (a conventional column uses a NOT NULL constraint)?
  • How can you use SQL data types to validate some of your tag values? You can't, because you're using a long string for every tag value. Is this string long enough for every tag you'll need in the future? You can't tell.
  • How can you constrain some of your tags to a set of permitted values (a conventional table uses a foreign key to a lookup table)? This is your "softtop" vs. "soft top" example. But you can't make a constraint on the tag column because that constraint would apply to all other tag values for other categories. You'd effectively restrict engine size and paint color to "soft top" as well.

SQL databases don't work well with this model. It's extremely difficult to get right, and querying it becomes very complex. If you do continue to use SQL, you will be better off modeling the tables conventionally, with one column per attribute. If you have need to have "subtypes" then define a subordinate table per subtype (Class-Table Inheritance), or else use Single-Table Inheritance. If you have an unlimited variation in the attributes per entity, then use Serialized LOB.

Another technology that is designed for these kinds of fluid, non-relational data models is a Semantic Database, storing data in RDF and queried with SPARQL. One free solution is RDF4J (formerly Sesame).


I needed to solve this exact problem (same general domain and everything — auto parts). I found that the best solution to the problem was to use Lucene/Xapian/Ferret/Sphinx or whichever full-text indexer you prefer. Much better performance than what SQL can offer.


What you describe are not tags, tags are only values, they do not have an associated key. Tags are normally implemented as a string column, the value being a list of values delimited.

For example #1, a tag field would contain a value such as:

"manufacture_Mercedes,model_SLK32 AMG,convertible_hardtop"

The user then would normally be able to easily filter entries, by the existence of one or more tags. It is essentially schemaless data from a database perspective. There are downsides to tags, but they also avoid the extreme complications that come from using an EAV model. If you really need an EAV model, it also might be worth considering an attributes field, which contains JSON data. It's more painful to query, but still not as horrible as querying EAV across multiple tables.