Entity framework 6 code first: what is the best implementation for a baseobject with 10 childobjects

Another solution would be to implement some kind of CQRS pattern where you have separate databases for writing (command) and reading (query). You could even de-normalize the data in the read database so it is very fast.

Assuming you need at least one normalized model with referential integrity, I think your decision really comes down to Table per Hierarchy and Table per Type. TPH is reported by Alex James from the EF team and more recently on Microsoft's Data Development site to have better performance.

Advantages of TPT and why they're not as important as performance:

Greater flexibility, which means the ability to add types without affecting any existing table. Not too much of a concern because EF migrations make it trivial to generate the required SQL to update existing databases without affecting data.

Database validation on account of having fewer nullable fields. Not a massive concern because EF validates data according to the application model. If data is being added by other means it is not too difficult to run a background script to validate data. Also, TPT and TPC are actually worse for validation when it comes to primary keys because two sub-class tables could potentially contain the same primary key. You are left with the problem of validation by other means.

Storage space is reduced on account of not needing to store all the null fields. This is only a very trivial concern, especially if the DBMS has a good strategy for handling 'sparse' columns.

Design and gut-feel. Having one very large table does feel a bit wrong, but that is probably because most db designers have spent many hours normalizing data and drawing ERDs. Having one large table seems to go against the basic principles of database design. This is probably the biggest barrier to TPH. See this article for a particularly impassioned argument.

That article summarizes the core argument against TPH as:

It's not normalized even in a trivial sense, it makes it impossible to enforce integrity on the data, and what's most "awesome:" it is virtually guaranteed to perform badly at a large scale for any non-trivial set of data.

These are mostly wrong. Performance and integrity are mentioned above, and TPH does not necessarily mean denormalized. There are just many (nullable) foreign key columns that are self-referential. So we can go on designing and normalizing the data exactly as we would with a TPH. In a current database I have many relationships between sub-types and have created an ERD as if it were a TPT inheritance structure. This actually reflects the implementation in code-first Entity Framework. For example here is my Expenditure class, which inherits from Relationship which inherits from Content:

public class Expenditure : Relationship
{
    /// <summary>
    /// Inherits from Content: Id, Handle, Description, Parent (is context of expenditure and usually 
    /// a Project)
    /// Inherits from Relationship: Source (the Principal), SourceId, Target (the Supplier), TargetId, 
    /// 
    /// </summary>
    [Required, InverseProperty("Expenditures"), ForeignKey("ProductId")]
    public Product Product { get; set; }
    public Guid ProductId { get; set; }

    public string Unit { get; set; }
    public double Qty { get; set; }
    public string Currency { get; set; }
    public double TotalCost { get; set; }        

}

The InversePropertyAttribute and the ForeignKeyAttribute provide EF with the information required to make the required self joins in the single database.

The Product type also maps to the same table (also inheriting from Content). Each Product has its own row in the table and rows that contain Expenditures will include data in the ProductId column, which is null for rows containing all other types. So the data is normalized, just placed in a single table.

The beauty of using EF code first is we design the database in exactly the same way and we implement it in (almost) exactly the same way regardless of using TPH or TPT. To change the implementation from TPH to TPT we simply need to add an annotation to each sub-class, mapping them to new tables. So, the good news for you is it doesn't really matter which one you choose. Just build it, generate a stack of test data, test it, change strategy, test it again. I reckon you'll find TPH the winner.


Having experienced similar problems myself I've a few suggestions. I'm also open to improvements on these suggestions as It's a complex topic, and I don't have it all worked out.

Entity framework can be very slow when dealing with non-trivial queries on complex entities - ie those with multiple levels of child collections. In some performance tests I've tried it does sit there an awful long time compiling the query. In theory EF 5 and onwards should cache compiled queries (even if the context gets disposed and re-instantiated) without you having to do anything, but I'm not convinced that this is always the case.

I've read some suggestions that you should create multiple DataContexts with only smaller subsets of your database entities for a complex database. If this is practical for you give it a try! But I imagine there would be maintenance issues with this approach.

1) I Know this is obvious but worth saying anyway - make sure you have the right foreign keys set up in your database for related entities, as then entity framework will keep track of these relationships, and be much quicker generating queries where you need to join using the foreign key.

2) Don't retrieve more than you need. One-size fits all methods to get a complex object are rarely optimal. Say you are getting a list of base objects (to put in a list) and you only need to display the name and ID of these objects in the list of the base object. Just retrieve only the base object - any navigation properties that aren't specifically needed should not be retrieved.

3) If the child objects are not collections, or they are collections but you only need 1 item (or an aggregate value such as the count) from them I would absolutely implement a View in the database and query that instead. It is MUCH quicker. EF doesn't have to do any work - its all done in the database, which is better equipped for this type of operation.

4) Be careful with .Include() and this goes back to point #2 above. If you are getting a single object + a child collection property you are best not using .Include() as then when the child collection is retrieved this will be done as a separate query. (so not getting all the base object columns for every row in the child collection)

EDIT

Following comments here's some further thoughts.

As we are dealing with an inheritance hierarchy it makes logical sense to store separate tables for the additional properties of the inheriting classes + a table for the base class. As to how to make Entity Framework perform well though is still up for debate.

I've used EF for a similar scenario (but fewer children), (Database first), but in this case I didn't use the actual Entity framework generated classes as the business objects. The EF objects directly related to the DB tables.

I created separate business classes for the base and inheriting classes, and a set of Mappers that would convert to them. A query would look something like

public static List<BaseClass> GetAllItems()
{
  using (var db = new MyDbEntities())
  {
    var q1 = db.InheritedClass1.Include("BaseClass").ToList()
       .ConvertAll(x => (BaseClass)InheritedClass1Mapper.MapFromContext(x));
    var q2 = db.InheritedClass2.Include("BaseClass").ToList()
       .ConvertAll(x => (BaseClass)InheritedClass2Mapper.MapFromContext(x));

    return q1.Union(q2).ToList();  
  }
}

Not saying this is the best approach, but it might be a starting point? The queries are certainly quick to compile in this case!

Comments welcome!


With Table per Hierarchy you end up with only one table, so obviously your CRUD operations will be faster and this table is abstracted out by your domain layer anyway. The disadvantage is that you loose the ability for NOT NULL constraints, so this needs to be handled properly by your business layer to avoid potential data integrity. Also, adding or removing entities means that the table changes; but that's also something that is manageable.

With Table per type you have the problem that the more classes in the hierarchy you have, the slower your CRUD operations will become.

All in all, as performance is probably the most important consideration here and you have a lot of classes, I think Table per Hierarchy is a winner in terms of both performance and simplicity and taking into account your number of classes.

Also look at this article, more specifically at chapter 7.1.1 (Avoiding TPT in Model First or Code First applications), where they state: "when creating an application using Model First or Code First, you should avoid TPT inheritance for performance concerns."