Values larger than 1/3 of a buffer page cannot be indexed

You have an UNIQUE index on (content, ref_id), called editor_contentmodel_content_2192f49c_uniq

"editor_contentmodel_content_2192f49c_uniq" UNIQUE CONSTRAINT, btree (content, ref_id)

I'm not sure why this there to begin with. So let's step back and address what this does. This makes sure that content, and ref_id are unique. However, in PostgreSQL the UNIQUE constraint is implemented with a btree which makes this a poor solution. Using this method, you're creating a btree with content which essentially duplicates the size of this small table, and makes for a gigantic index. A gigantic index that is still limited by content size though -- as you've found. It raises a few questions

  • Do you care that content is unique? If you do care that content is unique for ref_id, then what you probably want is to store the hash of the content. Something like..

    CREATE TABLE foo ( ref_id int, content text );
    CREATE UNIQUE INDEX ON foo (ref_id,md5(content));
    

    This will instead store the md5sum of content on the btree. So long as ref_id has content with a unique md5 over that ref_id, you're good.

  • If you don't care that content is unique consider removing it entirely.

It may be worth nothing that when you implement a UNIQUE constraint with a btree (as PostgreSQL does), you get an added index for free. Under a normal circumstance this has a fringe benefit.

CREATE TABLE foo ( ref_id int, content text );
CREATE UNIQUE INDEX ON foo (ref_id,content);

Will speed up the query

SELECT *
FROM foo
WHERE ref_id = 5
  AND content = 'This content'

However, when you chance it to use the functional md5() variant there is no longer an index on content, so now to use that index you'll have to

  1. Only query on ref_id,
  2. Add to ref_id a clause that md5(content) = md5('This content')

The whole text = text is over rated. That's almost never what you want. If you're looking to speed up query-time over text the btree is pretty useless. You likely want to look into

  1. pgtrgm
  2. text_pattern_ops
  3. Full text search (FTS)

UPDATE 1

Base on your JSON I would suggest storing it as a jsonb, and then creating the index on md5(content); so perhaps rather than the above instead run this.

ALTER TABLE public.editor_contentmodel
  ALTER COLUMN content
  SET DATA TYPE jsonb
  USING content::jsonb;

CREATE UNIQUE INDEX ON foo (ref_id,md5(content::text));

UPDATE 2

You ask which indexes you should remove

gollahalli_me_django=> create unique index on editor_contentmodel (ref_id, md5(content::text));
CREATE INDEX
gollahalli_me_django=> \d editor_contentmodel;
        Table "public.editor_contentmodel"
  Column   |           Type           | Modifiers
-----------+--------------------------+-----------
 ref_id    | character varying(120)   | not null
 content   | jsonb                    | not null
 timestamp | timestamp with time zone | not null
Indexes:
    "editor_contentmodel_pkey" PRIMARY KEY, btree (ref_id)
    "editor_contentmodel_content_2192f49c_uniq" UNIQUE CONSTRAINT, btree (content, ref_id) <---- 1
    "editor_contentmodel_ref_id_md5_idx" UNIQUE, btree (ref_id, md5(content::text))
    "editor_contentmodel_ref_id_8f74b4f3_like" btree (ref_id varchar_pattern_ops) <----2

Here is the surprising answer: you should remove all of them except: editor_contentmodel_pkey which says that all ref_id's need to be unique.

  1. editor_contentmodel_content_2192f49c_uniq this index makes sure that you're UNIQUE on ref_id AND content, but if you can't have a duplicate ref_id you can never have a duplicate content for that ref_id. So you can never violate this index without also violating editor_contentmodel_pkey. That makes it pointless.
  2. editor_contentmodel_ref_id_md5_idx this index is also pointless for the same reason. You can never have a duplicate md5(content::text) over ref_id because regardless of what the value of md5(content::text) is you can never have a duplicate ref_id.
  3. editor_contentmodel_ref_id_8f74b4f3_like is also a bad idea because you're duplicating the index over ref_id. This isn't useless, it's just not-optimal. Instead, if you need a varchar_pattern_ops use it instead over just the content field.

As a last note, we don't much use varchar in PostgreSQL because it's implemented as a varlena with a check constraint. There is no gain to it, and there is nothing lost when you simply use text. So unless there is a concrete reason why ref_id can ever be 120 characters but it can be 119 characters, then I would simply use the text type.

UPDATE 3

Let's go back to your earlier problem..

psycopg2.OperationalError: index row size 3496 exceeds maximum 2712 for index "editor_contentmodel_content_2192f49c_uniq"

This is telling you the problem is specifically with index "editor_contentmodel_content_2192f49c_uniq". You've defined that as

"editor_contentmodel_content_2192f49c_uniq" UNIQUE CONSTRAINT, btree (content, ref_id)

So the problem here is that you're trying to create an index over content. But, again, the index itself stores the actual json content of content, and that's what exceeds the limit. This isn't actually an issue, because even if that limit wasn't in place editor_contentmodel_content_2192f49c_uniq would be totally useless. Why? again you can't add more uniqueness to a row that is already guaranteed to be 100% unique. You don't seem to be getting this. Let's keep it simple.

ref_id | content
1      | 1
1      | 1
1      | 2
2      | 1

In the above a lone unique index/constraint (with no other indexes) over (ref_id, content) makes sense because it would stop the duplication of (1,1). An index over (ref_id, md5(content)) would also make sense because it would stop the duplication of (1,1) by proxy of stopping the duplication of (1, md5(1)). However all of this works because in the example I've given ref_id is NOT guaranteed to be UNIQUE. Your ref_id is not this ref_id. Your ref_id is a PRIMARY KEY. That means it is guaranteed to be UNIQUE.

That means the duplicate (1,1) and the row of (1,2) could NEVER be inserted. That also means that indexes over anything in addition to ref_id can not guarantee more uniqueness. They would have to be less strict than the index you currently have. So your table could only look like this

ref_id | content
1      | 1
2      | 1

"editor_contentmodel_pkey" PRIMARY KEY, btree (ref_id) "editor_contentmodel_content_2192f49c_uniq" UNIQUE CONSTRAINT, btree (content, ref_id)

Since ref_id is the primary key, you can't have duplicate values of it. That means the unique constraint on the combination (content, ref_id) is useless, as anything that violates that it would also violate the primary key constraint. Just get rid of it.

Tags:

Postgresql