How to store articles or other large texts in a database

Store everthing in one big text field as Alex suggested. For searching, don't hammer your database, use Lucene, or htdig to create an index of your output. This way searches are very fast. The side effect is you make your searches a little more search engine friendly; you take your keywords field (as backslash suggested) and stick them in the meta-keywords attribute.

Edit

Unless you are only searching keywords, having the db do the searches will be horribly slow (ever searched a forum and it takes FOREVER?). There is no way for the database to index a

  select.. where FULLTEXTFIELD like '%cookies%'.  

It is frustrating looking for an article and the search doesn't return the results your are looking for because they weren't in the keyword field! Htdig allows you to search the full text of the article efficiently. Your searches will come back instantly, and EVERY term in the article is fully searchable. Putting the keywords in the meta tags will make searches on those terms come higher on the results page.

Another benefit is fuzzy matching. If you search for 'activate' htdigg will match pages that have active, activation, activity etc. (configurable). Or if the user misspells a word, it will still be matched. You want your users to have a Google like experience, not an annoying one. :)

You do need a script to create a list of links to all your pages from your database. Have htdig crawl this automatically and you never have to think about it again.

Also htdig will crawl your non database pages as well so your whole site is searchable through the same simple interface.

As for the keyword field , you should have a separate table called keywords with the id of the article and a keyword field (1 keyword per row). But for simplicity, having a single field in the db isn't a terrible idea, it makes updating the keywords pretty easy if you put it in a form.

If you don't want to fuss with all the hassle of that, you can try using Google custom search. it is far less work, but you have no guarantee that all your pages will get indexed.

Good luck!


Depending on how you have arranged and installed everything, it can be hard to access outside files from remote clients that can access the DB just fine -- so why not save all of the XML into one TEXT field instead? You can refactor things to optimize that later if the DB engine can't handle that load well, but that's the easiest way to get started.


The TEXT, BIGTEXT, LONGTEXT and others data types fields were created in order to store large amount of text (64 Kbytes to 4 Gbytes depending of the RDBMS). They just create a binary pointer to locate the text in the database and it is not stored directly in the table. Is almost the same procedure if you store a path in a varchar field to locate the document, but having it in the database makes it easier to maintain because if you delete the row the document disappears with it without the need to delete it in other procedure (as if you stored as a file). Logically this makes your database bigger and sometimes not so easier to backup and transport, but to transport the documents one by one would be tedious and slow.

As you see it depends on the number of documents and rows in the database.

For the searching procedure, I recommend to create a new "keywords" field in order to speed your searches. You can search too into the first n characters of the documents too, casting them as a CHAR or VARCHAR and locate the title and subtitle into these amounts if they don't have already a specific field.


Take a quick look at native xml DBs. There are several, and some very good ones are free.

Search eXist, Document xDB, Oracle Berkeley.

If you are persisting, querying and updating semi-structured text and if the structure has any depth at all, you are almost certainly doing it the hard way if you stick with either the RDB of pointers, or stuff-it-in-a-blob techniques -- though there are many exterior reasons that these architectures can be necessary and successful.

Do a little reading on XPath and XQuery before you commit to a design. Here's a good place to start: https://community.emc.com/community/edn/xmltech

Tags:

Database

Xml