Posts Tagged "sql"

del.icio.us schema

Sunday, April 24th, 2005

Just read this discussion about possible del.icio.us database schema solutions. Interesting postulation. Here’s how I’d do it, and in fact, that’s how I set it up for tags in my blog’s previous incarnation.

I would have three principle tables, much like the completely normalized version in Philipp Keller’s blog. Of course, some modifications would need to be made to normalize this schema further to make it efficient for multiple users bookmarking the same URLs, etc. However, that’s a topic for another day. Here’s the MySQL-compatible table structures:

Items
CREATE TABLE Items (
item_id int NOT NULL AUTO_INCREMENT,
item_description varchar(255) NOT NULL,
item_url varchar(255) NOT NULL,
item_extended varchar(255) NOT NULL,
item_tags varchar(255) NOT NULL,
PRIMARY KEY(item_id),
KEY(item_url)
)
Items_x_Tags
CREATE TABLE Items_x_Tags (
item_id int NOT NULL,
tag_id int NOT NULL,
PRIMARY KEY(x_item_id,x_tag_id)
)
Tags
CREATE TABLE Tags (
tag_id int NOT NULL AUTO_INCREMENT,
tag_name varchar(255) NOT NULL,
tag_item_count int NOT NULL,
PRIMARY KEY(tag_id),
KEY(tag_name),
KEY(tag_item_count)
)

Here are some design decisions that were involved in this design. First of all, we want to be able to run statistics on tags, such as count the number of times they’ve been mentioned, etc. That’s why the Tags table is separate, assigning a unique integer identifier to each tag as a primary key. We’ll keep a count of the number of times that tag is used in this table as well, so it’ll be easy to figure out which tags are most-used. For multiple-user situations, we’d want to do the same thing with a table that has URLs, their hash and a count of the number of times that URL has been bookmarked. Then the user’s private URL name, tags, etc will just link to this table keeping track of hashes/counts for each URL.

When we add new Items to the database, we clean up the plain text string of tags the user submits, then check to see if each tag already exists (and get its tag_id, incrementing the tag_item_count). Otherwise we create a new Tags record with a count of one and take the newly created tag_id and use that. For each tag the user supplied, we insert a record into the Items_x_Tags with the newly inserted Item’s item_id. Note that there’s a field in the Items table that keeps the tag list in its character form. Why? First of all, its nice to have the original tag order. Second, this way we can display items without lots of joins, saving on database thrashing. Yes, its not 100% normalized. But then again, integer primary keys are excessive as well as far as normalization goes when the URL would serve as a good primary key for instance. I prefer to take a mid-line approach to normalization – just enough to make it practical, fast and well-suited for our purpose.

When modifying entries, we just unlink all tags, decrementing counts, then start over with the same procedure as for new items – link them again through the map, incrementing counts.

Ok, so now we have those three tables, populated with our tags. Lets try to design the querries that Philipp Keller’s blog entry mentions – Intersection, Union and Minus.

Intersection (AND)
SELECT *
FROM (Items AS i INNER JOIN Items_x_Tags AS x ON i.item_id=x.item_id)
INNER JOIN Tags AS t ON x.tag_id=t.tag_id
WHERE t.tag_name IN (’bookmark’, ‘webservice’, ’semweb’)
GROUP BY i.id
HAVING COUNT(i.id)
Union (OR)
SELECT *
FROM (Items AS i INNER JOIN Items_x_Tags AS x ON i.item_id=x.item_id)
INNER JOIN Tags AS t ON x.tag_id=t.tag_id
WHERE t.tag_name IN (’bookmark’, ‘webservice’, ’semweb’)
GROUP BY i.i
Minus
SELECT *
FROM ((Items AS i INNER JOIN Items_x_Tags AS x ON i.item_id=x.item_id)
INNER JOIN Tags AS t ON x.tag_id=t.tag_id)
INNER JOIN Tags AS m ON x.tag_id=m.tag_id)
WHERE t.tag_name IN (’bookmark’, ‘webservice’, ’semweb’) AND NOT (m.tag_name IN (’semweb’))
GROUP BY i.id

Haven’t tested the Minus query, but in theory it should work – do a four-table join, since its an INNER JOIN, we’ll have only Items that have ‘bookmark’ and ‘webservice’ tags, but that don’t have the tab ’semweb’. This might take some debugging, but in theory that’s the way I’d approach that kind of Minus query.

Here’s some extra reading: Philipp’s original entry, some code snippets that deal with tags.