Archive for the "Software Engineering" Category

Google Charts API

Saturday, April 25th, 2009

I’ve been looking at the Google Chart API recently, its an interesting service, and as far as I can tell, Google really gets nothing at all out of it.  I guess the good-will associated with this kind of service is priceless.  There’s even a PHP object wrapper for the Google Chart API. I guess its too much to ask to get embeddable charts like the ones they use for Google Analytics and Google Finance.

Here’s an example with one of the former slogans for the phdUS Perspectives blog – “Good, Cheap, Fast – pick any two”:

Maybe I’ll use this API to generate some charts for the Facebook Compulsion Inventory quiz results.

Privacy and Anonymity vs. Authenticated Presence

Thursday, April 23rd, 2009

1565920988_lrgMartin Zwilling (@startuppro) wrote about opportunities to tackle Internet privacy and I posed a question to him – “how do you authenticate an online something with an offline entity?”  The reason being that I’ve been thinking about privacy on the Internet as well, but from a slightly different perspective.  I’m not so concerned with things like the seemingly infinite lifetime of something said/posted to websites, or identity theft, or the resale of customer information that lots of shady companies and websites are engaged with.  Well, I am personally concerned about my own information of course, but not in the sense of the business problem.  I’m more interested in authenticating actions, statements, uploads and everything else an “offline entity” could do online.  By “offline entity” of course I mean everything from a teenager who’s friends post as him or her when they forget to lock their computer and ruin their reputation – to a public figure or celebrity being impersonated – to a corporate entity like CNN purchasing @CNNbrk (which some people thought was run by CNN in the first place).

Several things have prompted this line of thought for me.  First, a friend of mine asked me how to mitigate negative press about a design brand he runs.  He was asking if it was possible to get things off the first page of a google search about this brand.  Of course, thats not something that can be solved, but an authenticated coherent statement from this brand regarding bad publicity might actually help mitigate the situation.  There’s been several incidents of this type, for instance the Domino’s video prank incident and its effect on the value of the brand for instance.

Second, for a recent project I’ve been thinking about how to authenticate somebody that comes to complete and manage an online listing for an offline company.  In this scenario, listings are created for companies, the initial information is populated from an existing public records database.  These companies have a vested interest in expanding their listing and participating on the website.  However, how do you invite them to participate – who do you email or send a postcard to, and once they come to the site, how can you be sure that somebody appropriate at the company got that postcard and that its not a mail clerk from their building that’s signing up to destroy their reputation on this listing site?

I guess what I’m getting at is this – how do you know that something done by me online is really done by me?  Do we take PGP signing to the next level and set up a company that can authenticate a signed post online based on the assumption that only the real me has my private PGP key?  Do we set up a company that allows mailing of secret information to physical addresses that then authenticate an invitation to a website via this shared secret “token”?  I’m sure there’s opportunities to start a company that would do something interesting in this space.  Perhaps even tackling the problem Martin is talking about in the process – what if we could submit our private data to websites in an encrypted packet that can only be decrypted by entities that we give permission to?  Something that tracks how many times something’s been decrypted and used for instance, perhaps an intermediary or an escrow sort of service for information?

Any ideas? Its hard to resist signing this blog post with my PGP key, but perhaps a Wordpress plugin to authenticate posts that way isn’t that hard to make and maybe that’s a small step I’ll take in this direction sometime soon.

Using MySQL as a Storage Engine

Saturday, February 28th, 2009

friendfeedI just read about FriendFeed using MySQL as a data storage engine.  Very interesting post that highlights how much people have come to trust MySQL.  FriendFeed basically built their own object storage system inside a single table.  They manually create and update indexes in other tables in a nicely distributed and sharded database server setup.  Overall its what I’d call a “very impressive” implementation.  I can’t help but think back to how BlogLines used (or perhaps still uses) BerkeleyDB to store their data without a traditional RDBMS storage engine.  I wonder if FriendFeed considered using a solution like that?

VIM for PHP Programmers

Wednesday, February 11th, 2009

vim-editor_logo1

I came across a pretty good presentation by Andrei Zmievski from Yahoo! Inc.  Its a 77 page walk-through of VIM for PHP Programmers – quite comprehensive!  It covers essentials like setting marks and jumping to them quickly – which was a personal pet peeve of mine.  Little did I know vim had that covered – even global marks across files :) !  Of course vim is a bottomless bucket if you’re going to pour sweat into learning it, Andrei admits it quite succinctly on slide 18 for instance – “vim regexp language is too sophisticated to be covered here.” ;)

(more…)

del.icio.us schema

Sunday, April 24th, 2005

Just read this discussion about possible del.icio.us database schema solutions. Interesting postulation. Here’s how I’d do it, and in fact, that’s how I set it up for tags in my blog’s previous incarnation.

I would have three principle tables, much like the completely normalized version in Philipp Keller’s blog. Of course, some modifications would need to be made to normalize this schema further to make it efficient for multiple users bookmarking the same URLs, etc. However, that’s a topic for another day. Here’s the MySQL-compatible table structures:

Items
CREATE TABLE Items (
item_id int NOT NULL AUTO_INCREMENT,
item_description varchar(255) NOT NULL,
item_url varchar(255) NOT NULL,
item_extended varchar(255) NOT NULL,
item_tags varchar(255) NOT NULL,
PRIMARY KEY(item_id),
KEY(item_url)
)
Items_x_Tags
CREATE TABLE Items_x_Tags (
item_id int NOT NULL,
tag_id int NOT NULL,
PRIMARY KEY(x_item_id,x_tag_id)
)
Tags
CREATE TABLE Tags (
tag_id int NOT NULL AUTO_INCREMENT,
tag_name varchar(255) NOT NULL,
tag_item_count int NOT NULL,
PRIMARY KEY(tag_id),
KEY(tag_name),
KEY(tag_item_count)
)

Here are some design decisions that were involved in this design. First of all, we want to be able to run statistics on tags, such as count the number of times they’ve been mentioned, etc. That’s why the Tags table is separate, assigning a unique integer identifier to each tag as a primary key. We’ll keep a count of the number of times that tag is used in this table as well, so it’ll be easy to figure out which tags are most-used. For multiple-user situations, we’d want to do the same thing with a table that has URLs, their hash and a count of the number of times that URL has been bookmarked. Then the user’s private URL name, tags, etc will just link to this table keeping track of hashes/counts for each URL.

When we add new Items to the database, we clean up the plain text string of tags the user submits, then check to see if each tag already exists (and get its tag_id, incrementing the tag_item_count). Otherwise we create a new Tags record with a count of one and take the newly created tag_id and use that. For each tag the user supplied, we insert a record into the Items_x_Tags with the newly inserted Item’s item_id. Note that there’s a field in the Items table that keeps the tag list in its character form. Why? First of all, its nice to have the original tag order. Second, this way we can display items without lots of joins, saving on database thrashing. Yes, its not 100% normalized. But then again, integer primary keys are excessive as well as far as normalization goes when the URL would serve as a good primary key for instance. I prefer to take a mid-line approach to normalization – just enough to make it practical, fast and well-suited for our purpose.

When modifying entries, we just unlink all tags, decrementing counts, then start over with the same procedure as for new items – link them again through the map, incrementing counts.

Ok, so now we have those three tables, populated with our tags. Lets try to design the querries that Philipp Keller’s blog entry mentions – Intersection, Union and Minus.

Intersection (AND)
SELECT *
FROM (Items AS i INNER JOIN Items_x_Tags AS x ON i.item_id=x.item_id)
INNER JOIN Tags AS t ON x.tag_id=t.tag_id
WHERE t.tag_name IN (’bookmark’, ‘webservice’, ’semweb’)
GROUP BY i.id
HAVING COUNT(i.id)
Union (OR)
SELECT *
FROM (Items AS i INNER JOIN Items_x_Tags AS x ON i.item_id=x.item_id)
INNER JOIN Tags AS t ON x.tag_id=t.tag_id
WHERE t.tag_name IN (’bookmark’, ‘webservice’, ’semweb’)
GROUP BY i.i
Minus
SELECT *
FROM ((Items AS i INNER JOIN Items_x_Tags AS x ON i.item_id=x.item_id)
INNER JOIN Tags AS t ON x.tag_id=t.tag_id)
INNER JOIN Tags AS m ON x.tag_id=m.tag_id)
WHERE t.tag_name IN (’bookmark’, ‘webservice’, ’semweb’) AND NOT (m.tag_name IN (’semweb’))
GROUP BY i.id

Haven’t tested the Minus query, but in theory it should work – do a four-table join, since its an INNER JOIN, we’ll have only Items that have ‘bookmark’ and ‘webservice’ tags, but that don’t have the tab ’semweb’. This might take some debugging, but in theory that’s the way I’d approach that kind of Minus query.

Here’s some extra reading: Philipp’s original entry, some code snippets that deal with tags.