Distributed Storage Filesystem - Which one/Is there a ready to use product?

Solution 1:

I can't speak to the rest, but you seem to be confused between a 'distributed storage engine' and a 'distributed file system'. They are not the same thing, they shouldn't be mistaken for the same thing, and they will never be the same thing. A filesystem is a way to keep track of where things are located on a hard drive. A storage engine like hadoop is a way to keep track of a chunk of data identified by a key. Conceptually, not much difference. The problem is that a filesystem is a dependency of a storage engine... after all, it needs a way to write to a block device, doesn't it?

All that aside, I can speak to the use of ocfs2 as a distributed filesystem in a production environment. If you don't want the gritty details, stop reading after this line: It's kinda cool, but it may mean more downtime than you think it does.

We've been running ocfs2 in a production environment for the past couple of years. It's OK, but it's not great for a lot of applications. You should really look at your requirements and figure out what they are -- you might find that you have a lot more latitude for faults than you thought you did.

As an example, ocfs2 has a journal for each machine in the cluster that's going to mount the partition. So let's say you've got four web machines, and when you make that partition using mkfs.ocfs2, you specify that there will be six machines total to give yourself some room to grow. Each of those journals takes up space, which reduces the amount of data you can store on the disks. Now, let's say you need to scale to seven machines. In that situation, you need to take down the entire cluster (i.e. unmount all of the ocfs2 partitions) and use the tunefs.ocfs2 utility to create an additional journal, provided that there's space available. Then and only then can you add the seventh machine to the cluster (which requires you to distribute a text file to the rest of the cluster unless you're using a utility), bring everything back up, and then mount the partition on all seven machines.

See what I mean? It's supposed to be high availability, which is supposed to mean 'always online', but right there you've got a bunch of downtime... and god forbid you're crowded for disk space. You DON'T want to see what happens when you crowd ocfs2.

Keep in mind that evms, which used to be the 'preferred' way to manage ocfs2 clusters, has gone the way of the dodo bird in favor of clvmd and lvm2. (And good riddance to evms.) Also, heartbeat is quickly going to turn into a zombie project in favor of the openais/pacemaker stack. (Aside: When doing the initial cluster configuration for ocfs2, you can specify 'pcmk' as the cluster engine as opposed to heartbeat. No, this isn't documented.)

For what it's worth, we've gone back to nfs managed by pacemaker, because the few seconds of downtime or a few dropped tcp packets as pacemaker migrates an nfs share to another machine is trivial compared to the amount of downtime we were seeing for basic shared storage operations like adding machines when using ocfs2.

Solution 2:

I think you'll have to abandon the POSIX requirement, very few systems implement that - in fact even NFS doesn't really (think locks etc) and that has no redundancy.

Any system which uses synchronous replication is going to be glacially slow; any system which has asynchronous replication (or "eventual consistency") is going to violate POSIX rules and not behave like a "conventional" filesystem.

Solution 3:

I may be misunderstanding your requirements, but have you looked at http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems

Solution 4:

Just to throw my €0.02 in here: can't OpenAFS do what you want?

Solution 5:

Take a look at chirp http://www.cse.nd.edu/~ccl/software/chirp/ and parrot http://www.cse.nd.edu/~ccl/software/parrot/