Note that there are some explanatory texts on larger screens.

plurals
  1. POWhich DB would you use? MongoDB/Neo4j/SQL... all of them?
    primarykey
    data
    text
    <p>I'd like to know which choices you would do for my usecase. It's about building a social webapp where each user has its personal filesystem.</p> <hr> <p><strong>Specification</strong></p> <ul> <li>Users all have their own filesystem</li> <li>Files metadata look like unstructured documents </li> <li>Files content are sent to Amazon S3</li> <li>Users can create directories and files in this filesystem</li> <li>Users can share a single directory with other users (like unix)</li> <li>Some directories can be set as public (shared with all users)</li> <li>Users can search for content (their own content, public content, and shared content)</li> <li>Users can bookmark directories or files</li> <li>Performances and scalability should be ok</li> </ul> <hr> <p><strong>For now, we choose MongoDB for some reasons</strong></p> <ul> <li>The unstructured nature of files</li> <li>Advices of someone who already used it</li> <li>I accepted to contribute to this project to discover new technologies with a real usecase</li> <li>The ability to index JSON documents in ElasticSearch for scalable text search.</li> </ul> <hr> <p><strong>MongoDB needs denormalization (and ElasticSearch too)</strong></p> <p>The pain comes directly from the relational part between directories: each directory refers to its parent directory with a parentId attribute. This means when a directory is bookmarked and accessed, the breadcrumb should be available. Without denormalization of the breadcrumb, this leads to an expensive recursion.</p> <p>When doing a search query for content, it is the same: I'd like the breadcrumb of the directory to be available directly in the document (actually, I use the same parser to get back my object from ElasticSearch and MongoDB since both are using JSON/BSON).</p> <p>So denormalization works fine until a user move one of its root directories, under which there are thousands of subcategories: the subcategories breadcrumbs should be updated -> MongoDB doesn't really help for consistency here and it is kind of hard to maintain this denormalized breadcrumb up to date.</p> <hr> <p><strong>Graph databases seems appropriate to build a filesystem structure, but what about scalability?</strong></p> <p>I don't know so much about graph databases like Neo4J or Titan... but would it help to build the filesystem structure? As far as I know graphs are not good for distribution, and having the directories of a user distributed doesn't seem good for breadcrumb computation. </p> <p>But users have their own filesystem, which is a single/isolated graph. This means that perhaps I could create, and shard, a graph database per user? But then what about permissions for shared directories? Where should I store them?</p> <p>Anyway, in my search engine I still need to have a denormalized breadcrumb for files metadata (at least if I keep using ElasticSearch). And it is hard to denormalize all the shared directory permissions, so that a user can search on a subset of the content of another user. It seems hard to index a graph for search anyway: <a href="https://stackoverflow.com/questions/9970193/how-to-store-tree-data-in-a-lucene-solr-elasticsearch-index-or-a-nosql-db">How to store tree data in a Lucene/Solr/Elasticsearch index or a NoSQL db?</a></p> <hr> <p><strong>MongoDB is perhaps not a good choice to store structured and nearly static content like users</strong></p> <p>Another thing that matters is consistency. When creating a new user, I need to create 8 root directories. These root directories are not subdocuments of the user document. So how should I create these directories during user creation? MongoDB doesn't have transactions so how can I be sure that the 9 inserts are done atomically (user + 8 directories). It wouldn't be nice for us to have a user created with half of its directories. It wouldn't be very nice to have an async job and a flag on user document to check directories are created... </p> <p>So a traditionnal SQL database (free) seems nice for consistency, to store user related data. Scalability can be done using partitioning at the application level like it is done by Facebook or Tumblr. User related data can be colocated to the same instance to be able to perform some joins: for exemple, on the user's filesystem... And I know SQL and multitenancy strategies.</p> <hr> <p>So in the end, I'm totally lost into this NoSQL/SQL world. I just wonder if you could help me make a choice for this usecase? </p> <p>I'm not trying to over optimize, just to see what we may need to do in the future.</p> <p>Does someone know any company that is doing something similar?</p> <p>Some thing I think about is using an hybrid solution, where for exemple we store structured data in MySQL/PosgreSQL, the files metadata in MongoDB, directories in (? don't know), and when a user connects, we could cache its whole filesystem graph using an embedded Neo4J database (assuming the size of a graph is big but acceptable) Does it seem a nice idea?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload