python - Representing reciprocal relationships in Mongo DB -


i have group of things (genes) in mongodb. i'm doing analysis see how similar each gene each other gene, , i'd store information in database. have different documents in database each gene contains information species gene came , dna sequence. each 1 of course has unique identifier _id.

when analysis, information on how similar genes percent (their perc_identity). typically, lower bound analysis can return ~70%, there not number each gene, each relationship reciprocal (eg if perc_identity(a:b) == 90 perc_identity(b:a) == 90).

my question what's best data model store these relationships can retrieve them further analysis? in other words, i'll want grab pairs perc_identity > 95. other times i'll want of matches a particular gene. if matters, initial analysis perc_identity needs done once , takes quite long time already, performance on insert matters less retrieval later analysis.

some ideas had (i'm working mongodb in python if matters):

1) within document each gene, have sub-document contains of matched _ids , perc_identity. eg:

{     _id: genea,     dna_seq: 'aactg...',     species: 'homo sapiens',     hits:{         genea: 100,         geneb: 92,         genec: 70,     } }, {     _id: geneb,     dna_seq: 'aattg...',     species: 'pan troglodytes',     hits:{         genea: 92,         geneb: 100,     } }, {     _id: genec,     dna_seq: 'atggc...',     species: 'homo erectus',     hits:{         genea: 70         genec: 100     } } 

this cause duplication of data, closest how data spit out of initial analysis. of time, won't care of other data in gene document, i'm not clear if slow things down have information nested within them. i'm not clear if there efficient way query example, perc_identity > 90. , every time want analysis, i'll retrieve double amount of data need.

2) have separate document contains gene _ids , of hits. eg:

{     _id: 'hits',     genea: {         genea: 100         geneb: 92         genec: 70     },     geneb: {         genea: 92         geneb: 100     },     # etc } 

this has benefit don't have mess gene documents @ all. have different hits collection if makes difference. other thing there ~50k gene records, 1-2% of them have hits @ all, queries won't have bother checking majority of documents. otherwise, seems similar (1) me.

3) way have no redundancy. can't think of ways this. bad way thought of have perc_identity key, , have list of _id tuples. can round nearest integer percent. seems require checking presence of _id in every tuple within perc_identity every time insert something, or inserting , collapsing set afterwards. , in case, retrieving matches particular _id seems horribly inefficient.

or, since order doesn't matter like:

{     _id: ?     type: 'hit'     pair1: genea     pair2: geneb     perc_identity: 92 }, {     _id: ??     type:'hit'     pair1: genec     pair2: genea     perc_identity: 70 }, # etc 

any critique on 1 of these strategies, or suggestions other ways represent appreciated. let me know if there's other info should provide or if can clarify anything. if (1) or (2) seem strategies, guess question best way construct query based on perc_identity threshold.

this never easy question answer! however, guiding principle should decide based on way intend use data. in case, you've mentioned 2 queries:

  1. grab pairs perc_identity > 95
  2. get matches 1 gene

(of course, there may other common analyses plan make--it spell them out.)

based on this, encourage go denormalized approach 1 discuss in third alternative. have downsides, on insert, seem aware of, makes first type of query easy:

db.hits.find({perc_identity: {$gt: 95}}) 

...whereas other approach need iterate on keys in other documents. instance, first approach, need retrieve hits subdocuments every gene, iterate on keys of subdocuments, , add list greater 95. need done out of mongodb/pymongo.

the other query more complicated approaches 1 , 2, not much:

db.hits.find({$or: [{pair1: <your gene>}, {pair2: <your gene>}]}) 

so @ cost of more logic on inserts, 2 querying cases mention become extremely simple , can handled database server itself. if have other common use cases difficult achieve third approach, worth revisiting it--but stands, that's choose.

two notes: first, documentation of mongodb has some advice on data modeling may worth reading. second, as mongodb, given little know problem domain may 1 case relational database might better fit.


Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -