python - Representing reciprocal relationships in Mongo DB -
i have group of things (genes) in mongodb. i'm doing analysis see how similar each gene each other gene, , i'd store information in database. have different documents in database each gene contains information species gene came , dna sequence. each 1 of course has unique identifier _id
.
when analysis, information on how similar genes percent (their perc_identity
). typically, lower bound analysis can return ~70%, there not number each gene, each relationship reciprocal (eg if perc_identity(a:b) == 90
perc_identity(b:a) == 90
).
my question what's best data model store these relationships can retrieve them further analysis? in other words, i'll want grab pairs perc_identity > 95
. other times i'll want of matches a particular gene. if matters, initial analysis perc_identity
needs done once , takes quite long time already, performance on insert matters less retrieval later analysis.
some ideas had (i'm working mongodb in python if matters):
1) within document each gene, have sub-document contains of matched _id
s , perc_identity
. eg:
{ _id: genea, dna_seq: 'aactg...', species: 'homo sapiens', hits:{ genea: 100, geneb: 92, genec: 70, } }, { _id: geneb, dna_seq: 'aattg...', species: 'pan troglodytes', hits:{ genea: 92, geneb: 100, } }, { _id: genec, dna_seq: 'atggc...', species: 'homo erectus', hits:{ genea: 70 genec: 100 } }
this cause duplication of data, closest how data spit out of initial analysis. of time, won't care of other data in gene
document, i'm not clear if slow things down have information nested within them. i'm not clear if there efficient way query example, perc_identity > 90
. , every time want analysis, i'll retrieve double amount of data need.
2) have separate document contains gene _id
s , of hits. eg:
{ _id: 'hits', genea: { genea: 100 geneb: 92 genec: 70 }, geneb: { genea: 92 geneb: 100 }, # etc }
this has benefit don't have mess gene documents @ all. have different hits
collection if makes difference. other thing there ~50k gene records, 1-2% of them have hits @ all, queries won't have bother checking majority of documents. otherwise, seems similar (1) me.
3) way have no redundancy. can't think of ways this. bad way thought of have perc_identity
key, , have list of _id
tuples. can round nearest integer percent. seems require checking presence of _id
in every tuple within perc_identity
every time insert something, or inserting , collapsing set afterwards. , in case, retrieving matches particular _id
seems horribly inefficient.
or, since order doesn't matter like:
{ _id: ? type: 'hit' pair1: genea pair2: geneb perc_identity: 92 }, { _id: ?? type:'hit' pair1: genec pair2: genea perc_identity: 70 }, # etc
any critique on 1 of these strategies, or suggestions other ways represent appreciated. let me know if there's other info should provide or if can clarify anything. if (1) or (2) seem strategies, guess question best way construct query based on perc_identity
threshold.
this never easy question answer! however, guiding principle should decide based on way intend use data. in case, you've mentioned 2 queries:
- grab pairs
perc_identity > 95
- get matches 1 gene
(of course, there may other common analyses plan make--it spell them out.)
based on this, encourage go denormalized approach 1 discuss in third alternative. have downsides, on insert, seem aware of, makes first type of query easy:
db.hits.find({perc_identity: {$gt: 95}})
...whereas other approach need iterate on keys in other documents. instance, first approach, need retrieve hits
subdocuments every gene, iterate on keys of subdocuments, , add list greater 95. need done out of mongodb/pymongo.
the other query more complicated approaches 1 , 2, not much:
db.hits.find({$or: [{pair1: <your gene>}, {pair2: <your gene>}]})
so @ cost of more logic on inserts, 2 querying cases mention become extremely simple , can handled database server itself. if have other common use cases difficult achieve third approach, worth revisiting it--but stands, that's choose.
two notes: first, documentation of mongodb has some advice on data modeling may worth reading. second, as mongodb, given little know problem domain may 1 case relational database might better fit.
Comments
Post a Comment