Re: searching substrings with interpositions
Available news archives: comp.lang.tcl - comp.lang.python - comp.security.firewalls - sci.crypt - comp.lang.php - comp.lang.javascript
Google
 
Web news.hping.org


comp.lang.python archive

Re: searching substrings with interpositions

From: Andrew Dalke <dalke@dalkescientific.com>
Date: Tue May 24 2005 - 18:04:43 CEST

borges2003xx@yahoo.it wrote:
> the next step of my job is to make limits of lenght of interposed
> sequences (if someone can help me in this way i'll apreciate a lot)
> thanx everyone.

Kent Johnson had the right approach, with regular expressions.
For a bit of optimization, use non-greedy groups. That will
give you shorter matches.

Suppose you want no more than 10 bases between terms. You could
use this pattern.

    a.{,10}?t.{,10}?c.{,10}?g.{,10}?

>>> import re
>>> pat = re.compile('a.{,10}t.{,10}c.{,10}g.{,10}?')
>>> m = pat.search("tcgaacccgtaaaaagctaatcg")
>>> m.group(0), m.start(0), m.end(0)
('aacccgtaaaaagctaatcg', 3, 23)
>>>

>>> pat.search("tcgaacccgtaaaaagctaatttttttg")
<_sre.SRE_Match object at 0x9b950>
>>> pat.search("tcgaacccgtaaaaagctaattttttttg")
>>>

If you want to know the location of each of the bases, and
you'll have less than 100 of them (I think that's the limit)
then you can use groups in the regular expression language

>>> def make_pattern(s, limit = None):
... if limit is None:
... t = ".*?"
... else:
... t = ".{,%d}?" % (limit,)
... text = []
... for c in s:
... text.append("(%s)%s" % (c, t))
... return "".join(text)
...
>>> make_pattern("atcg")
'(a).*?(t).*?(c).*?(g).*?'
>>> make_pattern("atcg", 10)
'(a).{,10}?(t).{,10}?(c).{,10}?(g).{,10}?'
>>> pat = re.compile(make_pattern("atcg", 10))
>>> m = pat.search("tcgaacccgtaaaaagctaatttttttg")
>>> m
<_sre.SRE_Match object at 0x8ea70>
>>> m.groups()
('a', 't', 'c', 'g')
>>> for i in range(1, len("atcg")+1):
... print m.group(i), m.start(i), m.end(i)
...
a 3 4
t 9 10
c 16 17
g 27 28
>>>

                                Andrew
                                dalke@dalkescientific.com
Received on Thu Sep 29 16:13:21 2005