Parse RNA into codons

Retina, 39 38 32 30 bytes


The trailing linefeed is significant.

Output as a linefeed-separated list.

Try it online.



This is match stage which turns the input into a linefeed-separated list of all matches (due to the !). The regex itself matches every codon starting from the first AUG. We achieve this with two separate options. AUG matches unconditionally, so that it can start the list of matches. The second match can be any codon (... matches any three characters), but the \G is a special anchor which ensures that this can only match right after another match. The only problem is that \G also matches at the beginning of the string, which we don't want. Since the input consists only of word characters, we use \B (any position that is not a word boundary) to ensure that this match is not used at the beginning of the input.


This finds the first stop codon, matched as U(AA|AG|GA) as well as everything after it and removes it from the string. Since the first stage split the codons into separate lines, we know that this match is properly aligned with the start codon. We use \D (non-digits) to match any character, since . wouldn't go past the linefeeds, and the input won't contain digits.

Haskell, 115 112 bytes

import Data.Lists
fst.break(\e->elem e["UAA","UAG","UGA"]||length e<3).chunksOf 3.snd.spanList((/="AUG").take 3)

Usage example:

*Main> ( fst.break(\e->elem e["UAA","UAG","UGA"]||length e<3).chunksOf 3.snd.spanList((/="AUG").take 3) ) "AUGCUUAUGAAUGGCAUGUACUAAUAGACUCACUUAAGCGGUGAUGAA"

How it works:

                spanList((/="AUG").take 3)  -- split input at the first "AUG"
             snd                            -- take 2nd part ("AUG" + rest)
     chunksOf 3                             -- split into 3 element lists
fst.break(\e->                              -- take elements from this list
           elem e["UAA","UAG","UGA"]||      -- as long as we don't see end codons
           length e<3)                      -- or run out of full codons