EMBL file with space before quoted, multi-line qualifier value

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

EMBL file with space before quoted, multi-line qualifier value

Adam Sjøgren-2
  Hi.

I just stumbled over an EMBL file that has space after equal-sign that
follows the qualifier name.

BioPerl doesn't parse folded lines when that happens, because the "-char
is expected to be the first character after the '='.

I have whittled it down to this example:

ID   TEST standard; DNA; 10 BP.
XX
AC   TEST;
XX
DT   09-APR-2015
XX
DE   Test of space before quoted qualifier.
XX
XX
FH   Key             Location/Qualifiers
FT   CDS             1..10
FT                   /*tag= x
FT                   /gene= "someT"
FT                   /product= "somewordandt extthatisquite lon
FT                   gandthereforewraps"
XX
SQ   Sequence 10 BP; 10 A; 0 C; 0 G; 9 T; 0 U; 0 Other;
     aaaaaaaaaa       10
//

And this "one"-liner:

  ~$ perl -e 'use warnings; use strict; use Bio::SeqIO; my $in=Bio::SeqIO->new("-file"=>"white_space.embl", "-format"=>"embl"); my $seq=$in->next_seq; foreach my $feature ($seq->all_SeqFeatures) { print $feature->primary_tag . "\n"; map { print "  /" . $_ . "=" . (join " ", $feature->get_tag_values($_)) . "\n"; } $feature->get_all_tags }'

  ------------- EXCEPTION: Bio::Root::Exception -------------
  MSG: Can't see new qualifier in: gandthereforewraps"
  from:
  /*tag= x
  /gene= "someT"
  /product= "somewordandt extthatisquite lon
  gandthereforewraps"

  STACK: Error::throw
  STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472
  STACK: Bio::SeqIO::embl::_read_FTHelper_EMBL /usr/share/perl5/Bio/SeqIO/embl.pm:1361
  STACK: Bio::SeqIO::embl::next_seq /usr/share/perl5/Bio/SeqIO/embl.pm:400
  STACK: -e:1
  -----------------------------------------------------------

shows the error.

If I add this patch:

--- Bio/SeqIO/embl.pm.orig 2015-04-09 14:27:08.035573910 +0200
+++ Bio/SeqIO/embl.pm 2015-04-09 14:27:46.952373300 +0200
@@ -1358,7 +1358,7 @@
     # intact to provide informative error messages.)
   QUAL: for (my $i = 0; $i < @qual; $i++) {
         $_ = $qual[$i];
-        my( $qualifier, $value ) = m{^/([^=]+)(?:=(.+))?}
+        my( $qualifier, $value ) = m{^/([^=]+)*(?:=\s*(.+))?}
             or $self->throw("Can't see new qualifier in: $_\nfrom:\n"
                             . join('', map "$_\n", @qual));
         if (defined $value) {

then the output is:

  CDS
    /*tag=x
    /gene=someT
    /product=somewordandt extthatisquite lon gandthereforewraps

which seems more reasonable, even if the format does not allow
whitespace after the =-sign (I haven't checked).

What do you think?

  Best regards,

    Adam

--
                                                          Adam Sjøgren
                                                    [hidden email]

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: EMBL file with space before quoted, multi-line qualifier value

Adam Sjøgren-2
Adam writes:

> -        my( $qualifier, $value ) = m{^/([^=]+)(?:=(.+))?}
> +        my( $qualifier, $value ) = m{^/([^=]+)*(?:=\s*(.+))?}
                                                 |
I got a '*' to many there -----------------------'

I (only) meant to add the '\s*' after the '=':

+        my( $qualifier, $value ) = m{^/([^=]+)(?:=\s*(.+))?}

Sorry!

--
                                                          Adam Sjøgren
                                                    [hidden email]

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: EMBL file with space before quoted, multi-line qualifier value

Fields, Christopher J
On Apr 9, 2015, at 8:03 AM, Adam Sjøgren <[hidden email]> wrote:

>
> Adam writes:
>
>> -        my( $qualifier, $value ) = m{^/([^=]+)(?:=(.+))?}
>> +        my( $qualifier, $value ) = m{^/([^=]+)*(?:=\s*(.+))?}
>                                                 |
> I got a '*' to many there -----------------------'
>
> I (only) meant to add the '\s*' after the '=':
>
> +        my( $qualifier, $value ) = m{^/([^=]+)(?:=\s*(.+))?}
>
> Sorry!
>
> --
>                                                          Adam Sjøgren
>                                                    [hidden email]

As long as this passes current tests I don’t have a problem with adding it in.  I would suggest adding a simple test case for it; you could modify a current EMBL file in the data directory if needed for a test case, probably no need to add a new file.

I’ve long felt there's a fine line between having a parser being a strict validation tool and having it be flexible enough to allow for idiosyncrasies from various tools (e.g. see any GenBank output from anywhere).  I tend to veer in the direction of flexibility within reason; having a test suite helps quite a bit.

chris

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: EMBL file with space before quoted, multi-line qualifier value

Hamish McWilliam
On 9 April 2015 at 16:18, Fields, Christopher J <[hidden email]> wrote:
On Apr 9, 2015, at 8:03 AM, Adam Sjøgren <[hidden email]> wrote:
>
> Adam writes:
>
>> -        my( $qualifier, $value ) = m{^/([^=]+)(?:=(.+))?}
>> +        my( $qualifier, $value ) = m{^/([^=]+)*(?:=\s*(.+))?}
>                                                 |
> I got a '*' to many there -----------------------'
>
> I (only) meant to add the '\s*' after the '=':
>
> +        my( $qualifier, $value ) = m{^/([^=]+)(?:=\s*(.+))?}
>
> Sorry!
>
> --
>                                                          Adam Sjøgren
>                                                    [hidden email]

As long as this passes current tests I don’t have a problem with adding it in.  I would suggest adding a simple test case for it; you could modify a current EMBL file in the data directory if needed for a test case, probably no need to add a new file.

I’ve long felt there's a fine line between having a parser being a strict validation tool and having it be flexible enough to allow for idiosyncrasies from various tools (e.g. see any GenBank output from anywhere).  I tend to veer in the direction of flexibility within reason; having a test suite helps quite a bit.


In general I agree that having some wiggle room when reading is good. However it is also good to have the option of stricter interpretations of the data format specification, to catch errors like this and give users the option of informing the source of such data that their output needs to be adjusted to match the format specification. This makes it easier to ensure that tools which write these formats use stricter interpretations than those that read them, and outcome which makes everyone happier.

All the best,

Hamish
 
chris

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l



--
----
"Saying the internet has changed dramatically over the last five years is cliché – the internet is always changing dramatically" - Craig Labovitz, Arbor Networks.

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: EMBL file with space before quoted, multi-line qualifier value

Peter Cock
On Thu, Apr 9, 2015 at 5:05 PM, Hamish McWilliam
<[hidden email]> wrote:

> On 9 April 2015 at 16:18, Fields, Christopher J <[hidden email]> wrote:
>>
>> As long as this passes current tests I don’t have a problem with adding it
>> in.  I would suggest adding a simple test case for it; you could modify a
>> current EMBL file in the data directory if needed for a test case, probably
>> no need to add a new file.
>>
>> I’ve long felt there's a fine line between having a parser being a strict
>> validation tool and having it be flexible enough to allow for idiosyncrasies
>> from various tools (e.g. see any GenBank output from anywhere).  I tend to
>> veer in the direction of flexibility within reason; having a test suite
>> helps quite a bit.
>>
>
> In general I agree that having some wiggle room when reading is good.
> However it is also good to have the option of stricter interpretations of
> the data format specification, to catch errors like this and give users the
> option of informing the source of such data that their output needs to be
> adjusted to match the format specification. This makes it easier to ensure
> that tools which write these formats use stricter interpretations than those
> that read them, and outcome which makes everyone happier.
>
> All the best,
>
> Hamish

What we're doing with the Biopython GenBank/EMBL parser
(and others) on 'problematic' things where we think we can
parse them unambiguously, is to parse them but give a
warning. The Python warnings framework lets the user
silence all our parser warning if they want to.

Where unambiguous parsing is a problem, I vote for an
error.

I've not checked this specific issue with the extra space in
a feature qualifier yet...

Peter

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

t/Align/SimpleAlign.t: Bareword "unified_diff" not allowed while "strict subs" in use

Adam Sjøgren-2
In reply to this post by Fields, Christopher J
Christopher writes:

> As long as this passes current tests I don’t have a problem with
> adding it in. I would suggest adding a simple test case for it;

In starting to work in this, I did a fresh fork on github of
bioperl-live, and ran the tests - just to get a baseline before changing
anything.

The first problem I hit is this:

  adsj:~/work/bioperl/bioperl-live$ prove -v --timer -r t/Align/SimpleAlign.t
  [08:08:53] t/Align/SimpleAlign.t ..
  1..206
  ok 1 - use Bio::SimpleAlign;
  ok 2 - use Bio::AlignIO;
  ok 3 - use Bio::SeqFeature::Generic;
  ok 4 - use Bio::Location::Simple;
  ok 5 - use Bio::Location::Split;
  Bareword "unified_diff" not allowed while "strict subs" in use at t/Align/SimpleAlign.t line 816.
  Execution of t/Align/SimpleAlign.t aborted due to compilation errors.
  # Looks like you planned 206 tests but ran 5.
  # Looks like your test exited with 255 just after 5.
  Dubious, test returned 255 (wstat 65280, 0xff00)
  Failed 201/206 subtests
  [08:08:53]

  Test Summary Report
  -------------------
  t/Align/SimpleAlign.t (Wstat: 65280 Tests: 5 Failed: 0)
    Non-zero exit status: 255
    Parse errors: Bad plan.  You planned 206 tests but ran 5.
  Files=1, Tests=5,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.10 cusr  0.01 csys =  0.13 CPU)
  Result: FAIL

I can't find any mention of "unified_diff" anywhere else in the
checkout.

This is on Perl v5.14.2, but the line was introduced in 2011 (0a959bac),
so.

Searching a little, it seems that "use Test::Differences;" is somehow
missing from t/Align/SimpleAlign.t - or is running 'prove' on the .t
files not the correct way to run the tests?


  Best regards,

    Adam

--
                                                          Adam Sjøgren
                                                    [hidden email]

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: t/Align/SimpleAlign.t: Bareword "unified_diff" not allowed while "strict subs" in use

Adam Sjøgren-2
Adam writes:

> Searching a little, it seems that "use Test::Differences;" is somehow
> missing from t/Align/SimpleAlign.t - or is running 'prove' on the .t
> files not the correct way to run the tests?

Ok, Test::Differences is included in lib/Bio/Root/Test.pm, so you just
have to check out Bio-Root as well, them you can run prove like this:

  $ prove -I../Bio-Root/lib/ -v --timer -r t/Align/SimpleAlign.t

I guess moving Bio::Root out of bioperl-live is in work in progress,
which is why README, INSTALL, DEPENDENCIES et al don't mention this.


  Best regards,

    Adam

--
                                                          Adam Sjøgren
                                                    [hidden email]

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: EMBL file with space before quoted, multi-line qualifier value

Adam Sjøgren-2
In reply to this post by Fields, Christopher J
Christopher writes:

> As long as this passes current tests I don’t have a problem with
> adding it in. I would suggest adding a simple test case for it; you
> could modify a current EMBL file in the data directory if needed for a
> test case, probably no need to add a new file.

None of the tests fail on my machine after adding the patch.

I have created a pull request, including a test, now:

 * https://github.com/bioperl/bioperl-live/pull/111

It looked cleaner to me to add the simple embl test file, than to add a
"bogus" feature to one of the existing files, but I can of course change
that, if that is preferred.


  Best regards,

    Adam

--
                                                          Adam Sjøgren
                                                    [hidden email]

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: t/Align/SimpleAlign.t: Bareword "unified_diff" not allowed while "strict subs" in use

Fields, Christopher J
In reply to this post by Adam Sjøgren-2
On Apr 10, 2015, at 3:12 AM, Adam Sjøgren <[hidden email]> wrote:

>
> Adam writes:
>
>> Searching a little, it seems that "use Test::Differences;" is somehow
>> missing from t/Align/SimpleAlign.t - or is running 'prove' on the .t
>> files not the correct way to run the tests?
>
> Ok, Test::Differences is included in lib/Bio/Root/Test.pm, so you just
> have to check out Bio-Root as well, them you can run prove like this:
>
>  $ prove -I../Bio-Root/lib/ -v --timer -r t/Align/SimpleAlign.t
>
> I guess moving Bio::Root out of bioperl-live is in work in progress,
> which is why README, INSTALL, DEPENDENCIES et al don't mention this.
>
>
>  Best regards,
>
>    Adam

Yes, I’ve been giving serious thought about moving the root modules back into master and focusing on a top-down approach for removing code (which has worked better IMHO).

chris

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: EMBL file with space before quoted, multi-line qualifier value

Fields, Christopher J
In reply to this post by Adam Sjøgren-2
On Apr 10, 2015, at 3:35 AM, Adam Sjøgren <[hidden email]> wrote:

>
> Christopher writes:
>
>> As long as this passes current tests I don’t have a problem with
>> adding it in. I would suggest adding a simple test case for it; you
>> could modify a current EMBL file in the data directory if needed for a
>> test case, probably no need to add a new file.
>
> None of the tests fail on my machine after adding the patch.
>
> I have created a pull request, including a test, now:
>
> * https://github.com/bioperl/bioperl-live/pull/111
>
> It looked cleaner to me to add the simple embl test file, than to add a
> "bogus" feature to one of the existing files, but I can of course change
> that, if that is preferred.
>
>
>  Best regards,
>
>    Adam

Works for me, thanks for the fix Adam!  I just merged the pull into master.

chris

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: t/Align/SimpleAlign.t: Bareword "unified_diff" not allowed while "strict subs" in use

Brian Osborne-2
In reply to this post by Fields, Christopher J
Chris,

Yes, please do. Installation is complex enough for newcomers with Bio::Root in master, a big +1 from me on this.

BIO

On Apr 10, 2015, at 12:24 PM, Fields, Christopher J <[hidden email]> wrote:

Yes, I’ve been giving serious thought about moving the root modules back into master and focusing on a top-down approach for removing code (which has worked better IMHO).


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: EMBL file with space before quoted, multi-line qualifier value

Adam Sjøgren
In reply to this post by Fields, Christopher J
Christopher writes:

> Works for me, thanks for the fix Adam! I just merged the pull into
> master.

Great - thanks for the feedback!


  Best regards,

    Adam

--
 "XHTML was history over five years ago. It's time you        Adam Sjøgren
  XML fanatics accept that HTML is a different language  [hidden email]
  with different rules and stop this incessant
  kvetching."

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: t/Align/SimpleAlign.t: Bareword "unified_diff" not allowed while "strict subs" in use

Fields, Christopher J
In reply to this post by Brian Osborne-2
That’s probably all the incentive I need.  If I don’t hear any dissent, I’ll start working on reintegration of Bio::Root into master in the next few weeks.

chris

On Apr 10, 2015, at 11:56 AM, Brian Osborne <[hidden email]> wrote:

Chris,

Yes, please do. Installation is complex enough for newcomers with Bio::Root in master, a big +1 from me on this.

BIO

On Apr 10, 2015, at 12:24 PM, Fields, Christopher J <[hidden email]> wrote:

Yes, I’ve been giving serious thought about moving the root modules back into master and focusing on a top-down approach for removing code (which has worked better IMHO).



_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: t/Align/SimpleAlign.t: Bareword "unified_diff" not allowed while "strict subs" in use

Brian Osborne-2
Chris,

Tell me if you want assistance, happy to help.

BIO


On Apr 10, 2015, at 10:18 PM, Fields, Christopher J <[hidden email]> wrote:

That’s probably all the incentive I need.  If I don’t hear any dissent, I’ll start working on reintegration of Bio::Root into master in the next few weeks.

chris

On Apr 10, 2015, at 11:56 AM, Brian Osborne <[hidden email]> wrote:

Chris,

Yes, please do. Installation is complex enough for newcomers with Bio::Root in master, a big +1 from me on this.

BIO

On Apr 10, 2015, at 12:24 PM, Fields, Christopher J <[hidden email]> wrote:

Yes, I’ve been giving serious thought about moving the root modules back into master and focusing on a top-down approach for removing code (which has worked better IMHO).




_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: t/Align/SimpleAlign.t: Bareword "unified_diff" not allowed while "strict subs" in use

Fields, Christopher J
All comes down to whether we want the easy way or not (e.g. copy them back with or w/o history since the move out).  

chris

On Apr 10, 2015, at 9:26 PM, Brian Osborne <[hidden email]> wrote:

Chris,

Tell me if you want assistance, happy to help.

BIO


On Apr 10, 2015, at 10:18 PM, Fields, Christopher J <[hidden email]> wrote:

That’s probably all the incentive I need.  If I don’t hear any dissent, I’ll start working on reintegration of Bio::Root into master in the next few weeks.

chris

On Apr 10, 2015, at 11:56 AM, Brian Osborne <[hidden email]> wrote:

Chris,

Yes, please do. Installation is complex enough for newcomers with Bio::Root in master, a big +1 from me on this.

BIO

On Apr 10, 2015, at 12:24 PM, Fields, Christopher J <[hidden email]> wrote:

Yes, I’ve been giving serious thought about moving the root modules back into master and focusing on a top-down approach for removing code (which has worked better IMHO).





_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Loading...