Boulder

 view release on metacpan or  search on metacpan

Boulder/Genbank.pm  view on Meta::CPAN

shortcut syntax in which you provde new() with a list of accession
numbers:

  $gb = new Boulder::Genbank('M57939','M28274','L36028');

=item newFh()

This works like new(), but returns a filehandle.  To recover each
GenBank record read from the filehandle with the <> operator:

  $fh = Boulder::GenBank->newFh('M57939','M28274','L36028');
  while ($record = <$fh>) {
     print $record->asString;
  }

=item get()

The get() method is inherited from I<Boulder::Stream>, and simply
returns the next parsed Genbank Stone, or undef if there is nothing
more to fetch.  It has the same semantics as the parent class,
including the ability to restrict access to certain top-level tags.

The object returned is a L<Stone::GB_Sequence> object, which is a
descendent of L<Stone>.

=item put()

The put() method is inherited from the parent Boulder::Stream class,
and will write the passed Stone to standard output in Boulder format.
This means that it is currently not possible to write a
Boulder::Genbank object back into Genbank flatfile form.

=back

=head2 Extended Entrez Parameters

The Entrez accessor recognizes extended parameters that allow you the
ability to customize the search.  Instead of passing a query string
scalar or a list of accession numbers as the B<-fetch> argument, pass
a hash reference.  The hashref should contain one or more of the
following keys:

=over

=item B<-query>

The Entrez query to process.

=item B<-accession>

The list of accession numbers to fetch, as an array ref.

=item B<-db>

The database to search.  This is a single-letter database code
selected from the following list:

  m  MEDLINE
  p  Protein
  n  Nucleotide
  s  Popset

=item B<-proxy>

An HTTP proxy to use.  For example:

   -proxy => http://www.firewall.com:9000

If you think you need this, get the correct URL from your system
administrator.

=back

As an example, here's how to search for ESTs from Oryza sativa that
have been entered or modified since 1999.

  my $gb = new Boulder::Genbank( -accessor=>Entrez, 
				 -query=>'Oryza sativa[Organism] AND EST[Keyword] AND 1999[MDAT]', 
                                 -db   => 'n'   
                                });

=head1 METHODS DEFINED BY THE GENBANK STONE OBJECT

Each record returned from the Boulder::Genbank stream defines a set of
methods that correspond to features and other fields in the Genbank
flat file record.  L<Stone::GB_Sequence> gives the full details, but
they are listed for reference here:

=head2 $length = $entry->length

Get the length of the sequence.

=head2 $start = $entry->start

Get the start position of the sequence, currently always "1".

=head2 $end = $entry->end

Get the end position of the sequence, currently always the same as the
length.

=head2 @feature_list = $entry->features(-pos=>[50,450],-type=>['CDS','Exon'])

features() will search the entry feature list for those features that
meet certain criteria.  The criteria are specified using the B<-pos>
and/or B<-type> argument names, as shown below.

=over 4

=item -pos

Provide a position or range of positions which the feature must
B<overlap>.  A single position is specified in this way:

   -pos => 1500;         # feature must overlap postion 1500

or a range of positions in this way:

   -pos => [1000,1500];  # 1000 to 1500 inclusive

If no criteria are provided, then features() returns all the features,

Boulder/Genbank.pm  view on Meta::CPAN

      if (my $data = $self->_getline) {
	$self->_cleanup(\$data);
	return $data;
      } elsif (!$self->{accession} || @{$self->{accession}} == 0) {  # nothing more to do
	return;
      }
    }

    die "Must provide either a list of accession numbers or an Entrez query"
      unless $self->{accession} || $self->{query};

    return unless $self->get_entries;

    my $data = $self->_getline;
    $self->_cleanup(\$data);
    return $data;
}

sub _cleanup {
  my ($self,$d) = @_;
  $$d =~ s/\A\s+//;
  $$d=~s!//\n$!!;
  return unless $self->{format} eq 'fasta';
  chomp $$d;
  substr($$d,0,0)='>' unless $$d =~/^>/;
}

sub get_accessions {
  my $self = shift;
  my $query = shift;
  my $sock    = $self->_build_connection(ENTREZ_HOST) or return;

  # bug here: assume that the server will give us everything when we ask for 1 billion entries
  my $request = $self->_build_post(ENTREZ_HOST,
				   QUERY_URI,
				   undef,
				   sprintf("db=%s&dispmax=%d&report=gen&mode=text&tool=boulder&term=%s",$self->db,$self->limit, escape($query)));

  print $sock $request;
  my $status = $self->_read_header($sock);
  return unless $status == 200;

  local $/ = ' ';
  my $line = $sock->getline;
  chomp $line;
  warn "*** ENTREZ: $line ***" unless $line =~ /^\d+$/;
  my @accessions = $line;
  while (defined ($line = $sock->getline)) {
    chomp $line;
    push @accessions,$line;
  }
  return \@accessions;
}

sub db {
  my $self = shift;
  my $db = $self->{db} || 'n';
  my $translated = { m => 'medline',
		     p => 'protein',
		     n => 'nucleotide',
		     s => 'popset' }->{$db};
  $translated || $db;
}

# BUG: one billion = infinity
sub limit {
  shift->{limit} || 1_000_000_000;
}

sub _build_connection {
  my $self = shift;
  my $host = shift;
  my ($hostent,$peer,$peerport);

  if (my $proxy = $self->{proxy}) {
    $proxy =~ m!^http://([^/]+)/?! or return;
    $hostent = $1;
  } else {
    $hostent = $host;
  }
  
  ($peer,$peerport) = split(':',$hostent);
  $peerport ||= 'http(80)';

  my $sock = IO::Socket::INET->new(
				   PeerAddr => $peer,
				   PeerPort => $peerport,
				   Proto    => 'tcp'
				  );

  $sock;
}

sub _build_post {
  my $self = shift;
  my ($host,$uri,$type,$param) = @_;

  my $path = $self->{proxy} ? "http://$host$uri" : $uri;
  $type ||= 'application/x-www-form-urlencoded';
  my $length = length($param);
  my $request = join (CRLF,
		      "POST $path ".PROTO,
		      "User-agent: Mozilla/5.0 [en] (PalmOS)",
		      "Content-Type: $type",
		      "Content-Length: $length",
		      CRLF
		      );
  $request.$param;
}

sub _read_header {
  my $self = shift;
  my $sock = shift;
  local $/ = CRLF.CRLF;
  my $header = $sock->getline;
  return 500 unless $header;
  return 500 unless $header =~ /^HTTP\/[\d.]+ (\d+)/;
  $1;
}

sub get_entries {



( run in 0.629 second using v1.01-cache-2.11-cpan-71847e10f99 )