HTML-Tree

 view release on metacpan or  search on metacpan

lib/HTML/Tree/Scanning.pod  view on Meta::CPAN

if any matches.  So, here, 

  sub {
    not $_[0]->look_down('_tag', 'img')
  }

means "return true only if this element has no 'img' element as
descendants (and isn't an 'img' element itself)."

This correctly filters out the first "h1" that contains the ad, but it
also incorrectly filters out the second "h1" that contains a
non-advertisement photo besides the headline text you want.

There clearly are detectable differences between the first and second
"h1" elements -- the only second one contains the string "Schreck", and
we could just test for that:

  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
      $_[0]->as_text =~ m{Schreck}
    }
  );

And that works fine for this one example, but unless all thousand of
your press releases have "Schreck" in the headline, that's just not a
general solution.  However, if all the ads-in-"h1"s that you want to
exclude involve a link whose URL involves "/dyna/", then you can use
that:

  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
      my $link = $_[0]->look_down('_tag','a');
      return 1 unless $link;
        # no link means it's fine
      return 0 if $link->attr('href') =~ m{/dyna/};
        # a link to there is bad
      return 1; # otherwise okay
    }
  );

Or you can look at it another way and say that you want the first "h1"
element that either contains no images, or else whose image has a "src"
attribute whose value contains "/photos/":

  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
      my $img = $_[0]->look_down('_tag','img');
      return 1 unless $img;
        # no image means it's fine
      return 1 if $img->attr('src') =~ m{/photos/};
        # good if a photo
      return 0; # otherwise bad
    }
  );

Recall that this use of C<look_down> in a scalar context means to return
the first element at or under C<$tree> that matches all the criteria.
But if you notice that you can formulate criteria that'll match several
possible "h1" elements, some of which may be bogus but the I<last> one
of which is always the one you want, then you can use C<look_down> in a
list context, and just use the last element of that list:

  my @h1s = $tree->look_down(
    '_tag', 'h1',
    ...maybe more criteria...
  );
  die "What, no h1s here?" unless @h1s;
  my $real_h1 = $h1s[-1]; # last or only

=head2 A Case Study: Scanning Yahoo News's HTML

The above (somewhat contrived) case involves extracting data from a
bunch of pre-existing HTML files.  In that sort of situation, if your
code works for all the files, then you know that the code I<works> --
since the data it's meant to handle won't go changing or growing; and,
typically, once you've used the program, you'll never need to use it
again.

The other kind of situation faced in many data extraction tasks is
where the program is used recurringly to handle new data -- such as
from ever-changing Web pages.  As a real-world example of this,
consider a program that you could use (suppose it's crontabbed) to
extract headline-links from subsections of Yahoo News
(C<http://dailynews.yahoo.com/>).

Yahoo News has several subsections:

=over

=item http://dailynews.yahoo.com/h/tc/ for technology news

=item http://dailynews.yahoo.com/h/sc/ for science news

=item http://dailynews.yahoo.com/h/hl/ for health news

=item http://dailynews.yahoo.com/h/wl/ for world news

=item http://dailynews.yahoo.com/h/en/ for entertainment news

=back

and others.  All of them are built on the same basic HTML template --
and a scarily complicated template it is, especially when you look at
it with an eye toward making up rules that will select where the real
headline-links are, while screening out all the links to other parts of
Yahoo, other news services, etc.  You will need to puzzle
over the HTML source, and scrutinize the output of
C<$tree-E<gt>dump> on the parse tree of that HTML.

Sometimes the only way to pin down what you're after is by position in
the tree. For example, headlines of interest may be in the third
column of the second row of the second table element in a page:

  my $table = ( $tree->look_down('_tag','table') )[1];
  my $row2  = ( $table->look_down('_tag', 'tr' ) )[1];
  my $col3  = ( $row2->look-down('_tag', 'td')   )[2];
  ...then do things with $col3...



( run in 1.976 second using v1.01-cache-2.11-cpan-39bf76dae61 )