Bundle-WWW-Scraper-Job

 view release on metacpan or  search on metacpan

lib/WWW/Scraper/JustTechJobs.pm  view on Meta::CPAN

        </TABLE>
</BODY>
</HTML>

This scaffold describes the relevant skeleton of an HTML document; there's HTML and BODY elements, of course.
Then the <TABLE> entry tells our parser to skip to the TABLE in the HTML named "name", or skip "number" TABLE entries
(default=0, to pick up first TABLE element.)
Then the TABLE is described. The first <TR> is described as a "header" row. 
The parser throws that one away. The second <TR> is a "detail" row (the "*" means multiple detail rows, of course). 
The parser picks up each <TD> element, extracts it's content, and places that in the hash entry corresponding to its 
BIND= attribute. Thus, the first TD goes into $result->_elem('title')
(I needed to learn to use LWP::MemberMixin. Thanks, another lesson learned!)  
The second TD goes into $result->_elem('description'), etc. 
(Of course, some of these are _elem_array, but these details will be resolved later). 
The PARSE= in the url TD suggests a way for our parser to do special handling of a data element.
The generic scaffold parser would take this XML and convert it to a hash/array to be processed at run time;
we wouldn't actually use XML at run time. A backend author would use that hash/array in his native_setup_search() code,
calling the "scaffolder" scanner with that hash as a parameter.

As I said, this works great if the response is TABLE structured,
but I haven't seen any responses that aren't that way already.

This converts to an array tree that looks like this:

    my $scaffold = [ 'HTML', 
                     [ [ 'BODY', 
                       [ [ 'TABLE', 'name' ,                  # or 'name' = undef; multiple <TABLE number=n> mean n 'TABLE's here ,
                         [ [ 'NEXT', 1, 'NEXT &gt;' ] ,       # meaning how to find the NEXT button.
                           [ 'TR', 1 ] ,                      # meaning "header".
                           [ 'TR', 2 ,                        # meaning "detail*"
                             [ [ 'TD', 1, 'title' ] ,         # meaning clear text binding to _elem('title').
                               [ 'TD', 1, 'description' ] ,
                               [ 'TD', 1, 'location' ] ,
                               [ 'TD', 2, 'url' ]             # meaning anchor parsed text binding to _elem('title').
                             ]
                         ] ]
                       ] ]
                     ] ]
                  ];
 

=cut                     

    # JustTechJobs.com sets the 'whichTech' both in it's domain name, and
    #   in the CGI program's location; ergo, we need this translation table.
    # (NOTE: the ones with upper/lowercase still in the second term have not been verified, gdw.01.05.02)
    my %JustTechJobsDirectories = (
            'ACCESS' => ["http://www.JustAccessJobs.com",'jAccess j'] ,
            'AS/400' => ["http://www.JustAS400Jobs.com", 'jAS/400 j'] ,
            'ASP' => ["http://www.JustASPJobs.com", 'jASP j'] ,
            'BAAN' => ["http://www.JustBaanJobs.com", 'jBaan j'] ,
            'C/C++' => ["http://www.JustcJobs.com", 'jcj'] ,
            'CAD' => ["http://www.JustCADJobs.com", 'jCAD j'] ,
            'COBOL' => ["http://www.JustCOBOLJobs.com", 'jCOBOL j'] ,
            'COLDFUSION' => ["http://www.JustColdFusionJobs.com", 'jColdFusion j'] ,
            'CREATIVE' => ["http://www.JustCreativeJobs.com", 'jCreative j'] ,
            'DB2' => ["http://www.JustDB2Jobs.com", 'jDB2 j'] ,
            'DELPHI' => ["http://www.JustDelphiJobs.com", 'jDelphi j'] ,
            'E-COMMERCE' => ["http://www.Juste-CommerceJobs.com", 'je-Commerce j'] ,
            'ELECTRICAL ENGINEERING' => ["http://www.JustEEJobs.com", 'jElectrical Engineering j'] ,
            'EMBEDDED' => ["http://www.JustEmbeddedJobs.com", 'jembj'] ,
            'EXCHANGE' => ["http://www.JustExchangeJobs.com", 'jExchange j'] ,
            'FOXPRO' => ["http://www.JustFoxProJobs.com", 'jFoxPro j'] ,
            'HELPDESK' => ["http://www.JustHelpdeskJobs.com", 'jHelpdesk j'] ,
            'INFORMIX' => ["http://www.JustInformixJobs.com", 'jInformix j'] ,
            'JAVA' => ["http://www.JustJavaJobs.com", 'jjavj'] ,
            'JD EDWARDS' => ["http://www.JustJDEdwardsJobs.com", 'jJD Edwards j'] ,
            'MAINFRAME' => ["http://www.JustMainframeJobs.com", 'jMainframe j'] ,
            'NETWARE' => ["http://www.JustNetWareJobs.com", 'jNetWare j'] ,
            'NETWORKING' => ["http://www.JustNetworkingJobs.com", 'jnetj'] ,
            'NOTES' => ["http://www.JustNotesJobs.com", 'jnjr'] ,
            'OLAP' => ["http://www.JustOLAPJobs.com", 'jolaj'] ,
            'ORACLE' => ["http://www.JustOracleJobs.com", 'jOracle j'] ,
            'PDA' => ["http://www.JustPDAJobs.com", 'jPDA j'] ,
            'PEOPLESOFT' => ["http://www.JustPeopleSoftJobs.com", 'jPeopleSoft j'] ,
            'PERL' => ["http://www.JustPerlJobs.com", 'JSSearchJobs.asp?'] ,
            'POWERBUILDER' => ["http://www.JustPowerBuilderJobs.com", 'jPowerBuilder j'] ,
            'PROGRESS' => ["http://www.JustProgressJobs.com", 'jProgress j'] ,
            'PROJECT MANAGER' => ["http://www.JustProjectManagerJobs.com", 'jProject Manager j'] ,
            'QA' => ["http://www.JustQAJobs.com", 'jQA j'] ,
            'SAP' => ["http://www.JustSAPJobs.com", 'jSAP j'] ,
            'SECURITY' => ["http://www.JustSecurityJobs.com", 'jSecurity j'] ,
            'SIEBEL' => ["http://www.JustSiebelJobs.com", 'jSiebel j'] ,
            'SQL SERVER' => ["http://www.JustSQLServerJobs.com", 'jSQL Server j'] ,
            'SYBASE' => ["http://www.JustSybaseJobs.com", 'jSybase j'] ,
            'TECH SALES' => ["http://www.JustTechSalesJobs.com", 'jTech Sales j'] ,
            'TECH WRITER' => ["http://www.JustTechWriterJobs.com", 'jTech Writer j'] ,
            'TELEPHONY' => ["http://www.JustTelephonyJobs.com", 'jTelephony j'] ,
            'UNIX' => ["http://www.JustUNIXJobs.com", 'jUNIX j'] ,
            'VISUAL BASIC' => ["http://www.JustVBJobs.com", 'jVisual Basic j'] ,
            'WEB' => ["http://www.JustWebJobs.com", 'jWeb j'] ,
            'WINDOWS' => ["http://www.JustWindowsJobs.com", 'jWindows j'] ,
            'WIRELESS' => ["http://www.JustWirelessJobs.com", 'jWireless j'] ,
            'XML' => ["http://www.JustXMLJobs.com", 'jxmlj']
        );

# This is the LOCA list as of 2.May.2001.
# You're welcome to keep it up to date as you wish! ;-)
    my %locationList = (
        'All Locations' => 'All-Locations',
        'All US Locations' => 'US-All',
        'Alabama-All' => 'US-AL-All',
        'Alabama-Anniston' => 'US-AL-Anniston',
        'Alabama-Birmingham' => 'US-AL-Birmingham',
        'Alabama-Mobile/Dothan' => 'US-AL-Mobile/Dothan',
        'Alabama-Montgomery' => 'US-AL-Montgomery',
        'Alabama-Northern/Huntsville' => 'US-AL-Northern/Huntsville',
        'Alabama-Tuscaloosa' => 'US-AL-Tuscaloosa',
        'Alaska-All' => 'US-AK-All',
        'Alaska-Anchorage' => 'US-AK-Anchorage',
        'Alaska-Fairbanks' => 'US-AK-Fairbanks',
        'Alaska-Juneau' => 'US-AK-Juneau',
        'Arizona-All' => 'US-AZ-All',
        'Arizona-Flagstaff' => 'US-AZ-Flagstaff',
        'Arizona-Phoenix' => 'US-AZ-Phoenix',
        'Arizona-Tucson' => 'US-AZ-Tucson',
        'Arizona-Yuma' => 'US-AZ-Yuma',
        'Arkansas-All' => 'US-AR-All',
        'Arkansas-Eastern' => 'US-AR-Eastern',
        'Arkansas-Little Rock' => 'US-AR-Little Rock',
        'Arkansas-Western' => 'US-AR-Western',



( run in 1.282 second using v1.01-cache-2.11-cpan-71847e10f99 )