Bundle-WWW-Scraper-Job
view release on metacpan or search on metacpan
lib/WWW/Scraper/JustTechJobs.pm view on Meta::CPAN
</TABLE>
</BODY>
</HTML>
This scaffold describes the relevant skeleton of an HTML document; there's HTML and BODY elements, of course.
Then the <TABLE> entry tells our parser to skip to the TABLE in the HTML named "name", or skip "number" TABLE entries
(default=0, to pick up first TABLE element.)
Then the TABLE is described. The first <TR> is described as a "header" row.
The parser throws that one away. The second <TR> is a "detail" row (the "*" means multiple detail rows, of course).
The parser picks up each <TD> element, extracts it's content, and places that in the hash entry corresponding to its
BIND= attribute. Thus, the first TD goes into $result->_elem('title')
(I needed to learn to use LWP::MemberMixin. Thanks, another lesson learned!)
The second TD goes into $result->_elem('description'), etc.
(Of course, some of these are _elem_array, but these details will be resolved later).
The PARSE= in the url TD suggests a way for our parser to do special handling of a data element.
The generic scaffold parser would take this XML and convert it to a hash/array to be processed at run time;
we wouldn't actually use XML at run time. A backend author would use that hash/array in his native_setup_search() code,
calling the "scaffolder" scanner with that hash as a parameter.
As I said, this works great if the response is TABLE structured,
but I haven't seen any responses that aren't that way already.
This converts to an array tree that looks like this:
my $scaffold = [ 'HTML',
[ [ 'BODY',
[ [ 'TABLE', 'name' , # or 'name' = undef; multiple <TABLE number=n> mean n 'TABLE's here ,
[ [ 'NEXT', 1, 'NEXT >' ] , # meaning how to find the NEXT button.
[ 'TR', 1 ] , # meaning "header".
[ 'TR', 2 , # meaning "detail*"
[ [ 'TD', 1, 'title' ] , # meaning clear text binding to _elem('title').
[ 'TD', 1, 'description' ] ,
[ 'TD', 1, 'location' ] ,
[ 'TD', 2, 'url' ] # meaning anchor parsed text binding to _elem('title').
]
] ]
] ]
] ]
];
=cut
# JustTechJobs.com sets the 'whichTech' both in it's domain name, and
# in the CGI program's location; ergo, we need this translation table.
# (NOTE: the ones with upper/lowercase still in the second term have not been verified, gdw.01.05.02)
my %JustTechJobsDirectories = (
'ACCESS' => ["http://www.JustAccessJobs.com",'jAccess j'] ,
'AS/400' => ["http://www.JustAS400Jobs.com", 'jAS/400 j'] ,
'ASP' => ["http://www.JustASPJobs.com", 'jASP j'] ,
'BAAN' => ["http://www.JustBaanJobs.com", 'jBaan j'] ,
'C/C++' => ["http://www.JustcJobs.com", 'jcj'] ,
'CAD' => ["http://www.JustCADJobs.com", 'jCAD j'] ,
'COBOL' => ["http://www.JustCOBOLJobs.com", 'jCOBOL j'] ,
'COLDFUSION' => ["http://www.JustColdFusionJobs.com", 'jColdFusion j'] ,
'CREATIVE' => ["http://www.JustCreativeJobs.com", 'jCreative j'] ,
'DB2' => ["http://www.JustDB2Jobs.com", 'jDB2 j'] ,
'DELPHI' => ["http://www.JustDelphiJobs.com", 'jDelphi j'] ,
'E-COMMERCE' => ["http://www.Juste-CommerceJobs.com", 'je-Commerce j'] ,
'ELECTRICAL ENGINEERING' => ["http://www.JustEEJobs.com", 'jElectrical Engineering j'] ,
'EMBEDDED' => ["http://www.JustEmbeddedJobs.com", 'jembj'] ,
'EXCHANGE' => ["http://www.JustExchangeJobs.com", 'jExchange j'] ,
'FOXPRO' => ["http://www.JustFoxProJobs.com", 'jFoxPro j'] ,
'HELPDESK' => ["http://www.JustHelpdeskJobs.com", 'jHelpdesk j'] ,
'INFORMIX' => ["http://www.JustInformixJobs.com", 'jInformix j'] ,
'JAVA' => ["http://www.JustJavaJobs.com", 'jjavj'] ,
'JD EDWARDS' => ["http://www.JustJDEdwardsJobs.com", 'jJD Edwards j'] ,
'MAINFRAME' => ["http://www.JustMainframeJobs.com", 'jMainframe j'] ,
'NETWARE' => ["http://www.JustNetWareJobs.com", 'jNetWare j'] ,
'NETWORKING' => ["http://www.JustNetworkingJobs.com", 'jnetj'] ,
'NOTES' => ["http://www.JustNotesJobs.com", 'jnjr'] ,
'OLAP' => ["http://www.JustOLAPJobs.com", 'jolaj'] ,
'ORACLE' => ["http://www.JustOracleJobs.com", 'jOracle j'] ,
'PDA' => ["http://www.JustPDAJobs.com", 'jPDA j'] ,
'PEOPLESOFT' => ["http://www.JustPeopleSoftJobs.com", 'jPeopleSoft j'] ,
'PERL' => ["http://www.JustPerlJobs.com", 'JSSearchJobs.asp?'] ,
'POWERBUILDER' => ["http://www.JustPowerBuilderJobs.com", 'jPowerBuilder j'] ,
'PROGRESS' => ["http://www.JustProgressJobs.com", 'jProgress j'] ,
'PROJECT MANAGER' => ["http://www.JustProjectManagerJobs.com", 'jProject Manager j'] ,
'QA' => ["http://www.JustQAJobs.com", 'jQA j'] ,
'SAP' => ["http://www.JustSAPJobs.com", 'jSAP j'] ,
'SECURITY' => ["http://www.JustSecurityJobs.com", 'jSecurity j'] ,
'SIEBEL' => ["http://www.JustSiebelJobs.com", 'jSiebel j'] ,
'SQL SERVER' => ["http://www.JustSQLServerJobs.com", 'jSQL Server j'] ,
'SYBASE' => ["http://www.JustSybaseJobs.com", 'jSybase j'] ,
'TECH SALES' => ["http://www.JustTechSalesJobs.com", 'jTech Sales j'] ,
'TECH WRITER' => ["http://www.JustTechWriterJobs.com", 'jTech Writer j'] ,
'TELEPHONY' => ["http://www.JustTelephonyJobs.com", 'jTelephony j'] ,
'UNIX' => ["http://www.JustUNIXJobs.com", 'jUNIX j'] ,
'VISUAL BASIC' => ["http://www.JustVBJobs.com", 'jVisual Basic j'] ,
'WEB' => ["http://www.JustWebJobs.com", 'jWeb j'] ,
'WINDOWS' => ["http://www.JustWindowsJobs.com", 'jWindows j'] ,
'WIRELESS' => ["http://www.JustWirelessJobs.com", 'jWireless j'] ,
'XML' => ["http://www.JustXMLJobs.com", 'jxmlj']
);
# This is the LOCA list as of 2.May.2001.
# You're welcome to keep it up to date as you wish! ;-)
my %locationList = (
'All Locations' => 'All-Locations',
'All US Locations' => 'US-All',
'Alabama-All' => 'US-AL-All',
'Alabama-Anniston' => 'US-AL-Anniston',
'Alabama-Birmingham' => 'US-AL-Birmingham',
'Alabama-Mobile/Dothan' => 'US-AL-Mobile/Dothan',
'Alabama-Montgomery' => 'US-AL-Montgomery',
'Alabama-Northern/Huntsville' => 'US-AL-Northern/Huntsville',
'Alabama-Tuscaloosa' => 'US-AL-Tuscaloosa',
'Alaska-All' => 'US-AK-All',
'Alaska-Anchorage' => 'US-AK-Anchorage',
'Alaska-Fairbanks' => 'US-AK-Fairbanks',
'Alaska-Juneau' => 'US-AK-Juneau',
'Arizona-All' => 'US-AZ-All',
'Arizona-Flagstaff' => 'US-AZ-Flagstaff',
'Arizona-Phoenix' => 'US-AZ-Phoenix',
'Arizona-Tucson' => 'US-AZ-Tucson',
'Arizona-Yuma' => 'US-AZ-Yuma',
'Arkansas-All' => 'US-AR-All',
'Arkansas-Eastern' => 'US-AR-Eastern',
'Arkansas-Little Rock' => 'US-AR-Little Rock',
'Arkansas-Western' => 'US-AR-Western',
( run in 1.282 second using v1.01-cache-2.11-cpan-71847e10f99 )