streaming results from the CPAN

RDFStore

view release on metacpan or search on metacpan

the Perl language environment.
By using the Perl TIE interface, a generic application script can access RDF
triplets using normal key/value hashes; the storage can happen either
in-memory data structures (not tie) or on the local filesystem by using the
DB_File.pm or BerkeleyDB.pm modules. An experimental remote storage service
is also provided using a custom DBMS.pm module coupled with a fast and
performant TCP/IP deamon (http://rdfstore.sourceforge.net/dbms.html). The 
deamon has been written entirely in the C language and is actually storing 
the data in Berkeley DB v1.x files; such a software is similar to the rdfbd 
(http://web1.guha.com/rdfdb/) approach from Guha.  The input RDF files are 
being parsed and processed by using a streaming SiRPAC like parser completely 
written in Perl. Such an implementation includes most of the proposed bug 
fixes and updates as suggested on the W3C RDF-interest-Group mailing list and 
on the SiRPAC Web site.  A strawman parser for a simplified syntax proposed 
by Jonathan Borden at http://www.openhealth.org/RDF/rdf_Syntax_and_Names.htm, 
Jason Diamond's at http://www.injektilo.org/rdf/rdf.xsl and Dan Connolly at 
http://www.w3.org/XML/2000/04rdf-parse/ is also included. By using the Sablotron 
XSLT engine is then possible to easily tranform XML documents to RDF and query 
them from the Perl language.

INSTALLATION

doc/SWADe-rdfstore.html view on Meta::CPAN

<H2>The compression algorithm</H2>

Both the graph as well as the free-text words index are relatively sparsely populated which make simple compression possible. The bit arrays used in each can grow to very significant sizes; in the order of several, if not tens of page multiples. Comb...
Initially a Run Length Encoding method was used; with two small optimizations. The first optimization was early termination; i.e. if the remainder of the row would solely contain zero's it would simply not list those explicitly. The second optimizati...
The first issue is that certain values, such as a reference to a schema or a common property are dis-proportionally over represented; by several orders of magnitude (e.g rdf:type property or contextual information). Secondly certain other values; suc...
So for this reason a variant of the Variable Run Length encoding is used along with part of the above RLE method. This method is still applicable to the word indexing but adds the ability to recognize short patterns; and code the patterns which occur...
At this point in time (de-)compression is such that the storage volumes are reasonable, that transfer volumes are manageable and we do not expect to give priority to work in this area. However we expect to examine this issue again and will be looking...

<H2>Conclusion: RDFStore</H2>

RDFStore <a href="#47">[47]</a> is a perl/C toolkit to process, store, retrieve and manage RDF; it consists of a programming API, streaming RDF/XML and N-Triples parsers and a generic hashed data storage which implements the indexing algorithm as des...
<BR>
RDFStore has been successfully used for the development of several Semantic Web applications <a href="#16a">[16a]</a><a href="#16b">[16b]</a><a href="#16c">[16c]</a> and others which read/write and query RDF descriptions using RDQL.

<H2>References</H2>

<a name="1">[1]</a> "A Relational Model of Data for Large Shared Data Banks", E.F. Codd, Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387. <a href="http://www.acm.org/classics/nov95/toc.html">http://www.acm.org/classics/nov95/toc.htm...
<a name="2">[2]</a> P. Buneman, S. Davidson, G. Hillebrand and D. Suciu, "A query language and optimization techniques for unstructured data". In SIGMOD, San Diego, 1996<BR>
<a name="3">[3]</a> S. Abiteboul, D. Quass, J. McHugh, J. Widom and J. Wiener "The lorel query language for semistructured data" 1996 ftp://db.stanford.edu/pub/papers/lorel96.ps<BR>
<a name="4">[4]</a> Dan Brickley, R.V. Guha "RDF Vocabulary Description Language 1.0: RDF Schema" <a href="http://www.w3.org/TR/rdf-schema/">http://www.w3.org/TR/rdf-schema/</a><BR>
<a name="5">[5]</a> Grady Booch "Object-Oriented Analysis and Design with Applications" p. 71-72<BR>

lib/DBD/RDFStore.pm view on Meta::CPAN

# i.e. $sth->{'result'} = ( '?x' => 1, '?y' => Test1 )
#
sub _nextMatch {
        my( $sth, $rpi, $gp, $tpi, %bind ) = @_;

	if($DBD::RDFStore::st::debug>1) {
		print STDERR (" " x $tpi);
		print STDERR "$tpi BEGIN\n";	
		};

	# if we have a previous state try to recover it (this is needed for streaming results)
	my $bind_state = pop @{ $sth->{'binds'} };

	if(	( $bind_state ) && ($DBD::RDFStore::st::debug>1) ) {
		print STDERR (" " x $tpi);
		print STDERR "RECOVER previous state for $tpi\n";
		};

	_nextMatch( $sth, $rpi, $gp, $tpi+1, %{$bind_state} )
		if( $bind_state );

lib/DBD/RDFStore.pm view on Meta::CPAN

        my($sth) = @_;

	if($sth->{'RDF_or_XML_stream_finished'}) {
		$sth->{'RDF_or_XML_stream_finished'} = 0;
		return;
		};

	return _fetchrow_RDF_or_XML( $sth );
	};

# fetch the whole matching graph in one call (not streaming then)
# return RDFStore::Model of matching statements
sub fetchallgraph {
        my($sth) = @_;

	my $whole_graph;
	while ( my $graph = fetchsubgraph($sth) ) {
		$whole_graph = $graph
			unless($whole_graph);
		my $e = $graph->elements;
		while(my $ss = $e->each) {
			$whole_graph->add($ss);
			};
		};

	return $whole_graph;
	};

# should be streaming
sub _fetchrow_RDF_or_XML {
        my($sth, $syntax) = @_;

	return
		if($sth->{'RDF_or_XML_stream_finished'});

	unless($syntax) {
		$syntax = $sth->{'results'}->{'syntax'}
			if(exists $sth->{'results'}->{'syntax'});
		};

	return
		unless(	(!$syntax) ||
			($syntax =~ m#(RDF/XML|N-Triples|dawg-results|rdf-for-xml|dawg-xml)#i) );

	my $result = '';

	my $mm = new RDFStore::Model; # we want streaming - that's why this...

	# DESCRIBE <URI> are done once in one single subgraph / match
	if(	( $sth->{'Statement'}->getQueryType eq 'DESCRIBE' ) &&
		( grep m/^<([^>]+)>/, @{ $sth->{'Statement'}->{'describes'} }) ) {
		foreach my $d ( @{ $sth->{'Statement'}->{'describes'} } ) {
			next
				unless($d =~ m/^<([^>]+)>/);

			$d = $1;

lib/RDFStore/Parser/NTriples.pm view on Meta::CPAN

	return 'genid' . $class->{iReificationCounter}++;
	};

1;
};

__END__

=head1 NAME

RDFStore::Parser::NTriples - This module implements a streaming N-Triples parser 

=head1 SYNOPSIS

	use RDFStore::Parser::NTriples;
        use RDFStore::NodeFactory;
        my $p=new RDFStore::Parser::NTriples(
		ErrorContext => 2,
                Handlers        => {
                        Init    => sub { print "INIT\n"; },
                        Final   => sub { print "FINAL\n"; },

lib/RDFStore/Parser/NTriples.pm view on Meta::CPAN

                                        persistent      =>      1,
                                        seevalues       =>      1,
                                        store_options         =>      { Name => '/tmp/test' }
                                }
        );
	$pstore->parsefile('http://www.gils.net/bsr-gils.nt');


=head1 DESCRIPTION

This module implements a N-Triples I<streaming> parser.

=head1 METHODS

=over 4

=item new

This is a class method, the constructor for RDFStore::Parser::NTriples. B<Options> are passed as keyword value
pairs. Recognized options are:

lib/RDFStore/Parser/SiRPAC.pm view on Meta::CPAN


	$expat->{SiRPAC}->{EXPECT_Element} = $newElement
		if($setScanModeElement);

	my $sLiteralValue;
	if($expat->{SiRPAC}->{scanMode} ne 'SKIPPING') {

		# goes through the attributes of newElement to see
	 	# 1. if there are symbolic references to other nodes in the data model.
		# in which case they must be stored for later resolving with
		# resolveLater method (fix aboutEach on streaming!!!)
		# 2. if there is an identity attribute, it is registered using
		# registerResource or registerID method. 
	
       		my $sResource;
       		$sResource = getAttributeValue($expat,$newElement->{attlist}, $RDFStore::Parser::SiRPAC::RDFMS_resource);
		if (defined $sResource) {
       	 		$newElement->{sResource} = normalizeResourceIdentifier($expat,$sResource);
		} else {
       			$sResource = getAttributeValue($expat,$newElement->{attlist}, $RDFStore::Parser::SiRPAC::RDFMS_nodeID);
			if (defined $sResource) {

lib/RDFStore/Parser/SiRPAC.pm view on Meta::CPAN

	sub namespace { };
};

1;
};

__END__

=head1 NAME

RDFStore::Parser::SiRPAC - This module implements a streaming RDF Parser as a direct implementation of XML::Parser::Expat(3)

=head1 SYNOPSIS

	use RDFStore::Parser::SiRPAC;
        use RDFStore::NodeFactory;
        my $p=new RDFStore::Parser::SiRPAC(
		ErrorContext => 2,
                Handlers        => {
                        Init    => sub { print "INIT\n"; },
                        Final   => sub { print "FINAL\n"; },

lib/RDFStore/Parser/SiRPAC.pm view on Meta::CPAN

                                }
        );
	my $rdfstore_model = $pstore->parsefile('http://www.gils.net/bsr-gils.rdfs');

	#using the expat no-blocking feature (generally for large XML streams) - see XML::Parse::Expat(3)
	my $rdfstore_stream_model = $pstore->parsestream(*STDIN);
	

=head1 DESCRIPTION

This module implements a Resource Description Framework (RDF) I<streaming> parser completely in 
Perl using the XML::Parser::Expat(3) module. The actual RDF parsing happens using an instance of XML::Parser::Expat with Namespaces option enabled and start/stop and char handlers set.
The RDF specific code is based on the modified version of SiRPAC of Sergey Melnik in Java; a lot of
changes and adaptations have been done to actually run it under Perl.
Expat options may be provided when the RDFStore::Parser::SiRPAC object is created. These options are then passed on to the Expat object on each parse call.

Exactly like XML::Parser(3) the behavior of the parser is controlled either by the Style entry elsewhere in this document and/or the Handlers entry elsewhere in this document options, or by the setHandlers entry elsewhere in this document method. The...

To see some examples about how to use it look at the sections below and in the samples and utils directory coming with this software distribution.

E.g.

lib/RDFStore/Parser/SiRPAC.pm view on Meta::CPAN

 Benchmarking XML Parsers by Clark Cooper - http://www.xml.com/pub/Benchmark/article.html

 See also http://www.w3.org/RDF/Implementations/SiRPAC/SiRPAC-defects.html

 RDF::Parser(3) from http://www.pro-solutions.com

=head1 AUTHOR

	Alberto Reggiori <areggiori@webweaving.org>

	Sergey Melnik <melnik@db.stanford.edu> is the original author of the streaming version of SiRPAC in Java
	Clark Cooper is the author of the XML::Parser(3) module together with Larry Wall

( run in 0.372 second using v1.01-cache-2.11-cpan-4d50c553e7e )