XML-Minify

 view release on metacpan or  search on metacpan

lib/XML/Minify.pm  view on Meta::CPAN

	my $newnode = $doc->createElement($name);

	if($outnode) {
		$outnode->addChild($newnode);
	}

	$outnode = $newnode;

        my @as = $node->attributes ;
        foreach my $a (@as) { 
                $outnode->setAttribute($a->nodeName, $a->value); 
        }

        foreach my $child ($node->childNodes) {
		if($child->nodeType eq XML_TEXT_NODE) {
			my $str = $child->data;

			
			if($do_not_remove_blanks{$child->parentNode->getName()}) {
				# DO NOT REMOVE, PROTECTED BY DTD ELEMENT DECL	
			} else {
				# All these substitutions aim to remove indentation that people tend to put in xml files...
				# ...Or just clean on demand (default behavior keeps these blanks)


				# Blanks are several things like spaces, tabs, lf, cr, vertical space...

				# Configurable with remove_blanks_start : remove extra space/lf/cr at the start of the string
				$opt{remove_blanks_start} and $str =~ s/\A\s*//g;
				# Configurable with remove_blanks_end : remove extra space/lf/cr at the end of the string
				$opt{remove_blanks_end} and $str =~ s/\s*\Z//g;


				# Only CR and LF

				# Configurable with remove_cr_lf_everywhere : remove extra lf/cr everywhere
				$opt{remove_cr_lf_everywhere} and $str =~ s/\R*//g;


				# Spaces are 2 things : space and tabs

				# Configurable with remove_spaces_line_start : remove extra spaces or tabs at the start of each line
				$opt{remove_spaces_line_start} and $str =~ s/^( |\t)*//mg;
				# Configurable with remove_spaces_line_end : remove extra spaces or tabs at the end of each line
				$opt{remove_spaces_line_end} and $str =~ s/( |\t)*$//mg;
				# Configurable with remove_spaces_everywhere : remove extra spaces everywhere
				$opt{remove_spaces_everywhere} and $str =~ s/( |\t)*//g;

				# Configurable with remove_empty_text : remove text nodes that contains only space/lf/cr
				$opt{remove_empty_text} and $str =~ s/\A\s*\Z//g;
			}
			
			# Let me explain, we could have text nodes basically everywhere, and we don't know if whitespaces are ignorable or not. 
			# As we want to minify the xml, we can't just keep all blanks, because it is generally indentation or spaces that could be ignored.
			# Here is the strategy : 
			# A. If we have <name>   </name> we should keep it anyway (unless forced with argument)
			# B. If we have </name>   </person> we should *maybe* remove (in this case parent node contains more than one child node : text node + element node)
			# C. If we have <person>   <name> we should *maybe* remove it (in this case parent node contains more than one child node : text node + element node)
			# D. If we have </person>   <person> we should *maybe* remove it (in this case parent node contains more than one child node : text node + element node)
			# B, C, D : remove... unless explicitely declared in DTD as potential #PCDATA container OR unless it contains something...
			# *something* is a comment (not removed), some other text not empty, some cdata.
			# Imagine </name>   <!-- comment --> some text </person> then we don't want to remove spaces in the first text node
			# Same with </name>   <!-- comment -->   </person>
			# But if comments are removed then the latter piece of code will become </name></person>

			my $empty = 1;
			
			my $childbak = $child;
			my @siblings = ();
			# We want to inspect siblings to the right until we reach an element
			while($child = $child->nextSibling) {
				if($child->nodeType eq XML_ELEMENT_NODE) {
					last;
				}
				push @siblings, $child;
			}
			$child = $childbak;
			# We inspect to the left also
			while($child = $child->previousSibling) {
				if($child->nodeType eq XML_ELEMENT_NODE) {
					last;
				}
				push @siblings, $child;
			}

			# Then we will look at each siling to check
			# If it is an empty text node or not
			# If it is something that will be removed or not
			foreach my $child (@siblings) {
				if($child->nodeType eq XML_TEXT_NODE) {
					if($child->data =~ m/[^ \t\r\n]/) {
						# Not empty
						$empty = 0;
						last;
					}
				}
				if($child->nodeType eq XML_COMMENT_NODE and $opt{keep_comments}) {
					$empty = 0;
					last;
				}
				if($child->nodeType eq XML_CDATA_SECTION_NODE and $opt{keep_cdata}) {
					$empty = 0;
					last;
				}
				if($child->nodeType eq XML_PI_NODE and $opt{keep_pi}) {
					$empty = 0;
					last;
				}
				# Entity refs : we can choose to expand or not... but not to drop them
				if($child->nodeType eq XML_ENTITY_REF_NODE) {
					$empty = 0;
					last;
				}
			}


			$child = $childbak;

			# Were all siblings empty ? 
			# Are we alone ? (count child nodes from parent instead of filtered siblings)
			# If there is a DTD, probably we can remove even in the leafs (I'm not doing this at the moment) 
			if($we_have_infos_from_dtd) {
				# Only trust DTD, no need to consider if we are in a leaf or node
				if($do_not_remove_blanks{$child->parentNode->getName()}) {
					# DO NOT REMOVE, PROTECTED BY DTD ELEMENT DECL	
				} else {
					$str =~ s/\A\R*\Z//mg;
					$str =~ s/\A\s*\Z//mg;
				}
			} elsif($empty and @{$child->parentNode->childNodes()} > 1) {
				# Should it be configurable ? 
				if($do_not_remove_blanks{$child->parentNode->getName()}) {
					# DO NOT REMOVE, PROTECTED BY DTD ELEMENT DECL	
				} else {
					$str =~ s/\A\R*\Z//mg;
					$str =~ s/\A\s*\Z//mg;
				}
			}
			$outnode->appendText($str);
		} elsif($child->nodeType eq XML_ENTITY_REF_NODE) {
			# Configuration will be done above when creating document
			my $er = $doc->createEntityReference($child->getName());
			$outnode->addChild($er); 
		} elsif($child->nodeType eq XML_COMMENT_NODE) {
			# Configurable with keep_comments
			my $com = $doc->createComment($child->getData());
			$opt{keep_comments} and $outnode->addChild($com); 
		} elsif($child->nodeType eq XML_CDATA_SECTION_NODE) {
			# Configurable with keep_cdata
			#my $cdata = $child->cloneNode(1);
			my $cdata = $doc->createCDATASection($child->getData());
			$opt{keep_cdata} and $outnode->addChild($cdata);
		} elsif($child->nodeType eq XML_PI_NODE) {
			# Configurable with keep_pi
			#my $pi = $child->cloneNode(1);
			my $pi = $doc->createPI($child->nodeName, $child->getData());
			$opt{keep_pi} and $outnode->addChild($pi);
		} elsif($child->nodeType eq XML_ELEMENT_NODE) {
			$outnode->addChild(traverse($child, $outnode)); 
		}
	} 
	return $outnode;
}


1;

__END__

=encoding utf-8

=head1 NAME

XML::Minify - A configurable XML minifier.

=head1 WARNING

The API (option names) is almost stabilized (but not fully) and can therefore still change a bit.

=head1 SYNOPSIS

Here is the simplest way to use XML::Minify :

    use XML::Minify;

    my $maxi = "<person>   <name>tib   </name>   <level>  42  </level>  </person>";
    my $mini = minify($maxi);

But a typical use would include some parameters like this :

    use XML::Minify qw(minify);

    my $maxi = "<person>   <name>tib   </name>   <level>  42  </level>  </person>";
    my $mini = minify($maxi, no_prolog => 1, aggressive => 1);

That will produce :

    <person><name>tib</name><level>42</level></person>

B<aggressive>, B<destructive> and B<insane> are shortcuts that define a set of parameters. 

You can set indivually with :

    use XML::Minify qw(minify);

    my $maxi = "<person>   <name>tib   </name>   <level>  42  </level>  </person>";
    my $mini = minify($maxi, no_prolog => 1, aggressive => 1, keep_comments => 1, remove_indent => 1);

The code above means "minify this string with aggressive mode BUT keep comments and in addition remove indent".

Not every parameter has a B<keep_> neither a B<remove_>, please see below for detailed list.

=head2 DEFAULT MINIFICATION

The minifier has a predefined set of options enabled by default. 

They were decided by the author as relevant but you can disable individually with B<keep_> options.

=over 4

=item Merge elements when empty

=item Remove DTD (configurable).

=item Remove processing instructions (configurable)

=item Remove comments (configurable).

=item Remove CDATA (configurable).

=back

In addition, the minifier will drop every blanks between the first level children. 
What you can find between first level children is not supposed to be meaningful data then we we can safely remove formatting here. 
For instance we can remove a carriage return between prolog and a processing instruction (or even inside a DTD).

In addition again, the minifier will I<smartly> remove blanks between tags. By I<smart> I mean that it will not remove blanks if we are in a leaf (more chances to be meaningful blanks) or if the node contains something that will persist (a I<not remo...

If there is no DTD (very often), we are blind and simply use the approach I just described above (keep blanks in leafs, remove blanks in nodes if all siblings contains only blanks).


Everything listed above is the default and should be perceived as almost lossyless minification in term of semantic (for humans). 

It's not completely if you consider these things as data, but in this case you simply can't minify as you can't touch anything ;)


=head2 EXTRA MINIFICATION

In addition, you could enable mode B<aggressive>, B<destructive> or B<insane> to remove characters in the text nodes (sort of "cleaning") : 

=head3 Aggressive

=over 4

=item Remove empty text nodes.

=item Remove starting blanks (carriage return, line feed, spaces...).

=item Remove ending blanks (carriage return, line feed, spaces...).

=back

=head3 Destructive

=over 4

=item Remove indentation.

=item Remove invisible spaces and tabs at the end of line.

=back 

=head3 Insane

=over 4

=item Remove carriage returns and line feed into text nodes everywhere.

=item Remove spaces into text nodes everywhere.

=back 

=head2 OPTIONS

You can give various options:

=over 4

=item B<expand_entities>

Expand entities. An entity is like 
    
    &foo; 

=item B<process_xincludes>

Process the xincludes. An xinclude is like 

lib/XML/Minify.pm  view on Meta::CPAN

Remove spaces and tabs at the start of each line in text nodes. 
It's like removing indentation actually.

For instance 

    <tag>
           foo 
           bar    
       </tag> 

will become 

    <tag>
    foo 
    bar
    </tag>

=item B<remove_spaces_line_end>

Remove spaces and tabs at the end of each line in text nodes.
It's like removing invisible things.

=item B<remove_empty_text>

Remove (pseudo) empty text nodes (containing only spaces, carriage return, line feed...). 

For instance 
  
    <tag>

    </tag>

will become 

    <tag/>

=item B<remove_cr_lf_everywhere>

Remove carriage returns and line feed everywhere (inside text !). 

For instance 

    <tag>foo
    bar
    </tag> 

will become 

    <tag>foobar</tag>

It is aggressive and therefore lossy compression.

=item B<keep_comments>

Keep comments, by default they are removed. 

A comment is something like :

    <!-- comment -->

=item B<keep_cdata>

Keep cdata, by default they are removed. 

A CDATA is something like : 

    <![CDATA[ my cdata ]]>

=item B<keep_pi>

Keep processing instructions. 

A processing instruction is something like :

    <?xml-stylesheet href="style.css"/>

=item B<keep_dtd>

Keep DTD.

=item B<ignore_dtd>

When set, the minifier will ignore informations from the DTD (typically where blanks are meaningfull)

This option can be used with B<keep_dtd>, you can decide to get informations from DTD then remove it (or the contrary).

Then I must repeat that B<ignore_dtd> is NOT the contrary of B<keep_dtd>

=item B<no_prolog>

Do not put prolog (having no prolog is aggressive for XML readers).

Prolog is at the start of the XML file and look like this :

    <?xml version="1.0" encoding="UTF-8"?>

=item B<version>

Specify version.

=item B<encoding>

Specify encoding.

=item B<aggressive>

Enable B<aggressive> mode. Enables options B<remove_blanks_starts>, B<remove_blanks_end> and B<remove_empty_text> if they are not defined only.
Other options still keep their value.

=item B<destructive>

Enable B<destructive> mode. Enable options B<remove_spaces_line_starts> and B<remove_spaces_line_end> if they are not defined only.
Enable also B<aggressive> mode.
Other options still keep their value.

=item B<insane>

Enable B<insane> mode. Enables options B<remove_cr_lf_everywhere> and B<remove_spaces_everywhere> if they are not defined only.
Enable also B<destructive> mode and B<aggressive> mode.
Other options still keep their value.

=back 

=head1 LICENSE

Copyright (C) Thibault DUPONCHELLE.



( run in 1.234 second using v1.01-cache-2.11-cpan-39bf76dae61 )