Alien-Judy
view release on metacpan or search on metacpan
src/judy-1.0.5/test/SLcompare.c view on Meta::CPAN
9/2002 (dlb).
This is the output of this program when run on a 750Mhz P3 laptop.
This output is using a file available from www.google.com. It looks
like 'http' web pages and was one of the files used in a March 2002
contest by www.google.com
wc google/data/repos.00
3440314 14613027 162822437 google/data/repos.00
lines avg_linelen getline stored RAMused/line store/ln lookup/ln ADT
2807737 58.0 byts 0.856 uS 1455201 104.0 byts 4.859 uS 4.566 uS SPLAY
2807737 58.0 byts 0.862 uS 1455201 96.0 byts 2.523 uS 2.101 uS HASH
2807737 58.0 byts 0.856 uS 1455201 104.0 byts 4.601 uS 4.001 uS REDBLK
2807737 58.0 byts 0.875 uS 1455201 78.7 byts 3.898 uS 2.687 uS JUDY
With = 1.46 million lines stored and 2.81 million non-blank lines,
all the algorithms seem to keep their performance. The hash table is a
perfect size for this data set. I suspect only the HASH algorithm will
slow down with larger data sets unless the hash table is increased larger
than 2**20 million entrys. I have now tried a data set with about 10
million unique lines on a 64 bit machine (google/data/repos.*).
The speeds are about the same. Lookup times are in the 2-4 uSec range
and this is probably insignificant compared to the application needing
the data. I think all four methods are fine for speed, and Judy is a
standout for its memory efficiency -- frequently using less memory than
the strings alone.
I plan on tuning JudySL in the near future. This should improve the
performance of JudySL with short (a text word) strings. I have a data
base of all (28.8 million) domain names on the internet as a test case.
wc /data/domnames.txt
28826508 28826508 488227095 /data/domnames.txt
For the curious, JudySL (I.E JSLI/G macros in this program) is simply an
application subroutine that uses JudyL to the work. JudyL is a very
scaleable word to word (or pointer) mapper. JudySL is implemented as a
digital tree using JudyL arrays as 2^32 (or 2^64 in 64 bit machines)
wide (virtual) nodes. Each level in the digital tree decodes the next 4
(or 8) bytes of the line/string until only one entry would be in the
node. Then the remaining undecoded portion of the line is stored as a
string from the parent node. A digital tree does not need rotation or
balancing. In practice, digital trees are rarely use because of their
poor memory efficiency in the nodes and a tendency to have large depths.
These problems are solved in the Judy algorithm. For more details see
application notes on: www.sourcejudy.com. And for the really curious,
a technical paper is available in the application section. If you would
like to be a reviewer, contact Doug Baskins at <doug@sourcejudy.com>
*/
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h> // printf
#include <fcntl.h>
#include <string.h>
#include <errno.h> // errno
#include <sys/mman.h> // mmap()
#include <sys/stat.h> // stat()
#include <sys/time.h> // gettimeofday()
// remove all uses of this in production code.
int gDupCnt = 0; // counter for duplicate strings
//=======================================================================
// T I M I N G M A C R O S
//=======================================================================
// if your machine is one of the supported types in the following header
// file then uncomment this corresponding to what the header file says.
// This will give very high timing resolution.
//
// #define JU_xxxxxxx 1 // read timeit.h
// #define JU_LINUX_IA32 1 // I.E. IA32 Linux
//
//
// optional for high resolution times and compile with timeit.c
// cc -static -O -o Judy -DJUDYMETHOD SLcompare.c timeit.c -lJudy
//
//#include "timeit.h"
double DeltaUSec; // Global for remembering delta times
#ifndef _TIMEIT_H
// Note: I have found some Linux systems (2.4.18-6mdk) to have bugs in the
// gettimeofday() routine. Sometimes it returns a negative ~2840 microseconds
// instead of 0 or 1. If you use the above #include "timeit.h" and compile with
// timeit.c and use -DJU_LINUX_IA32, that problem will be eliminated. This is
// because for delta times less than .1 sec, the hardware free running timer
// is used instead of gettimeofday(). I have found the negative time problem
// appears about 40-50 times per second with consecutive gettimeofday() calls.
#define TIMER_vars(T) struct timeval __TVBeg_##T, __TVEnd_##T
#define STARTTm(T) gettimeofday(&__TVBeg_##T, NULL)
#define ENDTm(D,T) \
{ \
gettimeofday(&__TVEnd_##T, NULL); \
(D) = (double)(__TVEnd_##T.tv_sec - __TVBeg_##T.tv_sec) * 1E6 + \
((double)(__TVEnd_##T.tv_usec - __TVBeg_##T.tv_usec)); \
}
#endif // _TIMEIT_H
//=======================================================================
// M E M O R Y S I Z E M A C R O S
//=======================================================================
// Most mallocs have mallinfo()
// define this if your malloc does not have mallinfo();
#define NOMALLINFO 1
double DeltaMem; // for remembering
#ifndef NOMALLINFO
#include <malloc.h> // mallinfo()
src/judy-1.0.5/test/SLcompare.c view on Meta::CPAN
return (tmp);
}
#define INITARRAY ANSREC ans = {0}
#define STORESTRING(STRING) redblackinsert(&ans, STRING)
#define GETSTRING(STRING) redblacksearch(&ans, STRING)
#endif // REDBLACKMETHOD
//=======================================================================
// M A I N P R O G R A M -by- Doug Baskins
//=======================================================================
// error routine for system routines for accessing file
#define FILERROR \
{ \
printf("%s: Cannot open file \"%s\": %s " \
"(errno = %d)\n", argv[0], argv[1], strerror(errno), \
errno); \
fprintf(stderr, "%s: Cannot open file \"%s\": %s " \
"(errno = %d)\n", argv[0], argv[1], strerror(errno), \
errno); \
exit(1); \
}
//=======================================================================
// M E A S U R E A D T S P E E D and M E M O R Y U S A G E
//=======================================================================
int
main(int argc, char *argv[])
{
TIMER_vars(tm); // declare timer variables
int fd; // to read file.
struct stat statbuf; // to get size of file
char *Pfile; // ram address of file
size_t fsize; // file size in bytes
char *FSmap; // start address of mapped file
char *FEmap; // end address+1 of mapped file
char String[MAXLINE]; // input buffer
int StrCnt; // line counter
int ii; // temp
INITARRAY;
// CHECK FOR REQUIRED INPUT FILE PARAMETER:
if (argc != 2)
{
printf("Usage: %s <text file>\n", argv[0]);
exit(1);
}
// GET FILE SIZE
if (stat(argv[1], &statbuf) == -1)
FILERROR;
fsize = statbuf.st_size;
// OPEN INPUT FILE:
if ((fd = open(argv[1], O_RDONLY)) == -1)
FILERROR;
// MEMORY MAP FILE
Pfile =
(char *)mmap(NULL, fsize, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
if (Pfile == (char *)-1)
FILERROR;
FEmap = Pfile + fsize; // set end+1 address
// print command name + run arguments
printf("# ");
for (ii = 0; ii < argc; ii++)
printf("%s ", argv[ii]);
printf("\n");
fflush(stdout);
// file is in RAM, read again to measure time
STARTTm(tm); // start timer
for (StrCnt = 0, FSmap = Pfile; FSmap < FEmap; )
{
GETLINE(String, FSmap); // 'read' next string
StrCnt++;
}
ENDTm(DeltaUSec, tm); // end timer
//
// print header - 6 fields
printf("# lines avg_linelen getline stored");
printf(" RAMused/line store/ln lookup/ln ADT\n");
// print numb lines filesize/line
printf("%8u", StrCnt); // Number of lines
printf(" %6.1f byts", (double)fsize / (double)StrCnt); // filesize/line
fflush(stdout);
// print time to read a line
printf(" %5.3f uS", DeltaUSec / (double)StrCnt);
fflush(stdout);
// INPUT/STORE LOOP:
// read each input line into String and store into ADT
STARTmem; // current malloc() mem usage
STARTTm(tm); // start timer
( run in 3.059 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )