[ExtractStream] Random SplitStream Thoughts...

Warren Toomey wkt at t...
Sun, 10 Feb 2002 09:02:54 +1000 (EST)


[ This time with some explanations! ]

> Just looking through the code right now, I can see it uses fwrite()
> to write out the data, and fread() to read chunks in. They're pretty
> efficient, and I doubt that you would improve matters by buffering any
> more. You might be able to improve things a little bit with the use
> of setvbuf(), but I wouldn't expect much.
> 
> [ Just tried it, I got a 5% improvement by using a CHUNK_SIZE
> buffer and doing setvbuf(in_fp, mybuf, _IOFBF, sizeof(mybuf)); ]

The standard I/O library on Unix (i.e fopen(), fread(), fwrite(), fseek() etc
uses an internal buffer to help improve I/O performance. So, for example, if
you read 20 bytes, then it will actually read a whole BUF of data, and then
give you the 20 bytes. The next fread() will get stuff out of the buffer,
and not do any disk I/O.

Normally, BUF is around 4K or 8K, so when you are reading very large amounts
of data in one hit, it's effectively useless. By using setvbuf(), see

http://www.freebsd.org/cgi/man.cgi?query=setbuf&sektion=3&apropos=0&manpath=Red+Hat+Linux%2fi386+7.2

you can set the size of the standard I/O library's internal buffer, and
get it to read ahead as much as you want, without having to recode your
program. So if Windoze can do this, you might try setvbuf()ing to 12 Megs
and see if it helps. It also means that you don't have to modify splitstream
manually to get it to do the extra buffering.


> However, there might be one way of improving I/O performance, and that
> is to use memory mapping. I'm a BSD person and I would recommend mmap()
> and friends here. On SysV and relatives, something like shmat() and friends
> could be used. I'm not a Linux person, but just looking at a Debian system
> I can see something called memp_open().

What I'm suggesting here is to _map_ the entire input file into the
program's memory space, i.e make it look like a huge char[] array
that starts at some memory location.

The advantage of this is that, once you have mapped the file in, you
don't have to keep calling fread() to read from it; you just have to
access the memory locations and the file is there.

This can be a big win because fread() firstly reads data into the internal
buffer, and then copies it from there into the buffer which you pass in
to fread(), so that's a double copy. With memory-mapped files, there is
no double copying.

See a tutorial I wrote about this at:

http://www.cs.adfa.edu.au/teaching/studinfo/osrts/Tutes/tute3.html

and it looks like Windoze can do this too by using CreateFileMapping():
http://leb.net/wine/WinDoc/msdn/sdk/platforms/doc/sdk/win32/func/src/f09_11.htm
so this could be a big win.

> Suggestion if it is I/O intensive, make sure the input file and the
> output files are on physically different disks. Then the operating
> system can schedule I/O operations on each drive concurrently, and this
> will speed things up.

You can at least document this in the user's manual, and those people
who are lucky enough to have multiple drives can take advantage of it.

I hope the above makes more sense, and yell if there's anything else
I can do.

Warren