Re: Linux Filesystems Do Not Suffer Fragmentation. Here's Why.

Home	Messages Index
[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]
Author Index	Date Index	Thread Index
Re: Linux Filesystems Do Not Suffer Fragmentation. Here's Why.

Subject: Re: Linux Filesystems Do Not Suffer Fragmentation. Here's Why.
From: "Rex Ballard" <rex.ballard@xxxxxxxxx>
Date: 18 Aug 2006 19:51:17 -0700
Complaints-to: groups-abuse@google.com
In-reply-to: <12ecjlicd98ke60@news.supernews.com>
Injection-info: m73g2000cwd.googlegroups.com; posting-host=67.80.98.116; posting-account=W7I-5gwAAACdjXtgBZS0v1SA93ztSMgH
Newsgroups: comp.os.linux.advocacy
Organization: http://groups.google.com
References: <7471666.eJY3XFx0ak@schestowitz.com><MsmFg.6952$395.5524@edtnps90> <12ecjlicd98ke60@news.supernews.com>
User-agent: G2/0.2
Xref: news.mcc.ac.uk comp.os.linux.advocacy:1142293
Tim Smith wrote:
> In article <MsmFg.6952$395.5524@edtnps90>, Oliver Wong wrote:
> ><quote> Windows tries to put all files as close to the start of the hard
> >drive as it can, thus it constantly fragments files when they grow larger
> >and there's no free space available.

The key issue here is that any application on Windows tends to just try
to append wherever it can.  This even true of copy commands.

> > Linux scatters files all over the disk so there's plenty of free space if
> > the file's size changes. It also re-arranges files on-the-fly, since it
> > has plenty of empty space to shuffle around.
> ></quote>

Linux allocates file space using inodes.  Each i-node is about 1024
bytes (or whatever has been defined as the "block size"), and has a
small header.  The remaining bytes are pointers to the available
"blocks".  When the file is "closed" the unused blocks are passed to
the "free pool", and placed in i-node with the nearest blocks.  The
"defrag" occurs as the space is "freed".

If a "block" is 1 kbyte, you can map about 250 kbytes per inode.  An
IDE "track" is 63 512 byte sectors, or about 32k, and 255 tracks per
cylinder, which is around 8 megabytes.  You can index an entire
cylinder in about 33 inodes.

In most Linux file systems, the i-nodes are kept together, usually in
the middle of the disk, and are usually cached when a file is opened.

If an application such as cp knows that it will need a larger amount of
space, it can create a file in 'contigous space', similar to the way
swap files are created.

Linux administrators often use partitions to help reduce fragmentation
as well.  Putting log files and other "growing" files into a /var
partition means that the applications that really want files to be
contiguous have a better chance of getting it.

When a partition is very close to full, Linux may not be able to
completely eliminate fragments, but it can attempt to allocate space
from the same cylinder.  Since Linux has large disk buffers and often
reads a cylinder at a time (most hard drives read all tracks from all
platters into a local buffer or cache, then Linux reads in the entire
buffered cylinder).  The net result is that even if the file is
slightly fragmented, it can usually be retrieved in a single rotation
of the hard drive (10-30 ms), instead of reading 5-10 fragments, one
seek at a time, (50-300 ms).

> I've not seen any other references to this on-the-fly rearranging.  Anyone
> have a link?

There are several different file systems.  One can review the source
code to nearly all of them, in combination with the source code to the
Linux kernel.  Each targets specific niches, which is why many Linux
systems will have different partitions using different file systems,
each optimized to specific needs (database, programs and read-mostly,
log files, high reliability, small fast files, large archives,...

Keep in mind that each directory off the root directory was put there
so that it could be easily placed in it's own partition if desired.  On
really large multi-disk systems, administrators will tune the daylights
out of those partitions.

> > I'm actually interested in a more detailed analysis, as I have the feeling
> > the author skimmed over a lot of details, and this is a topic I don't have
> > a strong background in.
>
> I suspect that this is one of those areas that hasn't been adequately
> researched.

Actually, it's been extremely well researched.  That's why Linux has so
many different file systems instead of "one size fits all" like NTFS.
BSD has even more, but not all of them are supported by the Linux
interfaces.

> It's fairly well-known (or at least well-believed...) that on
> multi-user systems, fragmentation is not much of a problem, because you are
> sharing the system with other users.

Actually, that's part of the whole design of the file system.  It
doesn't matter whether you have 100 users running shel scripts and
emacs, or 1 user running 1000 daemons in the background, and 20 windows
on 4 desktops in the "foreground".  If you have multiple processes
writing to files at nearly the same time, the operating system/file
system needs to find the ways to do this as effeciently as possible.

>  Basically, you are only going to be
> allowed to do a certain amount of I/O before the system switches to work on
> the I/O of another user.

Actually, each time you do ANY I/O, you pass control to the kernel.  If
you do a "write(fd, buff, 1000)", the kernel knows you want to write
1000 bytes starting at buf, into file descripter fd.  If you have LOTS
of memory, you can change the call to "write(fd,buf,10000000) and tell
the kernel that you want to 10 megabytes.  The kernel can look for 100
i-nodes that are pointing to space that is close together.  Even if you
don't get a "perfect" defrag, you can usually get

>  Eventually, the system gets back to you, and at
> that point, it doesn't matter if the next part of the file you are working
> on is contiguous with the prior part or not.

Again, with proper use of cylinder and i-node caching, it's possible to
read up to 8 megabytes in about 10 milliseconds, figure out who can you
what you have pulled, map it to the correct file handles (including
pre-fetch where possible), and be ready to pull another 8 megabytes by
the time the next cylinder is sought and read.

> It's not at all clear that this applies when you take a multi-user system,
> like a member of the Unix family, and run it as essentially a single user
> system.  Spreading out the files, as described in the article under
> discussion, might actually hurt.

Again, much of this has to do with knowing what you are trying to
store, and letting the operating system and file system (remember, on
Linux the file system is a "function" independent of the operating
system, like a plug-in) manage the allocation and optimization of
i-nodes.

Compare this to some of the microsoft operating system which requires
the system to find not only the allocated space but the "table of
contents".  Microsoft knows where the "free space" is, but doesn't have
a well managed area for the table of contents.  As a result, you have
to conduct multiple accesses, which could be on many different
cylinders, then you can get a "cluster", typically between 4K and 32K,
at a time.

> Most of our conventional wisdom on disk performance developed years ago,
> when things were quite a bit different.

This is probably more true of Microsoft, than of Linux.  For example,
FAT was developed 25 years ago, when one needed to know the exact
geometry of the floppy or hard drive, storage and memory was limited.
In this case, finding free space was important, but minimizing the
space consumed on write-mostly floppy disks was critical.  FAT16 and
FAT32 increased the amount of storage available, but were still focused
on optimal storage of the "stable" portions of the system, including
the operating system, bundled applications, and preinstalled
applications.  The "user space" was less of a concern because the
assumption was that users would only store a limited amount of personal
information on the hard drive and would back-up/restore it frequently
as needed.

NTFS was based on HPFS developed for OS/2.  Even this environment was
designed for relatively small drives (20 megabytes or so), and
relatively limited memory (4-8 megabytes of RAM).

>  Disks today have much lower seek
> time, higher rotational rates, the disks have much bigger caches, and so do
> the computers.

Keep in mind that BSD 4.0 and later UNIX systems pretty much STARTED
with 300 megabyte drives, and 1-2 megabytes of RAM.  They were designed
for big drives, and multiple drives or partitions, and they were
designed to manage lots of little files (e-mail was often 1 file per
message, growing files (log files, spool files, and such), big files
(databases), and user files (home directories).  Over time, and
especially as Linux found it's niche as both a server and a
workstation, different filesystems were designed to be optimized for
each application.  In additon, some file systems were optimized to
support striping and RAID configurations.

>  Performance depends not only on the absolute value of these
> parameters, but on the relative values, because there are tradeoffs to be
> made, balancing one against the other.

This is precisely why Linux supports so many different filesystems
instead of just one or two.  Microsoft on the other hand, wants the
OEMS to place only one partition on the drive, and that drive is to be
used exclusively for Windows, and has been optimized for the
preinstalled software.  It really wasn't designed to support the
thousands of little files that are cached by web browsers like IE, or
lots of little files like e-mail messages or news articles.  They
aren't even optimized to support databases and other "volitile"
systems.

> What this all means is that it would probably be a good idea for someone
> with a good understanding of all these factors to do a series of careful
> experiments, and update the conventional wisdom if it is no longer correct.

Actually, Microsoft has done such studies, and has used this
information to help design benchmarks which give Microsoft the
advantage.  For example, if you create lots of "standard sized" files
that are exact multiples of the cluster size, and do reads and writes
that are exact multiples of the cluster, and limit the memory to around
64 megabytes, you can make NTFS about 5-10% faster than Linux when
configured with ext2 filesystem in 512 byte block sizes..  This is
because you end up with files that don't use the i-nodes efficently,
and you don't allocate blocks effeciently.  You can really skew the
results by properly "de-tuning" the file system on Linux and running it
against an optimized NTFS file system (large cache, large cluster size,
large memory, drive cached, and carefully written seek targets.

The Linux community studied these results back in 1996-7 and created
new file systems such as ext3, reiser, jfs, and so on, along with fat32
support, and even ntfs drivers.  Linux distributors strongly reccomend
against writing to NTFS partitions, primarily because, when journalling
is disabled to make it run faster, it is very vulnerable to power
failures and "hard boots".  With Linux, you have a choice, but choosing
NTFS simply "because it's there" is considered very unwise.

> -- 
> --Tim Smith
References:
- [News] Linux Filesystems Do Not Suffer Fragmentation. Here's Why.
  - From: Roy Schestowitz
- Re: [News] Linux Filesystems Do Not Suffer Fragmentation. Here's Why.
  - From: Oliver Wong
[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]
Author Index	Date Index	Thread Index