New inode, dcache and transname implementation for Linux. This patch is based on 2.0.27; it may work with later versions, but I have not tried. It is in alpha state - do not use with valuable data. Features: - old and new code for inode.c and dcache.c exist in parallel and can be exploited by enabling the new config options you will find in section "filesystems". - the new dcache can hold "negative entries", i.e. names that are known to *not* exist. This saves unneccessary lookups for non-existing names and is in particular useful for the transname facility. - /proc//fd/ now contains symlinks with the full absolute path to the inode. So you can see where the inode is. - deleted files are kept in a two-level basket: if a used inode is unlinked, the name foo is moved to .deleted-.foo and is kept in the dcache (not on disk!), so it remains accessible for /proc//fd/ (and for normal lookup of course). Try the shell script in the appendix to see what it does! - if the fs supports it (currently only ext2), deleted files go to the second level basket as soon as the i_count becomes zero. The second level basket keeps up to a constant number (currently 100) of deleted files, but only as long as no umount() is done (e.g. until the next reboot) and as long as sufficient space remains on the disk. If the second level basket needs to be reduced (e.g. because of space shortage), the files are freed in LRU order. Bugs: - MP-safety not yet fully implemented (missing vfs_lock()s etc.) - omirr-code not tested - lots of debug code / debug messages that will finally disappear.... - please report any other bugs. Things that I do not want to implement, but rather should be done by the maintainers of the corresponding packages / kernel parts: - arch/{sparc,alpha} contains readdir implementations that I did not update, since I have no such machines. Either make the same changes as I did in fs/readdir.c, or perhaps we should think of making less code arch dependent (I suspect only the dirent format is different for different architectures, but the algorithm can remain the same). - filesystems that want to support the second level basket must use a callback to free_ibasket() if they get short of space. I just implemented a provisionary for ext2 that is not quite correct (it calls free_ibasket() if a cylinder group gets short of space), just to demonstrate the effects. This should be fixed/implemented by the maintainers of the respective filesystems. - rm -rf some_directory does multiple scans on the directory. The problem is that the first rm pass will move all files to their .deleted-0001.* form, so that the directory *appears* as not being empty. However, the dir on the *filesystem* *is* empty, it is only the readdir() which simulates the existance of the basket entries. rm -rf then sweeps over the dir a second time, thus clearing all recursive entries forever. This should be fixed to only clear non-.deleted-* entries, such that the whole directory tree will stay intact in its .deleted-* form. Things that I want to implement in the near future: - simplify the code (less state information) if someone can tell me good ideas for this. - move the whole thing to the 2.1.* kernel series (perhaps to be done by David Miller) - introduce an interface for VFS extensions, in a most general form. Please report ideas to me if you have some. - read/write locks for dirs (and perhaps other kernel structures): Untested prototype already implemented. Other extension proposals, perhaps to be done by others: - change i_count policy globally, such that most is done in the VFS and fs'es no longer have to maintain it. For example, _lookup() and others that get an inode as parameter should not touch the i_count any more. - Move some more tests (e.g. in _unlink()) from the fs'es to the VFS, to make it more generic and redundancy-free. - Use the rw_lock at VFS level to control concurrency on directory inodes; the fs'es can be simplified *much* after this (I'm thinking of all that retry/versioning stuff in ext2 etc.) At least, if people fear to loose some concurrency (because the current behaviour is in essence an optimistic strategy), introduce a flag FS_CONTROL_CONCURRENCY (or similar) whether the particular fs is responsible for locking or not. - Another alternative would be to do the optimistic strategy at VFS level, at least in those places where performance is essential, such as concurrency in readdir(). The VFS could maintain all those version counters, and re-call the fs routines in case of update conflicts. Please tell me your ideas on that. - Some caching functionality, currently spread over many parts of the kernel, could be made more generic. - Centralize all kernel debugging options in one Config.in, so developers have less effort if their changes effect other parts of the kernel. Send bugs, comments, flames, fixes etc to schoebel@informatik.uni-stuttgart.de Greetings, -- Thomas ----------------------------------------------------------------------------- #!/bin/sh i=3 while [ $i -gt 0 ] do echo "Hello #$i" > file sleep 20 < file & rm -f file ls -l /proc/$!/fd/0 cat /proc/$!/fd/0 i=`expr $i - 1` done echo After: ls -a