StatsVsHierarchy - abrenner/robinhood-multifs-web GitHub Wiki

Stats vs Hierarchy Model

Stats Model

The stats model copies the stats that Robinhood uses from its ACCT_STAT table as it scans the filesystem. This is ideal for most installations as grabbing stats are extremely fast.

The stats model keeps track of the file and folder ownership, regardless of location on filesystem

To run / use the stats model, run the following file in cron or your browser:

http://my-website/index.php/cron/getStats

Hierarchy model

Only supports 2.4.3 database schema of Robinhood Policy Manager. Older versions of 2.4.3 may work, newer versions will not work

The hierarchy model attempts to address the short comings of the stats model as the file ownership relates to its location on the filesystem.

The idea with the hierarchy model is that we want to take into account all files and folders of a specific directory tree and count that towards a specific user and unix group.

Currently, RBH does not support this. Stats are based off of file ownership that matches the UID scheme on the system.

An example case of this below:

 Fullpath                    Chown:     User:group
/fhgfs/aebrenne/file1.txt            aebrenne:staff
/fhgfs/aebrenne/anotherUser.txt      hmangala:staff
/fhgfs/aebrenne/fileFromWeb.tar.gz      87291:21323

In the above example, the hierarchy model assumes the second segment in Fullpath (aebrenne) is the user (configurable) and the group is predefined (configurable). Without the hierarchy model, RBH would assign anotherUser.txt under the user hmangala and fileFromWeb.tar.gz to an unknown user. This will cause disk usage discrepancy and other reports to be false. This is extermely common in shared cluster enviroments, like UCI's High Performance Computing Cluster

What the hierarchy model does, is ignores the value for both user and group and assumes all files, folders and symlinks under the directory tree of /fhgfs/aebrenne/ to be owned by aebrenne regardless of the actual ownership reported by the filesystem. It does this for each directory tree and can be configured via the database.

Because stats have to be generated, this process is more time consuming that the stats model. In addition, I currently do not know a better way to generate the hierarchy model stats (in lieu of working on RBH C code) then to grab the data from the database. Because databases in general, including MySQL, is horrible with self-referencing data structures (one table with id and parent_id) generating a query that produces all children, grandchildren, great-grandchildren, etc is not supported (if it is, please let me know!). As such, the best designed I have come up with, is to create a FULLTEXT index of up to 40 characters on the fullpath column within the ENTRIES table for each Robinhood database.

CREATE INDEX fullpath ON rbh_bio.ENTRIES (fullpath(40)) ;

replace rbh_bio with the name of your database

Of course, now the issue is INSERT and UPDATE commands have an extra penalty associated with each query. For a medium size installation (~75 million files as we have seen at UCI's High Performance Computing Cluster) the penalty is manageable, given that we use a lot of caching on the OS level and have SSD disks.

Creating and deleting the fulltext index only when needed, is another solution that may be also be used. Creating an index is depended on the number of entries in the table and can take anywhere from a few minutes to hours. Deleting an index near instance.

Please use the MySQLTunner script to optimize your MySQL installation. Take special note of:

innodb_file_per_table=1
innodb_buffer_pool_size

Obviously the ideal solution is to generate the correct stats at insertion time during scanning and work is being done for that, however, in lieu of that, this is another solution for the hierarchy model.

To run / use the stats model, run the following file in cron or your browser:

http://my-website/index.php/cron/getStatsHierarchical

Please note, you will need to fill out configHierarchy table within the robinhood-multifs-web internal database and the required fields in config include fullpath and fsInodeNumber