Suggestions from Dave - NCAR/gufiwrappers GitHub Wiki

Great suggestions from Dave:

Sidd,

while it's fresh, here are my thoughts. these are not necessarily exhaustive, but they're most of the major items. more examples in the documentation may help a lot -- they could be less detailed than the couple you have, just the command and a single sentence of explanation.

gwrap

is there a clearer way to distinguish the storage system rather than the absolute path? it seems "/" is not particularly unique to HPSS. perhaps a named argument

--storage=[glade|campaign|hpss]

and allow for an optional path relative to the root path for querying subsets.

From the docs, I really didn't get a good idea of how/why to use gwrap (or why it's called gwrap -- what is it wrapping? and should users care?). Needs more explanation with the arguments and/or some examples. Perhaps ### gwrap should be "glist" instead, since it just produces file lists (i think?).

It also appears to fail silently if you don't use the --list option. There should be a default list format (all fields?) if one is not provided. That will make it simpler to use and more robust.

it'd be nice if gwrap (or glist?) worked if you only invoked it as "gwrap" with no arguments, such that it listed all of my own files in all storage with all data fields for all time periods. if that's too intensive, then gwrap --storage=hpss (or however you choose to limit by storage system).

grprt

It's not clear what list/.dat file that grprt acts on. does it do all of them every time? the script gives no way to select one .dat file or another. (and if you did provide that option, the file naming approach will be challenging for picking files.

what about (optionally) allowing the user to specify the output/input file name in both gwrap and grprt?

the default report is not terribly intuitive to me, particularly the histogram part. You need to document how to read the histogram chart and maybe use fewer bins by default (4 or 5, maybe), or even 0 as a default, (i.e., no histogram by default).

can you add write-period / read-period options to grprt, to allow filtering from within a larger gwrap result? (that would allow you to run a big gwrap once, then lots of grprts on that same file.

i don't understand how the "treename" argument would be used for grprt. examples may be helpful here.

the file names for the grprt and the raw gwrap output need to be more distinct, either by filename extension or prefix. as is, they only differ by timestamp. (i found myself wanting to bring the results to my laptop and it's easiest to do that by scp'ing all rep* files, but that dragged over both the very small reports and the very large raw .dat files.)

I would probably reorder the columns in the by-user report, to put the user first and histogram last.

user, num_files, size(TB), %age, cum%, [histogram]

Similarly, I'd probably reorder the columns in the by-project report, to put the project first.

project, num_files, size(TB), %age, cum%, [histogram]

i'm not a huge fan of the "e+03" number format for size and file counts. I'd prefer the actual numbers, but I understand that makes file columns difficult.

what about allowing for a less formatted report output, such as .csv format (minus the histogram). That would allow users to easily pull it into Excel for more slicing and dicing.

And as with gwrap, it'd be nice if grprt did something useful when invoked just as "grprt" (or "grprt -i gwrap-list-file.dat" -- e.g., it runs the "by-users" report with no histogram on the root treename.

Dave

Thanks, Sidd. Yes, some elaboration in the documentation may address some of my comments (e.g. using --list or not). And also just some simple examples with very brief explanations. What I have in mind is something like

  • gwrap

Using gwrap alone extracts information for the invoking user's files on all storage systems.

  • gwrap --owners=user1,user2 /

Extracts HPSS file information for user1 and user2. and so on.

Dave