Los Alamos National Lab (LANL) Archive Daily Metadata Summaries:

Overview: From LANL we have 13 months of daily metadata summaries from FSstats in the form of histograms coming from a single archive comprised of approximately 285 TB of disk and 1 PB (petabyte) of tape. Users are allocated a directory in the archive when they are allocated super-compute time. They may or may not actually make of the archive.

The metadata summaries come in two forms, one is a daily group of histograms over the entire archives contents, the other is the *same* archive, but with a set of histograms corresponding to each top level directory. The top level directories roughly correspond to individual users and projects.

File Descriptions: Each zip file contains a number of directories, each directory corresponds to a single day. The total archive fsstats directories will only contain a single csv file corresponding to that days histograms, the top level directory granularity data has multiple files in each directory, corresponding to an obfuscated top-level directory name but otherwise has histograms of the same format.

Files:

Names:

full_archive_projects_fsstats_2010_07_10.tar.gz

full_archive_fsstats_2010_06_18.tar.gz

Location:

/external/ssrc-nas-1/ucsc/Data/crawls/scientific/archive/lanl/

Histogram Descriptions:

Each file, whether it is for the total archive summary or individual top-level directories has the following histograms.

1)File Size-Files broken apart by reported EOF.

2)Capacity-The amount of space actually allocated to a file. For example, in a file system with 1k blocks a file with an EOF at 8k that is very sparse may only actually have been allocated a single block.

3)Positive Overhead-The difference between a file’s reported EOF and actual space allocated *above* the EOF. It is effectively a measure of internal fragmentation. For example, a 5k file may have been allocated 2 individual 4k blocks, effectively wasting 3k.

4)Negative Overhead-The difference between a file’s reported EOF and the amount of space that was allocated *beneath* the reported EOF. For example, a 10 MB file only being allocated 1 MB would have 9 MB of negative overhead. This is effectively a measure of a files sparseness.

5)Directory Entries-A simple count of directory entries

6)Directory Size-A measure of directory size in KB

7)Filename length-A count of filename lengths

8)Link Count- a count of file lengths.

9)mtime (files)-This histogram tracks files last reported modification time. It is a count of how many files fall into the respective histogram bucket day ranges.

10)mtime (KB)-As above, this histogram tracks the modification time of contents, but this time it does it by the fraction of allocated space. Effectively it is a count of what fraction of the total allocated space falls into respective histogram day ranges.

11)atime(files)-This histogram tracks the last access time of files. NOTE-THIS IS NOT TO BE TRUSTED AS ATIMES WERE DISABLED WITHIN THE SYSTEM WE HAVE DATA ON.

12)atime(KB)-This histogram tracks the last access time of total archive contents. NOTE-THIS IS NOT TO BE TRUSTED AS ATIMES WERE DISABLED WITHIN THE SYSTEM WE HAVE DATA ON.

Histogram Details:

Each histogram has a block of self descriptive text describing its contents at a high level. For example.

histogram,atime (files)

count,5376412,items

average,43.399220,days

min,0,days

max,158,days

This says the the histogram is for atime(files) with 5376412 distinct items. The average atime was 43 days with a minimum of 0 and a max of 158.

At the start of each histogram, there is a descriptive row:

bucket min,bucket max,count,percent,cumulative pct,val count,percent,cumulative pct

Bucket Min-the minimum value of the bucket *Bucket Max -The maximum value of the bucket

count-the count of individual values that fall into this range *percent-what percentage of all counted items fall into this bucket *cumulative percentage-a running total of all the prior and current percentages.

Val Count (Value Count)-this is a count of the value of the items that fall into a bucket. For example, 10 2kb files would have a value of 20. *percent-what percentage of all counted items fall into this bucket for the value count *cumulative percentage-a running total of all the prior and current percentages for the value count

all data below that corresponds to those headings.

NOTES (Very important! Do not skip!)

1)Missing Days: There are a few missing days scattered through out.

2)atimes: As mentioned earlier, atime tracking was explicitly disabled. Atime histograms are therefore not accurate.

3)Many of the top-level directory histograms are effectively empty. This means that for whatever reason that user/project did not make use of their allocated space.

4)Many files are very sparse. Though the total file size count adds up to over a petabyte, actual allocated storage generally around 100 TB. Worth noting as it can be confusing for the counts to be so mismatched if you are not aware of this.

5)If a particular histogram bucket has no items in it, it is skipped! For example, if there are no items between 8 and 16 the histogram would note 4-8 and 16-32 but would *not* have a bucket for 8 to 16.

SSRCWiki: SoftwareTraces/LANLArchiveCrawls (last edited 2010-09-09 17:46:23 by IanAdams)