1. Introduction
In the field of oceanographic research, one of the most common measurement techniques
has been to acquire “depth profiles,” i.e., to measure a parameter, or
group of parameters, as a function of depth. One of the earliest, and perhaps most
common profile measurement, is the temperature profile. As technology has advanced,
scientists have added many more sensors to their profiling equipment, such as sensors
that measure conductivity, chlorophyll fluorescence, and optical backscatter. For
decades, such sensor packages were lowered on a hydro-wire, but more recently, they have
been deployed on paravanes that can be towed behind a ship and
“yo-yo’ed” up and down to acquire a dense series of
“profiles.” (Although such measurements are actually making a 2-D
sawtooth or sine-wave pattern behind the ship, each vertical excursion is often treated
as a depth profile. This approximation is justified by the fact that horizontal
coherence scales are usually several orders of magnitude greater than vertical coherence
scales [
1
D. Olbers, Ocean Waves Volume 3: Oceanography , J. Siindermann, ed., (Springer-Verlag,
1986). Vol. 3, Chap. 6.
,
2
R. E. Thomson, S. E. Roth, and J. Dymond, “Near-inertial motions over a mid-ocean
ridge; effects of topography and hydrothermal plumes,”
J. Geophys. Res.
95, 7261–7278
(1990). [CrossRef]
]. An example of a commercially available paravane system is the Chelsea
Aquashuttle [
3,
4
E. Keegan, “Aquashuttle monitors Exxon Valdez oil
spill,” Spill Science & Technology
Bulletin
2, 87–88
(1995). [CrossRef]
,
5
R. Burt, “The growth in towed undulating vehicles for
oceanographic data gathering,” in Oceans
2000 MTS/IEEE 0-7803-6551-8 and 0-7803-6552 , Volume
1, Providence, RI, Sept 11, 2000 (IEEE, New
York, 2000)
641–645.
]. Another example is the SeaSoar
system built and deployed by the Woods Hole Oceanographic Institution (WHOI) [
6
K. H. Brink, F. Bahr, and R. K. Shearman, “Alongshore currents and mesoscale
variability near the shelf edge off northwestern
Australia,” J. Geophys. Res.
112, C05013, doi:10.1029/2006JC003725 (2007). [CrossRef]
]. Such systems can acquire thousands of
multi-variate profiles in a period of a few days.
Besides the paravane-based systems, the recent proliferation of Autonomous Underwater
Vehicles (AUV) has introduced still another method of gathering densely sampled,
multi-variate profile data. For example, Rutgers University has deployed Slocum AUV
gliders (built by Webb Research, Inc.) off the New Jersey coast, in the Mediterranean
Sea, in the Baltic Sea, and off the coast of Australia [
7
O. Schofield, J. Kohut, D. Aragon, L. Creed, J. Graver, C. Haldeman, J. Kerfoot, H. Roarty, C. Jones, D. Webb, and S. Glenn “Slocum gliders: robust and
ready,” J. Field Robotics
24, 1–13
(2007). [CrossRef]
]. Such glider deployments can last several weeks and obtain >10,000
profiles. The main point is that, unlike the time-consuming deployments of sensors from
a ship via a hydro-wire, the AUV vastly reduces the time between individual depth
“profiles” from days or hours to minutes. As a result, AUVs collect
unprecedented data volumes that are typically many orders of magnitude greater than
traditional ship-based water profiling surveys. However, due to the inherent spatial
coherence scales in the ocean, each profile is generally not significantly different
from the preceding profile.
The influx of this “over-sampled” AUV data provides much finer temporal
and spatial sampling, but it poses at least two analysis problems: (1) how to
examine/quality check thousands of data profiles per AUV survey, and (2) which of the
profiles to archive to capture the inherent ocean variability that was measured by the
AUV. With respect to the first problem, even with modern-day high-speed computers, the
exorbitant number of profiles gathered from these deployments ensures that the graphing
and processing of the data is time consuming. Furthermore, it is not uncommon to use up
all of one’s CPU memory in displaying results for a single AUV deployment.
With respect to the second problem—which and how many of the original profiles
to archive—this issue directly impacts the World-wide Ocean Optics Database
(WOOD). For such historical archives, the question really becomes one of how to
sub-sample the original space-time series of profiles so that the resultant dataset
accurately represents the original conditions but does not over-sample the environment.
For example, in open ocean regions free of oceanic fronts, hundreds of successive
profiles may look virtually identical. In contrast, when a front is crossed, or when one
approaches a shoreline, conditions are likely to vary quite rapidly. Our solution was to
develop data-thinning software that intelligently and automatically extracts only the
essential data from the original dataset, saving only those profiles that are necessary
to accurately represent the collected data. This software is able to
“thin” the data based on several parameters, including the distance
between profiles, the time between profiles, and, more importantly, the differences in
data structures between profiles. Furthermore, it is able to simultaneously keep track
of each data parameter in the AUV deployment, such as temperature, salinity, and beam
attenuation and optical backscattering coefficients, and the software ensures that if
one profile in one parameter is kept as a unique feature in the set, then the same
profiles in the other parameters are kept as well.
This paper documents the capabilities and algorithms associated with this software and
describes in detail how the software intelligently “thins” over-sampled
datasets. Examples are provided to show the effectiveness of this methodology in
processing AUV or SeaSoar data. It also discusses future options for the
software’s development and introduces features, such as automatic spike editing,
as possibilities for future improvements in software functionality.
In applying our data-thinning algorithm, we make use of the following terms. First,
during our initial data processing, we organize the data into single-variable files.
These files encompass all data of one data type gathered from one
“cruise” or AUV deployment. Each file contains many profiles: each
profile is a collection of a single parameter (such as chlorophyll concentration) versus
depth obtained at (nominally) one geographical location and one time. Profiles have a
metadata header that maintains important features such as location, date, time, a common
cruise number, and the identification number of the profile within the larger file. The
profile/file structure of these datasets is important to remember, for it determines how
the data-thinning algorithm uniquely traverses and analyzes sets of files from many
different parameters.
2. Methodology and application
As mentioned above, the World-wide Ocean Optics Database (WOOD) provides an archive of
bio-optical data from a wide variety of sources. One of these sources is the AUV, a
source that has become increasingly common in the past few years. WOOD became a logical
testbed for the development and testing of software that can compare and adaptively
reduce (or “thin”) raw AUV data files. (The same methods apply to
SeaSoar data.) As described further below, AUV data to be stored in the WOOD are thinned
based on four criteria: distance between measurements, elapsed time relative to previous
profiles, vertical extent of the data, and changes in the relative structure of
successive data profiles.
AUVs collect data profiles while they traverse pre-programmed or user-directed paths
through the ocean. Adjacent profiles along such a traverse are in close proximity to one
another (usually <1 km), and therefore rarely differ much from one another. In fact,
in many open ocean areas, a profile (such as temperature or chlorophyll) may not change
significantly for tens or even hundreds of kilometers. As a result, thinning such data
via a criterion that is solely based on statistically significant variations could
result in huge spatial gaps in the final output. To avoid such problems, one requirement
of the data-thinning software is that, regardless of meaningful changes within the data,
it will keep a complete set of profiles at some “reasonable”
(user-specified) minimum spatial interval.
The second requirement of the data-thinning algorithm involves elapsed time. Even if the
AUV were to make continuous circles at one location, and presuming the profile remained
constant during that time, one would still want to keep a sufficient number of those
profiles to provide a “representative” time series of the original
series. The raw dataset must therefore be sub-sampled in relation to both space and
time. If the original dataset provides a profile every 5 minutes, then storing a
profile, for example, every 4 hours would provide a good representation of that
day’s data while saving precious storage space and dramatically reducing loading
and retrieval times in a data archive like WOOD.
In addition to the fixed (and somewhat arbitrary) geospatial and chronological thinning
criteria described above, additional criteria are imposed to ensure that a sufficient
number of profiles are retained to accurately capture any significant physical
variability that occurs in adjacent profiles, such as when the depth extent (usually due
to bathymetric variability) varies significantly or when an oceanographic front is
crossed. To meet these data-sensitive thinning criteria, the algorithm assigns the first
data profile in a file to be a “reference” profile. It then iterates
through every subsequent profile in the data file, comparing the current profile against
the reference profile, which becomes the last saved profile in the thinned file. This
comparison involves an examination of the depth extent of the data and a calculation of
the change in the structure exhibited by the profile. For the change in the structure, a
percent change as well as an absolute mean change is computed. The percent change
specification is not given on a parameter-by-parameter basis because it is meant to be
used as a single metric for the entire thinning process. In contrast, parameter-specific
absolute mean change criteria are used to handle the situation that results when data
values change from one minutely small value (e.g., 0.01) to another (e.g., 0.02). While
the percent change from 0.01 to 0.02 is 100%, the mean change value is only 0.01; thus,
the mean change criterion is useful when deciding whether or not to keep profiles having
such small absolute changes in structure.
The respective change equations are given below:
In the equations above, variables
yi
and
yrefi
are continually updated as new profiles are tested and saved.
yi
represents the ith data point in the profile being tested, while
yrefi
is the
ith data point in the profile that was previously
saved, or the “reference” profile. After each equation is executed, the
results are compared to the threshold change criteria the user has defined as part of a
user-input file (see
Table 1), and, if the
change criteria are exceeded, then the test profile becomes the new
“reference” profile. Nominal values for the various change criteria are
summarized in
Table 1, but users are free to
modify any of these settings in a text input file.
Table 1. Nominal Change Criteria Used to Thin Vertical Profile Data
| Time | Distance | Depth | Percentage change |
|---|
| 4 hr | 10 km | 30 % | 30 % |
Change criteria for data thinning based on the absolute mean value strongly depend on
the parameter, the season, and the ambient conditions. For example, when thinning a
dataset of absorption and beam attenuation (at multiple wavelengths), temperature,
salinity, and uncalibrated chlorophyll fluorometry, the following absolute mean change
criteria resulted in about a 65 % reduction in the overall file size:
Absorption (400 to 700 nm): 0.05/m
Beam atten (400 to 700 nm): 0.2/m
Temperature: 0.5 °C
Salinity: 0.2 ppt
Fluorometry: 0.05 V
To reiterate, the relative and absolute mean changes are computed for each of the
parameters collected in a given profile. The percentage change criteria result in the
saving of a given profile only if the absolute mean change also exceeds the
user-provided threshold. The additional constraint for a
minimum mean
change of Z2 is based on the fact that at some low parameter value, even a large
percentage change is unimportant. For example, if chlorophyll falls below 0.2
mg/m
3, then even a change of 0.1 mg/m
3 is still too small to be
significant. Appendix A gives a more complete list of recommended absolute change
thresholds to use with
Eq. (2) to
determine whether to save a given profile.
As previously discussed, AUV deployments collect more than one type of data. In
deployments of interest to the WOOD archives, the AUVs are typically equipped with
sensors to measure stratification (temperature and conductivity), biological properties
(e.g., chlorophyll and bioluminescence), and optical properties (e.g., beam attenuation
and scattering coefficients) as a function of depth. Because the data are multi-variate,
the data-thinning algorithm must be able to concurrently examine the depth profiles of
each variable. For a given profile, if any one of the variables exhibits a sufficient
change to justify keeping that parameter’s depth profile, then the profiles from
all the other parameters are stored as well. For example, temperature and salinity might
change less than the specified criteria, but if the chlorophyll concentration exceeds
its threshold, then all three variables are stored in the thinned data files. This
approach ensures the maintenance of a synoptic, coherent representation of the
multi-variate data: all thinned files contain the same profiles so that data may be
compared across various parameters. This method produces a matching set of files that
can be easily compared using the “joined query” option in WOOD. (The
joined query option searches across multiple parameter tables using the unique profile
identifier number that ensures a given profile is from the same original multi-variate
profile as that selected from another parameter table.)
To run the data-thinning program—called SUBSMP4.EXE—one first sets up a
simple input file, such as the one shown in
Table
2. This file contains the filenames of all files (parameters) to be thinned.
Next, the distance criteria for thinning are specified, followed by elapsed time and
percentage depth change criteria. In this example, if any of the following fixed
conditions occur, then the profile becomes the new reference profile and is added to the
thinned file:
• The distance from the reference profile exceeds 5 nmi.
• Elapsed time exceeds 1.5 hr.
• The depth extent of the profile changes by more than 30 %.
The input file also has the parameter-specific Z1 (percentage change in structure) and
Z2 (absolute mean change) “threshold” values to use to identify
“significant” feature changes (i.e., a change that causes that profile
to be kept in the thinned file regardless of the changes in the distance, elapsed time,
or depth extent).
This input file, called SUMSMP4.INP, is provided to the thinning program from a DOS
window by using the input re-direction symbol (SUMSMP4 <SUMSMP4.INP). The inputs are
stored in memory and the program begins iterating through profiles in each input file.
The first profile from each file is chosen as the initial reference profile, and this
profile is then compared to subsequent profiles to determine whether they should be
kept. For every parameter being thinned, this reference profile is copied to a new,
thinned file. The algorithm then chooses the next profile to compare with the reference,
and the absolute mean difference between them, as well as the percentage change of mean
difference, is computed and compared to the threshold values provided at the start of
the process. If the two values derived from the comparison are greater than their
respective thresholds, then the profile is saved to the new thinned file and is chosen
as the new reference profile. (This test is done across all the parameters being
thinned, so if any one parameter experiences a change that exceeds the threshold
criterion, then this latest profile is saved for all the parameters and assigned as the
new reference profile.) If the threshold is not met by any of the parameters under
consideration, the profile is ignored, and the subsequent profile is tested. The process
repeats itself until the entire set of files has been examined. In this way, an accurate
representation of each parameter is generated, and each file contains only the profiles
that the algorithm deemed noteworthy across all data types.
Table 2. Sample Input File for the SUBSMP4 Executable Data-Thinning Program
| (Text that is separated from numerical values
by a comma is a comment and is ignored by the program.) |
| 2, #files to decimate/create; the file names
are: |
| bb470.mat |
| bb532.mat |
| 5, difference in nmi to keep |
| 1.5, difference in hours to keep |
| 30, difference in percent depth to keep |
| 15, mean percentage change for file #1 |
| .0002, mean change in value (/m) for file
#1 |
| 15, mean percentage change for file #2 |
| .0002, mean change in value (/m) for file
#2 |
| 0, write out list of profile numbers (1=yes;
0=no) |
| 0, write to a log file (1=yes; 0=no) |
2.1 Preconditions for effective use of the thinning software
To achieve optimal results, several conditions should be met prior to using the
software. First, to avoid falsely triggering the percentage or mean change criteria,
the files used as input to this software should be “cleaned.”
Cleaning entails the removal of any significant data artifacts, sporadic biases,
errors, and noise spikes in the data. Large spikes caused by instrument malfunction
or sudden changes in value due to, for example, scattering light off the ocean floor,
are likely to be interpreted by this software as real variability in data structure.
Note that the software’s inherent tendency to save such bad data is almost
impossible to change because the algorithm is designed to preserve variability, and
it has no way to discriminate between spurious and real changes. Thus, it is
important to remove any false spikes and significant errors in the data prior to
running them through this software. In some cases, one has to completely remove a
profile that is deemed bad. However, this removal will produce unwanted discrepancies
relative to the other parameters if the removal is not done for all the parameter
files. As discussed next, software has been written to remove these discrepancies to
ensure that the thinning algorithm still works properly when the cleansing/editing
process creates differences in the number of profiles across multiple variables.
A second precondition for running the software is to ensure that the files containing
the various parameters have identical sets of profiles. The reason is that the
software iterates sequentially through the individual data profiles within a given
data file, and the software does this process concurrently for multiple files. If any
one file has a missing profile, then it needs to be missing in all the parameter
files to avoid causing a mismatch in the profile being thinned. To force every data
file to have the same number of profiles, a FORTRAN program called MATCH2 was written
to sort through two files, find all the common profile “cast
numbers,” and then output two new files that match profiles identically. A
slower, but more general-purpose, Matlab program was also written called
“profile_intersect.” This function allows the user to enter any
number of input file names. The output is a set of files (with an extension of INT,
which stands for INTERSECTION) having identically matching sets of profiles.
3. Results
To date, this algorithm has been used on many sets of over-sampled multi-variate data,
each time with positive results.
Figures 1 and
2 show how this algorithm significantly
reduces the size of a dataset without compromising the overall structure of the data.
This particular dataset was reduced by 72% but maintained its overall shape. It is
important to note that this relatively small dataset (# profiles = 179) was used as a
test, and that, generally, this algorithm would be most useful for files containing
thousands or even hundreds of thousands of profiles.
Fig. 1. Dataset prior to thinning: 179 total absorption (“AT”)
profiles.
Fig. 2. Dataset after thinning: 50 total absorption (“AT”) profiles.
While the above figures are certainly evidence of this algorithm’s ability to
maintain data structure, they do not demonstrate the capability of thinning based on the
combined effects of structural (oceanic) changes plus elapsed time plus geographical
location. The following map (
Fig. 3) is the
result of running the algorithm on a complicated dataset of 4,967 temperature, salinity,
beam attenuation coefficient (c660 nm), yellow matter fluorescence, and chlorophyll
fluorescence profiles taken off the coast of Australia. (These SeaSoar data were
provided by Frank Bahr at WHOI.) When plotted on this scale, the original
(“un-thinned”) profile locations merge into what appears to be a single
blue line (they are actually discrete asterisks so close together that they become
indistinguishable). After being thinned for time, space, depth extent, and data
structure, the thinned data occur at only the red asterisks. As expected, the thinned
data are sampled less frequently in the deeper waters than in the waters closer to the
continental shelf (indicated by the 200-m depth contour).
Fig. 3. Map of thinned data (red) vs. un-thinned data (blue) off the coast of
Australia.
The corresponding sets of 1,269 thinned profiles have been examined, and they provide a
representative subset of the original data. For example,
Fig. 4 shows the chlorophyll fluorescence profiles before and
after the thinning algorithm has been applied, and it is clear that the algorithm tracks
the structural variations that occur along the track of the sensor system.
Fig. 4. Chlorophyll fluorescence depth profiles of thinned data (red) vs. un-thinned data
(black) off the coast of Australia (locations are shown in Fig. 3).
By examining the diagnostic outputs from the program, one finds that, in this example,
most of the spacing in the deeper water is due to the software’s capability to
thin based on elapsed time and distance. The sampling gets closer in space and time in
the shallower waters where more structural (and depth extent) variations tend to
occur.
The examples shown in
Figs. 1 through 4
demonstrate how the data-thinning software works qualitatively. To quantitatively ensure
that the code was working correctly, several simplistic test cases were run, where a
single criterion was configured to cause all of the triggering of saved data. For
example, a test file was created that had many copies of a single profile, but
successive copies were truncated in discrete percentages of the original depths to test
the percentage depth change software. In another test, all the threshold criteria except
distance were set to the equivalent of infinite values to ensure they would not cause a
profile to be saved. A test dataset with a geographical spacing of exactly 1.0 nmi was
created and then run through the code. The output was correctly thinned to the expected
5-nmi interval. For more details about the quantitative testing performed to date, see
page 12 of Barrett and Smart[
8
K. Barrett and J. Smart, “Sub-sampling software for environmental
profile data,” The Johns Hopkins University APL Internal
Memorandum STF-06-095, 30 June 2006.
].
4. Discussion and recommendations
For certain specialized applications, such as an assessment of internal wave activity,
data thinning should not be applied. However, for the purpose of archiving a
representative sample in a relational database (i.e., our specific application), some
kind of thinning is virtually essential. One might argue that our change thresholds
(
Table 2 and Appendix A) are too arbitrary
and that the thinning criteria should be computed based on an initial analysis of the
degree of variability present in the original data. This argument definitely has some
merit, and the fact is that we can and do sometimes adjust the thinning criteria to
account for the degree of variability in the original data. However, we do not automate
that process because of the following danger: if we apply a purely statistics-based
approach to a dataset with almost no variability, then we will create very small change
thresholds. Similarly, a dataset with a high degree of variability would lead to
unusually large change thresholds. We have avoided that problem by using
“subject matter expert” thresholds that can be kept uniform across
similar datasets and are (if anything) overly conservative (we err on the side of
setting the thresholds too low to avoid losing any meaningful variability). A
semi-automated approach for setting thresholds is discussed further in Appendix A.
Regardless of how change thresholds are obtained and applied, one must first ensure that
the data are free of noise spikes or other artifacts. Although data artifacts are common
in raw bio-optical data, there is still a significant lack of robust tools for cleaning
files and especially for removing noise spikes. Although we have developed a powerful
Matlab on-screen editor that allows the user to manually identify and remove spikes and
bad data using a mouse, we still lack a reliable automated spike editor, i.e., a tool
capable of scanning a file and intelligently removing spikes due to bad data while
ignoring genuine changes. We have asked numerous oceanographers and several commercial
companies if they have such a tool, and we have done web searches to discover what other
scientists are using for this purpose, but to date, we have not found a general-purpose
solution for this need. As the data-thinning software works best when noise spikes are
absent, a reliable automated spike editor would be an extremely beneficial
complement.
Finally, the software is currently implemented in FORTRAN. While this is a very fast
solution, it is also very limiting, as FORTRAN does not provide easy editing or
extension of the code. Ideally, in the future, this algorithm could be implemented in
Java, and could then be more easily expanded and shared. A reimplementation in Java
would not be as fast as the current FORTRAN version; however, the benefit of easy
extension probably outweighs this disadvantage.
The advent of autonomous data-gathering techniques, such as AUVs, has yielded a massive
influx of oceanographic data. This newfound wealth of data, while helpful for analysis
and research, creates a dilemma when sharing these data through internet databases. As a
result, a solution was needed to take a large amount of data and extract an accurate but
much smaller representation. The algorithm discussed above does this job quite well, but
there is still room for improvement. For example, adding a pattern recognition
capability and an associated change threshold look-up table would make the software
easier to use, and updating the software’s architecture/language (e.g., to Java)
would make the code easier to extend to other applications and also more portable. As
large datasets become increasingly prevalent, solutions like this one will become
increasingly important to the scientific community and especially to those charged with
archiving their results. Finally, the task of thinning over-sampled data will be
significantly aided by the development of a capable suite of data-cleansing/editing
tools.
Equation (2) defined the method used to
test for a significant absolute change in a given parameter. The table below provides
the recommended values for “yref
i” used in that equation.
Table 3. SBS test fiber and coupling characteristics
| Parameter Name | Minimal Change in Value |
|---|
| Absorption | 0.05/m |
| Backscatter | 0.0002/m |
| Beam Attenuation | 0.2/m |
| Chlorophyll | 0.2 mg/m3
|
| Diffuse Attenuation | 0.05/m |
| Relative Fluorescence | 0.25 (relative units; values nominally 0 to
5) |
| Relative Turbidity | 0.25 (relative units; values nominally 0 to
5) |
| Salinity | 0.2 ppt |
| Temperature | 0.5 °C |
The current software requires the user to manually provide such threshold values for
every file processed. We are considering the following improvement: give the algorithm a
memory that spans multiple uses; this memory would record threshold values versus
results and would rank these pairs by accuracy and user preference. This table of ranked
threshold-result pairs would then be available to the software as a reference for
“recommended” or default threshold values. Instead of the user manually
trying several threshold values before settling on an optimal set, the software itself
would be able to look up “good” threshold values determined from
previous usage. In this way, extended use of the software would result in faster, better
results; the software would “learn” the unique patterns and properties
of each parameter and be able to thin them more efficiently. However, such learning
should probably be associated with regions and seasons that are known to have similar
properties. Research needs to be done to determine how to specify a set of properties
defining similarity in such a supervised learning system.
The absolute change criteria provided in this Appendix are based on years of experience
working with this particular kind of data, and the values also reflect the known
accuracy of measurement systems. Nevertheless, a more objective approach would be to
first screen each variable for its inherent variability within the unthinned data, and
to provide statistics (such as the mean and standard deviation, the minimum and maximum
values, the 5 and 95 percentile values, etc.) to the user. This information could then
be combined with the expert’s knowledge about such data to make a
better-informed decision as to the change criteria for each parameter.