Matt,
 thanks for such a thoughtful response. I certainly don't know
myself exactly what will work, but the focus on answers to
specific questions in data handling sounds to me like a more
fruitful path.
 To explain a bit where I'm coming from, I've recently been
involved in the FORCE11 data citation implementation group (DCIG):
https://www.force11.org/datacitationimplementation
and in particular with those looking at what it means for a
dataset identifier to be "machine-actionable" as the "joint
declaration" held in principle #4:
"
A data
citation should include a persistent method for identification
that is machine actionable, globally unique, and widely used
by a community"
This is really on your third question, "where can
selections of data be obtained". But maybe it feeds into the
others as well.
The question that came up with DCIG was, suppose you have an
identifier for a dataset. What can an automated system reliably do
from that point? Suppose the user goal is to load some selection
of the data into an application - what would it take to do that
with the least amount of further user interaction if you just
start from the identifier?
One consensus seemed to be the identifier (DOI, say) should
resolve to a landing page that provides (for appropriate queries)
some form of metadata about the dataset. There's a (in progress)
Google spreadsheet the group has been working on here that
compares different dataset metadata standards that are out there:
https://drive.google.com/folderview?id=0B-3fjDTO3dDaRlJWSzZFYlJUZTg&usp=sharing
So there is some hope that metadata of this sort can be expected
to be available after resolving a dataset identifier.
Machine actionability then to me could imply
ÂÂ * application (an interoperability layer, not necessarily the
final application for analysis) accepts identifier (eg. DOI) from
some sort of user interaction, for example following a citation.
ÂÂ * application resolves DOI and receives metadata
ÂÂ * application could display title, date, size, format, etc. to
user and ask for confirmation
ÂÂ * application may download data from "file location", unpack
(to multiple files - a zipped "bagit" format was suggested - or
maybe OAI-ORE)
ÂÂ * --- but - then what? It needs to know something more about
the data to do anything useful with it. If there is some way for
the data files to be linked to something like the RDA data type
registry, then maybe at least some of it could be pulled in
directly to the application. Or else further user interaction
would be needed to select specific files, columns, etc??? But
that's making it hard for the user again.
This problem has essentially been solved for some datasets like
IVOA with things like the VOTable standard:
http://www.ivoa.net/documents/VOTable/20130920/REC-VOTable-1.3-20130920.html
(along with a lot of other components). Maybe it would be helpful
to start from there (or another example) and generalize?
ÂÂ Arthur
(PS for everybody except Matt, if this isn't appropriate for this
general mailing list please let me know!)
On 11/1/14, 2:06 PM, Matthew Turk wrote:
Hi Arthur,
Sorry it's taken me a few days to reply, but I've been
pondering your email for a while and trying to formulate a
response to it.
I think perhaps what I've been trying to get at in thinking
about NDS Labs and trying to spur the conversation, is to figure
out what's the best possible way to foster interoperability --
specifically, what can we do, now, to create an environment to
explore and experiment. I originally thought we could do this
by providing:
Â* Communication mechanisms
Â* Gradual growth and incubation of components that are
connectable
Â* Simple start, complex end
Perhaps, though, approaching this from "service discovery" is
the wrong way. Adding more indirection, reimplementing things
that have been done before -- both are tricky, like you point
out, and probably are best to avoid for the time being.
Service discovery in general may simply be too *big* a
sandbox for NDSLabs, where individual projects and instances
will likely number in the handful, not the hundreds, and where
the N^2 process of developing interoperability is going to be
relatively small. If we can standardize generic k-v pairs but
can't manage service discovery without them, we probably are
doing something wrong! Â:)
So let's try what you suggested, and hammer down on what the
specific, difficult technical things are that we want to do, and
then figure out how to implement them.
Here are the things that for a "next gen epiphyte" I know I
would want:
Â* What are the possible applications I can send input data
to
Â* Where can the resultant data be sent
Â* Where can selections of data be obtained
I'd also like to provide as much as possible in the simplest
technology available -- which means that OAI will be a goal, but
not the only conduit for data.
For epiphyte, we're also taking the tactic that (for now) the
data is all in the form of files on disk. This won't work
forever, but it will for the time being. If we were going to
take those three "services," I think I would want to know a
host/port, and some format for the transmittance of data --
perhaps even just REST posting, or a URI to a
filesystem/filesystem-like thing. If we had that information it
might be enough to provide some degree of interop at that level,
with a basis of understanding we can build more complex ideas on
top of later.
Does that help to focus the ideas more? Or is this still
complexity in search of need?
-Matt
On Thu Oct 30 2014 at 2:17:34 PM Arthur
Smith <
apsmith@xxxxxxx> wrote:
That does sound interesting. However, it also reminds
me of RFC 1925:
http://tools.ietf.org/html/rfc1925
in particular "6a - It is always possible to add another
level of indirection. " and perhaps #11 as well... Lots of
wisdom in the old IETF...
I really liked your talk about what you'd done with
Epiphyte - in particular making hard things easy. Very
impressive work. Is there some way to organize this by
starting from the "hard" use cases NDS labs is trying to
address, and drill down to the technology components
really needed to make that happen? Discovery does seem
likely to be a good part of it, but if it's based on
key-value pairs (for example) how does the user know what
keys to query, who sets the standards for those keys and
meanings of corresponding values? Aside from knowing where
exactly the etcd server or whatever is doing that work is.
There's got to be some base starting point, a system that
knows enough to help the user do things, can we work from
there?
On 10/30/14, 11:42 AM, Matthew Turk wrote:
Hi all,
In the other thread, Arthur brought up that we don't
want "just another cloud infrastructure," which I think
was really apt, and something that deserves thought for
any NDS Labs project. So I wanted to start a couple
topics about what can be provided on top of a standard
cloud infrastructure that might be of use.
I'm wondering about discovering data services within
a region, where that region is either some subnet on a
cloud provider, or even more globally across locations.Â
If we are thinking about interoperability of services,
then there are probably a few verbs that could be
identified as being necessary. If we can have services
identify themselves as providing verb endpoints, that
could provide an environment for testing interop.
Kacper and I have been experimenting with this
ourselves, mostly looking at the various service
discovery mechanisms that operate on docker containers
being orchestrated across machines. Some of these do
this via introspection, and some will even set up
automatic (nginx) reverse proxies for docker containers
running inside a system. Right now it looks like etcd
is a pretty good solution for this:
as it can allow for key/value pairs to be stored, and
it's discoverable. For instance:
I think having a discussion about what we want
services to be able to do is perhaps a much bigger
topic, but I wonder if this type of thing --
particularly etcd -- would be useful to any projects,
and would be a good avenue for service discovery and
intertop. Is there something else that would be better?
-Matt