Re: Beyond another cloud: data service discovery for NDSLabs


Its one of the tools, in an ecosystem of distributed tools, called by Polyglot.  There is a small Software Server script that provides the interface to it:

https://opensource.ncsa.illinois.edu/stash/projects/POL/repos/polyglot/browse/scripts/sh/daffodil_convert_csv_xml.sh

This takes care of allowing Polyglot to know when to use it and what schema to use.  The DFDL implementation, Daffodil:

https://opensource.ncsa.illinois.edu/confluence/display/DFDL/Daffodil%3A+Open+Source+DFDL

is then exposed by a Software Server, running on a machine(s) or VM(s) along with daffodil, as a resource to a Polyglot Steward which will coordinate its use along with other tools made available to the steward. The video at this time point is using it to get image pixel values from the file as XML:


We plan to use it as a means of accessing and preserving information stored within ad-hoc file formats (e.g. text files with arbitrary layouts made by students and labs to store information).  Some additional details can be found on the Brown Dog wiki here:


Kenton McHenry, Ph.D.
Senior Research Scientist, Adjunct Assistant Professor of Computer Science
Image and Spatial Data Analysis Division (http://isda.ncsa.illinois.edu)
National Center for Supercomputing Applications

On Nov 6, 2014, at 1:54 PM, Arthur Smith <apsmith@xxxxxxx> wrote:

Interesting and certainly partly analogous (mime types ~ data types) - I don't see how you are using DFDL there though, can you describe that a bit more, or point to a reference on how it's all put together? There always seems a lot to learn in these discussions!

   Arthur

On 11/6/14, 1:52 PM, McHenry, Kenton Guadron wrote:
Hi Arthur,

Would something like this be along some of those lines?:


The format portion is based off of two tools, Polyglot:


and DFDL:

https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl

Kenton McHenry, Ph.D.
Senior Research Scientist, Adjunct Assistant Professor of Computer Science
Image and Spatial Data Analysis Division (http://isda.ncsa.illinois.edu)
National Center for Supercomputing Applications

On Nov 6, 2014, at 9:43 AM, Arthur Smith <apsmith@xxxxxxx> wrote:

On 11/5/14, 11:43 AM, Matthew Turk wrote:
[...]  I'd rather we *not* provide an answer to this from the perspective of the infrastructure within which applications can run, but instead determine a matchmaking system for data.
Wow. That's making a ton of sense - and resonates completely with that email from the RDA datatype registry group (did they really just send that out earlier this week?) It looks to me like the "About" and "Scope" pages at http://typeregistry.org were updated recently too, along these lines?


What do you think about combining service discovery with the Datatype Registry for matchmaking applications to data? I'd rather we supply the ability for applications to fail than try to cover every possible aspect of their success. As a concrete example, imagine that applications get spawned, and they register themselves as working with a given datatype; data gets inserted into the system and either during a tilling step or as part of the ingestion, it's identified as fitting into a given datatype from the DTR.  When the data is selected to be acted upon, the available services would be returned.  In addition to this, we could provide standard services as well -- generic Python, R[Studio], shell, etc data manipulation methods.

So I'm imagining something similar to the way mime types and associated applications are registered with web browsers right now. For each content type, I as a user have a default application to open it in (if I've seen that type before), but also other options available that I can select from or change to. Perhaps the DTR could be at root an extension of the content-type system? Except we're imaging the data handling applications registered not with the local web browser but through some sort of online discovery service. But I might want to add some more local ones of my own (like the python R, etc. examples). Still not entirely clear to me how this ought to work but it seems like there should be a way to get there.

I really like this idea, and I think it blends very well with what RDA is trying to come up with. So to drill down from this idea - what are the technology components we need?
* A DTR is one piece (perhaps organized a bit differently from the RDA example as it stands).
* Some kind of discovery service to link applications with data types they support.
* Another service to link datasets or particular portions of datasets (individual files, ?) to data types (or some way to represent that in the dataset metadata?) I'm thinking something based on the Open Annotation model perhaps?
* An interoperability layer that can link up a dataset with a default application, either online or local, or present a list of options, through the above services

This doesn't sound too overwhelming... Are there other pieces needed?

  Arthur





Other Mailing lists | Author Index | Date Index | Subject Index | Thread Index