Data Cloud | Global Architecture & Information Network Initiative | www.GAINInitiative.net
The Internet is great for human interaction with people or information sources. What is still developing is the Internet as a "Data Cloud" - information that can be found, understood, queried, linked and re-purposed. It is this kind of data cloud that would support President Obama's Initiatives. What would such a cloud look like?
Imagine that what any person or organization knows and they want to share can be easily published to the data cloud. Once in the cloud every published fact can be referenced, commented on, analyzed, linked to. facts can be "mashed up" with other facts and presented in user friendly ways. Information isn't just accessible through the web page and limited search mechanisms on the publishers web site - data can be "mashed up", analyzed and compared. Others can use that information, add to it, comment on it and evolve it - with controls and security. We don't know what a data set may be useful for - by publishing the data we allow "a thousand flowers to bloom" that consume, analyze and present that information. This is visibility!
Information in the cloud doesn't have to be limited to tables of facts and figures - it can contain plans, architectures, designs - everything from the business process the OMB may use to approve a budget to detailed technology data structures. It could contain fishery data and nuclear safety statistics. Some information may be "raw", while other information could be analyzed and digested.
A data cloud is a necessary enabler of the administrations objectives - it will serve as the gateway to government visibility, the portal for participation and the basis for collaboration. The difference between the data cloud and what we have now is crucial - now we have web pages showing us data, we need access to the data with a way to use it and evolve it.
The data cloud is not a new idea - standards already exist for Linked Open Data and the Semantic Web as defined by the W3C. These standards make an excellent basis for the data cloud - it is already Internet based, standard, provides for open query and federation of data and allows the semantics of the data to be well defined. Commercial and open source technologies already exist to implement these standards and these can be used to make the data cloud a reality. What has not existed is a way to bring these diverse technologies together into a solution for visibility, participation and collaboration. We need a "platform" for the data cloud that is open, immediately usable and pervasive.
Imagine there is some information of interest in the data cloud - can you trust it? Who did it come from and how; do you know it is the real thing? Trust in information, particularly information on the Internet, is not a simple matter. Just because some fact is "on the Internet" does not make it true - the question is - do you trust it and why? Personally, if it is a question of units of measurement, I trust the U.S. National Institute of Standards and Technology (NIST) quite a bit - they are both competent and have no reason to deceive. So I trust information about units of measurement from NIST. When I am doing analysis, creating data, researching information or mashing up data I should be able to register my trust in this NIST data and it should be recorded that anything I produce depends on what they say.
On the other hand, I may not trust some terrorist organization about U.S. policy - but that doesn't make what they say uninteresting, just not trusted. These examples demonstrate a basic principle of the data cloud - it will contain information of all kinds and it is up to the consumer to decide what they trust and what to leverage. It is not expected that all data will agree or even be consistent - the data in the cloud is essentially the opinion of the publisher. So my trust in the data is derived from my trust in the provider. Some application of the data cloud may only use trusted data where as others may specialize in analyzing how various sources of data agree or conflict.
So the data, raw and analyzed, in the data cloud will form a "trust network" - we should always be able to point to the source of our data and understand who and what we are trusting. Who says something is just important as what is said. It is the job of technology supporting the data cloud to make sure the data really comes from who we think it does and that it has not been compromised along the way.
Security and Identity
To trust the data we have to trust the source and know that it is the "real deal". If NIST publishes something, we want to know that it is really the NIST data - not a copycat. This implies that both the publishers of data and the data it's self need a strong "identity", that is we need to be able to know who or what it is and not confuse it with anything else. Security is then important to make sure that the data has not been tampered with and also that, if the data is restricted, it is only available to those with access rights.
It is the responsibility of the data cloud's technology platform to provide for identity and security, particularly for data publishers.
Shared Concept Hubs
A nemesis of sharing data has always been the diversity of terms, symbols and structure used in the data - the vocabulary of the data its self. There are three basic approaches to this problem:
- on the one end every data source is different - it has its own terms and structure. If you want some other vocabulary or structure you have to convert it, but creating these converters is expensive and time consuming, it doesn't lend it's self to federation and a global solution.
- on the other end of this spectrum you have the universal data model - every term and structure is controlled and normalized. The problem here is that it is nearly impossible to get agreement on these terms and structures, the debates are never ending
- a modern approach is to use "ontologies" where the meaning of each term is very well defined and can be mapped by technology. As exciting as this is it has proved very difficult to define the semantics of everything and the technologies to match terms is in an early state - getting agreement on meaning is just as hard as getting agreement on terms
We propose a middle ground - shared concept "hubs". A hub is just a trusted source of common information, one that defines a vocabulary of terms and concepts. Concepts in a hub are then shared among the data sets that want to leverage it. Publishers are then able to trust one or more of these hubs and relate their terms (if they are different) to the hub terms. So it is up to the publisher of information to accept a hub and there can be more than one. The vocabulary within a hub is controlled by the publisher, but since they "own" that hub and (we expect) any one hub has some domain where that publisher has some credibility. By being able to "ground" data in these hubs and allowing for multiple hubs, we will produce a "marketplace" of trusted interoperability points - and those that are the most trusted will grow and become established, and thus link the data grounded in those hubs. Shared concept hubs become the Metadata of the data cloud.
Since hubs are just data they can also be grounded - hubs can be grounded in other hubs, making a network of trusted vocabularies. Hubs can also use ontologies to define their concepts - so that as semantics becomes more and more practical we will have more automatic ways to ground our data and our hubs.
It is the job of the data cloud's platform to provide for trusted hubs and "grounding" published data sets in hubs.
The data in the could it's self would have minimal structure - it would be the job of the platform to convey information in the hub to the vocabulary and structure that makes sense to a user.
Support for Multiple Data Formats
There should be a standard format and programmatic interface (SOA based API) for interacting with the cloud - one that supports a wide variety of information. Our current suggestion is that the Resource Description Format (RDF) standard from the W3C be used. This format works well on the Internet, can provide for information federation, semantic grounding and cross-Internet query. However, there is a lot of other data in other formats and there are reasons to want to be able to get data in other formats. The data cloud's platform technology should be able to transform information in these other formats to and from the standard format of the cloud. The Model Driven Architecture (MDA) standards of the OMG provide a framework for multiple data formats - the OMG has defined models for many of these formats and provides mechanisms for transformation between them.
Information, Architectures and Vocabularies
There are three kinds of data that are crucial for the data cloud:
- Information - this is the kind of "raw data" we are used to - tables, facts and figures, etc.
- Architectures - business and technology architectures define the services, processes, information and rules about our organizations and systems. Architectures also define the structure of technology resources and data
- Vocabularies define the common terms and concepts that are used in information and architectures, enabling us to understand the data in the cloud.
While the kinds of data in the could is intended to be very open, the core platform should provide built-in capabilities for raw data, architectures and vocabularies.
The Data Cloud Platform - GAIN
The data cloud for the administration's initiatives needs a platform- one that can be used easily by government agencies, departments, companies and individuals to publish information into the cloud, have a participatory process for maintaining it and open capabilities to utilize that information.. The Global Architecture & Information Network (GAIN) is an initiative to produce the platform for the data cloud as an open source and pervasive infrastructure It is anticipated that commercial and open source technologies will then be able to "plug into" the cloud for enhanced capabilities. Please see http://www.GAINInitiative.net