Recipe to Enable President Obama's Initiatives

 

President Obama's Initiatives to transform government to be more transparent, participatory and collaborative will impact the culture, processes and perception of the government and the way citizens interact with it.  These initiatives will also rely on and have an impact on the technologies that support government.  This page will outline a Recipe to Enable President Obama's Initiatives with respect to the technologies and architectures required..

Ingredients for the recipe

The requirements for this recipe are derived from President Obama's Initiatives as they relate to technology and architectural support.  There are also substantial requirements for cultural and policy changes that we will not address here, but will be supported by these technologies and architectures.  Culture, in particular, can be affected by the resources available and the infrastructures for collaboration and information sharing.  Business architectures reflect goals and policy and can initiative cultural changes where those architectures become embodied in the processes, technologies and services of our organizations.  These ingredients enable the administration's initiatives.

The Data Cloud

The Internet is great for human interaction with people or information sources.  What is still developing is the Internet as a "Data Cloud" - information that can be found, understood, queried, linked and re-purposed.  It is this kind of data cloud that would support President Obama's Initiatives.  What would such a cloud look like?

Data Cloud

Imagine that what any person or organization knows and they want to share can be easily published to the data cloud.  Once in the cloud every published fact can be referenced, commented on, analyzed, linked to.  facts can be "mashed up" with other facts and presented in user friendly ways.  Information isn't just accessible through the web page and limited search mechanisms on the publishers web site - data can be "mashed up", analyzed and compared.  Others can use that information, add to it, comment on it and evolve it - with controls and security.  We don't know what a data set may be useful for - by publishing the data we allow "a thousand flowers to bloom" that consume, analyze and present that information.  This is visibility!

Information in the cloud doesn't have to be limited to tables of facts and figures - it can contain plans, architectures, designs - everything from the business process the OMB may use to approve a budget to detailed technology data structures.  It could contain fishery data and nuclear safety statistics.  Some information may be "raw", while other information could be analyzed and digested.  

A data cloud is a necessary enabler of the administrations objectives - it will serve as the gateway to government visibility, the portal for participation and the basis for collaboration.  The difference between the data cloud and what we have now is crucial - now we have web pages showing us data, we need access to the data with a way to use it and evolve it.

The data cloud is not a new idea - standards already exist for Open Linked Data and the Semantic Web as defined by the W3C.  These standards make an excellent basis for the data cloud - it is already Internet based, standard, provides for open query and federation of data and allows the semantics of the data to be well defined.  Commercial and open source technologies already exist to implement these standards and these can be used to make the data cloud a reality.  What has not existed is a way to bring these diverse technologies together into a solution for  visibility, participation and collaboration.  We need a "platform" for the data cloud that is open, immediately usable and pervasive.

Trust 

Imagine there is some information of interest in the data cloud - can you trust it?  Who did it come from and how; do you know it is the real thing?  Trust in information, particularly information on the Internet, is not a simple matter.  Just because some fact is "on the Internet" does not make it true - the question is - do you trust it and why?  Persoanly, if it is a question of units of measurement, I trust the U.S. Natinal Institute of Standards and Technology (NIST) quite a bit - they are both competent and have no reason to deceve.  So I trust information about units of mesurement from NIST.  When I am doing analysis, creating data, researching information or mashing up data I should be able to register my trust in this NIST data and it should be recorded that anything I produce depends on what they say.

On the other hand, I may not trust some terrorist organization about U.S. policy - but that doen't make what they say uninteresting, just not trusted. These examples demonstrate a basic principle of the data cloud - it will contain information of all kinds and it is up to the consumer to deciide what they trust and what to leverage.  It is not expected that all data will agree or even be consistent - the data in the cloude is essentialy the opinion of the publisher.  So my trust in the data is derived from my trust in the provider.  Some application of the data cloud may only use trusted data where as others may specialize in analizing how various sources of data agree or conflict.

So the data, raw and analized, in the data cloud will form a "trust network" - we should always be able to point to the source of our data and understand who and what we are trusting.  Who says something is just important as what is said.  It is the job of technology supporting the data cloud to make sure the data really comes from who we think it does and that it has not been compremized along the way.

Security and Identity

To trust the data we have to trust the source and know that it is the "real deal".  If NIST publishes something, we want to know that it is really the NIST data - not a copycat.  This imples that both the publishers of data and the data it's self need a strong "identity", that is we need to be able to know who or what it is and not confuse it with anything else.  Security is then imprtant to make sure that the data has not been tamered with and also that, if the data is restricted, it is only available to those with access rights.

It is the repsonsibility of the data cloud's technology platform to provide for identity and security, particularly for data publishers.

Shared Concept Hubs

A nemisys of sharing data has always been the diversity of terms, symbols and structure used in the data - the vocabulary of there data its self.  There are three basic approaches to this problem:

  • on the one end every data source is differetent - it has its own terms and structure.  If you want some other vocabulary or structure you have to convert it, but creating these converters is expensive and time consuming, it doesn't lend it's self to federation and a global solution.
  • on the other end of this spectrum you have the universal data model - every term and structure is controled and normalized. The problem here is that it is nearly imposible to get agreement on these terms and structures, the debates are never ending
  • a modern approach is to use "ontologies" where the meaining of each term is very well defined and can be mapped by technology.  As exciting as this is it has proved very difficult to define the semanitcs of everything and the technologies to match terms is in an early state - getting agreement on meaning is just as hard as getting agrement on terms

We propose a middle ground - shared concept "hubs".  A hub is just a trusted source of common information, one that defines a vocabulary of terms and concepts.  Concepts in a hub are then shared among the data sets that want to leverage it.  Publishers are then able to trust one or more of these hubs and relate their terms (if they are different) to the hub terms.  So it is up to the publisher of information to accept a hub and there can be more than one.  The vocabulary within a hub is controled by the publisher, but since they "own" that hub and (we expect) any one hub has some domain where that publisher has some credibility.  By being able to "ground" data in these hubs and allowing for multiple hubs, we will produce a "marketplace" of trusted interoperability points - and those that are the most trusted will grow and become established, and thus link the data grounded in those hubs.  Shared concept hubs become the Metadata of the data cloud.
Since hubs are just data they can also be grounded - hubs can be gounded in other hubs, making a network of trusted vocabularies.  Hubs can also use ontologies to define their concepts - so that as semantics becomes more and more practical we will have more automatic ways to gound our data and our hubs.
It is the job of  the data cloud's platform to provide for tusted hubs and "grounding" published data sets in hubs.
The data in the could it's self would have minimal strucure - it would be the job of the platform to convery information in the hub to the vocabulary and structure that makes sense to a user.

Support for Multiple Data Formats

There should be a standard format and programaatic interface (SOA based API)  for interacting with the cloud - one that supports a wide variety of information.  Our current suggestion is that the Resource Description Format (RDF) standard from the W3C be used.  This format works well on the Internet, can provide for information federtion, semanitc grounding and cross-Internet query.  However, there is a lot of other data in other formats and there are reasons to want to be able to get data in other formats.  The data cloud's platform technology should be able to transform information in these other formats to and from the standard format of the cloud.  The Model Driven Architecture (MDA) standards of the OMG provide a framework for multiple data formats - the OMG has defined models for many of these formats and provides mechanisms for transformation between them.

Information, Architectures and Vocabularies

There are three kinds of data that are crucial for the data cloud:

  • Information - this is the kind of "raw data" we are used to - tables, facts and figures, etc.
  • Architectures - business and technology architectures define the services, processes, information and rules about our organizations and systems.  Architectures also define the structure of technology resources and data
  • Vocabularies define the common terms and concepts that are used in information and architectures, enabling us to understand the data in the cloud.

While the kinds of data in the could is intended to be very open, the core platform should provide built-in capabilities for raw data, architectures and vocabularies.

The Data Cloud Platform - GAIN

The data cloud for the administration's initiatives needs a platform- one that can be used easily by government agencies, departments, companies and individuals to publish information into the cloud, have a participatory process for maintaining it and open capabilities to utilize that information..  The Global Architecture & Information Network (GAIN) is an initiative to produce the platform for the data cloud as an open source and pervasive infrastructure   It is anticipated that commercial and open source technologies will then be able to "plug into" the cloud for enhanced capabilities.  Please see http://www.GAINInitiative.net