If managing the growing volume of data in your organisation seems like a challenge, spare a thought for NASA, where hundreds of terabytes of incoming data has to be processed on a daily basis. NASA's Jet Propulsion Laboratory (JPL) in Pasadena, California is one place where the challenge of storing, processing and accessing data on such a scale is being wrestled with.
Requiring digital records from dozens of ongoing missions to be stored, indexed and processed is critical in order to allow engineers and scientists to better understand Earth and the universe beyond. This is data gathering and analysis on a universal scale.
The reality for information-age-based organizations is that our success is throttled by our our ability to rapidly and comprehensively navigate the big data universe.
To understand just how much data is streaming in, the Square Kilometer Array (SKA) - a planned array of thousands of telescopes in South Africa and Australia designed to scan the skies for radio waves coming from the earliest galaxies known - is estimated to generate a colossal 700 terabytes of data every day. For comparison, consider that this is the entire size of Facebook's Graph Search posts index for its 1.11 billion users. Amazon, with over 59 million active customers, has about 42 terabytes of data, while YouTube holds at least 45 terabytes of video.
The other data behemoth business here would be Google, which as of 2008, claimed to be processing 20,000 terabytes of data (20 petabytes) a day. Of course this 700 terabytes of data is for a single NASA project in a single day. The number for the agency as whole is likely to be immense, given that one centre alone (Center for Climate Simulation), housed 32 petabytes of data in 2012, with a total capacity of 37 petabytes.
The Five 'V's
In facing the big data challenge, Tom Soderstrom, IT chief technology officer at JPL, defines a big data problem as having five or more of the following Vs:
High Volume: A significantly large amount of incoming data.
Rapid Velocity: Data at a high incoming speed/intervals.
Large Variety: The need to combine different data structures.
High Viscosity: Difficult data discovery and manipulation methods (detecting, extracting and combining).
Significant Value: Must build a business case which determines problems to take on first.
Soderstrom mentions that in building a crisp case for business value and return on investment (ROI), the final V is key to prioritising and solving big data problems.
Stressing the importance of combing diverse data sets, he also points to what NASA hopes to achieve with its big data strategy: "If we could effectively combine data from various sources -- such as oceans data with ozone data with hurricane data -- we could detect new science without needing to build new instruments or launch new spacecraft."
Open Source Information Architecture
Instead of purchasing or building software from scratch to process these torrents of data, NASA are frequently turning to existing open source tools and modifying code to fit their requirements.
NASA has used open source to address project and mission needs, to accelerate software development, and to maximize public awareness and impact of our research.
Observing some of the software projects that JPL have openly released back into the community via the Open Channel Foundation also provides an indicator of the challenges that are being addressed. 'JadeDisplay', for example, is a package that uses Java Advanced Imaging for loading/computing and displaying image tiles for images that are frequently over and beyond 2 GB in size. This avoids the tediousness of waiting for the entire image to load while the user might be trying to scroll via the GUI (Graphical User Interface).
We can?t just pick up a commercial off-the shelf system, plug it in and we?re good to go.
Among other packages, there is 'Pomegranate', a python application that 'webifies' science data files, while also simplifying the the retrieval of meta and data information from files. This essentially addresses the processing and accessing components of the data management strategy.
As a presentation by Soderstrom made clear in 2012, these are all part of an effort to "work with anyone, from anywhere, with any data, using any device at any time".
Analysis, Visualisation and Distribution
While the most complex problems faced by NASA are handled by NASA's Linux based supercomputers, the main priority is to allow scientists and engineers to make their own choices on the analysis of raw data. Here once again free and open source software tools play a key role in providing much of the analytical horsepower.
Typically, users would query a metadata server and order their data from the data centre. In turn, the data centre then fulfilled this request by preparing (subsetting, resampling, averaging, etc.) the data and placing this on an FTP server. After receiving a notification of the data's readiness, the user could then proceed to download, process and analyse the data locally. Some of the processed data was also usually uploaded back to the data centre.
This cumbersome and slow process is now one that NASA is seeking to move away from, particularly given the sizes of data involved. Part of this strategy involves moving the storage and large scale processing of data-sets onto the open-source Apache Hadoop framework and streamlining processes for the user. In order to minimise the movement of data, this system will also allow for some analytical processes to be carried out remotely, close to where the data is produced and stored.
The importance of visualisations to meet user requirements is also evident in both its software requirements - and separately - the Center for Climate Simulations' 'Visualization Wall'. Driven by 16 Linux-based servers, these servers split images across a 17-by-6-foot wall, creating one huge, high-resolution medium on which scientists can display still images, video and animated content from data.
As well as providing data for its own scientists and engineers, part of NASA's big data approach ties in with its open data strategy. In exploring innovative approaches to extremely large datasets, NASA encourages users (internally and externally) to utilise raw datasets in new ways to perform analysis, experiments, and learning. This big data culture recognises the immense value in crowd-sourced intelligence and avoiding unnecessary restrictions on information dissemination.
Two examples demonstrating the wider end result of NASA's public-side data archiving and distribution efforts are the Atmospheric Science Data Center (ASDC) and the Planetary Data System (PDS). With a data archive of more that 3 petabytes, the ASDC specialises in data important to understanding the causes and processes of global climate change and the consequences of human activities on the climate. Focusing more on NASA planetary missions, the PDS provides a 100 terabyte peer-reviewed and well-documented system of online catalogues that are organised by planetary disciplines.
NASA's approach to big data provides a number of valuable lessons for all organisations seeking to survive and succeed in the information age:
Define the problem: NASA reached a point at which they understood that their current data strategies would be insufficient for meeting their goals. In building a case for big data, a strong value and ROI case had to be made. Each organisation will have its own unique set of problems that big data can potentially address. As such, value and ROI cases will have to be accordingly adjusted.
Start simple: Aim for the low hanging big data fruit first. As Soderstrom puts it, "These are problems that have significant business impact if solved. An end user, who is facilitated by the data scientist, articulates these business problems. They are short enough that they can be prototyped and demonstrated within a three-month period and with a low budget." This learn-by-doing approach also ensures that you are better able to handle further big data projects.
Understand and meet user requirements: A feature such as visualisations might play an important role at NASA, but each organisation will have its own needs based on the type of data being analysed and what helps users to better do their job. Instead of an inefficient and costly reactive approach, ensure that you pro-actively analyse all such needs in advance. Here, people who can bridge the user-IT divide will be especially valuable.
Bring the right people on board: Data professionals capable of modernising information systems and approaches to data, covering everything from implementation, to analysis and security should be brought together to lead you big data strategy. Furthermore, existing personnel should be convinced of the benefits that this strategy will provide.
Build an open data culture: Avoid placing unnecessary barriers to the free-flow of internal information, this includes both upstream and downstream information. By providing easy access to data, you place more eyes on problems while also allowing others to learn and carry out their own analyses. This collaborative approach taps collective intelligence and significantly contributes to organisational learning.
Leverage the open source advantage: While you might be able to find proprietary out of the box solutions, open source software can provide a far greater degree of flexibility in terms of customisation. Flexibility, interoperability and the ability to bring together diverse data sets are significant open source advantages for a big data strategy.
As the volume of data and opportunities for analysis continue to grow, developing new approaches to understanding, analysing, visualising and disseminating data will be of crucial importance to all organisations. This is now the reality for the information age organisation, and whether companies choose to address it or not, this data revolution will continue to forge full steam ahead, either carrying its occupants onwards to success or running over those who stubbornly resist.
Photos: NASA APPEL, MATEUS_27:24&25 and NASA Goddard Space Flight Center
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.