AUTOMATION OF GEOSPATIAL RASTER DATA ANALYSIS AND METADATA UPDATING: AN IN-DATABASE APPROACH

This paper proposes a spatial data infrastructure (SDI) module for management of a continuous flow of geospatial images and related metadata. Examples of such flows are continuosly acquired map scans from the digitalization process of an old maps collection, or the satellite imagery retrieved through a receiving station. Storage of the raster data in a database is a key feature of the system, which enhances the usual tasks and usability of SDI systems. The analytical procedures deployed within the data store perform automated raster analysis and content-based metadata extraction. This functionality is illustrated with two experiments – improving the display of early map scans and snow and cloud detection from satellite images. Applications of the proposed approach and utilization of the prototype application by geographers and cartographers are discussed.


Introduction
At the Faculty of Science of Charles University in Prague, huge amount of descriptive, statistical and geometric spatial data is used for research and education purposes.A Spatial Data Infrastructure (SDI) implements a complex framework of technology, geographic data, metadata and users in order to use spatial data in an efficient way.Recently, the amount of spatial raster data has grown significantly due to a new receiving station of satellite imagery and the advancement of the old maps collection digitalization.The increased pressure on human and technological resources unveiled, how far the current approaches and tools designed for vector data are unsuitable for raster images, that are much bigger in data volumes and more variable in storage formats.This implicates higher demands on management and administration of the data store mechanism.

Related work
Data and metadata records are two crucial components in any SDI.Metadata enable data discovery and access for users and provide information about the purpose, currency and accuracy of spatial data sets (Olfat 2013).However, the manual creation, update and authoring of images' metadata is considered as being monotonous, time-consuming, and labor-intensive tasks (Trilles 2012).That is why challenges arise regarding metadata collection, storage, updating and integration in metadata catalogues (Batcheller 2009;Grill 2009;Olfat 2013).
Therefore, there is a need for an automated administration of collected raster data and related metadata that is more efficient than current archiving approaches.The key idea to increase the effectiveness of existing SDI solutions is the application of analytical tools for raster data integrated within the archiving system.This shift of application logic is allowed due to newly introduced support to in-database storage of raster data by several database management system (DBMS) vendors (PostgreSQL 2013;Oracle 2013).
The effectiveness of this approach is multiplied by having all available data in one place within SDI, enabling the extrusion of regions in the images or intersection analysis using available vector data.It also performs the vital functions that make spatial data interoperable, i.e., capable of being shared between systems.
To increase the effectiveness of searching for the desired raster data, the content-based metadata of such an image are needed.Their creation and retrieval fills the gap between low-level information that can be processed by computers and high-level semantic information understandable and applicable by humans (Akrivas 2007;Zhang 2012).The application of analytical tools in a raster processing line can automate the generation of such metadata or image annotation.
To prove the functionality of the proposed solution and to demonstrate the potential usage two case studies are presented in this paper.
The Floreo (Demonstration of ESA Environments in support to FLOod Risk Earth Observation monitoring) research project and the existence of the receiving station for satellite data at the Faculty of Science were the motivation for an implementation of clouds and snow detection procedure.The receiving station continuously provides AVHRR/NOAA images.Variety of snow and clouds detection methods using the Advanced Very High Resolution Radiometer (AVHRR) data have been already reported.While it is relatively uncomplicated to separate snow-free land from snow-covered land using spectral characteristics, it is no easy task to discriminate between snow and clouds (Höppner 2002).The variety of works referred to extraction of snow or clouds from NOAA/ AVHRR (Allen 1990;Gesell 1989;Saunders 1986;Simpson 1998;Voigt 1999).
The TEMAP (Technology for discovering of map collections) research project (TEMAP 2014) aims at applying the advancements in geospatial web technologies to facilitate the access to early maps for the end-user.The map collection of the Faculty of Science, Charles University in Prague contains tens of thousands of maps with more then 35,000 already digitized and catalogued.As an example of similar initiatives, the David Rumsey Map Collection (2014), can be mentioned.It contains more than 150,000 maps, of which 42,000 are digitized and georeferenced.
Solutions being developed within TEMAP project adapt and further extend the latest technologies for searching and distribution of digitized maps like MapRank (2013), enabling geographic searching by map location and coverage in Google Maps, or Georeferencer, designed for crowdsourced georeferencing for map collections (Fleet 2012).

SDI module for raster data managementimplementation architecture
The initial work on SDI solution was introduced by Hettler (2012) to provide means for automatic management of continuously acquired raster data and metadata.The implementation architecture consists of several components, as depicted in Figure 1.
The administration layer provides an environment for the initialization and configuration of the solution for automatic raster data and metadata archiving and publishing.Technically, the administration layer is based on the Java application MtdtRasPub, which, in addition to controlling the raster data flow, constitutes the metadata record for each raster image from all available sources (World Files, raster headers, bibliographic records).Metadata records follow the ISO 19115:2003 standard, the current "best practice" standard defining the geospatial metadata format.
The storage layer, which includes the databases used to store data and metadata, is based on the PostgreSQL database platform.Within this layer, appropriate data structures for data and metadata are built in order to perform their automatic publication for the system's users.
The service layer manages the communication of the data store with the metadata catalogue and map server, employing for this sake the GeoNetwork opensource (GeoNetwork 2013) metadata catalogue and GeoServer (Geoserver 2013) map server.The web-based graphic user interface presents the data and metadata to user.
This SDI module for raster data administration provided the automation of raster data storage and distribution via the web user interface together with all available metadata.However, using such a solution for analytical image processing, the data must be first transferred from the data store to an external application.Thus, so as to fully exploit the advantage of storage of raster data in the database, an extension of this architecture is needed.Consequently, the next section presents the shift of image processing from an external application onto the storage layer and discusses the requirements for the deployment of an image processing functionality within the DBMS.Furthermore, case studies of a custom-made images' analysis functionality are presented, followed by a discussion on the potential usage of a solution for a broad geographic community.

In-database image analysis approach
The current usual practice is based on the out-of-thedatabase storage of raster data.In a relational database only the metadata describing the image are kept.For any processing or analysis, the raster data must be first transferred to separate processing and analytical software applications, as it is suggested in Figure 2. The analytical task can result in raster editing.In such a case, a new raster representation needs to be transferred back to the data store, causing extra data transfer overhead.
The advantage of keeping the data out-of-the-database, i.e., as binary large objects, appears in case the publication of stored raster data from the database system (like map service publishing) is the only objective of an application.The first performance evaluation results presented in (Hettler 2012) shows that the publication from a native PostGIS raster format is slightly slower than from the alternative binary raster storage, which usage however prevents from the employment of analytical tools.
In-database storage.The in-database strategy (Xie 2013;Obe 2011) employed by the solution for geospatial images proposed in this paper has several features to enhance the storage and analysis of big geospatial images.
The first feature is moving the image processing closer to the data to avoid moving large data sets from the databases to detached analytical software.The second feature is parallel processing provided by the database for the in-database raster format storage.The third feature is concurrent processing that enables leveraging the power of computer clusters to concurrently process numerous images (Xie 2013).In the Figure 3, the retrieval of results of an image analysis, which is performed on the database side, is depicted.
In-database image processing functions.There are countless of possible image analysis functions.Raster data processing and analysis involves a large set of operations, such as radiometric and geometric corrections, image transformation and mosaicking, image enhancement, pattern recognition and raster map algebra, to name a few (Gonzales 2006).
Database platforms with raster data support implement only core functions that are required by database management or that improve the effectiveness of data manipulation, such as image updates, processing or aggregation.These complement traditional GIS applications and can be reused by complex or custom-made analytical procedures deployed for a specific purpose, providing effective raster data manipulation for such developed procedures.
In order to fully utilize the effectiveness of in-database approach for analytical procedures over spatial raster data, the database platform is supposed to support the following key functionality: -raster bands accessors, -raster pixel accessors and setters, -raster band statistics, -map algebra over individual pixels, -spatial indexing, -datum definition and coordinate system transformation.This allows for a basic analysis and moreover supports the development of custom-made functionality, providing procedural language.With respect to the architecture presented above, PostgreSQL with spatial extension Post-GIS (PostGIS 2013) was chosen for the implementation of the proposed SDI enhancement.PostgreSQL offers this key functionality for raster data processing and also provides PL/pgSQL procedural language.

Experiment
Ongoing research projects like TEMAP, which aims to develop of technologies and procedures for discovering old maps collections, or FLOREO, which is concerned with snow detection from satellite images, provided the motivation and data for the tests of the proposed solution.Continuous flow of acquired very large raster data from the satellite images receiving station and from the digitalization of tens of thousands old maps required the greatest possible reduction of raster data movements for the sake of processing and analysis.
This requirement was met by employing the in-database approach and the development of specialized analytical functions within the data store.Implementation of this functionality is enabled through the procedural languages.The PL/pgSQL procedural language of the Post-greSQL database platform was utilized for these purposes.

Cloud and snow detection
The algorithm for snow detection introduced within the FLOREO project was designed based on the past work of Romanov (2000).The cloud detection part is adopted from the AVHRR Processing over Land Cloud Fig. 2 Raster analysis in detached data storage and data analysis applications.
Fig. 3 The character of raster data stored in-database processing.
and Ocean (APOLLO), which was developed by Saunders (1986).The implementation of such procedures within the data store aimed at a retrieval of basic information about the snow and cloud coverage in the image.This information is utilized for two purposes.First, the automatic identification of such images appropriate for classification.Second, the automatic creation of content-based metadata to increase the effectiveness of the search in the metadata catalogue.
NOAA-AVHRR data.NOAA is a polar satellite that circles at an altitude of approximately 850 km.The satellite scans each place on Earth at least twice a day, with increasing frequency at places closer to poles.The Advanced Very High Resolution Radiometer (AVHRR) instrument on board of NOAA has 5 (or 6) wavelength channels.The channels are optimized to measure cloud and surface characteristics with minimum contamination from other atmospheric constituents.The channel specifications are presented in Table 1.

Channel number
Resolution at Nadir Wavelength (μm)  The spectral signatures of snow and clouds can be very similar and depend on various environmental factors.The snow and cloud discrimination relies on threshold value estimates, like minimum possible surface temperature of snow compared to clouds, as determined by a histogram analysis of temperature and reflectivity.Threshold values are instrument specific.

Snow detection.
The snow detection procedure is implemented based on a series tests adopted from the algorithm developed by Romanov (2000).An image pixel is identified as snow by a threshold method, which tests if the signal in a channel or combination of channels corresponds to defined spectral characteristics in a cloud free atmosphere.The main tests for daytime are:

Improving raster display
In image processing, the normalization is an image enhancement technique that improves the contrast in an image by stretching the range of pixel intensity values into a desired range of values (Gonzales 2006).
Implementation of a procedure for an improvement of raster display is motivated by the effort to increase the readability of early maps scans.The original documents faded out due to the long period of time since their creation, the character of the materials used and colouring techniques.
The PL/pgSQL procedure transforms an image I with intensity values in the range (Min, Max), into a new image IN with intensity values in the range (newMin, newMax).The linear normalization is performed according to the formula: The Figure 5 illustrates the outcome of the histogram stretching function.The implementation of such a procedure is straightforward and also computationally efficient.That is due to the optimized functions for raster data manipulation, editing and yielding of image statistics or image histogram.These functions are natively provided by the PostgreSQL Raster platform and are re-usable within other developed functions.

Metadata
GeoNetwork opensource metadata catalogue is employed by the proposed solution to provide means of description of various types of geographic data (vector or raster layers, map services, statistical data).
The metadata document is formed within the administration application from available sources in accordance with ISO 19115:2003 rules.Only a small portion of elements is used following the GeoNetwork and INSPIRE (2014) recommendations on required or highly recommended elements to properly describe geographic data.The compliance of metadata with these rules is checked by GeoNetwork when metadata records are imported or updated.For this sake, the GeoNetwork's xml.metadata.insertservice is employed.Three groups of metadata fields in a resulting metadata record can be identified based on the source of their origin.
Available descriptive metadata.Depending on the data source, the descriptive metadata can be retrieved from World Files, raster headers or bibliographic records.To carry this out automatically, the format of the source document must be known, i.e., a catalogued description of an old map in XML document following the MARC 21 standard.The cataloguing procedure itself follows the methodology described in Novotná (2013)  System generated metadata.Metadata generated automatically by some components of the system like Online resource linkage (url of the source disseminated by GeoServer) or Data quality info belongs to this category.This category includes metadata fields, whose values are set to the system administration application by a person responsible for the dataset, like Presentation form, Organization name, Role, Maintenance and update frequency.Also, the values of fields like Abstract or Purpose can be determined using the objective of the specific satellite mission and further automatically set by the administration MtdtRasPub application for all images of the dataset.
Metadata as an analysis product.In the field Supplemental Information, additional information acquired through raster analysis is encoded.This refers to the percentage of cloud and snow coverage in satellite images and also to the original and new minimum and maximum intensity values of old maps scans.The unknown reference system or map scale of old maps can stand as another example.In this case such information cannot be retrieved from bibliographic document.Cartometric analysis of such an old map scan however can provide estimates of these parameters (Bayer 2014).
The assignment of metadata fields into the categories above is not strict and depends on the unique characteristics of datasets.The update of existing metadata records is enabled by the system and carried out on an 'as needed' basis applying the metadata.insertxml service.The unique identifier prevents from the creation of duplicated records.The updating of existing metadata is required to encode the analysis results.

Conclusions
An SDI module for management of continuous flow of raster data and related metadata was proposed.This module addresses the needs for the automation of raster data archiving, analysis and distribution.
System evaluation.The functional parts of the prototype system have been developed in cooperation with the researchers and end-users of provided services.
The developers of the metadata solution, along with map archivists and end-users, defined fields from bibliographic records, that would be relevant for both public and scientists working in fields of geography and cartography.Selected metadata elements provided sufficient map description and search capabilities within the SDI system.
Also, the role of content-based image description was proven to be the key for the effective management of satellite imagery.Due to huge data volumes regularly produced, older or unused images are moved to be archived on backup media like magnetic tape.The image description is then crucial in allowing the searching of archived images, whose display is not available on-line.
The analysis of snow and cloud coverage demonstrated the way of content-based metadata creation.Romanov (2000) presented evaluation of classifications for satellite-based snow products.Results of the method varied from 75% to 85% of correct classification depending on environmental conditions.The comparison of results obtained by manual processing with automatically acquired results fit into this range.Nevertheless, to improve the image search capabilities and to fully answer the needs of end-users, more complex analysis on snow and clouds characteristics and more detail placement of such phenomenon are necessary.
The presented experiments provided valuable results for related research projects.The retrieved metadata enhanced the search capabilities and presented information about the suitability of an image for further analyses (like the classification of snow characteristics).The normalization of old map scans improved the readability of such documents.The main contribution, however, inheres in proving the functionality of the in-database analysis approach followed up by the automatic content-based metadata creation which is a major open research problem, not only to metadata catalogue or SDI systems but in other fields too.
Future development.The introduced approach provides many opportunities for geographers of various specializations to facilitate the retrieval of information from huge rasters, share it and effectively search for available data within the SDI.
The extension of the automatic processing of old map scan is an example.With a reasonable amount of effort, analytical procedures for the determination of the level of damage of a historical document, like that in Figure 5, can be implemented.
As another example, the enhancement of a geo-referenced mosaic created by historical cartographers from early map series like the military survey (Molnár 2011), can be mentioned.The image histogram equalization would provide a unified appearance by removing differences in contrast between individual lists caused by different materials or archiving approaches.More advanced procedures may deal with automatic map field extraction for such map series.
Another related example of potential future development is the extension of proposed snow and cloud detection aiming at the automation of snow and cloud typology classification or land use classification within the data store.
As shown in Dang (2012) the in-database storage is also a promising approach for the effective distribution and visualization of data changing continuously in time and space.Examples of such phenomena are temperature, pressure, precipitation, snow cover, land use or population density.
Future steps in application development will aim at the implementation of additional analytical procedures: (a) the image histogram equalization for the sake of mosaicking, (b) the embedding of existing procedures on cartometric analysis into the SDI system and (c) the integration of the application with solutions for crowdsourced georeferencing.

Fig. 1
Fig. 1 Implementation architecture for automated raster data management.

Fig. 4
Fig. 4 The classification of the original NOAA image (a) into the categories of land, snow and cloud coverage (b).

Fig. 5
Fig. 5 The segment of an early map with some damages, changes of paper colouring and faded labelling.(a) Before and (b) after application of normalization.
facilitating the identification of the key corresponding fields in both standards -the source MARC 21 bibliographical standard and the target ISO 19115:2003 geo-informatical standard.The following fields are acquired by processing the source documents: Title, Date, Date type, Abstract, Purpose, Descriptive Keywords, Language, Topic category, Scale Denominator, Temporal Extent, Geographic Bounding Box, Reference System Info.