private thoughts about life behind data processing

Author Archives: r@roopetervo.com


Releasing Open Source Part #2: Licenses

In previous post I gave number of reasons why open source is a terrific thing specially for public organisations.  I also promised to tell some practical considerations about releasing our software as an open source. Here you are. My experience is based on my employer’s Finnish Meteorological Institute’s, releases where I was active. I’m mainly talking about releasing old, previously propriety, software as an open source.

First the boring part: licences. I’m not a lawyer and I may be inaccurate or even wrong in some cases. But specially when handling open source software, every engineer need to think licenses at least a bit.

The first question is of course: What license should I choose? To analyse this more, first one has to decide whether he wants a permissive1 or copyleft licenses2. The most important difference is that in copyleft licences derivative work need to be published with the same license than original work. This makes copyleft licenses quite scary for many organisations.

MIT license

Another remarkable aspect in licenses is a requirement of  providing modifications back. For example GPL3 requires that all modifications has to be provided back to community. This is a nice idea and makes GPL quite popular with many. But it also makes a usage of such projects very hard for many organisations.

Correct choice also depends on a status of the project. If you have unique software with sure status, GPL may be a good choice. If you have to fight for your status and building community is hard, MIT or alike is much better. It’s often good idea to select familiar license for the target community. For java developers Apache4 is a familiar and for some other community MIT5 may be more friendly

To choose the license one has to also ask: What license can I use? Because of the copyleft aspect, usage of for example GPL licensed components requires publisher to either use GPL or remove the component. It’s notable still, that copyleft in GPL does not include system libraries. In practice, if you install your libraries from a common repository and link them dynamically, you probably don’t have to worry about it’s license.

In order to check what license can be used one has to know what licenses have been used in the software. In old and large software project this may be quite tricky. Happily, there are some ready tools available. Black Duck software6provides this software as a service. fossology7 developed mainly by Siemens provides probably the best open source way. Fossology is also available as docker image8 which makes using the software very easy.

Running these softwares is easy. Just upload the source code and read the results. But reading the results is not so easy. In practice, some manual check is always required. In Finland, non-profit organisation Validos9 has been a great resource to interpret the results. For member fee, Validos checks the licenses of used libraries and components and provides results for all members so that one component has to be checked only once.

Sample fossology output

It’s notable that also libraries and components that are included in the project after publication need to be compatible with selected license. This process has to be light enough so that it’s followed and does not slow down software development. But there are other aspects in selecting used libraries as well. Is library vital? Who’s developing it? What’s their business model? Checking the license is a great point to go through these concerns as well.

So, dealing with licenses is boring. It’s hard. But it’s not so hard. By questioning what license I can use and do i want permissive or copyleft license one can get far. By using ready tools like fossology and using dynamic linking handling licenses is relatively easy.

After going on this boring but necessary part, next I’m going to handle more interesting question like handling source code and other practical issues.


1https://opensource.org/faq#permissive

2https://opensource.org/faq#copyleft

3https://opensource.org/licenses/GPL-3.0https://opensource.org/licenses/GPL-2.0

4https://www.apache.org/licenses/LICENSE-2.0

5https://opensource.org/licenses/MIT

6https://www.blackducksoftware.com/

7https://www.fossology.org

8https://hub.docker.com/r/fossology/fossology/

9http://validos.org/fi/

Why Open Source is a Great Thing for Public Orgranizations

No activity here lately. I have been busy with my other recreational project visitkoli.fi1. This post is based on the speech I had2 in WMO EC-693 Side-Event last May.

Weather and climate business is a somewhat special niche sector of IT-field. The sector is large enough that significant systems can be made. Actually, even the World’s most powerful computers are used to calculate climate scenarios4. But the sector is also small enough that no proper software ecosystem has been arisen for processing and handling the data. The sector also mainly consists of public institutes and large companies which is apt to keep developed software private and bespoke.

And this is why also my employer, Finnish Meteorological Institute (FMI), has a variety of different kind of software to handle weather and climate information.

FMI though, is doing a remarkable effort to build the ecosystem. In 2012, it decided to release its data as open data. In 2013 a great share of all available data was published. And in 2016, FMI decided to release its software as open source.5 In theory the decision concerned all software, old and new. But in practice, every software have to be considered separately. And there’s no resources allocated to the release.

Despite of these minor caveats, some internationally very relevant software has been published. For example SmartMet Server6 (data and product server for MetOcean data) and HIMAN7 (post-processing tool for weather data) are world class software. As a ’product owner’, publishing the SmartMet Server has been my task. It has been hard work. My intention is to tell how hard, in the next post. Before that I will tell you why publishing software as open source is a great thing to do for any public organisation.

First of all, ecosystem is changing. Public organisations or typical companies can’t control the whole value chain anymore. Demands for very specific services have been increased and there are plenty of companies who are happy to fill them. At the same time large companies like IBM and Google are becoming more and more relevant players. So far many public sector’s tasks like weather forecasting or cartography have been so complicated and resource intensive work that only governments have been able to do this. But today, many technology giants have enough resources and know-how to complete the tasks. (Although often they base their work on open source models developed by public sector.)

This, of course, cast some challenges and changes on public organisations who want to pursue in business:

  1. Public organisations need more efficiency in development and operations.
    Even without healthy competition public sector need to achieve more with less resources in the future. 
  2. They need to ensure their authoritative voice.
    In some cases, objective and trustworthy information is vital. Tornado warnings, statistics, news are few examples.
  3. Most importantly, they to ensure an impact of produced information.
    For example, national meteorological and hydrological institutes are successful when weather and climate don’t cause unanticipated unwanted impact to society.

Increasing collaboration is one of the most powerful ways to increase efficiency. For example  are very well connected concerning observation and modelling activities. They have had to. Countries can’t do observations on others ground. And very few country have resources to run global weather forecast model on their own. But there’s a long way from model output to end user information. And closer we come to end user, more protective organisations tend to be.

Why not extend collaboration to upstream as well? Unfortunately international co-operation, what for example weather business requires, is specially challenging due to different cultures, motivation, environment, legislation and so on. Using open source software may lower these barriers a lot. It makes collaboration open and easy. There’s no need for long and burden negotiations (or at least it one topic can be handled easily). No-one fall in vendor lock. If one needs any changes or modifications, they can be done by oneself or ordered from 3rd party.

Open source software enables every organisation to provide their own service without doing things many times.

Ensuring authoritative voice and the impact of produced information is mainly a question how widely and correctly produced information reach relevant audience. Maximal coverage requires several different channels and services. One organisation can’t handle them all. Publishing data and products as open data helps, but in many cases the data is so complicated that information distorts easily. Providing tools to handle and analyse the data empowers 3rd party users to correctly utilize it. Thus, proper tools ensures consistency between information regardless of channel and service providers.

There’s also a one more aspect in open source, which concerns specially public organisations. Software produced by government institute is eventually produced and owned by the whole society. Releasing software as open source is a very simple and neutral way to ensure that everyone can utilise it. Companies, private developers and other countries.

”No one is left behind”8.


1https://visitkoli.fi/

2http://meetings.wmo.int/EC-69/_layouts/15/WopiFrame.aspx?sourcedoc=/EC-69/Presentations/EC-69%20FMI_Possibilities%20of%20open%20source%20code%20for%20meteorology,%2011%20May_Geneva_2017.pdf&action=default

3http://meetings.wmo.int/EC-69/SitePages/Session%20Information.aspx

4http://www.bbc.com/future/story/20170621-meet-the-worlds-most-powerful-computer

5http://en.ilmatieteenlaitos.fi/open-source-code

6https://github.com/fmidev/smartmet-server

7https://github.com/fmidev/himan

8https://sustainabledevelopment.un.org/hlpf/2016

Retrospective to FMI Open Data

Three years have passed since Finnish Meteorological Institute released it’s open data portal and now it’s good time for a retrospective. When the Open Data Project started, I was just finished my M.Sc thesis (although I had worked full time at FMI several years already) and I was ready to take a serious role in something cool. And serious role I took indeed. In the beginning, I worked in the project as an OGC2 and INSPIRE3 specialist but soon I found myself as an overall architect and responsible for the Download Service.

It’s worth noting that this is my personal blog and this text tries to reflect on only my part and my opinions about the Open Data. 

FMI’s open data initiative covers a significant amount of often updating data. The service contains weather, marine, air quality and radiation observations, weather and marine forecasts and different kinds of climate data. To browse the whole content of the data one can either access Data Catalog4 or Download Service’s stored query listing5. Many of the data sets are large (size up to ~30 gigabytes) and almost all of them have an update interval from 1 minute to 6 hours. This combination of large size and high update interval makes the weather data quite unique. FMI Open Data Portal is practically fully INSPIRE compliant with correct services, harmonised data formats and proper catalog.

INSPIRE directive require every EU state agency to share it’s geospatial data in certain format via certain services but it does not require open data. The simplest way to meet INSPIRE directive is to create a simple RSS-feed with link to downloadable items. More sophisticated services, relying heavily on OGC standards, has to be ready not until 21/10/20206. After deadline, the metadata need to be shared via CSW-interface7 in ISO 19115 -format.8 The data has to be accessible via View Service, meaning WMS 1.3.0 interface9. Download Service has to be established to download the data itself. WFS 2.010 is one obvious choice while some others can be also used. The data has to be provided in INSPIRE specific data models (derived from OGC O&M model11) directly from download service or via specific Transformation Service.

inspire

At 2012, we were sitting at one large meeting room. There was maybe 20 of us. We were starting a project which was established to meet INSPIRE requirements. The data group had created several different scenarios about the data release but in the beginning we had no decisions.

One notable peculiarity in INSPIRE is that it provides these two different way to provide the services. While some people supported more RSS-feeds as they are simpler I recommended going to WFS straight from the beginning. Double work makes sense very rarely. Luckily I got some support and we decided to implement WFS interface to our data server SmartMet Server12. There was also a clear consensus in the project team that we are hitting the INSPIRE requirements and possible open data with same solution. Double work makes sense very rarely.

Few months went on designing data formats, fulfilling the catalog and speculating about WMS view services. In august the Ministry of Transport and Communication decided to support FMI Open Data with 5,8 million euros13 (part of the budget increase was needed to compensate decreased data selling shares and part to the data releasing project). And things got going.

We had chosen our way. But how did we do? Was following INSPIRE specifications a good choice? Of course we did have other ways. We could had chosen RSS-feeds for INSPIRE and some other way for open data. Or maybe, just maybe, we could had chosen to work as a backend for our open data users as HSL does14.

Steven Adler, IBM Chief Data Strategist, specified requirements for well published open data in OGC meeting in January 2016:

  • It has to be linked to other data.
  • It has to be linked spatially (and has to be referable internationally).
  • The data has to be machine readable.
  • The data and the publishing process has to well governed.
  • The project has to be public; people has to be able to find the data.
  • The data has to be openly usable.

Note that this list does not contain easily usable, although it’s of course a good thing. Things should never be any more complicated than they need to be — but neither any simpler. Let’s see how we are answering to these requirements.

Machine Readability

Choosing to follow INSPIRE directives has gained some critics but also some praise. Critics has mainly came from private developers and praise from larger companies. In my previous post15 I speculated about good and bad sides of INSPIRE directive. It’s based on standards and includes a very good semantics structure but it’s unduly complicated. And probably because of complexity, it’s not very popular.  The INSPIRE is also very poorly documented. There sure is a lots of documentation but very few practical examples.

But was INSPIRE a good choice? Simple data formats enables quick and simple development but comes soon short in features. Serious business requires reliable interfaces and using well defined generic standard is the best way to ensure reliable design. When fully implemented by all European countries, INSPIRE will provide a maximal amount of technical reusability. And it sure provides good semantics. Well designed open data has to be linked to other data, linked spatially (and georeferenced) and referable.  One might think that these aspects are natural in weather data, but it’s not a case. Meta data used in FMI’s internal systems wouldn’t tell much to external user. After all, INSPIRE is designed 5 stars16 in mind.

5stars-data

Is INSPIRE commonly used for open data? No. Open Knowledge Foundation recommends some file formats17. Anyway, even though they march for openness, they don’t support OGC standards very much. That’s an serious issue, but another story for another rainy day. As a result though, open data portals are not following any standards which cause interoperability problems and huge amount of extra work to users.

Well Governed

Data governance is a serious issue. Unlike in many other business, weather industry has always lived from the data. There are long traditions to produce, manage, archive and consume data. But still, open data project improved our data governance. First of all, the project helped us to get all data in one place — behind one service. The project also improved our meta data practices. It forced us to describe our data sets very precisely. Our users don’t know our data as we do.

One specially good aspect18 in FMI Open Data Portal is, that open data is published by the same manner than FMI use it itself. This helps keeping data correct and up to date.

Maybe the most challenging aspect in opening the data is a culture change. Suddenly, People have to work in a showcase. The results of their work is public, their processes should be public. They get comments and critics. One example of this is the opening phase itself. In practice not all data can’t be opened at once. Selecting the data sets which are opened should be done in a transparent way19. The process should be clear and open for comments.

I think that change management can’t be emphasised enough!

Openly Usable

Even when talking about open data, ”openly usable” is not a trivial question. Open data ain’t free. Someone pays it and the funder typically want some kind of control and tracking. But all kind of control and monitoring decrease openness.

Also FMI requires its users to register. After registration, user gets an apikey which has to be added to every WMS and WFS requests. The requirement comes from need to report usage information. 5.8 millions is not a small amount of money — anyone investing this much, sure want’s to know what gives. The apikey requirement was also put in place to ensure that all can have similar access to the data. Unduly heavy usage can easily be tracked and shut down with the apikey.

Apikey requirement is quite typical. For example Facebook and Google use same kind of authentication. But registration has still gained some critique. Reasonable or not, the registration cause one severe implication that resources can’t be linked directly20. One can’t give URIs to the data which makes it less referable and thus less usable. In practice personal apikey makes also documentation and presentations much more complicated when examples can’t be working and easily tested.

Another issue which can reduce usability is a licence of the data. FMI and Finnish Government in general21 is using CC422 as a licence. It’s an open and free licence which requires only credits for the original author. There are only few more open common licences available but even still this licence has gained some critics.

cc-license

Ideology behind open data movement is strong and well reasoned. Still, it’s good to remember that open data isn’t free and some compromises has to be made also from user side. Specially in the beginning when organisations are moving to open data they may need better control and more detailed information about their data usage. When ecosystem is ready and stable, things can be loosened more.

Public

Open data doesn’t help if nobody knows about that. FMI has a fortunate position that it’s very well known organisation. We didn’t need to advertise our data. But we sure had to work hard describing how to use the data portal.  We kept workshops, press conference and speeches23. We also participated in Apps4Finland24 coding competition. We told about things in Facebook25 and in Twitter26. We also created some example programs27> and JavaScript Libraries28 to lower barrier in using the portal.

And of course we did quite a bit of documentation29. I think this all went relatively well. People do know about our data and I haven’t met anyone who haven’t been able to use the portal. Our documentation is complete and thorough. But it still would need some practical step-by-step quick start guides. When documenting systems like this, it’s always good idea to find someone who don’t know the system beforehand to write the user’s guide.

Impacts

So, after getting things going full speed the whole FMI was working very hard to get everything ready. In March 2013 we opened a beta version of our portal and in June the portal got official status. We got all systems up and running and most data sets open. I got over hundred hours extra on working time account.

Did we do well? Measuring the impact is hard. Danish government did an impact analysis of they data before and after publishing the data as an open data30. This is a clever approach but measuring the impact is still hard. Even getting the numbers is a serious task.

generic_graph

Our numbers are pretty good. At the moment we have over 10 000 registered users and over 5 data downloads every second. In total, over 300 million transactions have occurred since 2013. What’s also notable, the growth have been steady. The growth have never been very rapid though. Systems using the data are relatively complex and it takes time to implement them. Anyone publishing open data should be patient. Even three years is a short time to expect any significant results.

Another notable issue in numbers is that they are still very modest compared to FMI ’s Customer Data Service (providing the very same data for fmi.fi31, mobile applications32 and clients). While 300 million requests have occurred since 2013, at the same time over 32 billion requests have been served from Customer Data Service. Premium service is a vital aspect at the side of open data portal while serving the society.

Numbers are also hiding one distressing issue. Only 40-50 % of registered users are actually downloading something. One can only guess who are this majority who have registered but never downloaded anything. Some of them may have expected to get a nice user interface to fetch the data to excel. Some of them may have planned to do something at their free time but never managed to finish their project. Who knows what else.

Open data by definition means that it’s hard to keep touch with users. Apikey helps to get some granularity to the numbers but that’s pretty much all. That’s also one reason why it’s extremely important to be out there and talk with people. There’s no better way to know what’s relevant than talking to customers.

Personally, I find FMI Open Data as a success, both personally and from organisation point of view. But the World goes forward. One should not stop in publishing open data. Even large organisations often need partnerships33 to get maximal affect with their data. Also open source code is required34. To my delight, FMI is on very good track. But that’s another story.


1https://en.ilmatieteenlaitos.fi/open-data
2
http://www.opengeospatial.org
3
http://inspire.ec.europa.eu
4
http://catalog.fmi.fi/geonetwork/srv/en/main.home
5
http://en.ilmatieteenlaitos.fi/open-data-manual-fmi-wfs-services
6
http://inspire.ec.europa.eu/inspire-roadmap/61
7
http://portal.opengeospatial.org/files/?artifact_id=5929&version=2
8
http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?csnumber=53798
9
http://www.opengeospatial.org/standards/wms
10
http://www.opengeospatial.org/standards/wfs
11
http://www.opengeospatial.org/standards/om
12
https://github.com/fmidev/smartmet-server
13
https://www.lvm.fi/lvm-mahti-portlet/download?did=89517
14
http://dev.hsl.fi/
15
http://roopetervo.com/some-words-about-inspire/
16
http://5stardata.info/en/
17
http://opendatahandbook.org/guide/en/appendices/file-formats/
18
http://opendatatoolkit.worldbank.org/en/technology.html
19
http://sunlightfoundation.com/opendataguidelines/#prioritization
20
https://www.w3.org/TR/gov-data/#concepts.link
21
http://www.jhs-suositukset.fi/suomi/jhs189
22
https://creativecommons.org/licenses/by/4.0/deed.fi
23
http://www.slideshare.net/tervo
24
https://trello.com/b/Fg4ESbPs/apps4finland-2014
25
https://www.facebook.com/fmibeta/
26
https://twitter.com/meteorologit
27>
https://github.com/fmidev/metoclient-ui
28
https://github.com/fmidev/metolib
29
http://en.ilmatieteenlaitos.fi/open-data-manual
30
http://inspire.ec.europa.eu/events/conferences/inspire_2014/pdfs/19.06_4_09.00_Tina_Svan_Colding.pdf & http://opendatahandbook.org/value-stories/en/danish-address-registry/
31
http://ilmatieteenlaitos.fi/
32
http://ilmatieteenlaitos.fi/palvelunumerot-ja-mobiilisaa
33
http://sunlightfoundation.com/opendataguidelines/#partnerships
34
http://sunlightfoundation.com/opendataguidelines/#open-code

Some words about INSPIRE

INSPIRE conference is going and as I’m keeping two presentations there, it’s a great time to write down some thoughts about the directive. I’ve been somewhat active around the directive for four years now. My activities has been focused more on practical side: Finnish Meteorological Institute (FMI) has very extensive INSPIRE services as the same services are used for open data and INSPIRE services.

INSPIRE is designed to ease data exchange between European countries. It is a hugely complicated EU directive which isn’t very widely known nor utilised although by now all European agencies should provide INSPIRE services. In very short: it requires all public spatial data providers to provide their data via OGC services in a certain form and publish ISO 19115 metadata from their data. The required data formats varies a bit depending on domain, but all formats are based on Geographic Markup Language (GML) and Observation and Measurement (O&M) data model. The directive do not require open data.

In general, there’s no a single type of API to rule them all. While INSPIRE represents OGC world with O&M data model, WFS, WMS and other three-letter-standards, many modern edge developers would like to see much more simple REST APIs with JSON based data models. There’s always also an old guard who prefer FTP and more traditional data models and file formats. BUFR and GRIB are hot for example for meteorological domain.

All type of technologies have their up- and downsides. Old guard’s file based FTP systems have worked for decades. It’s a steady technology and we have ready tools to implement and use it. But generating and moving files for every transaction come quickly short in modern service based architecture. It’s better to move applications near data than data near applications.

Modern Edge

Modern edge developers prefer everything quick and easy. They don’t mind using FTP, complicated command line tools or C-libraries to process data for they specific use case. Several python and javascript libraries with Node.js and other modern web development tools have enabled a rapid prototyping and application development. A simple RESTful JSON interface with these powerful easy-to-use development frameworks are indeed an excellent combination for creating small software and web pages.

But most JSON based data formats and REST interfaces comes quickly short in serious geospatial business. Partly because of their technical features and mostly cause they are over-simplificating things. Some of these aspects are listed below:

  1. There are very few standards in REST-JSON-world. For example, there’s no standard way to select area of interest (bounding box for example), time nor output format. Every single client has to be bespoke development and every data has to be mapped together case by case. It’s fast to develop a single client, but it doesn’t take you far.
  2. For standards that DO exists, there isn’t enough expression power. For example GeoJSON community has gone so far in simplification that only WGS84 CRS is allowed — although it can only support for couple of hundreds of meters accuracy. I hope the future autopilots in aeroplanes are not based on GeoJSON.
  3. Typical  JSON interfaces don’t specify a standard way to indicate missing data (instead of broken pipeline) nor changes in data content. Environmental observations may be missing for countless reasons. For example snow depth may be missing because observation station is down, station has no sensor for snow depth or it’s summer. JSON typically says just ”not a number”.
Some interfaces and data types mapped against real life interoperability vs. rich interaction

Some interfaces and data types mapped against real life interoperability vs. rich interaction

And in addition, it’s hard to create intelligence programs based on too simple data services since there’s no way to know what the content really is. (Needs always human to interpret.) Unless you happen to be google-wice.

INSPIRE

INSPIRE, allied with OGC, tries to tackle these challenges of both old guard and modern edge developers. It comes from top as an order and it’s not very popular. I can see why. When I first saw the INSPIRE compliant data, I hated it (much because of overhead). Data format is too complicated and one needs a hell of a lot of time to get familiar with it. And it doesn’t help that it’s also very poorly documented from user point of view.

Why INSPIRE is actually a pretty good thing?

First, INSPIRE provides one framework to follow for all European geospatial data providers. Second, it’s based on standards. After going through all theoretical jargon and being able to actually develop INSPIRE compliant client (sure this has taken more time than simple REST-JSON-client), one should be able to use (almost) the same client to ALL European geospatial services.

At least if INSPIRE is really followed in all countries. Some data specification experts have seemed to taken do as little as possible strategy. They have tried just to drive their nations goals, not common European goal. Thus, no complete harmonisation for large data sets like images or meteorological data has really achieved.

Another challenge in INSPIRE is that data formats does not provide very good performance, mostly because of the used metadata. (Ok, the overhead compress very well, but still. All the compression and uncompression require a lot of CPU recourses and it’s still slow to handle the uncompressed data.) But INSPIRE is meant for data exchange, not to be as a back end of the client. It’s meant to provide compliance with several different types of data with a single client. That kind of interoperability can’t be achieved with very simple design.

Relative INSPIRE data model file sizes. Demonstration done with 138 weather station, 11 parameters and 12 hours of data.

Relative INSPIRE data model file sizes. Demonstration done with 138 weather station, 11 parameters and 12 hours of data.

From user point of view, there’s even more serious challenge: there are very few INSPIRE compliant clients. INSPIRE has came as a top-down order and it do not (at least yet) have very wide community support. This makes utilising a bit complicated services even harder. The time may help us, but I wouldn’t try to hold my breath.

Conclusion

So, every type of interface has their pros and cons. Simple data formats and interfaces enable quick and simple development. But when nothing is given, nothing is gained. INSPIRE provides a good standard based solution, but it has a lot of disadvantages. Most importantly, while OGC services with GML output has many excellent features it has proven to be too complicated for many use cases. JSON and REST should not supplant standards — standards should (and will) start supporting JSON and REST.

Will INSPIRE succeed? Depends on how widely it’s accepted between data providers. For wider acceptance, standards and data models should adopt JSON and REST and much clearer documentation with examples are required. After that we could expect to get some general OSS INSPIRE compliant clients.


http://inspire.ec.europa.eu/events/conferences/inspire_2016/page/home

https://en.ilmatieteenlaitos.fi/open-data

http://inspire.ec.europa.eu/

http://www.opengeospatial.org/

http://www.iso.org/iso/catalogue_detail.htm?csnumber=53798

http://www.opengeospatial.org/standards/gml

http://www.opengeospatial.org/standards/om

https://en.wikipedia.org/wiki/Application_programming_interface

http://www.opengeospatial.org/standards/wfs

http://www.opengeospatial.org/standards/wms

https://en.wikipedia.org/wiki/Representational_state_transfer

http://www.json.org/

https://www.wmo.int/pages/prog/gcos/documents/gruanmanuals/ECMWF/bufr_reference_manual.pdf

https://en.wikipedia.org/wiki/GRIB

https://nodejs.org/en/

http://geojson.org/

https://en.wikipedia.org/wiki/Geodetic_datum

It’s all about I/O

In previous posts I’ve been talking about data dissemination — how crucial data post-processing and exporting the data is. Producing huge amount of data is an useless effort if it can’t be ennobled to and delivered as information. But a weather model is not the source of all information. Weather models are based on many kinds of observations about atmosphere. Before running the model, those observations need to be processed and read in. It is far from trivial task.

Few months ago there was a fuzz about Panasonic’s new weather model1. The TV Maker has been developing it’s own weather model as side of it’s TAMDAR2 (Tropospheric Airborne Meteorological Data Reporting) services (a weather observation system attached to airplanes). While providing weather services for airlines they are gathering a large amount of useful data for weather forecasting. Instead of selling all data away, Panasonic took GFS3 weather model and developed their own one to take full advantage of observations they had.

The first time private company has been able to develop and run a global weather model that can compete with governments’ models. I don’t find the news themselves very surprising. Creating a well functioning weather model core sure needs huge amount of research. But this physics is commonly known. And many global weather models are open source code which gives private companies a good starting point for further development.

But there’s an other aspect which is notable in the article. One of the main advantages Panasonic says it has is a large amount of TAMDAR observations. Still, it’s not that governments’ modellers hadn’t had enough observations available. They just haven’t been able to use them. Weather forecasting has been a driving force in IT development for decades. It has created global telecommunication system (GTS)4 to share weather observations before World Wide Web5. Today, the most powerful computers are used for weather and climate modelling6. But long and glorious history has it’s disadvantages as well. Models are typically written in Fortran7 which has a great mathematical power but lacks of modern data I/O capabilities. Model developers are traditionally physicists, not IT specialists, who have more interest and skills in models’ scientific performance than creating the most sophisticated data flow. Public sectors’ weather modellers are also typically running and developing an operational system with large amount of legacy which slows down the development.

But regardless of partly inevitable disadvantages that traditional public sector weather forecasters have, the article emphasises how crucial data assimilation (reading data in) is. Having a good data assimilation process8 is also one reason why ECMWF model is better than GFS9. (These training course lecture notes10 give an idea how complicated data assimilation process can be.) Having the best available IT resources for creating fluent data I/O may also be one advantage that companies like Panasonic has.


1http://arstechnica.com/science/2016/04/tv-maker-panasonic-says-it-has-developed-the-worlds-best-weather-model/

2https://en.wikipedia.org/wiki/TAMDAR

3https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/global-forcast-system-gfs

4https://www.wmo.int/pages/prog/www/TEM/index_en.html

5https://en.wikipedia.org/wiki/World_Wide_Web

6http://www.wired.com/2016/06/fastest-supercomputer-sunway-taihulight/

7https://en.wikipedia.org/wiki/Fortran

8http://www.ecmwf.int/en/research/data-assimilation

9http://arstechnica.com/science/2016/03/the-european-forecast-model-already-kicking-americas-butt-just-improved/

10http://www.ecmwf.int/sites/default/files/Data%20assimilation%20concepts%20and%20methods.pdf

Circles of Influence

In the last post I picked up three points to be learned from holacracy1, new way to be organised at companies. Let’s recall them again:

  1. Meeting habits, specially one tension at the time2 is a very useful principle to keep meetings concise.
  2. ”What do you need?” is a great question to emphasises responsibility of individuals.
  3. Drawing circles as they exist at the moment and as they should exist is a great exercise. The diagram can be used in many ways.

Now it’s time to dig into the circles and see how they can be used.

First some background information about our group. We are responsible for FMI B2B solutions. We are a DevOps3 team, we install and operate our servers from OS level upon. Basicly, our group has two major parts: SmartMet Server providing data interfaces and (mainly) web based UI development. SmartMet Server4 is a very high performign C++ server environment. At the moment it holds about 2 TB often updating data and response to over 30 million requests/day. Our web solutions are mainly concentrated on ’Ilmanet’ product delivery network where account managers can manage clients and their orders.

We are 15 people in total at the moment. Most of the people in our team are able to do most of its tasks. All core members of our group have to be able to operate the whole system and fix problems if necessary. I believe in generalisation against specialisation5. Although there may be some efficiency hit6, generalisation provides agility, adaptivity7 and keeps everyone’s skills up to date8.

Our circles looks like following. Dashed lines stand for the circles and solid lines represent roles. The outermost circle is our team in the organisation. Numbers under role names tell amount of people having that particular role. There of course are more overlaps between the circles than illustrated in the image. It would be useful to visualise these overlaps, but I did not manage to keep the diagram readable while doing so. Sizes of the circles in the image roughly illustrates an importance of the circle, not size of it.

So, what’s the point of drawing circles like this? The diagram can be used in several ways:

1) It’s a great tool for conversations. Who are in the circles? Should I be in that circle? What skills would I like to or I need to learn in the future? Should I get off from some circle? Should I join in some circle? In my internal version I of course have people’s names in the circles instead of people count.

2) Above image describes the current situations but of course it’s useful to think what circles should we have? Especially, should we get rid of some? For example we should consider to establish a circle of consultants gaining and providing other circles and our clients special knowledge about standards and other technologies needed in our domain. Members of circle could map together what kind of knowledge should be acquired in the future, organise some workshops to other circles and so on… OGC9 Specialist, INSPIRE10 Specialist, Data Specialist and Consultant would at least be in the group.

3) Should we merge or split some activities? It’s notable that merging and splitting activities has direct implications to what skills people will have. In this case Sysadmin circle could and maybe should be exploded into other circles. Although this way developing production environment is easier to separate to its own role.

4) This kind of illustration is a great tool to structure group’s work. Having a clear idea what is needed and how one’s work supports others helps everyone to see what’s important and what’s not. It also motivates people: why is my work important? What’s my ten cents to this community? Holacracy indeed is a very purpose driven way to organise. This kind of exercise encourages to define clear purposes for all circles. Here are some examples from above image:

  • Customer Solutions:  Customer Solutions makes the best weather services in the World. We are fast and agile group with a state of the art knowledge of the newest technology and special skills to handle large data volumes. 
  • Sysadmin: Sysadmins provides an easy-to-use high-performance infrastructure for other circles to build services on. They keep the system and the services up and running ensuring high availability percents.
  • SmartMet Server: SmartMet Server circle develops a high-performance and high-availability data server for internal and external systems and user interfaces. Thanks to them, other circles, FMI OpenData users and FMI customers can easily fetch data of their need. 
  • […]

Holacracy’s circles also reminds me about Stephen R Covey’s  Circle of Concern and Circle of Influence11  introduced in his famous book The 7 Habits of Highly Effective People12. Covey noted that people have a circle of concerns encompassing a range of concerns we have in our life. People have also a circle of influence — their personal range to be able to affect things. Proactive people act with concerns they can affect and enlarge their circle of influence by numerous technics. Reactive people tend to focus their power out of range and their circle of influence shrinks. When having concerns at work, it may be beneficial to map the concerns and circles to the same map and see whether there is a circle and role to handle the problem.

I encourage every leader or anyone interested in to do the same exercise and discuss about that with others. It may not always be so clear how things are and how they should be. This work could of course be continued. To name one thing, it would be nice to visualise dependencies between the circles and the most important stakeholders.

25655293916_670cb180e0_o


1 https://en.wikipedia.org/wiki/Holacracy

2 https://blog.holacracy.org/one-thread-at-a-time-7b297718bc59

3 https://en.wikipedia.org/wiki/DevOps

4 http://www.slideshare.net/tervo/foss4g-fmi-opendataroopetervo

5 http://noop.nl/2008/04/specialization.html

6 http://www.bryanbraun.com/2012/01/22/generalizing-vs-specializing

7 http://tech.co/specialization-vs-generalization-business-strategy-2015-08

8 http://www.wisebread.com/is-it-better-to-specialize-or-generalize

9 http://www.opengeospatial.org

10 http://inspire.ec.europa.eu

11 http://uthscsa.edu/gme/documents/Circles.pdf

12 https://www.stephencovey.com/7habits/7habits.php

Three Points to Learn from Holacracy

There’s a lot of buzz around holacracy1, new way to be organised. My great mentor Tony Virtanen2 recommended me a book by Brian J. Robertson, Holacracy: The Revolutionary Management System That Abolishes Hierarchy3That is a book worth reading.

In very brief, holacracy replaces traditional hierarchy at work places with a very strict structure where everyone is responsible for their domain. Instead of units, it defines circlesroles and policies. Circles and roles can be re-arranged in the governance meetins4 by the initiative of anyone in the circles. If you are more interested in holacracy, I suggest consulting wikipedia1 and http://www.holacracy.org5.

Holacracy certainly is a beautiful idea. It is said to support adaptivity6 and keep motivation up7. It may also make working days more efficient. But it relies heavily on peoples motivation and readiness to take responsibility. Holacracy also require that purpose of company and circles (closest to departments in traditional organisations) are clearly stated, which is good. But what if goals of different circles don’t convergate? That is the case both in many government institutes and large corporations. Are holacratic organisations able to form new super circles for example to make larger alliances with other companies? May be hard, especially if the goals are new and in conflict with sub circles goals. I see lots of potential sub-optimization here.

 

Many_tensions

But even holacracy may not be the right choice for all, there are couple of good ideas to be exploited:

  1. Meeting habits, specially tactical meetings8, defined in the holacracy constitution9 (and described in the book) can be useful. Even if one don’t want to follow the structure slavishly, one tension at the time10 is a very useful principle to keep meetings concise.
  2. ”What do you need?” is a great question to resolve tensions. It emphasises responsibility of individuals.
  3. Drawing circles as they exist at the moment and as they should exist is a great exercise. The diagram can be used in many ways. I’m going to dig into that in my next post.

To summarise, holacracy sounds great but can work only if crucial — and unfortunate rare — preconditions are met.

One_tension


1 https://en.wikipedia.org/wiki/Holacracy

2 https://fi.linkedin.com/in/tonyvirtanen

3 http://amzn.com/0241205859

4http://www.holacracy.org/governance-meetings

5 http://holacracy.org

6http://www.fastcompany.com/3045848/hit-the-ground-running/heres-why-you-should-care-about-holacracy

7http://www.forbes.com/sites/stevedenning/2015/05/23/is-holacracy-succeeding-at-zappos/

8 http://www.holacracy.org/tactical-meetings

9 http://www.holacracy.org/constitution

10https://blog.holacracy.org/one-thread-at-a-time-7b297718bc59

Large Volume Data #Demand Side

In my last post I analysed requirements to be met to fully utilize new often updated large volume data. Now, after discussing about the topic for two days I have some further thougths. Mostly about demand.

Big (environmental) data producers have surprisingly same kind of thoughts about what is needed. As I pointed out in the previous post, data volumes are growing fast and data is updated faster and faster. For example ECMWF (The European Centre for Medium-Range Weather Forecasts) receives about 300 million observations and produces over 8 TB new data in a day.

It’s not hard to produce this large data. It’s hard to handle it.

Traditional workflow is pretty straight forward:

  1. you get some new information
  2. you run you model and get some data
  3. you trigger product generation and
  4. push products to users.

This don’t work any more. Measurements can’t be seen as a single event but as a constant flow of raw data. Like stream of fotons passing by, you are not interested about them, you don’t try to store them. But they give you a great amount of information about current moment. This is why all this data is produced — more light, more accurate picture. Users want information, not data. And they want it when they need it — not when next foton happens to hit their retina.

Huge amount of measurements and large data volumes have been driving a paradigm change from push to pull. Instead of good-old ftp, users are given interfaces to request information whenever they need it, trusting that they always get the best available information at the time. Data can be fetched when needed and only when needed.

Data producers have also had to overhaul their product generation architectures. Five years ago Finnish Meteorological Institute was producing over million images in a day to be fetched only five million times before they were outdated. Nowadays, more and more products are generated on-demand only when users request for them.

measurement-to-information

On-demand paradigm can be also utilized while processing information. Great amount of post-processing can be done only when users want to investigate a phenomenon of their interest. Take a look at NOAA’s version for example.

So how this is achieved? In the last post I listed three basic requirements:

  1. standardisation and harmonisation,
  2. rich interfaces to retrieve the data (even with a cost of ease of use) and
  3. bringing users to data by providing processing capabilities where the data is located.

The truth can still be found from there. With these guidelines anyone can become as an information island in the sea of data.

 

 

21115242240_ee7a03ce17_o

 


Lähteet:

http://www.copernicus.eu/sites/default/files/library/Big_Data_at_ECMWF_01.pdf

http://www.ncdc.noaa.gov/wct/

Future Infrastructure for Large Data Sets

Next week I’m attending a highly interesting EUMETSAT1 Data Services Workshop to find future path for providing satellite data dissemination and processing capabilities for environmental satellite data. Here’s some considerations about the topic.

Traditionally data dissemination has been file based. Large data sets have been chopped into peaces and delivered via ftp server or suchlike. NOAA’s GFS dissemination system2 is a typical example of traditional way to provide data.

data_volume_graph

Today, when data sets grow bigger and bigger3 and update frequency gets denser and denser,  traditional way is not enough. When handling large often updated data sets, I/O  (both network and disk) is a bottleneck. That’s why there should be as little copying and transferring data as possible. To achieve this, three requirements should be met:

1) Data have to be in a standard and open format with a good software support so that data users don’t have to convert data to other formats.

Providing a good open standard format is a domain specific problem. It’s still a good principle for software developers to support original formats and transform data to data model required by software on-demand. Converting everything to software specific data format may first seem seductive but becomes quickly a bottleneck.

2) Users need to be given possibility to retrieve only subset of the data according to their interest. Interest (of area for example) may change routinely (consider for example a moving ship in Arctic) and thus retrieving subset need to be automatic.

In geospatial domain Open Geospatial Consortium (OGC)4 provides a good set of interface standards to follow. WMS5, WFS6 and WCS7 standards are enough for most cases. It’s notable that these standards are soon to be extended with PubSub standard8 which helps a dissemination of data.

In industry use prober SLA is required, but since the OGC services can be chained and thus data provider does not need to work as a backend of all clients. This mean that request limits can be relatively tight.

3) When possible, users can be given access to the data on it’s original source or near it. NOAA’s Data Alliance program9 and Landsat Program110 are excellent examples of this. Users using Amazon AWS may attach the volumes containing open data set11 to their server instances and use the data without copying or converting the data at all. Google Cloud Platform12 provides data in it’s services also.

Producers of large data sets have a significant ICT infrastructure of their own. It should not be tremendous task to open some processing capacity for users. Users could purchase virtual servers with a raw data reading access or do processing via WPS12.

As Jason Fried noted, you should always sell you by-products13.

Sample infrastructure for large volume data

Sample infrastructure for large volume data dissemination


1 http://www.eumetsat.int/website/home/index.html

2https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/global-forcast-system-gfs

3 http://www.copernicus.eu/sites/default/files/library/Big_Data_at_ECMWF_01.pdf

4 http://www.opengeospatial.org/

5 http://www.opengeospatial.org/standards/wms

6 http://www.opengeospatial.org/standards/wfs

7 http://www.opengeospatial.org/standards/wcs

8 http://www.opengeospatial.org/projects/groups/pubsubswg

9 https://data-alliance.noaa.gov/

10 https://aws.amazon.com/public-data-sets/

11 https://cloud.google.com/noaa-big-data/

12 http://www.opengeospatial.org/standards/wps

13 https://signalvnoise.com/posts/1620-sell-your-by-products

Completely Fair Scheduler

It’s heard often that concentrating to one thing is good. Many agile methods assumes that there’s only one project going on. Also principles of Lean Development¹ implicates a glory of single project by encouraging to keep cycle time low. Deliver as fast as possible.

One project at a time is good. Among many benefits, it helps to

  • confuseddeliver fast;
  • keep queues short;
  • keep focused;
  • prevent multitasking.

But there’s an other side. According to Bernice Eiduson extensive studies², most successful scientists tend to have many fields of interests and they change their focus between them oftenIt’s natural, connecting things is vital to creativity. (Actually, I believe that real creativity happens when solutions conflict creating a new reality.)

I don’t see why having multiple projects going on would be good specially for scientists. Being able to see different aspects and having vast variety of skills is good for anyone doing creative job like software development. And as Mr. Appelo puts it³, having multiple projects:

  • keeps your days versatile;
  • prevents you to bang your head in a single problem all day long;
  • provides flexibility, as normally always some project is waiting for something.

It’s good to remember though that multitasking is still bad. It destroys brains and decrease performance.But switching tasks couple of times a day is not the same than checking emails every minute. Actually, distractions may be the best things happening to you5. Distraction ”may provide the break you need to disengage from a fixation on the ineffective solution5”.  And even without distractions switching tasks every now and then may help your creativity. According to Shelley Carson’s and Justin Moore’s studies task switching slowed problem solving but increased divergent thinking6.

connecting_bossSo how many projects one should have? Of course it’s a personal thing… …but some analysis can be done. First, feedback loop need to be fast enough so that things stays in mind and one can react to the feedback got from customers or other sources. Second, although it’s not good to stuck for too long in on problem, harder the problem is more time you need for it. Complex problems can’t be solved in seconds.

Mr. Donald G. Reinertsen speaks in his book ’The Principles of Product Development Flow: Second Generation Lean Product Development’7 about Marines who put all their power to the most focal point. In case one’s not Marine or establishing a startup, it’s impossible to have only one focal point. But there should not be more projects going on than there’s ”best value projects” available; it’s always better to put more effort to fewer valuable projects than having theatrical projects for showing much activity. And final reminder: two many projects causes multitasking.

Let’s take a real life example: In my job we do product development based on account managers’ orders. We have more account managers than we have software developers. For one software developer we have two contact managers, each making orders. Most of the orders are quite small. We use in average about 5 days to complete the order. Still, we have always couple of long infrastructure projects going on. We are a DevOp team; we have our own infrastructure development and up-keeping activities going on all the time.

So, at a time, for every developer there’s two account managers waiting for order to complete and at least one task to maintain the system. In theory we have three projects going on with every software developer. How would that make developer’s day look like (hope she’s understood that multitasking is bad and is doing things in junks):

  • 2 hours for ”doing nothing in particular”
  • 1 hour for mail, keeping things in order and bureaucracy
  • 1 hour for project 1
  • 1 hour for project 2
  • 1 hour for project 3
  • 1 hour hour for preparing next project and learning

Pretty fair.


¹ https://en.m.wikipedia.org/wiki/Lean_software_development

² https://www.psychologytoday.com/blog/imagine/200903/arts-and-crafts-keys-scientific-creativity

³ http://noop.nl/2015/09/multitasking-is-bad-multiprojecting-is-good.html

http://www.theguardian.com/science/2015/jan/18/modern-world-bad-for-brain-daniel-j-levitin-organized-mind-information-overload

5 http://www.shelleycarson.com/blog/when-being-distracted-is-a-good-thing

6 http://timharford.com/2015/09/multi-tasking-how-to-survive-in-the-21st-century

7 http://www.amazon.com/The-Principles-Product-Development-Flow/dp/1935401009