I was recently asked to write a section for a policy document touching the intersection of database and semantics, as a follow up to the meeting in Sofia I blogged about earlier. I will write about technology, but this same document also touches the matter of education and computer science curricula. Since the matter came up, I will share a few thoughts on the latter topic.

I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it.

When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, "working such magic that makes things do what they already want to do is easy." There is a grain of truth in that.

In order to build or manage organizations, we must work, as the wizard put it, with nature, not against it. There are also counter-examples, for example my wife's grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such "magic," needless to say, takes constant maintenance; else the spell breaks.

To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow.

Now, in more specific terms, what can we realistically expect to teach about computer science?

Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., cache, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third.

Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.

Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time.

I tried once to tell the SPARQL committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the "semanticist" mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry.

Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces.

LarKC (EU FP7 Large Knowledge Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests.

Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-"paradigmatism" given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, information hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles.

I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:

The edge is created in the "Wild West" — there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism's sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be "driven out o'Dodge."

So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after.

But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty.

  • Know when to ontologize, when to folksonomize. The history of standards has examples of "stacks of Babel," sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.

  • Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.

    The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.

  • Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.

Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.

So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?

  • Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go.

  • Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.

  • Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.

  • Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.

  • Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.

Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf.

Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.

The semanticists I have met are more of the scholar than the IT consultant profile. I say semanticist for the semantic web research people and semantician for the practitioner we are trying to define.

We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error.

If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills.

The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered.

Related