Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
Virtuoso loads 110,500 triples-per-second on LUBM 8000
[
Orri Erling
]
LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.
The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 6667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.
The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.
The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.
In comparison, Bigdata reported 200K triples-per-second for the first 8000 LUBM universities on a 15 blade box. We expect to do about that much on one new dual Xeon board; we’ll publish this when this is done.
We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.
|
06/29/2009 12:12 GMT
|
Modified:
06/29/2009 12:22 GMT
|
Virtuoso loads 110,500 triples-per-second on LUBM 8000
[
Virtuso Data Space Bot
]
LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.
The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 6667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.
The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.
The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.
In comparison, Bigdata reported 200K triples-per-second for the first 8000 LUBM universities on a 15 blade box. We expect to do about that much on one new dual Xeon board; we’ll publish this when this is done.
We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.
|
06/29/2009 12:12 GMT
|
Modified:
06/29/2009 12:22 GMT
|
Linked Data Rules Simplified
[
Kingsley Uyi Idehen
]
As a compliment to the most recent Linked Data Design Issues note by TimBL, I would like to add this subtle tweak to the enumerated rules:
-
Identify or Name things using HTTP URIs
-
Describe things using the RDF metadata model
-
Increase link data mesh density on the Web by linking (referring) to things in other data spaces using their HTTP URIs.
If you perform the steps above, on any HTTP network (e.g. World Wide Web), you implicitly bind the Names/Identifiers of things to negotiable representations of their metadata (description) bearing documents.
Also note, you can create and deploy the resulting RDF metadata using any of the following approaches:
-
RDFa within (X)HTML documents
-
N3, Turtle, TriX, RDF/XML etc. based documents
- Programmatically generated variants of 1&2.
Related
|
06/26/2009 10:49 GMT
|
Modified:
06/26/2009 23:18 GMT
|
BBC Linked Data Meshup In 3 Steps
[
Kingsley Uyi Idehen
]
Situation Analysis:
Dr. Dre is one of the artists in the Linked Data Space we host for the BBC. He is also referenced in music oriented data spaces such as DBpedia, MusicBrainz and Last.FM (to name a few).
Challenge:
How do I obtain a holistic view of the entity "Dr. Dre" across the BBC, MusicBrainz, and Last.FM data spaces? We know the BBC published Linked Data, but what about Last.FM and MusicBrainz? Both of these data spaces only expose XML or JSON data via REST APIs?
Solution:
Simple 3 step Linked Data Meshup courtesy of Virtuoso's in-built RDFizer Middleware "the Sponger" (think ODBC Driver Manager for the Linked Data Web) and its numerous Cartridges (think ODBC Drivers for the Linked Data Web).
Steps:
-
Go to Last.FM and search using pattern: Dr. Dre (you will end up with this URL: http://www.last.fm/music/Dr.+Dre)
-
Go to the Virtuoso powered BBC Linked Data Space home page and enter: http://bbc.openlinksw.com/about/html/http://www.last.fm/music/Dr.+Dre
-
Go to the BBC Linked Data Space home page and type full text pattern (using default tab): Dr. Dre, then view Dr. Dre's metadata via the Statistics Link.
What Happened?
The following took place:
-
Virtuoso Sponger sent an HTTP GET to Last.FM
-
Distilled the "Artist" entity "Dr. Dre" from the page, and made a Linked Data graph
-
Inverse Functional Property and sameAs reasoning handled the Meshup (augmented graph from a conjunctive query processing pipeline)
- Links for "Dr. Dre" across BBC (sameAs), Last.FM (seeAlso), via DBpedia URI.
The new enhanced URI for Dr. Dre now provides a rich holistic view of the aforementioned "Artist" entity. This URI is usable anywhere on the Web for Linked Data Conduction :-)
Related (as in NearBy)
|
06/12/2009 14:09 GMT
|
Modified:
06/12/2009 16:38 GMT
|
Understanding the BBC's Virtuoso Powered Linked Data Space
[
Kingsley Uyi Idehen
]
The BBC's recently announced Linked Data space for Programmes and Music data, joins a growing list of immediately useful "Virtuoso Powered" linked data spaces, driving the burgeoning Web of Linked Data. Others include: DBpedia, Bio2RDF, NeuroCommons etc (the click friendly version of the LOD-Cloud diagram reveals a snapshot of other Virtuoso driven linked data spaces).
Why is it important?
As a leading media organization, the BBC's use of Linked Data provides a clear beacon to other media players re. the imminence of a serious Linked Data induced sector inflection. In a nutshell, every Web Site has to evolve into a Linked Data Space: a location on the Web that provides granular access to discrete data items in line with the core principles of the Linked Data meme.
Remember, the essence of the Linked Data meme is simply this: you reference data items and access their metadata, in variety of formats via a single HTTP based URI. This approach to Web data publishing is compatible with any HTTP aware user agent (e.g., your Web Browser or tools & applications that provide abstracted access to HTTP).
How Do I use it?
There a number of very powerful things available to end-users and developers alike.
End-Users:
The most powerful feature of our variant of the BBC's Linked Data Space is the exposure of Faceted Find (think Search++ and beyond). Thus, you can go the the home page of the service and commence data discovery and exploration via any of the following interfaces:
-
Full Text Search Tab -- type in a full text pattern and then experience Linked Data Entity Ranking as opposed to Page Ranking
- URI Lookup (By Label) Tab -- type in part of a URI and let the system auto-complete by looking up Entity Labels
- URI Lookup (Raw String Pattern) Tab -- type in part of a URI and let the system auto-complete by looking up the raw URI
-
OpenLink Data Explorer Service -- "deceptively simple" Linked Data explorer and Data Mesher (simply type in a URI or Text pattern, then view the data via a myriad of entity type specific viewer tabs).
Once you are comfortable with at least one of the items above, you can exploit the system further by performing any of the following:
Information Architects & Developers
Disambiguated Search (aka. Search++ or Find)
In line with the time-tested "embrace and extend" pattern, we provide Full Text search capability, but unlike Google, Yahoo!, Bing and other search engines, we don't use use "Page Rank" algorithm to sort results; instead, we use an "Entity Rank" algorithm since we are dealing with an RDF based Graph model DBMS where links exist between entities across instance data and data dictionary (vocabularies, schemas, ontologies) boundaries. In addition, when you get results (by clicking "show values" or "show values with distinct counts") that list entities associated with a full text search pattern, we take a quantum leap beyond search engines by allowing you to use "Entity Type" and/or "Entity Properties" (all of these have HTTP URIs too) to set your own context for what you seek.
Much more to come in the form of BBC specific demo queries and tutorials :-)
Related
-
Live LOD Cloud Cache instance that combines BBC data with other data sets from the LOD Cloud (in a single Virtuoso RDF DBMS hosting 5 Billion+ triples & counting)
|
06/11/2009 17:59 GMT
|
Modified:
06/26/2009 23:15 GMT
|
Comparing Virtuoso Performance on Different Processors
[
Orri Erling
]
Over the years we have run Virtuoso on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso.
Our test is very simple: Load 20 warehouses of TPC-C data, and then run one client per warehouse for 10,000 new orders. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.
The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.
All times reported are real times, starting from the start of the first client and ending with the completion of the last client.
Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.
| Test |
Platform |
Load (seconds) |
Run (seconds) |
GHz / cores / threads |
| 1 |
Amazon EC2 Extra Large (4 virtual cores) |
340 |
42 |
1.2 GHz? / 4 / 1 |
| 1 |
Amazon EC2 Extra Large (4 virtual cores) |
305 |
43.3 |
1.2 GHz? / 4 / 1 |
| 2 |
1 x dual-core AMD 5900 |
263 |
58.2 |
2.9 GHz / 2 / 1 |
| 3 |
2 x dual-core Xeon 5130 ("Woodcrest") |
245 |
35.7 |
2.0 GHz / 4 / 1 |
| 4 |
2 x quad-core Xeon 5410 ("Harpertown") |
237 |
18.0 |
2.33 GHz / 8 / 1 |
| 5 |
2 x quad-core Xeon 5520 ("Nehalem") |
162 |
18.3 |
2.26 GHz / 8 / 2 |
We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 ("Woodcrest") and Xeon 5410 ("Harpertown"), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 ("Nehalem"), 72 GB RAM, and 8 x 500 GB SATA disks.
Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We'll see another time with faster memories.
The operating systems were various 64 bit Linux distributions.
We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.
We tried some RDF operations on the two last systems:
| operation |
Harpertown |
Nehalem |
| Build text index for DBpedia |
1080s |
770s |
| Entity Rank iteration |
263s |
251s |
Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in SQL to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.
For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.
Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.
- On the host OS of the Nehalem system —
| n |
cpu% |
rows per second |
| 1 query |
503 |
906,413 |
| 2 queries |
1263 |
1,578,585 |
| 3 queries |
1204 |
1,566,849 |
- In a VM under Xen, on the Nehalem system —
| n |
cpu% |
rows per second |
| 1 query |
652 |
799,293 |
| 2 queries |
1266 |
1,486,710 |
| 3 queries |
1222 |
1,484,093 |
- On the host OS of the Harpertown system —
| n |
cpu% |
rows per second |
| 1 query |
648 |
1,041,448 |
| 2 queries |
708 |
1,124,866 |
The CPU percentages are as reported by the OS: user + system CPU divided by real time.
So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache — 12 MB vs 8 MB.
We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.
The executables were compiled with gcc with default settings. Specifying -march=nocona (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.
The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.
Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.
If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.
|
05/28/2009 10:54 GMT
|
Modified:
05/28/2009 11:15 GMT
|
Comparing Virtuoso Performance on Different Processors
[
Virtuso Data Space Bot
]
Over the years we have run Virtuoso on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso.
Our test is very simple: Load 20 warehouses of TPC-C data, and then run one client per warehouse for 10,000 new orders. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.
The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.
All times reported are real times, starting from the start of the first client and ending with the completion of the last client.
Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.
| Test |
Platform |
Load (seconds) |
Run (seconds) |
GHz / cores / threads |
| 1 |
Amazon EC2 Extra Large (4 virtual cores) |
340 |
42 |
1.2 GHz? / 4 / 1 |
| 1 |
Amazon EC2 Extra Large (4 virtual cores) |
305 |
43.3 |
1.2 GHz? / 4 / 1 |
| 2 |
1 x dual-core AMD 5900 |
263 |
58.2 |
2.9 GHz / 2 / 1 |
| 3 |
2 x dual-core Xeon 5130 ("Woodcrest") |
245 |
35.7 |
2.0 GHz / 4 / 1 |
| 4 |
2 x quad-core Xeon 5410 ("Harpertown") |
237 |
18.0 |
2.33 GHz / 8 / 1 |
| 5 |
2 x quad-core Xeon 5520 ("Nehalem") |
162 |
18.3 |
2.26 GHz / 8 / 2 |
We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 ("Woodcrest") and Xeon 5410 ("Harpertown"), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 ("Nehalem"), 72 GB RAM, and 8 x 500 GB SATA disks.
Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We'll see another time with faster memories.
The operating systems were various 64 bit Linux distributions.
We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.
We tried some RDF operations on the two last systems:
| operation |
Harpertown |
Nehalem |
| Build text index for DBpedia |
1080s |
770s |
| Entity Rank iteration |
263s |
251s |
Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in SQL to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.
For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.
Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.
- On the host OS of the Nehalem system —
| n |
cpu% |
rows per second |
| 1 query |
503 |
906,413 |
| 2 queries |
1263 |
1,578,585 |
| 3 queries |
1204 |
1,566,849 |
- In a VM under Xen, on the Nehalem system —
| n |
cpu% |
rows per second |
| 1 query |
652 |
799,293 |
| 2 queries |
1266 |
1,486,710 |
| 3 queries |
1222 |
1,484,093 |
- On the host OS of the Harpertown system —
| n |
cpu% |
rows per second |
| 1 query |
648 |
1,041,448 |
| 2 queries |
708 |
1,124,866 |
The CPU percentages are as reported by the OS: user + system CPU divided by real time.
So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache — 12 MB vs 8 MB.
We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.
The executables were compiled with gcc with default settings. Specifying -march=nocona (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.
The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.
Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.
If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.
|
05/28/2009 10:54 GMT
|
Modified:
05/28/2009 11:15 GMT
|
Library of Congress & Reasonable Linked Data
[
Kingsley Uyi Idehen
]
While exploring the Subject Headings Linked Data Space (LCSH) recently unveiled by the Library of Congress, I noticed that the URI for the subject heading: World Wide Web, exposes an "owl:sameAs" link to resource URI: "info:lc/authorities/sh95000541" -- in fact, a URI.URN that isn't HTTP protocol scheme based.
The observations above triggered a discussion thread on Twitter that involved: @edsu, @iand, and moi. Naturally, it morphed into a live demonstration of: human vs machine, interpretation of claims expressed in the RDF graph.
What makes this whole thing interesting?
It showcases (in Man vs Machine style) the issue of unambiguously discerning the meaning of the owl:sameAs claim expressed in the LCSH Linked Data Space.
Perspectives & Potential Confusion
From the Linked Data perspective, it may spook a few people to see owl:sameAs values such as: "info:lc/authorities/sh95000541", that cannot be de-referenced using HTTP.
It may confuse a few people or user agents that see URI de-referencing as not necessarily HTTP specific, thereby attempting to de-reference the URI.URN on the assumption that it's associated with a "handle system", for instance.
It may even confuse RDFizer / RDFization middleware that use owl:sameAs as a data provider attribution mechanism via hint/nudge URI values derived from original content / data URI.URLs that de-reference to nothing e.g., an original resource URI.URL plus "#this" which produces URI.URN-URL -- think of this pattern as "owl:shameAs" in a sense :-)
Unambiguously Discerning Meaning
Simply bring OWL reasoning (inference rules and reasoners) into the mix, thereby negating human dialogue about interpretation which ultimately unveils a mesh of orthogonal view points. Remember, OWL is all about infrastructure that ultimately enables you to express yourself clearly i.e., say what you mean, and mean what you say.
Path to Clarity (using Virtuoso, its in-built Sponger Middleware, and Inference Engine):
- GET the data into the Virtuoso Quad store -- what the sponger does via its URIBurner Service (while following designated predicates such as owl:sameAs in case they point to other mesh-able data sources)
- Query the data in Quad Store with "owl:sameAs" inference rules enabled
- Repeat the last step with the inference rules excluded.
Actual SPARQL Queries:
Observations:
The SPARQL queries against the Graph generated and automatically populated by the Sponger reveal -- without human intervention-- that: "info:lc/authorities/sh95000541", is just an alternative name for < xmlns="http" id.loc.gov="id.loc.gov" authorities="authorities" sh95000541="sh95000541" concept="concept">, and that the graph produced by LCSH is self-describing enough for an OWL reasoner to figure this all out courtesy of the owl:sameAs property :-).
Hopefully, this post also provides a simple example of how OWL facilitates "Reasonable Linked Data".
Related
|
05/05/2009 13:53 GMT
|
Modified:
05/06/2009 14:26 GMT
|
Social Web Camp (#5 of 5)
[
Orri Erling
]
(Last of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.
By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks — one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant information overload.
Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.
There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don't get lost in it.
There is a CIA memorandum about how analysts misinterpret data and see what they want to see. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.
I participated in discussions on security and privacy and on mobile social networks and context.
For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.
There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies à la SQL do not work well when schema is fluid and end-users can't be expected to formulate or understand these. Remember Ted Nelson? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic's Data Patrol should be a part of the social web infrastructure of the future.
People at MIT have developed AIR (Accountability In RDF) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.
In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.
For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one's location at the granularity of the city; for some other purposes, one would say which conference room one is in.
Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.
There is a thin line between convenience and having IT infrastructure rule one's life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as Gödel put it. Similarly, human self-knowledge, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.
|
04/30/2009 12:14 GMT
|
Modified:
04/30/2009 12:51 GMT
|
Social Web Camp (#5 of 5)
[
Virtuso Data Space Bot
]
(Last of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.
By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks — one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant information overload.
Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.
There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don't get lost in it.
There is a CIA memorandum about how analysts misinterpret data and see what they want to see. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.
I participated in discussions on security and privacy and on mobile social networks and context.
For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.
There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies à la SQL do not work well when schema is fluid and end-users can't be expected to formulate or understand these. Remember Ted Nelson? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic's Data Patrol should be a part of the social web infrastructure of the future.
People at MIT have developed AIR (Accountability In RDF) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.
In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.
For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one's location at the granularity of the city; for some other purposes, one would say which conference room one is in.
Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.
There is a thin line between convenience and having IT infrastructure rule one's life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as Gödel put it. Similarly, human self-knowledge, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.
|
04/30/2009 12:14 GMT
|
Modified:
04/30/2009 12:51 GMT
|
|
|