<?xml version="1.0" encoding="UTF-8" ?>
<!--RDF based XML document generated By OpenLink Virtuoso-->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <rss:channel xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/dav/dav-blog-1/">
  <rss:title>OpenLink Community Blog</rss:title>
  <rss:link>http://www.openlinksw.com/weblog/dav/dav-blog-1/</rss:link>
  <rss:description>A Collection of blogs by OpenLink Staff</rss:description>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">kidehen@openlinksw.com</dc:creator>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-11-23T11:40:37Z</dc:date>
  <rss:items>
   <rdf:Seq>
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-11-11#1588" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-11-11#1587" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-10-27#1586" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-10-27#1585" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1573" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-09-01#1572" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-08-19#1571" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-08-19#1570" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-08-14#1569" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-08-14#1568" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-06-29#1563" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-06-29#1562" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-05-28#1558" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-05-28#1557" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-04-30#1552" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-04-30#1550" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-04-27#1545" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-04-27#1544" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-04-01#1541" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-04-01#1540" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-03-05#1529" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-03-05#1528" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-02-16#1527" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-02-16#1526" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-01-09#1516" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-01-09#1515" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-01-02#1511" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-01-02#1510" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-18#1507" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-18#1506" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-17#1505" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1504" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-17#1503" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1502" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-16#1499" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-16#1498" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-11#1495" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-11#1494" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-27#1488" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-27#1487" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-20#1485" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-20#1484" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1481" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1479" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1477" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1476" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-03#1473" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-03#1471" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1467" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1465" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1466" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1464" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-24#1460" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-24#1459" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-02#1451" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-02#1450" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1449" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1448" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-09-30#1446" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-09-30#1445" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-09-08#1436" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-09-08#1435" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-09-08#1434" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-09-08#1433" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-09-05#1432" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-09-05#1431" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-08-27#1423" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-08-27#1422" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-08-25#1419" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-08-25#1418" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-08-06#1410" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-08-06#1409" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-07-30#1401" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-07-30#1400" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-07-17#1393" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-07-17#1392" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1381" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1380" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1379" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1376" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1375" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1374" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-05-30#1369" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-05-30#1368" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-05-09#1359" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-05-09#1358" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-04-30#1354" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-04-30#1353" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-04-29#1350" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-04-29#1349" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-04-29#1348" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-04-29#1347" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-04-29#1346" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-04-29#1345" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-04-14#1340" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-04-14#1339" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-04-14#1338" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-04-14#1337" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-04-14#1336" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-04-14#1335" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-03-25#1327" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-03-25#1326" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-03-06#1322" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-03-06#1321" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-02-04#1309" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-02-04#1308" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-02-01#1305" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-02-01#1304" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-02-01#1302" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-01-16#1297" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-01-16#1296" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-12-18#1287" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-12-18#1286" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-12-07#1285" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-12-06#1284" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-11-21#1273" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-11-21#1271" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-11-20#1270" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-11-08#1268" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-09-06#1251" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-09-06#1250" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-08-28#1248" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-08-28#1246" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-08-27#1245" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-08-27#1244" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-07-19#1230" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-07-19#1229" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-07-12#1226" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-07-12#1225" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-05-23#1198" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-05-23#1199" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-05-23#1201" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-05-23#1197" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-05-23#1196" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-05-23#1195" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-05-23#1194" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-04-12#1184" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-04-12#1183" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-03-16#1159" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-02-05#1131" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-02-05#1132" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-01-10#1116" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-01-10#1117" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-01-09#1113" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-01-09#1110" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-01-09#1109" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-12-22#1108" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-11-01#1075" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-11-01#1074" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-09-25#1112" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-09-19#1111" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-08-10#1025" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-08-10#1024" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-07-31#1022" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-07-31#1021" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-07-17#1008" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-07-17#1007" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-07-13#1003" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-07-13#1002" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-07-13#1001" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-07-13#1000" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-07-11#999" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-07-11#998" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-04-27#963" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-04-27#964" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-04-24#962" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-04-24#961" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-04-17#958" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-04-11#950" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-04-11#949" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-06-09#266" />
   </rdf:Seq>
  </rss:items>
 </rss:channel>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-11-11#1588">
  <rss:title>RDF Geography With Virtuoso</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-11-11T17:17:27Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have just added a geometry data type and corresponding R-tree index to Virtuoso. This follows the general scheme of SQL/MM, as is implemented by PostGIS and many others. We have all the engine-side stuff, including optimizer support for geometry cardinality sampling and good execution plans for combinations of spatial and other joins. We have however not yet implemented all the different geometry types and library function support for them, like shortest distance between two arbitrary shapes. The geometry support is for both SQL and SPARQL. On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with RDF, a geometry can occur as the object of a quad. If the object is a typed-literal of the virtrdf:Geometry type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed. After this, SQL MM predicates and functions can be used with SPARQL, like this: PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt; SELECT ?class COUNT (*) WHERE { ?m geo:geometry ?geo . ?m a ?class . FILTER ( &lt;bif:st_intersects&gt; ( ?geo, &lt;bif:st_point&gt; (0, 52), 100 ) ) } GROUP BY ?class ORDER BY DESC 2 This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London. For any data set with WGS 84 geo:long and geo:lat values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the geo:geometry property of the subject with the long/lat. This then enables fast spatial access to arbitrary location data in RDF. Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities. As these get adopted we will support them. For scalability, we tried the implementation with OpenStreetMap&#39;s 350 million or so points. The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object&#39;s key, thus not by range of coordinates or such. Like this, the items are evenly spread even though the coordinate distribution is highly uneven. We can do spatial joins like â SELECT ?s ( &lt;sql:num_or_null&gt; (?p) ) COUNT (*) WHERE { ?s &lt;http://dbpedia.org/ontology/populationTotal&gt; ?p . FILTER ( &lt;sql:num_or_null&gt; (?p) &gt; 1000000 ) . ?s geo:geometry ?geo . FILTER ( &lt;bif:st_intersects&gt; ( ?pt, ?geo, 5 ) ) . ?xx geo:geometry ?pt } GROUP BY ?s ( &lt;sql:num_or_null&gt; (?p) ) ORDER BY DESC 3 LIMIT 20 This takes the DBpedia subjects that have a population over 1 million and a geometry. We then count all the geometries within 5 km of the point location of the first geometry. With DBpedia (about 5 million points), GeoNames (7 million points), and OpenStreetMap (350 million points), we get the result: http://dbpedia.org/resource/Munich 1356594 117280 http://dbpedia.org/resource/London 7355400 81486 http://dbpedia.org/resource/Davao_City 1363337 58640 http://dbpedia.org/resource/Belo_Horizonte 2412937 58640 http://dbpedia.org/resource/Chengde 3610000 58640 http://dbpedia.org/resource/Hamburg 1769117 51664 http://dbpedia.org/resource/San_Diego%2C_California 1266731 47685 http://dbpedia.org/resource/Bursa 1562828 47685 http://dbpedia.org/resource/Port-au-Prince 1082800 47685 http://dbpedia.org/resource/Oakland_County%2C_Michigan 1194156 45636 http://dbpedia.org/resource/Sana%27a 1747627 40923 http://dbpedia.org/resource/Milan 1303437 40923 http://dbpedia.org/resource/Campinas 1059420 40923 http://dbpedia.org/resource/Hohhot 2580000 40923 http://dbpedia.org/resource/Brussels 1031215 40923 http://dbpedia.org/resource/Bogra_District 2988567 40923 http://dbpedia.org/resource/Cort%C3%A9s_Department 1202510 40923 http://dbpedia.org/resource/Berlin 3416300 35668 http://dbpedia.org/resource/New_York_City 8274527 30810 http://dbpedia.org/resource/Los_Angeles%2C_California 3849378 25614 20 Rows. -- 1733 msec. Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s 664% cpu 2% read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm cache. Fair enough for a first crack, this can obviously be optimized further. Still, the geo part of the processing is already as good as instantaneous. We will shortly have the geography features installed on DBpedia and the other data sets we host. As these come online we will show more demo queries. For more about SQL/MM, you can look to a couple of PDFs: SQL/MM Spatial: The Standard to Manage Spatial Data in Relational Database Systems by Knut Stolze SQL Multimedia and Application Packages (SQL/MM) by Jim Melton and Andrew Eisenberg</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have just added a geometry <a href="http://dbpedia.org/resource/Data" id="link-id0x22a7df78">data</a> type and corresponding <a href="http://dbpedia.org/resource/R-tree" id="link-id0x132dfbe0">R</a>-tree index to <a href="http://virtuoso.openlinksw.com" id="link-id0x1e41e1b0">Virtuoso</a>.  This follows the general scheme of <a href="http://dbpedia.org/resource/SQL" id="link-id0x14568960">SQL</a>/MM, as is implemented by <a href="http://dbpedia.org/resource/PostGIS" id="link-id0x141653b0">PostGIS</a> and many others.  We have all the engine-side stuff, including optimizer support for geometry cardinality sampling and good execution plans for combinations of spatial and other joins.  We have however not yet implemented all the different geometry types and library function support for them, like shortest distance between two arbitrary shapes.</p>

<p>The geometry support is for both SQL and <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x11563f40">SPARQL</a>.  On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x2209a3a8">RDF</a>, a geometry can occur as the object of a quad.  If the object is a typed-literal of the <code>virtrdf:Geometry</code> type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed.  After this, SQL MM predicates and functions can be used with SPARQL, like this:</p>

<blockquote>
 <pre><code>  PREFIX  geo:  &lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x21051a00">http</a>://www.w3.org/2003/01/geo/wgs84_pos#&gt;  
  SELECT  ?class
          COUNT (*) 
   WHERE  { ?m  geo:geometry  ?geo    . 
            ?m  a             ?class  . 
                FILTER ( &lt;bif:st_intersects&gt; 
                          ( ?geo, 
                            &lt;bif:st_point&gt; (0, 52), 
                            100
                          )
                       )
          } 
GROUP BY  ?class 
ORDER BY  DESC 2 </code>
 </pre></blockquote>


<p>This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London.</p>

<p>For any data set with <a href="http://dbpedia.org/resource/World_Geodetic_System" id="link-id0x20050048">WGS 84</a> <code>geo:long</code> and <code>geo:lat</code> values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the <code>geo:geometry</code> property of the subject with the long/lat.  This then enables fast spatial access to arbitrary location data in RDF.</p>

<p>Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities.  As these get adopted we will support them.</p>

<p>For scalability, we tried the implementation with <a href="http://www.openstreetmap.org/" id="link-id0x100e0188">OpenStreetMap</a>&#39;s 350 million or so points.  The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object&#39;s key, thus not by range of coordinates or such.  Like this, the items are evenly spread even though the coordinate distribution is highly uneven.</p>

<p>We can do spatial joins like â</p>

<blockquote>
 <pre><code>   SELECT  ?s 
           ( &lt;sql:num_or_null&gt; (?p) )  
           COUNT (*) 
    WHERE  { ?s   &lt;http://<a href="http://dbpedia.org/resource/DBpedia" id="link-id0x25465c68">dbpedia</a>.org/ontology/populationTotal&gt;  ?p    . 
             FILTER 
               ( &lt;sql:num_or_null&gt; (?p) &gt; 1000000 )                      . 
             ?s   geo:geometry                                   ?geo  .
             FILTER 
               ( &lt;bif:st_intersects&gt; ( ?pt, ?geo, 5 ) )                  . 
             ?xx  geo:geometry                                   ?pt 
           } 
 GROUP BY  ?s 
           ( &lt;sql:num_or_null&gt; (?p) )
 ORDER BY  DESC 3 
    LIMIT  20 </code> </pre></blockquote>

<p>This takes the DBpedia subjects that have a population over 1 million and a geometry.  We then count all the geometries within 5 km of the point location of the first geometry.  With DBpedia (about 5 million points), <a href="http://www.geonames.org/" id="link-id0x230328d0">GeoNames</a> (7 million points), and OpenStreetMap (350 million points), we get the result:</p>

<blockquote>
 <pre><code>http://dbpedia.org/resource/Munich                        1356594    117280
http://dbpedia.org/resource/London                        7355400     81486
http://dbpedia.org/resource/Davao_City                    1363337     58640
http://dbpedia.org/resource/Belo_Horizonte                2412937     58640
http://dbpedia.org/resource/Chengde                       3610000     58640
http://dbpedia.org/resource/Hamburg                       1769117     51664
http://dbpedia.org/resource/San_Diego%2C_California       1266731     47685
http://dbpedia.org/resource/Bursa                         1562828     47685
http://dbpedia.org/resource/Port-au-Prince                1082800     47685
http://dbpedia.org/resource/Oakland_County%2C_Michigan    1194156     45636
http://dbpedia.org/resource/Sana%27a                      1747627     40923
http://dbpedia.org/resource/Milan                         1303437     40923
http://dbpedia.org/resource/Campinas                      1059420     40923
http://dbpedia.org/resource/Hohhot                        2580000     40923
http://dbpedia.org/resource/Brussels                      1031215     40923
http://dbpedia.org/resource/Bogra_District                2988567     40923
http://dbpedia.org/resource/Cort%C3%A9s_Department        1202510     40923
http://dbpedia.org/resource/Berlin                        3416300     35668
http://dbpedia.org/resource/New_York_City                 8274527     30810
http://dbpedia.org/resource/Los_Angeles%2C_California     3849378     25614<br />
20 Rows. -- 1733 msec.<br />
Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s  664% <a href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x139250c8">cpu</a> 2%  read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs
</code></pre></blockquote>

<p>This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm <a href="http://dbpedia.org/resource/Cache" id="link-id0x216d1070">cache</a>.  Fair enough for a first crack, this can obviously be optimized further.  Still, the geo part of the processing is already as good as instantaneous.</p>

<p>We will shortly have the geography features installed on DBpedia and the other data sets we host.  As these come online we will show more demo queries.</p>

<p>For more about SQL/MM, you can look to a couple of PDFs:</p>
<ul>
<li>
<a href="http://www.fer.hr/_download/repository/SQLMM_Spatial-_The_Standard_to_Manage_Spatial_Data_in_Relational_Database_Systems.pdf" id="link-id133775f0">SQL/MM Spatial: The Standard to Manage Spatial Data in
Relational Database Systems</a> by Knut Stolze</li>
<li>
  <a href="http://www.sigmod.org/record/issues/0112/standards.pdf" id="link-id1433c5e0">SQL Multimedia and Application Packages (SQL/MM)</a> by Jim Melton and Andrew Eisenberg</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-11-11#1587">
  <rss:title>RDF Geography With Virtuoso</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-11-11T17:17:27Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have just added a geometry data type and corresponding R-tree index to Virtuoso. This follows the general scheme of SQL/MM, as is implemented by PostGIS and many others. We have all the engine-side stuff, including optimizer support for geometry cardinality sampling and good execution plans for combinations of spatial and other joins. We have however not yet implemented all the different geometry types and library function support for them, like shortest distance between two arbitrary shapes. The geometry support is for both SQL and SPARQL. On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with RDF, a geometry can occur as the object of a quad. If the object is a typed-literal of the virtrdf:Geometry type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed. After this, SQL MM predicates and functions can be used with SPARQL, like this: PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt; SELECT ?class COUNT (*) WHERE { ?m geo:geometry ?geo . ?m a ?class . FILTER ( &lt;bif:st_intersects&gt; ( ?geo, &lt;bif:st_point&gt; (0, 52), 100 ) ) } GROUP BY ?class ORDER BY DESC 2 This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London. For any data set with WGS 84 geo:long and geo:lat values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the geo:geometry property of the subject with the long/lat. This then enables fast spatial access to arbitrary location data in RDF. Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities. As these get adopted we will support them. For scalability, we tried the implementation with OpenStreetMap&#39;s 350 million or so points. The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object&#39;s key, thus not by range of coordinates or such. Like this, the items are evenly spread even though the coordinate distribution is highly uneven. We can do spatial joins like â SELECT ?s ( &lt;sql:num_or_null&gt; (?p) ) COUNT (*) WHERE { ?s &lt;http://dbpedia.org/ontology/populationTotal&gt; ?p . FILTER ( &lt;sql:num_or_null&gt; (?p) &gt; 1000000 ) . ?s geo:geometry ?geo . FILTER ( &lt;bif:st_intersects&gt; ( ?pt, ?geo, 5 ) ) . ?xx geo:geometry ?pt } GROUP BY ?s ( &lt;sql:num_or_null&gt; (?p) ) ORDER BY DESC 3 LIMIT 20 This takes the DBpedia subjects that have a population over 1 million and a geometry. We then count all the geometries within 5 km of the point location of the first geometry. With DBpedia (about 5 million points), GeoNames (7 million points), and OpenStreetMap (350 million points), we get the result: http://dbpedia.org/resource/Munich 1356594 117280 http://dbpedia.org/resource/London 7355400 81486 http://dbpedia.org/resource/Davao_City 1363337 58640 http://dbpedia.org/resource/Belo_Horizonte 2412937 58640 http://dbpedia.org/resource/Chengde 3610000 58640 http://dbpedia.org/resource/Hamburg 1769117 51664 http://dbpedia.org/resource/San_Diego%2C_California 1266731 47685 http://dbpedia.org/resource/Bursa 1562828 47685 http://dbpedia.org/resource/Port-au-Prince 1082800 47685 http://dbpedia.org/resource/Oakland_County%2C_Michigan 1194156 45636 http://dbpedia.org/resource/Sana%27a 1747627 40923 http://dbpedia.org/resource/Milan 1303437 40923 http://dbpedia.org/resource/Campinas 1059420 40923 http://dbpedia.org/resource/Hohhot 2580000 40923 http://dbpedia.org/resource/Brussels 1031215 40923 http://dbpedia.org/resource/Bogra_District 2988567 40923 http://dbpedia.org/resource/Cort%C3%A9s_Department 1202510 40923 http://dbpedia.org/resource/Berlin 3416300 35668 http://dbpedia.org/resource/New_York_City 8274527 30810 http://dbpedia.org/resource/Los_Angeles%2C_California 3849378 25614 20 Rows. -- 1733 msec. Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s 664% cpu 2% read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm cache. Fair enough for a first crack, this can obviously be optimized further. Still, the geo part of the processing is already as good as instantaneous. We will shortly have the geography features installed on DBpedia and the other data sets we host. As these come online we will show more demo queries. For more about SQL/MM, you can look to a couple of PDFs: SQL/MM Spatial: The Standard to Manage Spatial Data in Relational Database Systems by Knut Stolze SQL Multimedia and Application Packages (SQL/MM) by Jim Melton and Andrew Eisenberg</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have just added a geometry <a href="http://dbpedia.org/resource/Data" id="link-id0x1c4085f8">data</a> type and corresponding <a href="http://dbpedia.org/resource/R-tree" id="link-id0x1c2ea830">R</a>-tree index to <a href="http://virtuoso.openlinksw.com" id="link-id0x201556b0">Virtuoso</a>.  This follows the general scheme of <a href="http://dbpedia.org/resource/SQL" id="link-id0x20152fc0">SQL</a>/MM, as is implemented by <a href="http://dbpedia.org/resource/PostGIS" id="link-id0x1c1a7610">PostGIS</a> and many others.  We have all the engine-side stuff, including optimizer support for geometry cardinality sampling and good execution plans for combinations of spatial and other joins.  We have however not yet implemented all the different geometry types and library function support for them, like shortest distance between two arbitrary shapes.</p>

<p>The geometry support is for both SQL and <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1c0fe6f8">SPARQL</a>.  On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x20d637e8">RDF</a>, a geometry can occur as the object of a quad.  If the object is a typed-literal of the <code>virtrdf:Geometry</code> type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed.  After this, SQL MM predicates and functions can be used with SPARQL, like this:</p>

<blockquote>
 <pre><code>  PREFIX  geo:  &lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1c4f4b50">http</a>://www.w3.org/2003/01/geo/wgs84_pos#&gt;  
  SELECT  ?class
          COUNT (*) 
   WHERE  { ?m  geo:geometry  ?geo    . 
            ?m  a             ?class  . 
                FILTER ( &lt;bif:st_intersects&gt; 
                          ( ?geo, 
                            &lt;bif:st_point&gt; (0, 52), 
                            100
                          )
                       )
          } 
GROUP BY  ?class 
ORDER BY  DESC 2 </code>
 </pre></blockquote>


<p>This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London.</p>

<p>For any data set with <a href="http://dbpedia.org/resource/World_Geodetic_System" id="link-id0x1fa3a1d0">WGS 84</a> <code>geo:long</code> and <code>geo:lat</code> values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the <code>geo:geometry</code> property of the subject with the long/lat.  This then enables fast spatial access to arbitrary location data in RDF.</p>

<p>Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities.  As these get adopted we will support them.</p>

<p>For scalability, we tried the implementation with <a href="http://www.openstreetmap.org/" id="link-id0x1c4207d0">OpenStreetMap</a>&#39;s 350 million or so points.  The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object&#39;s key, thus not by range of coordinates or such.  Like this, the items are evenly spread even though the coordinate distribution is highly uneven.</p>

<p>We can do spatial joins like â</p>

<blockquote>
 <pre><code>   SELECT  ?s 
           ( &lt;sql:num_or_null&gt; (?p) )  
           COUNT (*) 
    WHERE  { ?s   &lt;http://<a href="http://dbpedia.org/resource/DBpedia" id="link-id0x10da9b08">dbpedia</a>.org/ontology/populationTotal&gt;  ?p    . 
             FILTER 
               ( &lt;sql:num_or_null&gt; (?p) &gt; 1000000 )                      . 
             ?s   geo:geometry                                   ?geo  .
             FILTER 
               ( &lt;bif:st_intersects&gt; ( ?pt, ?geo, 5 ) )                  . 
             ?xx  geo:geometry                                   ?pt 
           } 
 GROUP BY  ?s 
           ( &lt;sql:num_or_null&gt; (?p) )
 ORDER BY  DESC 3 
    LIMIT  20 </code> </pre></blockquote>

<p>This takes the DBpedia subjects that have a population over 1 million and a geometry.  We then count all the geometries within 5 km of the point location of the first geometry.  With DBpedia (about 5 million points), <a href="http://www.geonames.org/" id="link-id0x21af78d0">GeoNames</a> (7 million points), and OpenStreetMap (350 million points), we get the result:</p>

<blockquote>
 <pre><code>http://dbpedia.org/resource/Munich                        1356594    117280
http://dbpedia.org/resource/London                        7355400     81486
http://dbpedia.org/resource/Davao_City                    1363337     58640
http://dbpedia.org/resource/Belo_Horizonte                2412937     58640
http://dbpedia.org/resource/Chengde                       3610000     58640
http://dbpedia.org/resource/Hamburg                       1769117     51664
http://dbpedia.org/resource/San_Diego%2C_California       1266731     47685
http://dbpedia.org/resource/Bursa                         1562828     47685
http://dbpedia.org/resource/Port-au-Prince                1082800     47685
http://dbpedia.org/resource/Oakland_County%2C_Michigan    1194156     45636
http://dbpedia.org/resource/Sana%27a                      1747627     40923
http://dbpedia.org/resource/Milan                         1303437     40923
http://dbpedia.org/resource/Campinas                      1059420     40923
http://dbpedia.org/resource/Hohhot                        2580000     40923
http://dbpedia.org/resource/Brussels                      1031215     40923
http://dbpedia.org/resource/Bogra_District                2988567     40923
http://dbpedia.org/resource/Cort%C3%A9s_Department        1202510     40923
http://dbpedia.org/resource/Berlin                        3416300     35668
http://dbpedia.org/resource/New_York_City                 8274527     30810
http://dbpedia.org/resource/Los_Angeles%2C_California     3849378     25614<br />
20 Rows. -- 1733 msec.<br />
Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s  664% <a href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1c4406a0">cpu</a> 2%  read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs
</code></pre></blockquote>

<p>This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm <a href="http://dbpedia.org/resource/Cache" id="link-id0x1c420158">cache</a>.  Fair enough for a first crack, this can obviously be optimized further.  Still, the geo part of the processing is already as good as instantaneous.</p>

<p>We will shortly have the geography features installed on DBpedia and the other data sets we host.  As these come online we will show more demo queries.</p>

<p>For more about SQL/MM, you can look to a couple of PDFs:</p>
<ul>
<li>
<a href="http://www.fer.hr/_download/repository/SQLMM_Spatial-_The_Standard_to_Manage_Spatial_Data_in_Relational_Database_Systems.pdf" id="link-id133775f0">SQL/MM Spatial: The Standard to Manage Spatial Data in
Relational Database Systems</a> by Knut Stolze</li>
<li>
  <a href="http://www.sigmod.org/record/issues/0112/standards.pdf" id="link-id1433c5e0">SQL Multimedia and Application Packages (SQL/MM)</a> by Jim Melton and Andrew Eisenberg</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-10-27#1586">
  <rss:title>European Commission and the Data Overflow</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-10-27T18:29:51Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data. Since the questionnaire is public, I am publishing my answers below. Data and data types What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news. This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema. Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like. The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction. Relevant sections of this mass of data are a potential addition to any present or future analytics application. Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth. Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models. By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas. What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)? All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know. Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred. Interleaving of all database functions and types becomes increasingly important. Industries, communities Who is producing these data and why? Could they do it better? How? Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs). Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread. Who is consuming these data and why? Could they do it better? How? Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird&#39;s eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages. Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a &quot;meshup economy&quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean. What industrial sectors in Europe could become more competitive if they became much better at managing data? Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media. Is the regulation landscape imposing constraints (privacy, compliance ...) that don&#39;t have today good tool support? The regulation landscape drives database demand through data retention requirements and the like. With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science. For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help. More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&#39;s profile. This is a tall order and implementing something of the sort is an open question. What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers. We have come across the following: Knowing that the data exists in the first place. If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like. Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data. Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this. Services, software stacks, protocols, standards, benchmarks What combinations of components are needed to deal with these problems? Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction. Interoperability and at least de facto standards in ways of doing this will emerge. What data exchange and processing mechanisms will be needed to work across platforms and programming languages? HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient. There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF. For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate. What data environments are today so wastefully messy that they would benefit from the development of standards? RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together. Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish. Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example. What kind of performance is expected or required of these systems? Who will measure it reliably? How? Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases. The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need. These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing. We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today&#39;s best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached. The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction. Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such. Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs. If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run â think of the high end TPC-C results â then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish. Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction. Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding. Usability and training How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier? In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL. Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned. There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches. Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea. For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities. For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming. This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so. One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more. This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer. How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries? For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors. RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these. A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries. For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize. The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable. As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off. If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination. The problem is creating the pull. Challenges What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ? The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better â reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges. Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight. The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact. The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain. If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed. What should one do to set up such a challenge, administer, and monitor it? The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed. The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&amp;D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US. There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress. Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big <a href="http://dbpedia.org/resource/Data" id="link-id0x43bae00">data</a>.</p>
 
<p>Since the <a href="http://cordis.europa.eu/fp7/ict/content-knowledge/consultation_en.html" id="link-id1191c0f8">questionnaire is public</a>, I am publishing my answers below.</p>

<ol type="1" start="1">
<li>
  <p>
    <b>Data and data types</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? </b>
    </p>

<p>Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional.  This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x5c7add0">RDF</a> and <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x5c7adb8">linked data</a> principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data.  There is convergence around <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x5c7ada0">DBpedia</a> identifiers for real-world entities, e.g., most things that would be in the news.</p>

<p>This also means that internal data processes and silos may be enriched with this content.  There is consequent pressure for accommodating more diversity of data, with more flexible <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x7d87a88">schema</a>.</p>

<p>Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data.  Examples are product catalogs, price lists, event schedules  and the like.</p>

<p>The volume of the well known linked data sets is around 10 billion statements.  With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable,  This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.</p>

<p>Relevant sections of this mass of data are a potential addition to any present or future analytics application.</p>

<p>Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data.  This will drive database innovation for the next years even more than the continued classical warehouse growth.</p>

<p>Science data is another driver of the data overflow.  For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data.  This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data.  Data and <a href="http://dbpedia.org/resource/Metadata" id="link-id0x7a3fb40">metadata</a> should travel together but may have different data models.</p>

<p>By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible.  Restricted circles can and likely will implement similar ideas.</p>
    </li>

<li>
    <p>
        <b>What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x5a48058">knowledge</a> graphs, 3D, sensor streams...)?</b>
    </p>

<p>All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., <i>photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.</i>
      </p>

<p>Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.</p>

<p>Interleaving of all database functions and types becomes increasingly important.</p>
</li>
  </ol>
</li>


<li>
  <p>
    <b>Industries, communities</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>Who is producing these data and why? Could they do it better? How?</b>
    </p>

<p>Right now, projects such as <a href="http://www.bio2rdf.org/" id="link-id0x2a29de8">Bio2RDF</a>, <a href="http://neurocommons.org/page/Main_Page" id="link-id0x7ddaed0">Neurocommons</a>, and DBPedia produce this data.  The processes are in place and are reasonable.  Incremental improvement is to be expected.  These processes, along with the <a href="http://www.w3.org/DesignIssues/LinkedData.html" id="link-id0xbab4dfd0">linked data meme</a> generally taking off, drive demand for better <a href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x51f4e0">NLP</a> (<a href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x51a1b48">Natural Language Processing</a>), e.g., <a href="http://dbpedia.org/resource/Entity" id="link-id0x956680">entity</a> and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).</p>

<p>Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this.  The required baseline level has been reached; the rest is a matter of automating deployment.  Within the enterprise, there are advantages to be gained for <a href="http://dbpedia.org/resource/Information" id="link-id0x7da9e80">information</a> integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x71673f8">URI</a>.  Some of this information may even be published on an <a href="http://dbpedia.org/resource/Extranet" id="link-id0x9aa6e0">extranet</a> for self-service and web-service interfaces.  This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier.  Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.</p>

</li>
	<li>
    <p>
        <b>Who is consuming these data and why? Could they do it better? How?</b>
    </p>

<p>Consumers are various.  The greatest need is for tools that summarize complex data and allow getting a bird&#39;s eye view of what data is in the first instance available.  Consuming the data is hindered by the user not even necessarily knowing what data there is.  This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with <a href="http://dbpedia.org/resource/SQL" id="link-id0x7f7b148">SQL</a> report generators and statistics packages.</p>

<p>Where Web 2.0 made the <i>citizen journalist</i>, the web of linked data will make the <i>citizen analyst</i>.  For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful.  We may envision a &quot;meshup economy&quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.</p>

</li>
	<li>
    <p>
        <b>What industrial sectors in Europe could become more competitive if they became much better at managing data?</b>
    </p>

<p>Any sector could benefit.  Early adopters are seen in the biomedical field and to an extent in media.  </p>

</li>
	<li>
    <p>
        <b>Is the regulation landscape imposing constraints (privacy, compliance ...) that don&#39;t have today good tool support?</b>
    </p>

<p>The regulation landscape drives database demand through data retention requirements and the like.</p>

<p>With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online.   Regulation is needed to protect individuals, but integration should still be possible for science.</p>

<p>For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF.  This is possible but needs some more work.  Also, creating on-the-fly-anonymizing views on data might help.</p>

<p>More research is needed for reconciling the need for security with the advantages of broad-based <i>ad hoc</i> integration.  Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&#39;s profile.  This is a tall order and implementing something of the sort is an open question.</p>


</li>
	<li>
    <p>
        <b>What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.</b>
    </p>

<p>We have come across the following:</p>

<ul>
        <li>Knowing that the data exists in the first place.</li>
<li>If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.</li>
<li>Compatible subject matter but incompatible representation:  For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument.  It is only to be expected that the time interval between measurements is not the same.  So there is need for a lot of one-off programming to align data.</li>
      </ul>

<p>Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network.  Computation needs to go to the data, and databases need to support this.</p>

</li>
  </ol>
</li>

<li>
  <p>
    <b>Services, software stacks, protocols, standards, benchmarks</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>What combinations of components are needed to deal with these problems?</b>
    </p>

<p>Recent times have seen a proliferation of special purpose databases.  Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility.  We see some of this in integration of map-reduce and scale-out databases.  The former antagonists have become partners. Vertica, <a href="http://dbpedia.org/resource/Greenplum" id="link-id0x7a94e70">Greenplum</a>, and OpenLink <a href="http://virtuoso.openlinksw.com" id="link-id0x2ab2868">Virtuoso</a> are example of DBMS featuring work in this direction.</p>

<p>Interoperability and at least <i>de facto</i> standards in ways of doing this will emerge.</p>

</li>
	<li>
    <p>
        <b>What data exchange and processing mechanisms will be needed to work across platforms and programming languages?</b>
    </p>

<p>
        <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x78a0458">HTTP</a>, <a href="http://dbpedia.org/resource/XML" id="link-id0x7ff2360">XML</a>, and RDF are in fact very verbose, yet these are the formats and models that have uptake.  Thus, these will continue to be used even though one might think binary formats to be more efficient.</p>

<p>There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.</p>

<p>For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue.  Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.</p>


</li>
	<li>
    <p>
        <b>What data environments are today so wastefully messy that they would benefit from the development of standards?</b>
    </p>


<p>RDF and <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x5643d70">OWL</a> are not messy but they could use some more performance; we are working on this.  <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x152ab18">SPARQL</a> is finally acquiring the capabilities of a serious query language, so things are slowly coming together.</p>

<p>Community process for developing application domain specific vocabularies works quite well, even though one could argue it is <i>ad hoc</i> and not up to what a modeling purist might wish.</p>

<p>Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.</p>

</li>
	<li>
    <p>
        <b>What kind of performance is expected or required of these systems? Who will measure it reliably? How?</b>
    </p>

<p>Relational databases have a history of substantial investment in <a href="http://dbpedia.org/resource/Program_optimization" id="link-id0xecc100">optimization</a> and some of them are very good for what they do, e.g., the newer generation of analytics databases.</p>

<p>The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.</p>

<p>These trends will merge:  Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.</p>

<p>We find RDF augmented with some binary types at this crossroads.  This point of the design space will have to provide performance roughly on the level of today&#39;s best relational solution for workloads that fit the relational model.  The added cost of schema-last and inference must come down.  We are working on this.  Research work such as carried out with <a href="http://dbpedia.org/resource/MonetDB" id="link-id0x7ae2890">MonetDB</a> gives clues as to how these aims can be reached.</p>

<p>The separation of query language and inference is artificial.  After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.</p>

<p>Benchmarks are key.  Some gain can be had even from repurposing standard relational benchmarks like <a href="http://www.tpc.org/" id="link-id0x71eb528">TPC</a>-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x5e16a40">H</a>.  But the TPC-H rules do not allow official reporting of such.</p>

<p>Development of benchmarks for RDF, complex queries, and inference is needed.  A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity.  A key-value store benchmark might also be conceived.  A transaction benchmark like TPC-<a href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x78562d0">C</a> might be the basis, maybe augmented with massive user-generated content like reviews and blogs.</p>

<p>If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run â think of the high end TPC-C results â then TPC-style rules and processes would be quite adequate.  The threshold to publish should be lowered:  Everybody runs the TPC workloads internally but few publish.</p>

<p>Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government.  Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.</p>

<p>Benchmarks should be run by software vendors on their own systems, tuned by themselves.  But there should be a process of disclosure and auditing; the TPC rules give an example.  Compliance should not be too expensive or time consuming.  Some community development for automating these things would be a worthwhile target for EC funding.</p>

</li>
  </ol>
</li>

<li>
  <p>
    <b>Usability and training</b>
  </p>

<ol type="a" start="1">

	<li>
    <p>
        <b>How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?</b>
    </p>

<p>In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL.  For the linked data <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x7761e50">web</a>, the same will take place behind SPARQL.</p>

<p>Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult.  The casual amateur is hereby warned.</p>

<p>There is no single solution.  For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.</p>

<p>Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.</p>

<p>For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.</p>

<p>For shipping functions in a cluster or cloud, the <a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x5494b0">BOOM</a> (<a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x7f1f148">Berkeley Orders Of Magnitude</a>) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce.  The question is whether a <a href="http://dbpedia.org/resource/PHP" id="link-id0x5c758c8">PHP</a> developer can be made to do logic programming.</p>

<p>This bridge will be crossed only with actual need and even then reluctantly.  We may look at the Web 2.0 practice of sharding <a href="http://dbpedia.org/resource/MySQL" id="link-id0x432f868">MySQL</a>, inconvenient as this may be, for an example.  There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, <i>post hoc</i>, often a point solution.  One could argue that planning ahead would be smarter but by and large the world does not work so.</p>

<p>One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce.  If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.</p>

<p>This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this.  Therefore we wish to go for bold new application types for which the client-server database application is not the model.  Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there.  These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.</p>

</li>
	<li>
    <p>
        <b>How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?</b>
    </p>

<p>For the most part, developers do not learn things for the sake of learning.  When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction.  The research world is often similarly insular.  A new inflection in the application landscape is needed to drive learning.  This inflection is provided by the <a href="https://wiki.mozilla.org/Labs/Ubiquity" id="link-id0x7f051c8">ubiquity</a> of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.</p>

<p>RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML.  These new things should, within possibility, be deployed in the usual technology stack, <a href="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29" id="link-id0x77151e0">LAMP</a> or Java.  Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.</p>

<p>A lot of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x7940cd0">semantic web</a> potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.</p>

<p>For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.</p>

<p>The question is one of providing challenges.  Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training.  With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.</p>

<p>As the data overflow proceeds, its victims will multiply and create demand for solutions.  The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.</p>

<p>If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT.  This would create interest, and interest would drive training and dissemination.</p>

<p>The problem is creating the pull.</p>
</li>
  </ol>
</li>

<li>
  <p>
    <b>Challenges</b>
  </p>
<ol type="a" start="1">

	<li>
    <p>
        <b>What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, <a href="http://dbpedia.org/resource/Google" id="link-id0x7e72f40">Google</a> Lunar X Prize, etc. ... ?</b>
    </p>

<p>The EC itself no doubt suffers from data overflow in one function or another.  Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start.  The more real the data, the better â reality is consistently more complex and surprising than imagination.  Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.</p>

<p>Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.</p>

<p>The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.</p>

<p>The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded.  Otherwise investing in existing business development will be more interesting to industry.  Some industry participation seems necessary; we would wish academia and industry to work closer.  Also, having industry supply the baseline guarantees that academia actually does further the state of the art.  This is not always certain.</p>

<p>If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia.  Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.</p>


</li>
	<li>
    <p>
        <b>What should one do  to set up such a challenge, administer, and monitor it?</b>
    </p>

<p>The EC should probably circulate a call for actual problem scenarios involving big data.  If the matter of the overflow is as dire as represented, cases should be easy to find.  A few should be selected and then anonymized if needed.</p>

<p>The party with the use case would benefit by having hopefully the best work on it.  The contestants would benefit from having real world needs guide R&amp;D.  The EC would not have to do very much, except possibly use some money for funding the best proposals.  The winner would possibly get a large account and related sales and service income.  The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.</p>

<p>There may be a good benchmark at the time, possibly resulting from FP7 itself.  In such a case, the EC could offer a prize for winners.  Details would have to be worked out case by case.  Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.</p>

<p>Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.</p>

</li>
  </ol>
</li>
</ol>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-10-27#1585">
  <rss:title>European Commission and the Data Overflow</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-10-27T18:29:51Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data. Since the questionnaire is public, I am publishing my answers below. Data and data types What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news. This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema. Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like. The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction. Relevant sections of this mass of data are a potential addition to any present or future analytics application. Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth. Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models. By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas. What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)? All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know. Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred. Interleaving of all database functions and types becomes increasingly important. Industries, communities Who is producing these data and why? Could they do it better? How? Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs). Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread. Who is consuming these data and why? Could they do it better? How? Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird&#39;s eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages. Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a &quot;meshup economy&quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean. What industrial sectors in Europe could become more competitive if they became much better at managing data? Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media. Is the regulation landscape imposing constraints (privacy, compliance ...) that don&#39;t have today good tool support? The regulation landscape drives database demand through data retention requirements and the like. With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science. For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help. More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&#39;s profile. This is a tall order and implementing something of the sort is an open question. What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers. We have come across the following: Knowing that the data exists in the first place. If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like. Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data. Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this. Services, software stacks, protocols, standards, benchmarks What combinations of components are needed to deal with these problems? Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction. Interoperability and at least de facto standards in ways of doing this will emerge. What data exchange and processing mechanisms will be needed to work across platforms and programming languages? HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient. There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF. For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate. What data environments are today so wastefully messy that they would benefit from the development of standards? RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together. Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish. Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example. What kind of performance is expected or required of these systems? Who will measure it reliably? How? Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases. The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need. These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing. We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today&#39;s best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached. The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction. Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such. Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs. If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run â think of the high end TPC-C results â then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish. Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction. Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding. Usability and training How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier? In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL. Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned. There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches. Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea. For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities. For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming. This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so. One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more. This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer. How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries? For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors. RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these. A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries. For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize. The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable. As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off. If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination. The problem is creating the pull. Challenges What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ? The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better â reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges. Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight. The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact. The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain. If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed. What should one do to set up such a challenge, administer, and monitor it? The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed. The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&amp;D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US. There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress. Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big <a href="http://dbpedia.org/resource/Data" id="link-id0x79cfe58">data</a>.</p>
 
<p>Since the <a href="http://cordis.europa.eu/fp7/ict/content-knowledge/consultation_en.html" id="link-id1191c0f8">questionnaire is public</a>, I am publishing my answers below.</p>

<ol type="1" start="1">
<li>
  <p>
    <b>Data and data types</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? </b>
    </p>

<p>Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional.  This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x7d7e7a0">RDF</a> and <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x7f2a788">linked data</a> principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data.  There is convergence around <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x7dfbca8">DBpedia</a> identifiers for real-world entities, e.g., most things that would be in the news.</p>

<p>This also means that internal data processes and silos may be enriched with this content.  There is consequent pressure for accommodating more diversity of data, with more flexible <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x7babaf8">schema</a>.</p>

<p>Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data.  Examples are product catalogs, price lists, event schedules  and the like.</p>

<p>The volume of the well known linked data sets is around 10 billion statements.  With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable,  This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.</p>

<p>Relevant sections of this mass of data are a potential addition to any present or future analytics application.</p>

<p>Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data.  This will drive database innovation for the next years even more than the continued classical warehouse growth.</p>

<p>Science data is another driver of the data overflow.  For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data.  This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data.  Data and <a href="http://dbpedia.org/resource/Metadata" id="link-id0x96ce60">metadata</a> should travel together but may have different data models.</p>

<p>By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible.  Restricted circles can and likely will implement similar ideas.</p>
    </li>

<li>
    <p>
        <b>What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x7e8e248">knowledge</a> graphs, 3D, sensor streams...)?</b>
    </p>

<p>All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., <i>photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.</i>
      </p>

<p>Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.</p>

<p>Interleaving of all database functions and types becomes increasingly important.</p>
</li>
  </ol>
</li>


<li>
  <p>
    <b>Industries, communities</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>Who is producing these data and why? Could they do it better? How?</b>
    </p>

<p>Right now, projects such as <a href="http://www.bio2rdf.org/" id="link-id0x43bd098">Bio2RDF</a>, <a href="http://neurocommons.org/page/Main_Page" id="link-id0x5c074b0">Neurocommons</a>, and DBPedia produce this data.  The processes are in place and are reasonable.  Incremental improvement is to be expected.  These processes, along with the <a href="http://www.w3.org/DesignIssues/LinkedData.html" id="link-id0x72131d0">linked data meme</a> generally taking off, drive demand for better <a href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x71e7798">NLP</a> (<a href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x7e0e2f0">Natural Language Processing</a>), e.g., <a href="http://dbpedia.org/resource/Entity" id="link-id0x71ab500">entity</a> and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).</p>

<p>Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this.  The required baseline level has been reached; the rest is a matter of automating deployment.  Within the enterprise, there are advantages to be gained for <a href="http://dbpedia.org/resource/Information" id="link-id0x7a8e9a8">information</a> integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x599f630">URI</a>.  Some of this information may even be published on an <a href="http://dbpedia.org/resource/Extranet" id="link-id0x2a28f98">extranet</a> for self-service and web-service interfaces.  This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier.  Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.</p>

</li>
	<li>
    <p>
        <b>Who is consuming these data and why? Could they do it better? How?</b>
    </p>

<p>Consumers are various.  The greatest need is for tools that summarize complex data and allow getting a bird&#39;s eye view of what data is in the first instance available.  Consuming the data is hindered by the user not even necessarily knowing what data there is.  This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with <a href="http://dbpedia.org/resource/SQL" id="link-id0x5999558">SQL</a> report generators and statistics packages.</p>

<p>Where Web 2.0 made the <i>citizen journalist</i>, the web of linked data will make the <i>citizen analyst</i>.  For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful.  We may envision a &quot;meshup economy&quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.</p>

</li>
	<li>
    <p>
        <b>What industrial sectors in Europe could become more competitive if they became much better at managing data?</b>
    </p>

<p>Any sector could benefit.  Early adopters are seen in the biomedical field and to an extent in media.  </p>

</li>
	<li>
    <p>
        <b>Is the regulation landscape imposing constraints (privacy, compliance ...) that don&#39;t have today good tool support?</b>
    </p>

<p>The regulation landscape drives database demand through data retention requirements and the like.</p>

<p>With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online.   Regulation is needed to protect individuals, but integration should still be possible for science.</p>

<p>For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF.  This is possible but needs some more work.  Also, creating on-the-fly-anonymizing views on data might help.</p>

<p>More research is needed for reconciling the need for security with the advantages of broad-based <i>ad hoc</i> integration.  Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&#39;s profile.  This is a tall order and implementing something of the sort is an open question.</p>


</li>
	<li>
    <p>
        <b>What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.</b>
    </p>

<p>We have come across the following:</p>

<ul>
        <li>Knowing that the data exists in the first place.</li>
<li>If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.</li>
<li>Compatible subject matter but incompatible representation:  For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument.  It is only to be expected that the time interval between measurements is not the same.  So there is need for a lot of one-off programming to align data.</li>
      </ul>

<p>Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network.  Computation needs to go to the data, and databases need to support this.</p>

</li>
  </ol>
</li>

<li>
  <p>
    <b>Services, software stacks, protocols, standards, benchmarks</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>What combinations of components are needed to deal with these problems?</b>
    </p>

<p>Recent times have seen a proliferation of special purpose databases.  Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility.  We see some of this in integration of map-reduce and scale-out databases.  The former antagonists have become partners. Vertica, <a href="http://dbpedia.org/resource/Greenplum" id="link-id0x45ecfa0">Greenplum</a>, and OpenLink <a href="http://virtuoso.openlinksw.com" id="link-id0x7f73fc8">Virtuoso</a> are example of DBMS featuring work in this direction.</p>

<p>Interoperability and at least <i>de facto</i> standards in ways of doing this will emerge.</p>

</li>
	<li>
    <p>
        <b>What data exchange and processing mechanisms will be needed to work across platforms and programming languages?</b>
    </p>

<p>
        <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x776a1a0">HTTP</a>, <a href="http://dbpedia.org/resource/XML" id="link-id0x2a4e8d0">XML</a>, and RDF are in fact very verbose, yet these are the formats and models that have uptake.  Thus, these will continue to be used even though one might think binary formats to be more efficient.</p>

<p>There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.</p>

<p>For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue.  Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.</p>


</li>
	<li>
    <p>
        <b>What data environments are today so wastefully messy that they would benefit from the development of standards?</b>
    </p>


<p>RDF and <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x2a35960">OWL</a> are not messy but they could use some more performance; we are working on this.  <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x12362e8">SPARQL</a> is finally acquiring the capabilities of a serious query language, so things are slowly coming together.</p>

<p>Community process for developing application domain specific vocabularies works quite well, even though one could argue it is <i>ad hoc</i> and not up to what a modeling purist might wish.</p>

<p>Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.</p>

</li>
	<li>
    <p>
        <b>What kind of performance is expected or required of these systems? Who will measure it reliably? How?</b>
    </p>

<p>Relational databases have a history of substantial investment in <a href="http://dbpedia.org/resource/Program_optimization" id="link-id0x7b2d7c8">optimization</a> and some of them are very good for what they do, e.g., the newer generation of analytics databases.</p>

<p>The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.</p>

<p>These trends will merge:  Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.</p>

<p>We find RDF augmented with some binary types at this crossroads.  This point of the design space will have to provide performance roughly on the level of today&#39;s best relational solution for workloads that fit the relational model.  The added cost of schema-last and inference must come down.  We are working on this.  Research work such as carried out with <a href="http://dbpedia.org/resource/MonetDB" id="link-id0x794ee48">MonetDB</a> gives clues as to how these aims can be reached.</p>

<p>The separation of query language and inference is artificial.  After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.</p>

<p>Benchmarks are key.  Some gain can be had even from repurposing standard relational benchmarks like <a href="http://www.tpc.org/" id="link-id0x7d45c58">TPC</a>-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x45b0198">H</a>.  But the TPC-H rules do not allow official reporting of such.</p>

<p>Development of benchmarks for RDF, complex queries, and inference is needed.  A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity.  A key-value store benchmark might also be conceived.  A transaction benchmark like TPC-<a href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x7e32178">C</a> might be the basis, maybe augmented with massive user-generated content like reviews and blogs.</p>

<p>If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run â think of the high end TPC-C results â then TPC-style rules and processes would be quite adequate.  The threshold to publish should be lowered:  Everybody runs the TPC workloads internally but few publish.</p>

<p>Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government.  Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.</p>

<p>Benchmarks should be run by software vendors on their own systems, tuned by themselves.  But there should be a process of disclosure and auditing; the TPC rules give an example.  Compliance should not be too expensive or time consuming.  Some community development for automating these things would be a worthwhile target for EC funding.</p>

</li>
  </ol>
</li>

<li>
  <p>
    <b>Usability and training</b>
  </p>

<ol type="a" start="1">

	<li>
    <p>
        <b>How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?</b>
    </p>

<p>In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL.  For the linked data <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x7e01618">web</a>, the same will take place behind SPARQL.</p>

<p>Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult.  The casual amateur is hereby warned.</p>

<p>There is no single solution.  For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.</p>

<p>Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.</p>

<p>For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.</p>

<p>For shipping functions in a cluster or cloud, the <a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x43665a8">BOOM</a> (<a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x7718f00">Berkeley Orders Of Magnitude</a>) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce.  The question is whether a <a href="http://dbpedia.org/resource/PHP" id="link-id0x7d64f68">PHP</a> developer can be made to do logic programming.</p>

<p>This bridge will be crossed only with actual need and even then reluctantly.  We may look at the Web 2.0 practice of sharding <a href="http://dbpedia.org/resource/MySQL" id="link-id0xbab1ae98">MySQL</a>, inconvenient as this may be, for an example.  There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, <i>post hoc</i>, often a point solution.  One could argue that planning ahead would be smarter but by and large the world does not work so.</p>

<p>One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce.  If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.</p>

<p>This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this.  Therefore we wish to go for bold new application types for which the client-server database application is not the model.  Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there.  These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.</p>

</li>
	<li>
    <p>
        <b>How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?</b>
    </p>

<p>For the most part, developers do not learn things for the sake of learning.  When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction.  The research world is often similarly insular.  A new inflection in the application landscape is needed to drive learning.  This inflection is provided by the <a href="https://wiki.mozilla.org/Labs/Ubiquity" id="link-id0x770df38">ubiquity</a> of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.</p>

<p>RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML.  These new things should, within possibility, be deployed in the usual technology stack, <a href="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29" id="link-id0x55596a8">LAMP</a> or Java.  Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.</p>

<p>A lot of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x3d5378">semantic web</a> potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.</p>

<p>For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.</p>

<p>The question is one of providing challenges.  Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training.  With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.</p>

<p>As the data overflow proceeds, its victims will multiply and create demand for solutions.  The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.</p>

<p>If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT.  This would create interest, and interest would drive training and dissemination.</p>

<p>The problem is creating the pull.</p>
</li>
  </ol>
</li>

<li>
  <p>
    <b>Challenges</b>
  </p>
<ol type="a" start="1">

	<li>
    <p>
        <b>What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, <a href="http://dbpedia.org/resource/Google" id="link-id0x6a6c2b0">Google</a> Lunar X Prize, etc. ... ?</b>
    </p>

<p>The EC itself no doubt suffers from data overflow in one function or another.  Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start.  The more real the data, the better â reality is consistently more complex and surprising than imagination.  Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.</p>

<p>Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.</p>

<p>The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.</p>

<p>The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded.  Otherwise investing in existing business development will be more interesting to industry.  Some industry participation seems necessary; we would wish academia and industry to work closer.  Also, having industry supply the baseline guarantees that academia actually does further the state of the art.  This is not always certain.</p>

<p>If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia.  Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.</p>


</li>
	<li>
    <p>
        <b>What should one do  to set up such a challenge, administer, and monitor it?</b>
    </p>

<p>The EC should probably circulate a call for actual problem scenarios involving big data.  If the matter of the overflow is as dire as represented, cases should be easy to find.  A few should be selected and then anonymized if needed.</p>

<p>The party with the use case would benefit by having hopefully the best work on it.  The contestants would benefit from having real world needs guide R&amp;D.  The EC would not have to do very much, except possibly use some money for funding the best proposals.  The winner would possibly get a large account and related sales and service income.  The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.</p>

<p>There may be a good benchmark at the time, possibly resulting from FP7 itself.  In such a case, the EC could offer a prize for winners.  Details would have to be worked out case by case.  Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.</p>

<p>Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.</p>

</li>
  </ol>
</li>
</ol>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1573">
  <rss:title>Provenance and Reification in Virtuoso</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-09-01T14:44:08Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">These days, data provenance is a big topic across the board, ranging from the linked data web, to RDF in general, to any kind of data integration, with or without RDF. Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc. Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata. And if they do, the approach is often a proprietary relational schema with web services in front. RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road. For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level. If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph. On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query. One could write queries like â SELECT ?pub WHERE { GRAPH ?g { ?person foaf:knows ?contact } ?contact foaf:name &quot;Alice&quot; . ?g xx:has_publisher ?pub } This would return the publishers of graphs that assert that somebody knows Alice. Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it. If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity. Virtuoso&#39;s RDF_QUAD table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O. Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed. In SPARQL, one could use an extension syntax like â SELECT * WHERE { ?person foaf:knows ?connection OPTION ( time ?ts ) . ?connection foaf:name &quot;Alice&quot; . FILTER ( ?ts &gt; &quot;2009-08-08&quot;^^xsd:datetime ) } This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column. The OPTION (time ?ts) syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD via a column ?variable syntax in the OPTION clause. If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO and a dependent part of R, where R would be the reification URI of the quad. Reification statements would then be made with R as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD table. The syntax for referring to this could be something like â SELECT * WHERE { ?person foaf:knows ?contact OPTION ( reify ?r ) . ?r xx:assertion_time ?ts . ?contact foaf:name &quot;Alice&quot; . FILTER ( ?ts &gt; &quot;2008-8-8&quot;^^xsd:datetime ) } We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don&#39;t think there would be huge demand. Who knows? You tell us.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>These days, <a href="http://dbpedia.org/resource/Data" id="link-id0x37019c8">data</a> provenance is a big topic across the board, ranging from the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x53c3620">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x4aa3848">web</a>, to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x385aff0">RDF</a> in general, to any kind of data integration, with or without RDF.  Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc.  Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata.  And if they do, the approach is often a proprietary relational schema with web services in front.</p>

<p>RDF and linked data principles could evidently be a great help.  This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.</p>

<p>For now, I will talk about possible ways of dealing with provenance annotations in <a href="http://virtuoso.openlinksw.com" id="link-id0x51c4da0">Virtuoso</a> at a fairly technical level.</p>

<p>If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph.  Annotations can then be made on the graph.  The graph IRI will simply occur as the subject of a triple in the same or some other graph.  For example, all such annotations could go into a special annotations graph.</p>

<p>On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme <a href="http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfindexes" id="link-id142a0798">discussed in the Virtuoso documentation</a>.  If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.</p>


<p>One could write queries like â</p>

<blockquote>
 <code><pre>SELECT  ?pub 
  WHERE 
    { 
      GRAPH  ?g 
        { 
          ?person  foaf:knows  ?contact 
        } 
      ?contact  foaf:name         &quot;Alice&quot;  . 
      ?g        xx:has_publisher  ?pub 
    }</pre>
 </code>
</blockquote>

<p>This would return the publishers of graphs that assert that somebody knows Alice.</p>

<p>Of course, the <a href="http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification" id="link-id14fa9488">RDF reification vocabulary</a> can be used as-is to say things about single triples.  It is however very inefficient and is not supported by any specific optimization.  Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.</p>

<p>If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.</p>

<p>Virtuoso&#39;s <code>RDF_QUAD</code> table can be altered to have more columns.  The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns.  A <a href="http://dbpedia.org/resource/SQL" id="link-id0x4784bf0">SQL</a> update statement can be used to set values for these additional columns if one knows the <code>G,S,P,O</code>. </p>

<p>Suppose we annotated each quad with the user who inserted it and a timestamp.  These would be columns in the <code>RDF_QUAD</code> table.  The next choice would be whether these were primary key parts or dependent parts.  If primary key parts, these would be non-<code>NULL</code> and would occur on every index.  The same quad would exist for each distinct user and time this quad had been inserted.  For loading functions to work, these columns would need a default.  In practice, we think that having such metadata as a dependent part is more likely, so that <code>G,S,P,O</code> are the unique identifier of the quad.  Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.</p>

<p>In <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x4a8a7c0">SPARQL</a>, one could use an extension syntax like â</p>

<blockquote>
 <code><pre>SELECT  * 
  WHERE 
    { ?person      foaf:knows  ?connection 
                   OPTION ( time  ?ts )     . 
      ?connection  foaf:name   &quot;Alice&quot;      . 
      FILTER ( ?ts &gt; &quot;2009-08-08&quot;^^xsd:datetime ) 
    }</pre>
 </code>
</blockquote>

<p>This would return everybody who knows Alice since a date more recent than 2009-08-08.  This presupposes that the quad table has been extended with a datetime column.</p>

<p>The <code>OPTION (time ?ts)</code> syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of <code>RDF_QUAD</code> via a column <code>?variable</code> syntax in the <code>OPTION</code> clause.</p>


<p>If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of <code>GSPO</code> and a dependent part of <code>R</code>, where <code>R</code> would be the reification <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x49e6108">URI</a> of the quad.  Reification statements would then be made with <code>R</code> as a subject.  This would be more compact than the reification vocabulary and would not modify the <code>RDF_QUAD</code> table.   The syntax for referring to this could be something like â</p>

<blockquote>
 <code><pre>SELECT * 
  WHERE 
    { ?person   foaf:knows         ?contact 
                OPTION ( reify  ?r )          . 
      ?r        xx:assertion_time  ?ts       . 
      ?contact  foaf:name          &quot;Alice&quot;   . 
      FILTER ( ?ts &gt; &quot;2008-8-8&quot;^^xsd:datetime ) 
    }</pre>
 </code>
</blockquote>

<p>We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary.  But since it is so unwieldy I don&#39;t think there would be huge demand.  Who knows?  You tell us.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-09-01#1572">
  <rss:title>Provenance and Reification in Virtuoso</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-09-01T14:44:08Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">These days, data provenance is a big topic across the board, ranging from the linked data web, to RDF in general, to any kind of data integration, with or without RDF. Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc. Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata. And if they do, the approach is often a proprietary relational schema with web services in front. RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road. For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level. If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph. On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query. One could write queries like â SELECT ?pub WHERE { GRAPH ?g { ?person foaf:knows ?contact } ?contact foaf:name &quot;Alice&quot; . ?g xx:has_publisher ?pub } This would return the publishers of graphs that assert that somebody knows Alice. Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it. If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity. Virtuoso&#39;s RDF_QUAD table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O. Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed. In SPARQL, one could use an extension syntax like â SELECT * WHERE { ?person foaf:knows ?connection OPTION ( time ?ts ) . ?connection foaf:name &quot;Alice&quot; . FILTER ( ?ts &gt; &quot;2009-08-08&quot;^^xsd:datetime ) } This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column. The OPTION (time ?ts) syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD via a column ?variable syntax in the OPTION clause. If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO and a dependent part of R, where R would be the reification URI of the quad. Reification statements would then be made with R as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD table. The syntax for referring to this could be something like â SELECT * WHERE { ?person foaf:knows ?contact OPTION ( reify ?r ) . ?r xx:assertion_time ?ts . ?contact foaf:name &quot;Alice&quot; . FILTER ( ?ts &gt; &quot;2008-8-8&quot;^^xsd:datetime ) } We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don&#39;t think there would be huge demand. Who knows? You tell us.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>These days, <a href="http://dbpedia.org/resource/Data" id="link-id0x4a44870">data</a> provenance is a big topic across the board, ranging from the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x4e10e60">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x4738350">web</a>, to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1fe33310">RDF</a> in general, to any kind of data integration, with or without RDF.  Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc.  Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata.  And if they do, the approach is often a proprietary relational schema with web services in front.</p>

<p>RDF and linked data principles could evidently be a great help.  This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.</p>

<p>For now, I will talk about possible ways of dealing with provenance annotations in <a href="http://virtuoso.openlinksw.com" id="link-id0x36581e8">Virtuoso</a> at a fairly technical level.</p>

<p>If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph.  Annotations can then be made on the graph.  The graph IRI will simply occur as the subject of a triple in the same or some other graph.  For example, all such annotations could go into a special annotations graph.</p>

<p>On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme <a href="http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfindexes" id="link-id142a0798">discussed in the Virtuoso documentation</a>.  If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.</p>


<p>One could write queries like â</p>

<blockquote>
 <code><pre>SELECT  ?pub 
  WHERE 
    { 
      GRAPH  ?g 
        { 
          ?person  foaf:knows  ?contact 
        } 
      ?contact  foaf:name         &quot;Alice&quot;  . 
      ?g        xx:has_publisher  ?pub 
    }</pre>
 </code>
</blockquote>

<p>This would return the publishers of graphs that assert that somebody knows Alice.</p>

<p>Of course, the <a href="http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification" id="link-id14fa9488">RDF reification vocabulary</a> can be used as-is to say things about single triples.  It is however very inefficient and is not supported by any specific optimization.  Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.</p>

<p>If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.</p>

<p>Virtuoso&#39;s <code>RDF_QUAD</code> table can be altered to have more columns.  The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns.  A <a href="http://dbpedia.org/resource/SQL" id="link-id0x4b1d938">SQL</a> update statement can be used to set values for these additional columns if one knows the <code>G,S,P,O</code>. </p>

<p>Suppose we annotated each quad with the user who inserted it and a timestamp.  These would be columns in the <code>RDF_QUAD</code> table.  The next choice would be whether these were primary key parts or dependent parts.  If primary key parts, these would be non-<code>NULL</code> and would occur on every index.  The same quad would exist for each distinct user and time this quad had been inserted.  For loading functions to work, these columns would need a default.  In practice, we think that having such metadata as a dependent part is more likely, so that <code>G,S,P,O</code> are the unique identifier of the quad.  Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.</p>

<p>In <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x472afb0">SPARQL</a>, one could use an extension syntax like â</p>

<blockquote>
 <code><pre>SELECT  * 
  WHERE 
    { ?person      foaf:knows  ?connection 
                   OPTION ( time  ?ts )     . 
      ?connection  foaf:name   &quot;Alice&quot;      . 
      FILTER ( ?ts &gt; &quot;2009-08-08&quot;^^xsd:datetime ) 
    }</pre>
 </code>
</blockquote>

<p>This would return everybody who knows Alice since a date more recent than 2009-08-08.  This presupposes that the quad table has been extended with a datetime column.</p>

<p>The <code>OPTION (time ?ts)</code> syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of <code>RDF_QUAD</code> via a column <code>?variable</code> syntax in the <code>OPTION</code> clause.</p>


<p>If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of <code>GSPO</code> and a dependent part of <code>R</code>, where <code>R</code> would be the reification <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x365b190">URI</a> of the quad.  Reification statements would then be made with <code>R</code> as a subject.  This would be more compact than the reification vocabulary and would not modify the <code>RDF_QUAD</code> table.   The syntax for referring to this could be something like â</p>

<blockquote>
 <code><pre>SELECT * 
  WHERE 
    { ?person   foaf:knows         ?contact 
                OPTION ( reify  ?r )          . 
      ?r        xx:assertion_time  ?ts       . 
      ?contact  foaf:name          &quot;Alice&quot;   . 
      FILTER ( ?ts &gt; &quot;2008-8-8&quot;^^xsd:datetime ) 
    }</pre>
 </code>
</blockquote>

<p>We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary.  But since it is so unwieldy I don&#39;t think there would be huge demand.  Who knows?  You tell us.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-08-19#1571">
  <rss:title>More On Parallel RDF/Text Query Evaluation </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-08-19T17:28:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have received some more questions about Virtuoso&#39;s parallel query evaluation model. In answer, we will here explain how we do search engine style processing by writing SPARQL. There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce. The point is that what used to require programming can often be done in a generic query language. The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit. But by combining these two things, we are a step closer to the web being the database. I will here show how we do some joins combining full text, RDF conditions, and aggregates and ORDER BY. The sample task is finding the top 20 entities with New York in some attribute value. Then we specify the search further by only taking actors associated with New York. The results are returned in the order of a composite of entity rank and text match score. The basic query is: SELECT ( sql:s_sum_page ( &lt;sql:vector_agg&gt; ( &lt;bif:vector&gt; ( ?c1 , ?sm ) ), bif:vector ( &#39;new&#39;, &#39;york&#39; ) ) ) AS ?res WHERE { { SELECT ( &lt;SHORT_OR_LONG::&gt;(?s1) ) AS ?c1 ( &lt;sql:S_SUM&gt; ( &lt;SHORT_OR_LONG::IRI_RANK&gt; ( ?s1 ) , &lt;SHORT_OR_LONG::&gt; ( ?s1textp ) , &lt;SHORT_OR_LONG::&gt; ( ?o1 ) , ?sc ) ) AS ?sm WHERE { ?s1 ?s1textp ?o1 . ?o1 bif:contains &quot;new AND york&quot; OPTION ( SCORE ?sc ) } ORDER BY DESC ( &lt;sql:sum_rank&gt; (( &lt;sql:S_SUM&gt; ( &lt;SHORT_OR_LONG::IRI_RANK&gt; ( ?s1 ) , &lt;SHORT_OR_LONG::&gt; ( ?s1textp ) , &lt;SHORT_OR_LONG::&gt; ( ?o1 ) , ?sc ) )) ) LIMIT 20 } } This takes some explaining. The basic part is { ?s1 ?s1textp ?o1 . ?o1 bif:contains &quot;new AND york&quot; OPTION ( SCORE ?sc ) } This just makes tuples where ?s1 is the object, ?s1textp the property, and ?o1 the literal which contains &quot;New York&quot;. For a single ?s1, there can of course be many properties which all contain &quot;New York&quot;. The rest of the query gathers all the &quot;New York&quot; containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities. After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between &quot;New York&quot; and the strings containing &quot;New York&quot;. The text hit score is higher if the words repeat often and in close proximity. The s_sum function is a user-defined aggregate which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score. These are grouped by the subject of the triple. After this, these are sorted by sum_score of the aggregate constructed with s_sum. The sum_score is a SQL function combining the entity rank with the text scores of the different literals. This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple. The text index entries of an object are stored in the same partition as the object. But the entity rank is a property of the subject and is partitioned by the subject. Also the GROUP BY is by the subject. Thus the data is produced from all partitions, then streamed into the receiving partitions, determined by the subject. This partition can then get the score and group the matches by the subject. Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top k sort can be done for each partition separately. Finally, the top 20 of each partition are merged into the global top 20. This is then passed to a final function s_sum_page that turns this all into an XML fragment that can be processed with XSLT for inclusion on a web page. This differs from the text search engine in that the query pipeline can contain arbitrary cross-partition joins. Also, the string &quot;New York&quot; is a common label that occurs in many distinct entities. Thus one text match, to one document, in the case the containing only the string &quot;New York&quot; will get many entities, likely all from different partitions. So, if we only want actors with a mention of &quot;New York&quot;, we need to get the inner part of the query as: { ?s1 ?s1textp ?o1 . ?o1 bif:contains &quot;new AND york&quot; OPTION ( SCORE ?sc ) . ?s1 a &lt;http://umbel.org/umbel/sc/Actor&gt; } Whether an entity is an actor can be checked in the same partition as the rank of the entity. Thus the query plan gets this check right before getting the rank. This is natural since there is no point in getting the rank of something that is not an actor. The &lt;short_or_long::sql:func&gt; notation means that we call func, which is a SQL stored procedure with the arguments in their internal form. Thus, if a variable bound to an IRI is passed, the short_or_long specifies that it is passed as its internal ID and is not converted into its text form. This is essential, since there is no point getting the text of half a million IRIs when only 20 at most will be shown in the end. Now, when we run this on a collection of 4.5 billion triples of linked data, once we have the working set, we can get the top 20 &quot;New York&quot; occurrences, with text summaries and all, in just 1.1s, with 12 of 16 cores busy. (The hardware is two boxes with two quad-core Xeon 5345 each.) If we run this query in two parallel sessions, we get both results in 1.9s, with 14 of 16 cores busy. This gets about 200K &quot;New York&quot; strings, which becomes about 400K entities with New York somewhere, for which a rank then has to be retrieved. After this, all the possibly-many occurrences of New York in the title, text, and other properties of the entity are aggregated together, resolving into some 220K groups. These are then sorted. This is internally over 1.5 million random lookups and some 40MB of traffic between processes. Restricting the type of the entity to actor drops the execution time of one query to 0.8s because there are then fewer ranks to retrieve and less data to aggregate and sort. By adding partitions and cores, we scale horizontally, as evaluating the query involves almost no central control, even though data are swapped between partitions. There is some flow control to avoid constructing overly-large intermediate results but generally partitions run independently and asynchronously. In the above case, there is just one fence at the point where all aggregates are complete, so that they can be sorted; otherwise, all is asynchronous. Doing JOINs between partitions and partitioned GROUP BY/ORDER BY is pretty regular database stuff. Applying this to RDF is a most natural thing. If we do not parallelize the user-defined aggregate for grouping all the &quot;New York&quot; occurrences, the query takes 8s instead of 1.1s. If we could not put SQL procedures as user-defined aggregates to be parallelized with the query, we&#39;d have to either bring all the data to a central point before the top k, which would destroy performance, or we would have to do procedures with explicit parallel procedure calls which is hard to write, surely too hard for ad hoc queries. Results of live execution may not be complete on initial load, as this link includes a &quot;Virtuoso Anytime&quot; timeout of 10 seconds. Running against a cold cache, these results may take much longer to return; a warm cache will deliver response times along the lines of those discussed above. Engineering matters. If we wish to commoditize queries on a lot of data, such intelligence in the DBMS is necessary; it is very unscalable to require people to do procedural code or give query parallelization hints. If you need to optimize a workload of 10 different transactions, this is of course possible and even desirable, but for the infinity of all search or analysis, this will not happen.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have received some more questions about <a href="http://virtuoso.openlinksw.com" id="link-id0x15ca9a30">Virtuoso</a>&#39;s parallel query evaluation model.</p>

<p>In answer, we will here explain how we do search engine style processing by writing <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1574c560">SPARQL</a>.  There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce.</p>

<p>The point is that what used to require programming can often be done in a generic query language.  The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit.  But by combining these two things, we are a step closer to the web being the database.</p>

<p>I will here show how we do some joins combining full text, <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x15949970">RDF</a> conditions, and aggregates and <code>ORDER BY</code>.  The sample task is finding the top 20 entities with New York in some attribute value.  Then we specify the search further by only taking actors associated with New York.  The results are returned in the order of a composite of <a href="http://dbpedia.org/resource/Entity" id="link-id0x213bf310">entity</a> rank and text match score.</p>
 
<p>The basic query is:</p>

<blockquote>
 <code><pre>
SELECT 
  ( 
    <a href="http://dbpedia.org/resource/SQL" id="link-id0x23632230">sql</a>:s_sum_page 
      ( 
        &lt;sql:vector_agg&gt; 
          (
            &lt;bif:vector&gt; ( ?c1 , ?sm )
          ), 
        bif:vector 
          ( &#39;new&#39;, &#39;york&#39; )
      )
  ) AS ?res
WHERE 
  {
    { 
      SELECT 
        ( 
          &lt;SHORT_OR_LONG::&gt;(?s1) 
        ) AS ?c1
        ( 
          &lt;sql:S_SUM&gt; 
            (
               &lt;SHORT_OR_LONG::IRI_RANK&gt;  ( ?s1 )      ,
               &lt;SHORT_OR_LONG::&gt;          ( ?s1textp ) ,
               &lt;SHORT_OR_LONG::&gt;          ( ?o1 ) ,
               ?sc 
             )
         ) AS ?sm
      WHERE 
        { 
          ?s1  ?s1textp      ?o1             . 
          ?o1  bif:contains  &quot;new AND york&quot; 
            OPTION ( SCORE ?sc )
        }
      ORDER BY 
        DESC 
          ( 
            &lt;sql:sum_rank&gt; 
              ((
                 &lt;sql:S_SUM&gt; 
                   (
                     &lt;SHORT_OR_LONG::IRI_RANK&gt; ( ?s1 )      ,
                     &lt;SHORT_OR_LONG::&gt;         ( ?s1textp ) ,
                     &lt;SHORT_OR_LONG::&gt;         ( ?o1 )      ,
                     ?sc 
                   ) 
              )) 
          ) 
        LIMIT 20 
    } 
  }
</pre>
 </code>
</blockquote>

<p>This takes some explaining.  The basic part is</p>

<blockquote>
 <code><pre>{ 
  ?s1  ?s1textp      ?o1             . 
  ?o1  bif:contains  &quot;new AND york&quot;  
    OPTION ( SCORE ?sc )
}</pre>
 </code>
</blockquote>
          
<p>This just makes tuples where <code>?s1</code> is the object, <code>?s1textp</code> the property, and <code>?o1</code> the literal which contains &quot;New York&quot;.  For a single <code>?s1</code>, there can of course be many properties which all contain &quot;New York&quot;.</p>

<p>The rest of the query gathers all the &quot;New York&quot; containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities.</p>

<p>After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between &quot;New York&quot; and the strings containing &quot;New York&quot;.  The text hit score is higher if the words repeat often and in close proximity.</p>

<p>The <code>s_sum</code> function is a user-defined aggregate which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score.</p>

<p>These are grouped by the subject of the triple.  After this, these are sorted by <code>sum_score</code> of the aggregate constructed with <code>s_sum</code>.  The <code>sum_score</code> is a SQL function combining the entity rank with the text scores of the different literals.</p>

<p>This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple.  The text index entries of an object are stored in the same partition as the object.  But the entity rank is a property of the subject and is partitioned by the subject.  Also the <code>GROUP BY</code> is by the subject.  Thus the <a href="http://dbpedia.org/resource/Data" id="link-id0x15da01b8">data</a> is produced from all partitions, then streamed into the receiving partitions, determined by the subject.  This partition can then get the score and group the matches by the subject.  Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top <code>k</code> sort can be done for each partition separately.  Finally, the top 20 of each partition are merged into the global top 20.  This is then passed to a final function <code>s_sum_page</code> that turns this all into an <a href="http://dbpedia.org/resource/XML" id="link-id0x15d59fc8">XML</a> fragment that can be processed with XSLT for inclusion on a web page.</p>

<p>This differs from the text search engine in that the query pipeline can contain arbitrary cross-partition joins.  Also, the string &quot;New York&quot; is a common label that occurs in many distinct entities.  Thus one text match, to one document, in the case the containing only the string &quot;New York&quot; will get many entities, likely all from different partitions.</p>

<p>So, if we only want actors with a mention of &quot;New York&quot;, we need to get the inner part of the query as:</p>

<blockquote>
 <code><pre>{ 
  ?s1  ?s1textp      ?o1            . 
  ?o1  bif:contains  &quot;new AND york&quot;  
    OPTION ( SCORE ?sc )              . 
  ?s1  a             &lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x15befb10">http</a>://<a href="http://umbel.org/about/" id="link-id0x15c92330">umbel</a>.org/umbel/sc/Actor&gt; 
}</pre>
 </code>
</blockquote>

<p>Whether an entity is an actor can be checked in the same partition as the rank of the entity.  Thus the query plan gets this check right before getting the rank.  This is natural since there is no point in getting the rank of something that is not an actor.</p>

<p>The <code>&lt;short_or_long::sql:func&gt;</code> notation means that we call <code>func</code>, which is a SQL stored procedure with the arguments in their internal form.  Thus, if a variable bound to an IRI is passed, the <code>short_or_long</code> specifies that it is passed as its internal ID and is not converted into its text form.  This is essential, since there is no point getting the text of half a million IRIs when only 20 at most will be shown in the end.</p>

<p>Now, when we run this on a collection of 4.5 billion triples of <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x153772e8">linked data</a>, once we have the working set, we can get the top 20 &quot;New York&quot; occurrences, with text summaries and all, in just 1.1s, with 12 of 16 cores busy.  (The hardware is two boxes with two quad-core Xeon 5345 each.)</p>

<p>If we run this query in two parallel sessions, we get both results in 1.9s, with 14 of 16 cores busy.  This gets about 200K &quot;New York&quot; strings, which becomes about 400K entities with New York somewhere, for which a rank then has to be retrieved.  After this, all the possibly-many occurrences of New York in the title, text, and other properties of the entity are aggregated together, resolving into some 220K groups.  These are then sorted.  This is internally over 1.5 million random lookups and some 40MB of traffic between processes. Restricting the type of the entity to actor drops the execution time of one query to 0.8s because there are then fewer ranks to retrieve and less data to aggregate and sort.</p>

<p>By adding partitions and cores, we scale horizontally, as evaluating the query involves almost no central control, even though data are swapped between partitions.  There is some flow control to avoid constructing overly-large intermediate results but generally partitions run independently and asynchronously.  In the above case, there is just one fence at the point where all aggregates are complete, so that they can be sorted; otherwise, all is asynchronous.</p>

<p>Doing <code>JOINs</code> between partitions and partitioned <code>GROUP BY</code>/<code>ORDER BY</code> is pretty regular database stuff. Applying this to RDF is a most natural thing.</p>

<p>If we do not parallelize the user-defined aggregate for grouping all the &quot;New York&quot; occurrences, the query takes 8s instead of 1.1s.  If we could not put SQL procedures as user-defined aggregates to be parallelized with the query, we&#39;d have to either bring all the data to a central point before the top k, which would destroy performance, or we would have to do procedures with explicit parallel procedure calls which is hard to write, surely too hard for <i>ad hoc</i> queries.</p>

<a href="http://bit.ly/4jAVHC" id="link-id114d58f0">Results of live execution</a> may not be complete on initial load, as this link includes a &quot;Virtuoso Anytime&quot; timeout of 10 seconds.  Running against a cold cache, these results may take much longer to return; a warm cache will deliver response times along the lines of those discussed above.

<p>Engineering matters.  If we wish to commoditize queries on a lot of data, such intelligence in the DBMS is necessary; it is very unscalable to require people to do procedural code or give query parallelization hints.  If you need to optimize a workload of 10 different transactions, this is of course possible and even desirable, but for the infinity of all search or analysis, this will not happen.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-08-19#1570">
  <rss:title>More On Parallel RDF/Text Query Evaluation </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-08-19T17:28:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have received some more questions about Virtuoso&#39;s parallel query evaluation model. In answer, we will here explain how we do search engine style processing by writing SPARQL. There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce. The point is that what used to require programming can often be done in a generic query language. The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit. But by combining these two things, we are a step closer to the web being the database. I will here show how we do some joins combining full text, RDF conditions, and aggregates and ORDER BY. The sample task is finding the top 20 entities with New York in some attribute value. Then we specify the search further by only taking actors associated with New York. The results are returned in the order of a composite of entity rank and text match score. The basic query is: SELECT ( sql:s_sum_page ( &lt;sql:vector_agg&gt; ( &lt;bif:vector&gt; ( ?c1 , ?sm ) ), bif:vector ( &#39;new&#39;, &#39;york&#39; ) ) ) AS ?res WHERE { { SELECT ( &lt;SHORT_OR_LONG::&gt;(?s1) ) AS ?c1 ( &lt;sql:S_SUM&gt; ( &lt;SHORT_OR_LONG::IRI_RANK&gt; ( ?s1 ) , &lt;SHORT_OR_LONG::&gt; ( ?s1textp ) , &lt;SHORT_OR_LONG::&gt; ( ?o1 ) , ?sc ) ) AS ?sm WHERE { ?s1 ?s1textp ?o1 . ?o1 bif:contains &quot;new AND york&quot; OPTION ( SCORE ?sc ) } ORDER BY DESC ( &lt;sql:sum_rank&gt; (( &lt;sql:S_SUM&gt; ( &lt;SHORT_OR_LONG::IRI_RANK&gt; ( ?s1 ) , &lt;SHORT_OR_LONG::&gt; ( ?s1textp ) , &lt;SHORT_OR_LONG::&gt; ( ?o1 ) , ?sc ) )) ) LIMIT 20 } } This takes some explaining. The basic part is { ?s1 ?s1textp ?o1 . ?o1 bif:contains &quot;new AND york&quot; OPTION ( SCORE ?sc ) } This just makes tuples where ?s1 is the object, ?s1textp the property, and ?o1 the literal which contains &quot;New York&quot;. For a single ?s1, there can of course be many properties which all contain &quot;New York&quot;. The rest of the query gathers all the &quot;New York&quot; containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities. After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between &quot;New York&quot; and the strings containing &quot;New York&quot;. The text hit score is higher if the words repeat often and in close proximity. The s_sum function is a user-defined aggregate which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score. These are grouped by the subject of the triple. After this, these are sorted by sum_score of the aggregate constructed with s_sum. The sum_score is a SQL function combining the entity rank with the text scores of the different literals. This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple. The text index entries of an object are stored in the same partition as the object. But the entity rank is a property of the subject and is partitioned by the subject. Also the GROUP BY is by the subject. Thus the data is produced from all partitions, then streamed into the receiving partitions, determined by the subject. This partition can then get the score and group the matches by the subject. Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top k sort can be done for each partition separately. Finally, the top 20 of each partition are merged into the global top 20. This is then passed to a final function s_sum_page that turns this all into an XML fragment that can be processed with XSLT for inclusion on a web page. This differs from the text search engine in that the query pipeline can contain arbitrary cross-partition joins. Also, the string &quot;New York&quot; is a common label that occurs in many distinct entities. Thus one text match, to one document, in the case the containing only the string &quot;New York&quot; will get many entities, likely all from different partitions. So, if we only want actors with a mention of &quot;New York&quot;, we need to get the inner part of the query as: { ?s1 ?s1textp ?o1 . ?o1 bif:contains &quot;new AND york&quot; OPTION ( SCORE ?sc ) . ?s1 a &lt;http://umbel.org/umbel/sc/Actor&gt; } Whether an entity is an actor can be checked in the same partition as the rank of the entity. Thus the query plan gets this check right before getting the rank. This is natural since there is no point in getting the rank of something that is not an actor. The &lt;short_or_long::sql:func&gt; notation means that we call func, which is a SQL stored procedure with the arguments in their internal form. Thus, if a variable bound to an IRI is passed, the short_or_long specifies that it is passed as its internal ID and is not converted into its text form. This is essential, since there is no point getting the text of half a million IRIs when only 20 at most will be shown in the end. Now, when we run this on a collection of 4.5 billion triples of linked data, once we have the working set, we can get the top 20 &quot;New York&quot; occurrences, with text summaries and all, in just 1.1s, with 12 of 16 cores busy. (The hardware is two boxes with two quad-core Xeon 5345 each.) If we run this query in two parallel sessions, we get both results in 1.9s, with 14 of 16 cores busy. This gets about 200K &quot;New York&quot; strings, which becomes about 400K entities with New York somewhere, for which a rank then has to be retrieved. After this, all the possibly-many occurrences of New York in the title, text, and other properties of the entity are aggregated together, resolving into some 220K groups. These are then sorted. This is internally over 1.5 million random lookups and some 40MB of traffic between processes. Restricting the type of the entity to actor drops the execution time of one query to 0.8s because there are then fewer ranks to retrieve and less data to aggregate and sort. By adding partitions and cores, we scale horizontally, as evaluating the query involves almost no central control, even though data are swapped between partitions. There is some flow control to avoid constructing overly-large intermediate results but generally partitions run independently and asynchronously. In the above case, there is just one fence at the point where all aggregates are complete, so that they can be sorted; otherwise, all is asynchronous. Doing JOINs between partitions and partitioned GROUP BY/ORDER BY is pretty regular database stuff. Applying this to RDF is a most natural thing. If we do not parallelize the user-defined aggregate for grouping all the &quot;New York&quot; occurrences, the query takes 8s instead of 1.1s. If we could not put SQL procedures as user-defined aggregates to be parallelized with the query, we&#39;d have to either bring all the data to a central point before the top k, which would destroy performance, or we would have to do procedures with explicit parallel procedure calls which is hard to write, surely too hard for ad hoc queries. Results of live execution may not be complete on initial load, as this link includes a &quot;Virtuoso Anytime&quot; timeout of 10 seconds. Running against a cold cache, these results may take much longer to return; a warm cache will deliver response times along the lines of those discussed above. Engineering matters. If we wish to commoditize queries on a lot of data, such intelligence in the DBMS is necessary; it is very unscalable to require people to do procedural code or give query parallelization hints. If you need to optimize a workload of 10 different transactions, this is of course possible and even desirable, but for the infinity of all search or analysis, this will not happen.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have received some more questions about <a href="http://virtuoso.openlinksw.com" id="link-id0x266cd288">Virtuoso</a>&#39;s parallel query evaluation model.</p>

<p>In answer, we will here explain how we do search engine style processing by writing <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x23c628b8">SPARQL</a>.  There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce.</p>

<p>The point is that what used to require programming can often be done in a generic query language.  The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit.  But by combining these two things, we are a step closer to the web being the database.</p>

<p>I will here show how we do some joins combining full text, <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x22ff08b0">RDF</a> conditions, and aggregates and <code>ORDER BY</code>.  The sample task is finding the top 20 entities with New York in some attribute value.  Then we specify the search further by only taking actors associated with New York.  The results are returned in the order of a composite of <a href="http://dbpedia.org/resource/Entity" id="link-id0x22da5258">entity</a> rank and text match score.</p>
 
<p>The basic query is:</p>

<blockquote>
 <code><pre>
SELECT 
  ( 
    <a href="http://dbpedia.org/resource/SQL" id="link-id0x237a6530">sql</a>:s_sum_page 
      ( 
        &lt;sql:vector_agg&gt; 
          (
            &lt;bif:vector&gt; ( ?c1 , ?sm )
          ), 
        bif:vector 
          ( &#39;new&#39;, &#39;york&#39; )
      )
  ) AS ?res
WHERE 
  {
    { 
      SELECT 
        ( 
          &lt;SHORT_OR_LONG::&gt;(?s1) 
        ) AS ?c1
        ( 
          &lt;sql:S_SUM&gt; 
            (
               &lt;SHORT_OR_LONG::IRI_RANK&gt;  ( ?s1 )      ,
               &lt;SHORT_OR_LONG::&gt;          ( ?s1textp ) ,
               &lt;SHORT_OR_LONG::&gt;          ( ?o1 ) ,
               ?sc 
             )
         ) AS ?sm
      WHERE 
        { 
          ?s1  ?s1textp      ?o1             . 
          ?o1  bif:contains  &quot;new AND york&quot; 
            OPTION ( SCORE ?sc )
        }
      ORDER BY 
        DESC 
          ( 
            &lt;sql:sum_rank&gt; 
              ((
                 &lt;sql:S_SUM&gt; 
                   (
                     &lt;SHORT_OR_LONG::IRI_RANK&gt; ( ?s1 )      ,
                     &lt;SHORT_OR_LONG::&gt;         ( ?s1textp ) ,
                     &lt;SHORT_OR_LONG::&gt;         ( ?o1 )      ,
                     ?sc 
                   ) 
              )) 
          ) 
        LIMIT 20 
    } 
  }
</pre>
 </code>
</blockquote>

<p>This takes some explaining.  The basic part is</p>

<blockquote>
 <code><pre>{ 
  ?s1  ?s1textp      ?o1             . 
  ?o1  bif:contains  &quot;new AND york&quot;  
    OPTION ( SCORE ?sc )
}</pre>
 </code>
</blockquote>
          
<p>This just makes tuples where <code>?s1</code> is the object, <code>?s1textp</code> the property, and <code>?o1</code> the literal which contains &quot;New York&quot;.  For a single <code>?s1</code>, there can of course be many properties which all contain &quot;New York&quot;.</p>

<p>The rest of the query gathers all the &quot;New York&quot; containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities.</p>

<p>After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between &quot;New York&quot; and the strings containing &quot;New York&quot;.  The text hit score is higher if the words repeat often and in close proximity.</p>

<p>The <code>s_sum</code> function is a user-defined aggregate which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score.</p>

<p>These are grouped by the subject of the triple.  After this, these are sorted by <code>sum_score</code> of the aggregate constructed with <code>s_sum</code>.  The <code>sum_score</code> is a SQL function combining the entity rank with the text scores of the different literals.</p>

<p>This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple.  The text index entries of an object are stored in the same partition as the object.  But the entity rank is a property of the subject and is partitioned by the subject.  Also the <code>GROUP BY</code> is by the subject.  Thus the <a href="http://dbpedia.org/resource/Data" id="link-id0x24381030">data</a> is produced from all partitions, then streamed into the receiving partitions, determined by the subject.  This partition can then get the score and group the matches by the subject.  Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top <code>k</code> sort can be done for each partition separately.  Finally, the top 20 of each partition are merged into the global top 20.  This is then passed to a final function <code>s_sum_page</code> that turns this all into an <a href="http://dbpedia.org/resource/XML" id="link-id0x2363d6c0">XML</a> fragment that can be processed with XSLT for inclusion on a web page.</p>

<p>This differs from the text search engine in that the query pipeline can contain arbitrary cross-partition joins.  Also, the string &quot;New York&quot; is a common label that occurs in many distinct entities.  Thus one text match, to one document, in the case the containing only the string &quot;New York&quot; will get many entities, likely all from different partitions.</p>

<p>So, if we only want actors with a mention of &quot;New York&quot;, we need to get the inner part of the query as:</p>

<blockquote>
 <code><pre>{ 
  ?s1  ?s1textp      ?o1            . 
  ?o1  bif:contains  &quot;new AND york&quot;  
    OPTION ( SCORE ?sc )              . 
  ?s1  a             &lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x237110b8">http</a>://<a href="http://umbel.org/about/" id="link-id0x2318e198">umbel</a>.org/umbel/sc/Actor&gt; 
}</pre>
 </code>
</blockquote>

<p>Whether an entity is an actor can be checked in the same partition as the rank of the entity.  Thus the query plan gets this check right before getting the rank.  This is natural since there is no point in getting the rank of something that is not an actor.</p>

<p>The <code>&lt;short_or_long::sql:func&gt;</code> notation means that we call <code>func</code>, which is a SQL stored procedure with the arguments in their internal form.  Thus, if a variable bound to an IRI is passed, the <code>short_or_long</code> specifies that it is passed as its internal ID and is not converted into its text form.  This is essential, since there is no point getting the text of half a million IRIs when only 20 at most will be shown in the end.</p>

<p>Now, when we run this on a collection of 4.5 billion triples of <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x24381160">linked data</a>, once we have the working set, we can get the top 20 &quot;New York&quot; occurrences, with text summaries and all, in just 1.1s, with 12 of 16 cores busy.  (The hardware is two boxes with two quad-core Xeon 5345 each.)</p>

<p>If we run this query in two parallel sessions, we get both results in 1.9s, with 14 of 16 cores busy.  This gets about 200K &quot;New York&quot; strings, which becomes about 400K entities with New York somewhere, for which a rank then has to be retrieved.  After this, all the possibly-many occurrences of New York in the title, text, and other properties of the entity are aggregated together, resolving into some 220K groups.  These are then sorted.  This is internally over 1.5 million random lookups and some 40MB of traffic between processes. Restricting the type of the entity to actor drops the execution time of one query to 0.8s because there are then fewer ranks to retrieve and less data to aggregate and sort.</p>

<p>By adding partitions and cores, we scale horizontally, as evaluating the query involves almost no central control, even though data are swapped between partitions.  There is some flow control to avoid constructing overly-large intermediate results but generally partitions run independently and asynchronously.  In the above case, there is just one fence at the point where all aggregates are complete, so that they can be sorted; otherwise, all is asynchronous.</p>

<p>Doing <code>JOINs</code> between partitions and partitioned <code>GROUP BY</code>/<code>ORDER BY</code> is pretty regular database stuff. Applying this to RDF is a most natural thing.</p>

<p>If we do not parallelize the user-defined aggregate for grouping all the &quot;New York&quot; occurrences, the query takes 8s instead of 1.1s.  If we could not put SQL procedures as user-defined aggregates to be parallelized with the query, we&#39;d have to either bring all the data to a central point before the top k, which would destroy performance, or we would have to do procedures with explicit parallel procedure calls which is hard to write, surely too hard for <i>ad hoc</i> queries.</p>

<a href="http://bit.ly/4jAVHC" id="link-id114d58f0">Results of live execution</a> may not be complete on initial load, as this link includes a &quot;Virtuoso Anytime&quot; timeout of 10 seconds.  Running against a cold cache, these results may take much longer to return; a warm cache will deliver response times along the lines of those discussed above.

<p>Engineering matters.  If we wish to commoditize queries on a lot of data, such intelligence in the DBMS is necessary; it is very unscalable to require people to do procedural code or give query parallelization hints.  If you need to optimize a workload of 10 different transactions, this is of course possible and even desirable, but for the infinity of all search or analysis, this will not happen.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-08-14#1569">
  <rss:title>Updated hardware improves LUBM 8000 load rate in Virtuoso 6</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-08-14T19:01:30Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same. The load rate is now 160,739 triples-per-second.    Virtuoso 6 (previous run)    Virtuoso 6 (new run)    Virtuoso 6 (newest run) blades    1    1    2 processors    2 x Xeon 5410    2 x Xeon 5520    2 x Xeon 5520 + 2 x Xeon 5410 with 1x1GigE interconnect memory    16G 667 MHz    72G 1333 MHz    72G 1333 MHz + 16G 667 MHz respectively reported load ratetriples-per-second    110,532    160,739    214,188 Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We repeated the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1562" id="link-id173d3068">earlier LUBM 8000 experiment</a> on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.</p>

<p>The load rate is now 160,739 triples-per-second.</p>

<table>
<tr>
<th></th>
<td>   </td>
<th align="center"><a href="http://virtuoso.openlinksw.com" id="link-id0x240daf38">Virtuoso</a> 6 <br /> (previous run)</th>
<td>   </td>
<th align="center">Virtuoso 6 <br /> (new run)</th>
<td>   </td>
<th align="center">Virtuoso 6 <br /> (newest run)</th>
</tr>
<tr>
<td align="left">blades</td>
<td>   </td>
<td align="center">1 </td>
<td>   </td>
<td align="center">1 </td>
<td>   </td>
<td align="center">2</td>
</tr>
<tr>
<td align="left">processors</td>
<td>   </td>
<td align="center">2 x Xeon 5410</td>
<td>   </td>
<td align="center">2 x Xeon 5520</td>
<td>   </td>
<td align="center"> 2 x Xeon 5520 <br />+ <br />2 x Xeon 5410 <br />with 1x1GigE <br />interconnect </td>
</tr>
<tr>
<td align="left">memory</td>
<td>   </td>
<td align="center"> 16G 667 MHz</td>
<td>   </td>
<td align="center">72G 1333 MHz</td>
<td>   </td>
<td align="center">72G 1333 MHz <br />+ <br /> 16G 667 MHz <br /> respectively</td>
</tr>
<tr>
<td align="left">reported load rate<br />triples-per-second</td>
<td>   </td>
<td align="center"> 110,532 </td>
<td>   </td>
<td align="center"> 160,739 </td>
<td>   </td>
<td align="center"> 214,188  </td>
</tr>
</table>

<p>Again, if others talk about loading LUBM, so must we.  Otherwise, this metric is rather uninteresting.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-08-14#1568">
  <rss:title>Updated hardware improves LUBM 8000 load rate in Virtuoso 6</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-08-14T19:01:30Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same. The load rate is now 160,739 triples-per-second.    Virtuoso 6 (previous run)    Virtuoso 6 (new run)    Virtuoso 6 (newest run) blades    1    1    2 processors    2 x Xeon 5410    2 x Xeon 5520    2 x Xeon 5520 + 2 x Xeon 5410 with 1x1GigE interconnect memory    16G 667 MHz    72G 1333 MHz    72G 1333 MHz + 16G 667 MHz respectively reported load ratetriples-per-second    110,532    160,739    214,188 Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We repeated the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1562" id="link-id173d3068">earlier LUBM 8000 experiment</a> on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.</p>

<p>The load rate is now 160,739 triples-per-second.</p>

<table>
<tr>
<th></th>
<td>   </td>
<th align="center"><a href="http://virtuoso.openlinksw.com" id="link-id0x199b9740">Virtuoso</a> 6 <br /> (previous run)</th>
<td>   </td>
<th align="center">Virtuoso 6 <br /> (new run)</th>
<td>   </td>
<th align="center">Virtuoso 6 <br /> (newest run)</th>
</tr>
<tr>
<td align="left">blades</td>
<td>   </td>
<td align="center">1 </td>
<td>   </td>
<td align="center">1 </td>
<td>   </td>
<td align="center">2</td>
</tr>
<tr>
<td align="left">processors</td>
<td>   </td>
<td align="center">2 x Xeon 5410</td>
<td>   </td>
<td align="center">2 x Xeon 5520</td>
<td>   </td>
<td align="center"> 2 x Xeon 5520 <br />+ <br />2 x Xeon 5410 <br />with 1x1GigE <br />interconnect </td>
</tr>
<tr>
<td align="left">memory</td>
<td>   </td>
<td align="center"> 16G 667 MHz</td>
<td>   </td>
<td align="center">72G 1333 MHz</td>
<td>   </td>
<td align="center">72G 1333 MHz <br />+ <br /> 16G 667 MHz <br /> respectively</td>
</tr>
<tr>
<td align="left">reported load rate<br />triples-per-second</td>
<td>   </td>
<td align="center"> 110,532 </td>
<td>   </td>
<td align="center"> 160,739 </td>
<td>   </td>
<td align="center"> 214,188  </td>
</tr>
</table>

<p>Again, if others talk about loading LUBM, so must we.  Otherwise, this metric is rather uninteresting.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-06-29#1563">
  <rss:title>Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-06-29T16:12:34Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso. The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes. The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy. The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made. We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>LUBM load speed still seems to be a metric that is quoted in comparisons of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id142df6e8">RDF</a> stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest <a href="http://virtuoso.openlinksw.com" id="link-id1389dfa0">Virtuoso</a>.</p>
 
<p>The real time for the load was 161m 3s.  The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM.  The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core.  Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.</p>
 
<p>The load was done on 8 streams, one per server process.   At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.</p>
 
<p>The RDF store was configured with the default two indices over quads, these being GSPO and OGPS.  Text indexing of literals was not enabled.  No materialization of entailed triples was made.</p>
 
<p>We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-06-29#1562">
  <rss:title>Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-06-29T16:12:34Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso. The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes. The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy. The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made. We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>LUBM load speed still seems to be a metric that is quoted in comparisons of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id142df6e8">RDF</a> stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest <a href="http://virtuoso.openlinksw.com" id="link-id1389dfa0">Virtuoso</a>.</p>
 
<p>The real time for the load was 161m 3s.  The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM.  The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core.  Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.</p>
 
<p>The load was done on 8 streams, one per server process.   At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.</p>
 
<p>The RDF store was configured with the default two indices over quads, these being GSPO and OGPS.  Text indexing of literals was not enabled.  No materialization of entailed triples was made.</p>
 
<p>We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-05-28#1558">
  <rss:title>Comparing Virtuoso Performance on Different Processors</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-05-28T14:54:59Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Over the years we have run Virtuoso on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso. Our test is very simple: Load 20 warehouses of TPC-C data, and then run one client per warehouse for 10,000 new orders. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal. The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache. All times reported are real times, starting from the start of the first client and ending with the completion of the last client. Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable. TABLE { background: none; border: none } TH { text-align: center; font-weight: bold } TR.top { background: } TD { text-align: center; border: none } Test Platform Load(seconds) Run(seconds) GHz / cores / threads 1 Amazon EC2 Extra Large(4 virtual cores) 340 42 1.2 GHz? / 4 / 1 1 Amazon EC2 Extra Large(4 virtual cores) 305 43.3 1.2 GHz? / 4 / 1 2 1 x dual-core AMD 5900 263 58.2 2.9 GHz / 2 / 1 3 2 x dual-core Xeon 5130 (&quot;Woodcrest&quot;) 245 35.7 2.0 GHz / 4 / 1 4 2 x quad-core Xeon 5410 (&quot;Harpertown&quot;) 237 18.0 2.33 GHz / 8 / 1 5 2 x quad-core Xeon 5520 (&quot;Nehalem&quot;) 162 18.3 2.26 GHz / 8 / 2 We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 (&quot;Woodcrest&quot;) and Xeon 5410 (&quot;Harpertown&quot;), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 (&quot;Nehalem&quot;), 72 GB RAM, and 8 x 500 GB SATA disks. Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We&#39;ll see another time with faster memories. The operating systems were various 64 bit Linux distributions. We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel. We tried some RDF operations on the two last systems: operation Harpertown Nehalem Build text index for DBpedia 1080s 770s Entity Rank iteration 263s 251s Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in SQL to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected. For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s. Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses. On the host OS of the Nehalem system â n cpu% rows per second 1 query 503 906,413 2 queries 1263 1,578,585 3 queries 1204 1,566,849 In a VM under Xen, on the Nehalem system â n cpu% rows per second 1 query 652 799,293 2 queries 1266 1,486,710 3 queries 1222 1,484,093 On the host OS of the Harpertown system â n cpu% rows per second 1 query 648 1,041,448 2 queries 708 1,124,866 The CPU percentages are as reported by the OS: user + system CPU divided by real time. So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache â 12 MB vs 8 MB. We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O. The executables were compiled with gcc with default settings. Specifying -march=nocona (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load. The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon. Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen. If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Over the years we have run <a href="http://virtuoso.openlinksw.com" id="link-id0x16735e20">Virtuoso</a> on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso.</p>

<p>Our test is very simple: <i>Load 20 warehouses of <a href="http://dbpedia.org/resource/TPC-C" id="link-id0x16e0dba8">TPC-C</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x14ff4f80">data</a>, and then run one client per warehouse for 10,000 new orders</i>. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.</p>

<p>The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.</p>

<p>All times reported are real times, starting from the start of the first client and ending with the completion of the last client.</p>

<p>Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.</p>

    <style type="text/css">
      TABLE  { background: none; border: none }
      TH     { text-align: center; font-weight: bold }
      TR.top { background:  }
      TD     { text-align: center; border: none }
    </style>

<table align="center" cellspacing="10">
<tr>
  <th>Test</th>
  <th>Platform</th>
  <th>Load<br />(seconds)</th>
  <th>Run<br />(seconds)</th>
  <th>GHz / cores / threads</th>
</tr>
<tr>
  <td>1</td>
  <td>Amazon <a href="http://aws.amazon.com/ec2/" id="link-id0x15d68e20">EC2</a> Extra Large<br />(4 virtual cores)</td>
  <td>340</td>
  <td>42</td>
  <td>1.2 GHz? / 4 / 1</td>
</tr>
<tr>
  <td>1</td>
  <td>Amazon EC2 Extra Large<br />(4 virtual cores)</td>
  <td>305</td>
  <td>43.3</td>
  <td>1.2 GHz? / 4 / 1</td>
</tr>
<tr>
  <td>2</td>
  <td>1 x dual-core AMD 5900</td>
  <td>263</td>
  <td>58.2</td>
  <td>2.9 GHz / 2 / 1</td>
</tr>
<tr>
  <td>3</td>
  <td>2 x dual-core Xeon 5130 (&quot;Woodcrest&quot;)</td>
  <td>245</td>
  <td>35.7</td>
  <td>2.0 GHz / 4 / 1</td>
</tr>
<tr>
  <td>4</td>
  <td>2 x quad-core Xeon 5410 (&quot;Harpertown&quot;)</td>
  <td>237</td>
  <td>18.0</td>
  <td>2.33 GHz / 8 / 1</td>
</tr>
<tr>
  <td>5</td>
  <td>2 x quad-core Xeon 5520 (&quot;Nehalem&quot;)</td>
  <td>162</td>
  <td>18.3</td>
  <td>2.26 GHz / 8 / 2</td>
</tr>
</table>

<p>We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 (&quot;Woodcrest&quot;) and Xeon 5410 (&quot;Harpertown&quot;), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 (&quot;Nehalem&quot;), 72 GB RAM, and 8 x 500 GB SATA disks.</p>

<p>
<i>Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We&#39;ll see another time with faster memories.</i>
</p>

<p>The operating systems were various 64 bit Linux distributions.</p>

<p>We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.</p>

<p>We tried some <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1460b688">RDF</a> operations on the two last systems:</p>

<table align="center" cellspacing="10">
<tr>
  <th>operation</th>
  <th> Harpertown</th>
  <th>Nehalem</th>
</tr>

<tr>
  <th>Build text index for <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x16a94590">DBpedia</a></th>
  <td>1080s</td>
  <td>770s</td>
</tr>
<tr>
  <th><a href="http://dbpedia.org/resource/Entity" id="link-id0xc37f380">Entity</a> Rank iteration</th>
  <td>263s</td>
  <td>251s</td>
</tr>
</table>

<p>Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in <a href="http://dbpedia.org/resource/SQL" id="link-id0x15842a20">SQL</a> to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.</p>

<p>For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.</p>

<p>Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.</p>

<ol>
<li>On the host OS of the Nehalem system â
<table align="center" cellspacing="10">
<tr>
      <th>n</th>
      <th>cpu%</th>
      <th>rows per second</th>
    </tr>
<tr>
      <th>1 query</th>
      <td>503</td>
      <td>906,413</td>
    </tr>
<tr>
      <th>2 queries</th>
      <td>1263</td>
      <td>1,578,585</td>
    </tr>
<tr>
      <th>3 queries</th>
      <td>1204</td>
      <td>1,566,849</td>
    </tr>
</table>
</li>
<li>In a VM under Xen, on the Nehalem system â
<table align="center" cellspacing="10">
<tr>
      <th>n</th>
      <th>cpu%</th>
      <th>rows per second</th>
    </tr>
<tr>
      <th>1 query</th>
      <td>652</td>
      <td>799,293</td>
    </tr>
<tr>
      <th>2 queries</th>
      <td>1266</td>
      <td>1,486,710</td>
    </tr>
<tr>
      <th>3 queries</th>
      <td>1222</td>
      <td>1,484,093</td>
    </tr>
</table>
</li>
<li> On the host OS of the Harpertown system â
<table align="center" cellspacing="10">
<tr>
      <th>n</th>
      <th>cpu%</th>
      <th>rows per second</th>
    </tr>
<tr>
      <th>1 query</th>
      <td> 648 </td>
      <td> 1,041,448 </td>
    </tr>
<tr>
      <th>2 queries</th>
      <td> 708 </td>
      <td> 1,124,866 </td>
    </tr>
</table>
</li>
</ol>

<p>The CPU percentages are as reported by the OS: user + system CPU divided by real time.</p>

<p>So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache â 12 MB vs 8 MB.</p>

<p>We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.</p>

<p>The executables were compiled with <code>gcc</code> with default settings. Specifying <code>-march=nocona</code> (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.</p>

<p>The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.</p>

<p>Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.</p>

<p>If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-05-28#1557">
  <rss:title>Comparing Virtuoso Performance on Different Processors</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-05-28T14:54:59Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Over the years we have run Virtuoso on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso. Our test is very simple: Load 20 warehouses of TPC-C data, and then run one client per warehouse for 10,000 new orders. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal. The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache. All times reported are real times, starting from the start of the first client and ending with the completion of the last client. Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable. TABLE { background: none; border: none } TH { text-align: center; font-weight: bold } TR.top { background: } TD { text-align: center; border: none } Test Platform Load(seconds) Run(seconds) GHz / cores / threads 1 Amazon EC2 Extra Large(4 virtual cores) 340 42 1.2 GHz? / 4 / 1 1 Amazon EC2 Extra Large(4 virtual cores) 305 43.3 1.2 GHz? / 4 / 1 2 1 x dual-core AMD 5900 263 58.2 2.9 GHz / 2 / 1 3 2 x dual-core Xeon 5130 (&quot;Woodcrest&quot;) 245 35.7 2.0 GHz / 4 / 1 4 2 x quad-core Xeon 5410 (&quot;Harpertown&quot;) 237 18.0 2.33 GHz / 8 / 1 5 2 x quad-core Xeon 5520 (&quot;Nehalem&quot;) 162 18.3 2.26 GHz / 8 / 2 We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 (&quot;Woodcrest&quot;) and Xeon 5410 (&quot;Harpertown&quot;), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 (&quot;Nehalem&quot;), 72 GB RAM, and 8 x 500 GB SATA disks. Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We&#39;ll see another time with faster memories. The operating systems were various 64 bit Linux distributions. We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel. We tried some RDF operations on the two last systems: operation Harpertown Nehalem Build text index for DBpedia 1080s 770s Entity Rank iteration 263s 251s Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in SQL to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected. For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s. Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses. On the host OS of the Nehalem system â n cpu% rows per second 1 query 503 906,413 2 queries 1263 1,578,585 3 queries 1204 1,566,849 In a VM under Xen, on the Nehalem system â n cpu% rows per second 1 query 652 799,293 2 queries 1266 1,486,710 3 queries 1222 1,484,093 On the host OS of the Harpertown system â n cpu% rows per second 1 query 648 1,041,448 2 queries 708 1,124,866 The CPU percentages are as reported by the OS: user + system CPU divided by real time. So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache â 12 MB vs 8 MB. We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O. The executables were compiled with gcc with default settings. Specifying -march=nocona (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load. The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon. Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen. If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Over the years we have run <a href="http://virtuoso.openlinksw.com" id="link-id0xd420b90">Virtuoso</a> on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso.</p>

<p>Our test is very simple: <i>Load 20 warehouses of <a href="http://dbpedia.org/resource/TPC-C" id="link-id0xdaaec90">TPC-C</a> <a href="http://dbpedia.org/resource/Data" id="link-id0xca1b7e0">data</a>, and then run one client per warehouse for 10,000 new orders</i>. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.</p>

<p>The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.</p>

<p>All times reported are real times, starting from the start of the first client and ending with the completion of the last client.</p>

<p>Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.</p>

    <style type="text/css">
      TABLE  { background: none; border: none }
      TH     { text-align: center; font-weight: bold }
      TR.top { background:  }
      TD     { text-align: center; border: none }
    </style>

<table align="center" cellspacing="10">
<tr>
  <th>Test</th>
  <th>Platform</th>
  <th>Load<br />(seconds)</th>
  <th>Run<br />(seconds)</th>
  <th>GHz / cores / threads</th>
</tr>
<tr>
  <td>1</td>
  <td>Amazon <a href="http://aws.amazon.com/ec2/" id="link-id0xdaab030">EC2</a> Extra Large<br />(4 virtual cores)</td>
  <td>340</td>
  <td>42</td>
  <td>1.2 GHz? / 4 / 1</td>
</tr>
<tr>
  <td>1</td>
  <td>Amazon EC2 Extra Large<br />(4 virtual cores)</td>
  <td>305</td>
  <td>43.3</td>
  <td>1.2 GHz? / 4 / 1</td>
</tr>
<tr>
  <td>2</td>
  <td>1 x dual-core AMD 5900</td>
  <td>263</td>
  <td>58.2</td>
  <td>2.9 GHz / 2 / 1</td>
</tr>
<tr>
  <td>3</td>
  <td>2 x dual-core Xeon 5130 (&quot;Woodcrest&quot;)</td>
  <td>245</td>
  <td>35.7</td>
  <td>2.0 GHz / 4 / 1</td>
</tr>
<tr>
  <td>4</td>
  <td>2 x quad-core Xeon 5410 (&quot;Harpertown&quot;)</td>
  <td>237</td>
  <td>18.0</td>
  <td>2.33 GHz / 8 / 1</td>
</tr>
<tr>
  <td>5</td>
  <td>2 x quad-core Xeon 5520 (&quot;Nehalem&quot;)</td>
  <td>162</td>
  <td>18.3</td>
  <td>2.26 GHz / 8 / 2</td>
</tr>
</table>

<p>We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 (&quot;Woodcrest&quot;) and Xeon 5410 (&quot;Harpertown&quot;), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 (&quot;Nehalem&quot;), 72 GB RAM, and 8 x 500 GB SATA disks.</p>

<p>
<i>Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We&#39;ll see another time with faster memories.</i>
</p>

<p>The operating systems were various 64 bit Linux distributions.</p>

<p>We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.</p>

<p>We tried some <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xce85438">RDF</a> operations on the two last systems:</p>

<table align="center" cellspacing="10">
<tr>
  <th>operation</th>
  <th> Harpertown</th>
  <th>Nehalem</th>
</tr>

<tr>
  <th>Build text index for <a href="http://dbpedia.org/resource/DBpedia" id="link-id0xab826a8">DBpedia</a></th>
  <td>1080s</td>
  <td>770s</td>
</tr>
<tr>
  <th><a href="http://dbpedia.org/resource/Entity" id="link-id0xcbb9938">Entity</a> Rank iteration</th>
  <td>263s</td>
  <td>251s</td>
</tr>
</table>

<p>Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in <a href="http://dbpedia.org/resource/SQL" id="link-id0xcd62218">SQL</a> to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.</p>

<p>For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.</p>

<p>Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.</p>

<ol>
<li>On the host OS of the Nehalem system â
<table align="center" cellspacing="10">
<tr>
      <th>n</th>
      <th>cpu%</th>
      <th>rows per second</th>
    </tr>
<tr>
      <th>1 query</th>
      <td>503</td>
      <td>906,413</td>
    </tr>
<tr>
      <th>2 queries</th>
      <td>1263</td>
      <td>1,578,585</td>
    </tr>
<tr>
      <th>3 queries</th>
      <td>1204</td>
      <td>1,566,849</td>
    </tr>
</table>
</li>
<li>In a VM under Xen, on the Nehalem system â
<table align="center" cellspacing="10">
<tr>
      <th>n</th>
      <th>cpu%</th>
      <th>rows per second</th>
    </tr>
<tr>
      <th>1 query</th>
      <td>652</td>
      <td>799,293</td>
    </tr>
<tr>
      <th>2 queries</th>
      <td>1266</td>
      <td>1,486,710</td>
    </tr>
<tr>
      <th>3 queries</th>
      <td>1222</td>
      <td>1,484,093</td>
    </tr>
</table>
</li>
<li> On the host OS of the Harpertown system â
<table align="center" cellspacing="10">
<tr>
      <th>n</th>
      <th>cpu%</th>
      <th>rows per second</th>
    </tr>
<tr>
      <th>1 query</th>
      <td> 648 </td>
      <td> 1,041,448 </td>
    </tr>
<tr>
      <th>2 queries</th>
      <td> 708 </td>
      <td> 1,124,866 </td>
    </tr>
</table>
</li>
</ol>

<p>The CPU percentages are as reported by the OS: user + system CPU divided by real time.</p>

<p>So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache â 12 MB vs 8 MB.</p>

<p>We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.</p>

<p>The executables were compiled with <code>gcc</code> with default settings. Specifying <code>-march=nocona</code> (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.</p>

<p>The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.</p>

<p>Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.</p>

<p>If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-04-30#1552">
  <rss:title>Short Recap of Virtuoso Basics (#3 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-04-30T15:49:53Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(Third of five posts related to the WWW 2009 conference, held the week of April 20, 2009.) There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it. Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all. If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS. On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like. Installation is simple, just one exe and one config file. There is a huge amount of code in installers â application code and test suites and such â but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters. Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL. To condense further: Scalable Delivery of Linked Data SPARQL and SQL Arbitrary RDF Data + Relational Also From 3rd Party RDBMS Easy Deployment Standard Interfaces ODBC, JDBC, OLE DB, ADO.NET, XMLA Jena, Sesame, etc. All Web Protocols</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>(Third of five posts related to the <a href="http://www2009.org/" id="link-id0x14b582b8">WWW 2009</a> conference, held the week of April 20, 2009.)

</p>
<p>There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.</p>

<p>
<a href="http://virtuoso.openlinksw.com" id="link-id0x14bf48b8">Virtuoso</a> is a DBMS. We pitch it primarily to the <a href="http://dbpedia.org/resource/Data" id="link-id0x16bc4490">data</a> web space because this is where we see the emerging frontier. Virtuoso does both <a href="http://dbpedia.org/resource/SQL" id="link-id0x1223dc30">SQL</a> and <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x170eec88">SPARQL</a> and can do both at large scale and high performance. The popular perception of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x15a05fc0">RDF</a> and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.</p>

<p>If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too.  <a href="http://docs.openlinksw.com/virtuoso/rdfsparqlintegrationmiddleware.html#rdfviews" id="link-id14ddc7c8">Rendering application specific data structures as RDF</a> applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can <a href="http://docs.openlinksw.com/virtuoso/qsvdbsrv.html" id="link-id14aaea70">federate tables from heterogenous DBMS</a>.</p>

<p>On top of this, there is a <a href="http://docs.openlinksw.com/virtuoso/qswebserver.html" id="link-id16fcde60">web server built in</a>, so that no extra server is needed for web services, web pages, and the like.</p>

<p>Installation is simple, just one exe and one config file. There is a huge amount of code in <a href="http://docs.openlinksw.com/virtuoso/installation.html" id="link-id16767b40">installers</a> â application code and test suites and such â but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.</p>

<p>Clusters (coming in Release 6) and SQL federation are <a href="http://download.openlinksw.com/download/product_matrix.vsp?p=l_os&amp;c=39&amp;df=16" id="link-id16722550">commercial only</a>; the rest can be had <a href="http://sourceforge.net/project/showfiles.php?group_id=161622" id="link-id131080a8">under GPL</a>.</p>

<p>To condense further:</p>

<ul>
<li>Scalable Delivery of <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1060ad98">Linked Data</a>
</li>
<li>SPARQL and SQL
<ul>
    <li>Arbitrary RDF Data + Relational</li>
<li>Also From 3rd Party <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x16bbce60">RDBMS</a>
    </li>
  </ul>
</li>
<li>Easy Deployment </li>
<li>Standard Interfaces
<ul>
    <li>
      <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x12e284d8">ODBC</a>, <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0xb5e1400">JDBC</a>, OLE DB, <a href="http://dbpedia.org/resource/ADO.NET" id="link-id0x15a55db8">ADO</a>.<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x16beb070">NET</a>, XMLA</li>
<li>
      <a href="http://jena.sourceforge.net/" id="link-id0x122b5008">Jena</a>, <a href="http://sourceforge.net/projects/sesame/" id="link-id0x148d4078">Sesame</a>, etc.</li>
<li>All Web Protocols </li>
  </ul>
</li>
</ul>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-04-30#1550">
  <rss:title>Short Recap of Virtuoso Basics (#3 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-04-30T15:49:53Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(Third of five posts related to the WWW 2009 conference, held the week of April 20, 2009.) There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it. Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all. If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS. On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like. Installation is simple, just one exe and one config file. There is a huge amount of code in installers â application code and test suites and such â but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters. Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL. To condense further: Scalable Delivery of Linked Data SPARQL and SQL Arbitrary RDF Data + Relational Also From 3rd Party RDBMS Easy Deployment Standard Interfaces ODBC, JDBC, OLE DB, ADO.NET, XMLA Jena, Sesame, etc. All Web Protocols</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>(Third of five posts related to the <a href="http://www2009.org/" id="link-id0x1081fe40">WWW 2009</a> conference, held the week of April 20, 2009.)

</p>
<p>There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.</p>

<p>
<a href="http://virtuoso.openlinksw.com" id="link-id0xd0e85f0">Virtuoso</a> is a DBMS. We pitch it primarily to the <a href="http://dbpedia.org/resource/Data" id="link-id0x14a294d8">data</a> web space because this is where we see the emerging frontier. Virtuoso does both <a href="http://dbpedia.org/resource/SQL" id="link-id0x108042f8">SQL</a> and <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x10889878">SPARQL</a> and can do both at large scale and high performance. The popular perception of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x107d3b40">RDF</a> and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.</p>

<p>If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too.  <a href="http://docs.openlinksw.com/virtuoso/rdfsparqlintegrationmiddleware.html#rdfviews" id="link-id14ddc7c8">Rendering application specific data structures as RDF</a> applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can <a href="http://docs.openlinksw.com/virtuoso/qsvdbsrv.html" id="link-id14aaea70">federate tables from heterogenous DBMS</a>.</p>

<p>On top of this, there is a <a href="http://docs.openlinksw.com/virtuoso/qswebserver.html" id="link-id16fcde60">web server built in</a>, so that no extra server is needed for web services, web pages, and the like.</p>

<p>Installation is simple, just one exe and one config file. There is a huge amount of code in <a href="http://docs.openlinksw.com/virtuoso/installation.html" id="link-id16767b40">installers</a> â application code and test suites and such â but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.</p>

<p>Clusters (coming in Release 6) and SQL federation are <a href="http://download.openlinksw.com/download/product_matrix.vsp?p=l_os&amp;c=39&amp;df=16" id="link-id16722550">commercial only</a>; the rest can be had <a href="http://sourceforge.net/project/showfiles.php?group_id=161622" id="link-id131080a8">under GPL</a>.</p>

<p>To condense further:</p>

<ul>
<li>Scalable Delivery of <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x12211da8">Linked Data</a>
</li>
<li>SPARQL and SQL
<ul>
    <li>Arbitrary RDF Data + Relational</li>
<li>Also From 3rd Party <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x168db0e0">RDBMS</a>
    </li>
  </ul>
</li>
<li>Easy Deployment </li>
<li>Standard Interfaces
<ul>
    <li>
      <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x10473bf0">ODBC</a>, <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x12187f58">JDBC</a>, OLE DB, <a href="http://dbpedia.org/resource/ADO.NET" id="link-id0x10354e48">ADO</a>.<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x16eeadd0">NET</a>, XMLA</li>
<li>
      <a href="http://jena.sourceforge.net/" id="link-id0x12e3fe08">Jena</a>, <a href="http://sourceforge.net/projects/sesame/" id="link-id0x15e62470">Sesame</a>, etc.</li>
<li>All Web Protocols </li>
  </ul>
</li>
</ul>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-04-27#1545">
  <rss:title>Linked Data at WWW 2009 (#1 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-04-27T21:28:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(First of five posts related to the WWW 2009 conference, held the week of April 20, 2009.) We gave a talk at the Linked Open Data workshop, LDOW 2009, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments. The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance data is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as RDF. To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key. People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like x is a bit like y is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times. Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player. We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in SQL and SPARQL. To showcase this, we do linked data search, browsing, and so on, but are essentially a platform provider. Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today&#39;s retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on EC2. With this on one side and Virtuoso on the other, a lot that was impractical in the past is now within reach. Like Giovanni Tummarello put it for airplanes, the physics are as they were for da Vinci but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone. A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy. I answered as follows, which apparently cannot be repeated too much: First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. Ceterum censeo, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent. Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions. There was a talk about MashQL by Mustafa Jarrar, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated &quot;on the fly&quot; as they are being composed. As it happens, we already have an XML-based query abstraction layer incorporated into Virtuoso 6.0&#39;s built-in Faceted Data Browser Service, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: Use the faceted browser service to get vastly higher performance when querying against Linked Data, be it DBpedia or the entity LOD Cloud. Virtuoso 6.0 (Open Source Edition) &quot;TP1&quot; is now publicly available as a Technology Preview (beta). We heard that there is an effort for porting Freebase&#39;s Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting. As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is &quot;virtualized&quot; into the database cloud or the local secure server, as the use case may require. For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might de facto have that in just a bit, whenever there is a working draft from the SPARQL WG. Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject semantic web papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions. For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>(First of five posts related to the <a href="http://www2009.org/" id="link-id0x12d8ed90">WWW 2009</a> conference, held the week of April 20, 2009.)</p>

<p>We gave a talk at the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x152bf430">Linked Open Data</a> workshop, <a href="http://events.linkeddata.org/ldow2009/" id="link-id0x191721c8">LDOW 2009</a>, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments.</p>

<p>The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance <a href="http://dbpedia.org/resource/Data" id="link-id0x17c8f998">data</a> is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xd562090">RDF</a>.</p>

<p>To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key.</p>

<p>People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like <i>x is a bit like y</i> is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times.</p>

<p>Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player.</p>

<p>We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in <a href="http://dbpedia.org/resource/SQL" id="link-id0x60bd0c48">SQL</a> and <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x13db1ff0">SPARQL</a>.</p>

<p>To showcase this, we do <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x10a5dde8">linked data</a> search, browsing, and so on, but are essentially a platform provider.</p>

<p>Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today&#39;s retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on <a href="http://aws.amazon.com/ec2/" id="link-id0xa59039d8">EC2</a>. With this on one side and <a href="http://virtuoso.openlinksw.com" id="link-id0x19f86c10">Virtuoso</a> on the other, a lot that was impractical in the past is now within reach. Like <a href="http://g1o.net/foaf.rdf#me" id="link-id0xa1853af8">Giovanni Tummarello</a> put it for airplanes, the physics are as they were for <a href="http://dbpedia.org/resource/Leonardo_da_Vinci" id="link-id0x12df02e0">da Vinci</a> but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone.</p>

<p>A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy.</p>

<p>I answered as follows, which apparently cannot be repeated too much:</p>

<p>First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. <i>Ceterum censeo</i>, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent.</p>

<p>Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions.</p>

<p>There was a talk about <a href="http://semanticweb.org/wiki/MashQL" id="link-id0x1642a780">MashQL</a> by <a href="http://data.semanticweb.org/person/mustafa-jarrar" id="link-id0x116e5af8">Mustafa Jarrar</a>, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated &quot;on the fly&quot; as they are being composed. As it happens, we already have an <a href="http://dbpedia.org/resource/XML" id="link-id0x11442520">XML</a>-based query abstraction layer incorporated into Virtuoso 6.0&#39;s built-in <a href="http://lod.openlinksw.com/fct/facet.vsp" id="link-id0x6a9ebfe0">Faceted Data Browser Service</a>, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: <i>Use the <a href="http://lod.openlinksw.com/fct/facet.vsp" id="link-id0x1676e158">faceted browser service</a> to get vastly higher performance when querying against Linked Data, be it <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x12653418">DBpedia</a> or the <a href="http://dbpedia.org/resource/Entity" id="link-id0x10a61e78">entity</a> <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x164150d8">LOD</a> <a href="http://lod.openlinksw.com/" id="link-id0xc5ec918">Cloud</a>. Virtuoso 6.0 (Open Source Edition) &quot;<a href="http://sourceforge.net/project/showfiles.php?group_id=161622&amp;package_id=319652&amp;release_id=677866" id="link-id12159728">TP1</a>&quot; is now publicly available as a Technology Preview (beta).</i>
</p>

<p>We heard that there is an effort for porting Freebase&#39;s Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting.</p>

<p>As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is &quot;virtualized&quot; into the database cloud or the local secure server, as the use case may require. </p>

<p>For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might <i>de facto</i> have that in just a bit, whenever there is a working draft from the SPARQL WG.</p>

<p>Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x12563ea0">semantic web</a> papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions.</p>

<p>For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-04-27#1544">
  <rss:title>Linked Data at WWW 2009 (#1 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-04-27T21:28:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(First of five posts related to the WWW 2009 conference, held the week of April 20, 2009.) We gave a talk at the Linked Open Data workshop, LDOW 2009, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments. The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance data is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as RDF. To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key. People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like x is a bit like y is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times. Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player. We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in SQL and SPARQL. To showcase this, we do linked data search, browsing, and so on, but are essentially a platform provider. Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today&#39;s retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on EC2. With this on one side and Virtuoso on the other, a lot that was impractical in the past is now within reach. Like Giovanni Tummarello put it for airplanes, the physics are as they were for da Vinci but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone. A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy. I answered as follows, which apparently cannot be repeated too much: First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. Ceterum censeo, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent. Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions. There was a talk about MashQL by Mustafa Jarrar, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated &quot;on the fly&quot; as they are being composed. As it happens, we already have an XML-based query abstraction layer incorporated into Virtuoso 6.0&#39;s built-in Faceted Data Browser Service, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: Use the faceted browser service to get vastly higher performance when querying against Linked Data, be it DBpedia or the entity LOD Cloud. Virtuoso 6.0 (Open Source Edition) &quot;TP1&quot; is now publicly available as a Technology Preview (beta). We heard that there is an effort for porting Freebase&#39;s Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting. As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is &quot;virtualized&quot; into the database cloud or the local secure server, as the use case may require. For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might de facto have that in just a bit, whenever there is a working draft from the SPARQL WG. Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject semantic web papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions. For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>(First of five posts related to the <a href="http://www2009.org/" id="link-id0x114c2450">WWW 2009</a> conference, held the week of April 20, 2009.)</p>

<p>We gave a talk at the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x166e10f0">Linked Open Data</a> workshop, <a href="http://events.linkeddata.org/ldow2009/" id="link-id0x19c2b1f0">LDOW 2009</a>, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments.</p>

<p>The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance <a href="http://dbpedia.org/resource/Data" id="link-id0x12c65250">data</a> is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x441128e0">RDF</a>.</p>

<p>To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key.</p>

<p>People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like <i>x is a bit like y</i> is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times.</p>

<p>Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player.</p>

<p>We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in <a href="http://dbpedia.org/resource/SQL" id="link-id0x16766eb0">SQL</a> and <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1645ddc8">SPARQL</a>.</p>

<p>To showcase this, we do <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0xa167e698">linked data</a> search, browsing, and so on, but are essentially a platform provider.</p>

<p>Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today&#39;s retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on <a href="http://aws.amazon.com/ec2/" id="link-id0xa1a67b70">EC2</a>. With this on one side and <a href="http://virtuoso.openlinksw.com" id="link-id0x1622d4e0">Virtuoso</a> on the other, a lot that was impractical in the past is now within reach. Like <a href="http://g1o.net/foaf.rdf#me" id="link-id0x3d5c8b50">Giovanni Tummarello</a> put it for airplanes, the physics are as they were for <a href="http://dbpedia.org/resource/Leonardo_da_Vinci" id="link-id0x198e7cc0">da Vinci</a> but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone.</p>

<p>A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy.</p>

<p>I answered as follows, which apparently cannot be repeated too much:</p>

<p>First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. <i>Ceterum censeo</i>, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent.</p>

<p>Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions.</p>

<p>There was a talk about <a href="http://semanticweb.org/wiki/MashQL" id="link-id0x60bd57b0">MashQL</a> by <a href="http://data.semanticweb.org/person/mustafa-jarrar" id="link-id0xa1fb98d8">Mustafa Jarrar</a>, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated &quot;on the fly&quot; as they are being composed. As it happens, we already have an <a href="http://dbpedia.org/resource/XML" id="link-id0x1923a380">XML</a>-based query abstraction layer incorporated into Virtuoso 6.0&#39;s built-in <a href="http://lod.openlinksw.com/fct/facet.vsp" id="link-id0x67712740">Faceted Data Browser Service</a>, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: <i>Use the <a href="http://lod.openlinksw.com/fct/facet.vsp" id="link-id0x70df2798">faceted browser service</a> to get vastly higher performance when querying against Linked Data, be it <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x1b3fd608">DBpedia</a> or the <a href="http://dbpedia.org/resource/Entity" id="link-id0x13ecd708">entity</a> <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x17f16970">LOD</a> <a href="http://lod.openlinksw.com/" id="link-id0x54334250">Cloud</a>. Virtuoso 6.0 (Open Source Edition) &quot;<a href="http://sourceforge.net/project/showfiles.php?group_id=161622&amp;package_id=319652&amp;release_id=677866" id="link-id12159728">TP1</a>&quot; is now publicly available as a Technology Preview (beta).</i>
</p>

<p>We heard that there is an effort for porting Freebase&#39;s Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting.</p>

<p>As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is &quot;virtualized&quot; into the database cloud or the local secure server, as the use case may require. </p>

<p>For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might <i>de facto</i> have that in just a bit, whenever there is a working draft from the SPARQL WG.</p>

<p>Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x185d6bf8">semantic web</a> papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions.</p>

<p>For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-04-01#1541">
  <rss:title>Web Scale and Fault Tolerance</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-04-01T15:18:06Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">One concern about Virtuoso Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale Linked Data Web. We will see how large scale, continuous operation, and redundancy are related. It has been said many times â when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema information in synchronous replicas on all servers. Approaches to Fault Tolerance Fault tolerance has many aspects but it starts with keeping data in at least two copies. There are shared-disk cluster databases like Oracle RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed. With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks. There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough. In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous. This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application&#39;s requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context. Basics of Partition Fail-Over For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site. The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application. Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application. When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled â but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier. For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later. If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved. As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action. Levels of Tolerance and Performance The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby. Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources. The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage. Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory. Background Bulk Processing When loading data, the system is online in principle, but query response can be quite bad. A large RDF load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server&#39;s partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference. But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate. Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost. This technique applies to all data intensive background tasks â calculation of entity search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating. Configurations of Redundancy Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will. We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers&#39; RAM when in normal operation. There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon&#39;s Dynamo does something similar at the database level. The analogies are not exact, though. If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient. Maintenance and Continuity Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime. Present Status The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>One concern about <a href="http://virtuoso.openlinksw.com" id="link-id0x719d2f8">Virtuoso</a> Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0xa9a1d8d8">Linked Data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x25201030">Web</a>. We will see how large scale, continuous operation, and redundancy are related.</p>

<p>It has been said many times â when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema <a href="http://dbpedia.org/resource/Information" id="link-id0x4548898">information</a> in synchronous replicas on all servers.</p>

<h2>Approaches to Fault Tolerance</h2>

<p>Fault tolerance has many aspects but it starts with keeping <a href="http://dbpedia.org/resource/Data" id="link-id0x18757400">data</a> in at least two copies. There are shared-disk cluster databases like <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x711c900">Oracle</a> RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed.</p>

<p>With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks.</p>

<p>There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough.</p>

<p>In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous.</p>

<p>This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application&#39;s requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context.</p>

<h2>Basics of Partition Fail-Over</h2>

<p>For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site.</p>

<p>The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application.</p>

<p>Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application.</p>

<p>When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled â but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier.</p>

<p>For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later.</p>

<p>If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved.</p>

<p>As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action.</p>

<h2>Levels of Tolerance and Performance</h2>

<p>The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby.</p>

<p>Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources.</p>

<p>The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage.</p>

<p>Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory.</p>

<h2>Background Bulk Processing</h2>

<p>When loading data, the system is online in principle, but query response can be quite bad. A large <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x19fd9c18">RDF</a> load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server&#39;s partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference.</p>

<p>But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate.</p>

<p>Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost.</p>

<p>This technique applies to all data intensive background tasks â calculation of <a href="http://dbpedia.org/resource/Entity" id="link-id0x20b7a568">entity</a> search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating.</p>

<h2>Configurations of Redundancy</h2>

<p>Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will.</p>

<p>We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers&#39; RAM when in normal operation.</p>

<p>There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon&#39;s Dynamo does something similar at the database level. The analogies are not exact, though.</p>

<p>If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient.</p>

<h2>Maintenance and Continuity</h2>

<p>Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime.</p>

<h2>Present Status</h2>

<p>The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-04-01#1540">
  <rss:title>Web Scale and Fault Tolerance</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-04-01T15:18:06Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">One concern about Virtuoso Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale Linked Data Web. We will see how large scale, continuous operation, and redundancy are related. It has been said many times â when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema information in synchronous replicas on all servers. Approaches to Fault Tolerance Fault tolerance has many aspects but it starts with keeping data in at least two copies. There are shared-disk cluster databases like Oracle RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed. With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks. There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough. In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous. This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application&#39;s requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context. Basics of Partition Fail-Over For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site. The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application. Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application. When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled â but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier. For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later. If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved. As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action. Levels of Tolerance and Performance The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby. Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources. The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage. Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory. Background Bulk Processing When loading data, the system is online in principle, but query response can be quite bad. A large RDF load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server&#39;s partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference. But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate. Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost. This technique applies to all data intensive background tasks â calculation of entity search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating. Configurations of Redundancy Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will. We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers&#39; RAM when in normal operation. There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon&#39;s Dynamo does something similar at the database level. The analogies are not exact, though. If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient. Maintenance and Continuity Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime. Present Status The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>One concern about <a href="http://virtuoso.openlinksw.com" id="link-id0x3b82c38">Virtuoso</a> Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x22c42e10">Linked Data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x1e4f0b58">Web</a>. We will see how large scale, continuous operation, and redundancy are related.</p>

<p>It has been said many times â when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema <a href="http://dbpedia.org/resource/Information" id="link-id0x224401c0">information</a> in synchronous replicas on all servers.</p>

<h2>Approaches to Fault Tolerance</h2>

<p>Fault tolerance has many aspects but it starts with keeping <a href="http://dbpedia.org/resource/Data" id="link-id0x230b7500">data</a> in at least two copies. There are shared-disk cluster databases like <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0xa9a1d8d8">Oracle</a> RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed.</p>

<p>With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks.</p>

<p>There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough.</p>

<p>In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous.</p>

<p>This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application&#39;s requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context.</p>

<h2>Basics of Partition Fail-Over</h2>

<p>For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site.</p>

<p>The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application.</p>

<p>Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application.</p>

<p>When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled â but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier.</p>

<p>For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later.</p>

<p>If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved.</p>

<p>As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action.</p>

<h2>Levels of Tolerance and Performance</h2>

<p>The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby.</p>

<p>Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources.</p>

<p>The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage.</p>

<p>Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory.</p>

<h2>Background Bulk Processing</h2>

<p>When loading data, the system is online in principle, but query response can be quite bad. A large <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x3c0cfb8">RDF</a> load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server&#39;s partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference.</p>

<p>But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate.</p>

<p>Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost.</p>

<p>This technique applies to all data intensive background tasks â calculation of <a href="http://dbpedia.org/resource/Entity" id="link-id0x3b38ac0">entity</a> search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating.</p>

<h2>Configurations of Redundancy</h2>

<p>Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will.</p>

<p>We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers&#39; RAM when in normal operation.</p>

<p>There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon&#39;s Dynamo does something similar at the database level. The analogies are not exact, though.</p>

<p>If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient.</p>

<h2>Maintenance and Continuity</h2>

<p>Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime.</p>

<h2>Present Status</h2>

<p>The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-03-05#1529">
  <rss:title>An Update on Virtuoso Development</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-03-05T10:23:49Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">It is time for an update on Virtuoso developments. We continue enhancing our hosting of the Linked Open Data (LOD) cloud at http://lod.openlinksw.com. We have now added result ranking for both text and URIs. Text hit scores are based on word frequency and proximity; URI scores are based on link density. We calculate each URI&#39;s rank by adding up references and weighing these by the score of the referrer. This is like in web search. Each iteration of the ranking will join every referred to each of its referrers. We do about 1.2 million such joins per second, across partitions, over 2.2 billion triples and 400M distinct subjects without any great optimization, just using SQL stored procedures and partitioned function calls. This is a sort of SQL map-reduce. We would do over twice as fast if it were all in C but this is adequate for now. The more interesting bit will be tuning the scoring based on what type of link we have. This is what the web search engines cannot do as well, since document links are untyped. We are moving toward a decent user interface for the LOD hosting, including offering ready-made domain-specific queries, e.g., biomedical. Things like &quot;URI finding with autocomplete&quot; are done and just have to be put online. With linked data, there is the whole question of identifier choice. We will have a special page just for this. There we show reference statistics, synonyms declared by owl:sameAs, synonyms determined by shared property values, etc. In this way we become a terminology lookup service. Copies of the LOD cluster system are available for evaluators, on a case by case basis. We will make this publicly available on EC2 also in not too long. Otherwise, we continue working on productization, primarily things like reliability and recovery. One exercise is running TPC-C with intentionally stupid partitioning, so that almost all joins and deadlocks are distributed. Then we simulate a cluster interconnect that drops messages now and then, sometimes kill server processes, and still keep full ACID properties. Cloud capable, also in bad weather. The open source release of Virtuoso 6 (no cluster) is basically ready to go, mostly this is a question of logistics. I will talk about these things in greater individual detail next week.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>It is time for an update on <a href="http://virtuoso.openlinksw.com" id="link-id0x151e89e8">Virtuoso</a> developments.</p>

<p>We continue enhancing our hosting of the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1464e168">Linked Open Data</a> (<a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x151e7f38">LOD</a>) cloud at <a href="http://lod.openlinksw.com" id="link-id11ac2448">http://lod.openlinksw.com</a>.</p>

<p>We have now added result ranking for both text and URIs.  Text hit scores are based on word frequency and proximity; <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x14f9fae0">URI</a> scores are based on link density.</p>

<p>We calculate each URI&#39;s rank by adding up references and weighing these by the score of the referrer.  This is like in web search.  Each iteration of the ranking will join every referred to each of its referrers.  We do about 1.2 million such joins per second, across partitions, over 2.2 billion triples and 400M distinct subjects without any great optimization, just using <a href="http://dbpedia.org/resource/SQL" id="link-id0xaa36a458">SQL</a> stored procedures and partitioned function calls. This is a sort of SQL map-reduce.  We would do over twice as fast if it were all in <a href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x1571e270">C</a> but this is adequate for now.  The more interesting bit will be tuning the scoring based on what type of link we have.  This is what the web search engines cannot do as well, since document links are untyped.</p>

<p>We  are moving toward a decent user interface for the LOD hosting, including offering ready-made domain-specific queries, e.g., biomedical.</p>

<p>Things like &quot;URI finding with autocomplete&quot; are done and just have to be put online.</p>

<p>With <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x14325e08">linked data</a>, there is the whole question of identifier choice.  We will have a special page just for this.  There we show reference statistics, synonyms declared by <code><a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x638b3900">owl</a>:sameAs</code>, synonyms determined by shared property values, etc.  In this way we become a terminology lookup service.</p>

<p>Copies of the LOD cluster system are available for evaluators, on a case by case basis.  We will make this publicly available on EC2 also in not too long.</p>

<p>Otherwise, we continue working on productization, primarily things like reliability and recovery.  One exercise is running <a href="http://dbpedia.org/resource/TPC-C" id="link-id0x144d00f0">TPC-C</a> with intentionally stupid partitioning, so that almost all joins and deadlocks are distributed.  Then we simulate a cluster interconnect that drops messages now and then, sometimes kill server processes, and still keep full ACID properties.  Cloud capable, also in bad weather.</p>

<p>The open source release of Virtuoso 6 (no cluster) is basically ready to go, mostly this is a question of logistics.</p>

<p>I will talk about these things in greater individual detail next week.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-03-05#1528">
  <rss:title>An Update on Virtuoso Development</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-03-05T10:23:49Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">It is time for an update on Virtuoso developments. We continue enhancing our hosting of the Linked Open Data (LOD) cloud at http://lod.openlinksw.com. We have now added result ranking for both text and URIs. Text hit scores are based on word frequency and proximity; URI scores are based on link density. We calculate each URI&#39;s rank by adding up references and weighing these by the score of the referrer. This is like in web search. Each iteration of the ranking will join every referred to each of its referrers. We do about 1.2 million such joins per second, across partitions, over 2.2 billion triples and 400M distinct subjects without any great optimization, just using SQL stored procedures and partitioned function calls. This is a sort of SQL map-reduce. We would do over twice as fast if it were all in C but this is adequate for now. The more interesting bit will be tuning the scoring based on what type of link we have. This is what the web search engines cannot do as well, since document links are untyped. We are moving toward a decent user interface for the LOD hosting, including offering ready-made domain-specific queries, e.g., biomedical. Things like &quot;URI finding with autocomplete&quot; are done and just have to be put online. With linked data, there is the whole question of identifier choice. We will have a special page just for this. There we show reference statistics, synonyms declared by owl:sameAs, synonyms determined by shared property values, etc. In this way we become a terminology lookup service. Copies of the LOD cluster system are available for evaluators, on a case by case basis. We will make this publicly available on EC2 also in not too long. Otherwise, we continue working on productization, primarily things like reliability and recovery. One exercise is running TPC-C with intentionally stupid partitioning, so that almost all joins and deadlocks are distributed. Then we simulate a cluster interconnect that drops messages now and then, sometimes kill server processes, and still keep full ACID properties. Cloud capable, also in bad weather. The open source release of Virtuoso 6 (no cluster) is basically ready to go, mostly this is a question of logistics. I will talk about these things in greater individual detail next week.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>It is time for an update on <a href="http://virtuoso.openlinksw.com" id="link-id0x14b05090">Virtuoso</a> developments.</p>

<p>We continue enhancing our hosting of the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x14395f00">Linked Open Data</a> (<a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x13e0cf98">LOD</a>) cloud at <a href="http://lod.openlinksw.com" id="link-id11ac2448">http://lod.openlinksw.com</a>.</p>

<p>We have now added result ranking for both text and URIs.  Text hit scores are based on word frequency and proximity; <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0xa9f5ff20">URI</a> scores are based on link density.</p>

<p>We calculate each URI&#39;s rank by adding up references and weighing these by the score of the referrer.  This is like in web search.  Each iteration of the ranking will join every referred to each of its referrers.  We do about 1.2 million such joins per second, across partitions, over 2.2 billion triples and 400M distinct subjects without any great optimization, just using <a href="http://dbpedia.org/resource/SQL" id="link-id0x79e77aa0">SQL</a> stored procedures and partitioned function calls. This is a sort of SQL map-reduce.  We would do over twice as fast if it were all in <a href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x11a3f3e8">C</a> but this is adequate for now.  The more interesting bit will be tuning the scoring based on what type of link we have.  This is what the web search engines cannot do as well, since document links are untyped.</p>

<p>We  are moving toward a decent user interface for the LOD hosting, including offering ready-made domain-specific queries, e.g., biomedical.</p>

<p>Things like &quot;URI finding with autocomplete&quot; are done and just have to be put online.</p>

<p>With <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x14334a48">linked data</a>, there is the whole question of identifier choice.  We will have a special page just for this.  There we show reference statistics, synonyms declared by <code><a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x59a757b0">owl</a>:sameAs</code>, synonyms determined by shared property values, etc.  In this way we become a terminology lookup service.</p>

<p>Copies of the LOD cluster system are available for evaluators, on a case by case basis.  We will make this publicly available on EC2 also in not too long.</p>

<p>Otherwise, we continue working on productization, primarily things like reliability and recovery.  One exercise is running <a href="http://dbpedia.org/resource/TPC-C" id="link-id0x1f6797c0">TPC-C</a> with intentionally stupid partitioning, so that almost all joins and deadlocks are distributed.  Then we simulate a cluster interconnect that drops messages now and then, sometimes kill server processes, and still keep full ACID properties.  Cloud capable, also in bad weather.</p>

<p>The open source release of Virtuoso 6 (no cluster) is basically ready to go, mostly this is a question of logistics.</p>

<p>I will talk about these things in greater individual detail next week.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-02-16#1527">
  <rss:title>Facets and Large Ontologies of the LOD Cloud</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-02-16T11:21:05Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have just submitted this paper to the WWW09 Linked Open Data Workshop. The thing is intermittently live with both Dbpedia on one instance and a LOD Cloud data collection of about 2 billion triples on another. We will give out the links once we have tested a bit more. The present activity is all about testing Virtuoso 6 for release, cluster and otherwise.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have just submitted <a href="http://www.openlinksw.com/weblog/oerling/lodw.pdf" id="link-id13d9bc68">this paper</a> to the WWW09 <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x51c2fd00">Linked Open Data</a> Workshop.</p>
<p>The thing is intermittently live with both <a href="http://dbpedia.org/resource/DBpedia" id="link-id0xa16f87a0">Dbpedia</a> on one instance and a <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1a8212d8">LOD</a> Cloud <a href="http://dbpedia.org/resource/Data" id="link-id0xc1097a8">data</a> collection of about 2 billion triples on another.  We will give out the links once we have tested a bit more.</p>
<p>The present activity is all about testing <a href="http://virtuoso.openlinksw.com" id="link-id0x17d64f90">Virtuoso</a> 6 for release, cluster and otherwise.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-02-16#1526">
  <rss:title>Facets and Large Ontologies of the LOD Cloud</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-02-16T11:21:05Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have just submitted this paper to the WWW09 Linked Open Data Workshop. The thing is intermittently live with both Dbpedia on one instance and a LOD Cloud data collection of about 2 billion triples on another. We will give out the links once we have tested a bit more. The present activity is all about testing Virtuoso 6 for release, cluster and otherwise.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have just submitted <a href="http://www.openlinksw.com/weblog/oerling/lodw.pdf" id="link-id13d9bc68">this paper</a> to the WWW09 <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1a2b37a0">Linked Open Data</a> Workshop.</p>
<p>The thing is intermittently live with both <a href="http://dbpedia.org/resource/DBpedia" id="link-id0xc0dd578">Dbpedia</a> on one instance and a <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1d721b60">LOD</a> Cloud <a href="http://dbpedia.org/resource/Data" id="link-id0xa141b238">data</a> collection of about 2 billion triples on another.  We will give out the links once we have tested a bit more.</p>
<p>The present activity is all about testing <a href="http://virtuoso.openlinksw.com" id="link-id0xd69ff70">Virtuoso</a> 6 for release, cluster and otherwise.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-01-09#1516">
  <rss:title>Faceted Search:  Unlimited Data in Interactive Time</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-01-09T22:03:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Why not see the whole world of data as facets? Well, we&#39;d like to, but there is the feeling that this is not practical. The old problem has been that it is not really practical to pre-compute counts of everything for all possible combinations of search conditions and counting/grouping/sorting. The actual matches take time. Well, neither is in fact necessary. When there are large numbers of items matching the conditions, counting them can take time but then this is the beginning of the search, and the user is not even likely to look very closely at the counts. It is enough to see that there are many of one and few of another. If the user already knows the precise predicate or class to look for, then the top-level faceted view is not even needed. The faceted view for guiding search and precise analytics are two different problems. There are client-side faceted views like Exhibit or our own ODE. The problem with these is that there are a few orders of magnitude difference between the actual database size and what fits on the user agent. This is compounded by the fact that one does not know what to cache on the user agent because of the open nature of the data web. If this were about a fixed workflow, then a good guess would be possible â but we are talking about the data web, the very soul of serendipity and unexpected discovery. So we made a web service that will do faceted search on arbitrary RDF. If it does not get complete results within a timeout, it will return what it has counted so far, using Virtuoso&#39;s Anytime feature. Looking for subjects with some specific combination of properties is however a bit limited, so this will also do JOINs. Many features are one or two JOINs away; take geographical locations or social networks, for example. Yet a faceted search should be point-and-click, and should not involve a full query construction. We put the compromise at starting with full text or property or class, then navigating down properties or classes, to arbitrary depth, tree-wise. At each step, one can see the matching instances or their classes or properties, all with counts, faceted-style. This is good enough for queries like &#39;what do Harry Potter fans also like&#39; or &#39;who are the authors of articles tagged semantic web and machine learning and published in 2008&#39;. For complex grouping, sub-queries, arithmetic or such, one must write the actual query. But one can begin with facets, and then continue refining the query by hand since the service also returns SPARQL text. We made a small web interface on top of the service with all logic server side. This proves that the web service is usable and that an interface with no AJAX, and no problems with browser interoperability or such, is possible and easy. Also, the problem of syncing between a user-agent-based store and a database is entirely gone. If we are working with a known data structure, the user interface should choose the display by the data type and offer links to related reports. This is all easy to build as web pages or AJAX. We show how the generic interface is done in Virtuoso PL, and you can adapt that or rewrite it in PHP, Java, JavaScript, or anything else, to accommodate use-case specific navigation needs such as data format. The web service takes an XML representation of the search, which is more restricted and easier to process by machine than the SPARQL syntax. The web service returns the results, the SPARQL query it generated, whether the results are complete or not, and some resource use statistics. The source of the PL functions, Web Service and Virtuoso Server Page (HTML UI) will be available as part of Virtuoso 6.0 and higher. A Programmer&#39;s Guide will be available as part of the standard Virtuoso Documentation collection, including the Virtuoso Open Source Edition Website.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Why not see the whole world of <a href="http://dbpedia.org/resource/Data" id="link-id0xc3f6b38">data</a> as facets?  Well, we&#39;d like to, but there is the feeling that this is not practical.</p>

<p>The old problem has been that it is not really practical to pre-compute counts of everything for all possible combinations of search conditions and counting/grouping/sorting. The actual matches take time.</p>

<p>Well, neither is in fact necessary.  When there are large numbers of items matching the conditions, counting them can take time but then this is the beginning of the search, and the user is not even likely to look very closely at the counts.  It is enough to see that there are many of one and few of another.  If the user already knows the precise predicate or class to look for, then the top-level faceted view is not even needed.  The faceted view for guiding search and precise analytics are two different problems.</p>

<p>There are client-side faceted views like Exhibit or our own <a href="http://ode.openlinksw.com/" id="link-id0x1bc1cfe0">ODE</a>.  The problem with these is that there are a few orders of magnitude difference between the actual database size and what fits on the user agent.  This is compounded by the fact that one does not know what to cache on the user agent because of the open nature of the data web. If this were about a fixed workflow, then a good guess would be possible â but we are talking about the data web, the very soul of serendipity and unexpected discovery.</p>

<p>So we made a web service that will do faceted search on arbitrary <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xbb62170">RDF</a>. If it does not get complete results within a timeout, it will return what it has counted so far, using <a href="http://virtuoso.openlinksw.com" id="link-id0xb122b00">Virtuoso</a>&#39;s <a href="http://www.openlinksw.com/weblog/oerling/?id=1494" id="link-id117b0df0"><b>Anytime</b></a> feature.  Looking for subjects with some specific combination of properties is however a bit limited, so this will also do <code>JOINs</code>.  Many features are one or two <code>JOINs</code> away; take geographical locations or social networks, for example.</p>

<p>Yet a faceted search should be point-and-click, and should not involve a full query construction.  We put the compromise at starting with full text or property or class, then navigating down properties or classes, to arbitrary depth, tree-wise.  At each step, one can see the matching instances or their classes or properties, all with counts, faceted-style.</p>

<p>This is good enough for queries like &#39;what do Harry Potter fans also like&#39; or &#39;who are the authors of articles tagged <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0xbee32d8">semantic web</a> and machine learning and published in 2008&#39;.  For complex grouping, sub-queries, arithmetic or such, one must write the actual query.</p>

<p>But one can begin with facets, and then continue refining the query by hand since the service also returns <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xbcc9f38">SPARQL</a> text.  We made a small web interface on top of the service with all logic server side.  This proves that the web service is usable and that an interface with no AJAX, and no problems with browser interoperability or such, is possible and easy.  Also, the problem of syncing between a user-agent-based store and a database is entirely gone.</p>

<p>If we are working with a known data structure, the user interface should choose the display by the data type and offer links to related reports.  This is all easy to build as web pages or AJAX.  We show how the generic interface is done in Virtuoso PL, and you can adapt that or rewrite it in <a href="http://dbpedia.org/resource/PHP" id="link-id0xcdbe268">PHP</a>, Java, JavaScript, or anything else, to accommodate use-case specific navigation needs such as data format.</p>

<p>The web service takes an <a href="http://dbpedia.org/resource/XML" id="link-id0xc019c08">XML</a> representation of the search, which is more restricted and easier to process by machine than the SPARQL syntax.  The web service returns the results, the SPARQL query it generated, whether the results are complete or not, and some resource use statistics.</p>

<p>The source of the PL functions, Web Service and Virtuoso Server Page (HTML UI) will be available as part of Virtuoso 6.0 and higher.  A Programmer&#39;s Guide will be available as part of the standard Virtuoso Documentation collection, including the Virtuoso Open Source Edition Website.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-01-09#1515">
  <rss:title>Faceted Search:  Unlimited Data in Interactive Time</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-01-09T22:03:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Why not see the whole world of data as facets? Well, we&#39;d like to, but there is the feeling that this is not practical. The old problem has been that it is not really practical to pre-compute counts of everything for all possible combinations of search conditions and counting/grouping/sorting. The actual matches take time. Well, neither is in fact necessary. When there are large numbers of items matching the conditions, counting them can take time but then this is the beginning of the search, and the user is not even likely to look very closely at the counts. It is enough to see that there are many of one and few of another. If the user already knows the precise predicate or class to look for, then the top-level faceted view is not even needed. The faceted view for guiding search and precise analytics are two different problems. There are client-side faceted views like Exhibit or our own ODE. The problem with these is that there are a few orders of magnitude difference between the actual database size and what fits on the user agent. This is compounded by the fact that one does not know what to cache on the user agent because of the open nature of the data web. If this were about a fixed workflow, then a good guess would be possible â but we are talking about the data web, the very soul of serendipity and unexpected discovery. So we made a web service that will do faceted search on arbitrary RDF. If it does not get complete results within a timeout, it will return what it has counted so far, using Virtuoso&#39;s Anytime feature. Looking for subjects with some specific combination of properties is however a bit limited, so this will also do JOINs. Many features are one or two JOINs away; take geographical locations or social networks, for example. Yet a faceted search should be point-and-click, and should not involve a full query construction. We put the compromise at starting with full text or property or class, then navigating down properties or classes, to arbitrary depth, tree-wise. At each step, one can see the matching instances or their classes or properties, all with counts, faceted-style. This is good enough for queries like &#39;what do Harry Potter fans also like&#39; or &#39;who are the authors of articles tagged semantic web and machine learning and published in 2008&#39;. For complex grouping, sub-queries, arithmetic or such, one must write the actual query. But one can begin with facets, and then continue refining the query by hand since the service also returns SPARQL text. We made a small web interface on top of the service with all logic server side. This proves that the web service is usable and that an interface with no AJAX, and no problems with browser interoperability or such, is possible and easy. Also, the problem of syncing between a user-agent-based store and a database is entirely gone. If we are working with a known data structure, the user interface should choose the display by the data type and offer links to related reports. This is all easy to build as web pages or AJAX. We show how the generic interface is done in Virtuoso PL, and you can adapt that or rewrite it in PHP, Java, JavaScript, or anything else, to accommodate use-case specific navigation needs such as data format. The web service takes an XML representation of the search, which is more restricted and easier to process by machine than the SPARQL syntax. The web service returns the results, the SPARQL query it generated, whether the results are complete or not, and some resource use statistics. The source of the PL functions, Web Service and Virtuoso Server Page (HTML UI) will be available as part of Virtuoso 6.0 and higher. A Programmer&#39;s Guide will be available as part of the standard Virtuoso Documentation collection, including the Virtuoso Open Source Edition Website.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Why not see the whole world of <a href="http://dbpedia.org/resource/Data" id="link-id0x1a0319c0">data</a> as facets?  Well, we&#39;d like to, but there is the feeling that this is not practical.</p>

<p>The old problem has been that it is not really practical to pre-compute counts of everything for all possible combinations of search conditions and counting/grouping/sorting. The actual matches take time.</p>

<p>Well, neither is in fact necessary.  When there are large numbers of items matching the conditions, counting them can take time but then this is the beginning of the search, and the user is not even likely to look very closely at the counts.  It is enough to see that there are many of one and few of another.  If the user already knows the precise predicate or class to look for, then the top-level faceted view is not even needed.  The faceted view for guiding search and precise analytics are two different problems.</p>

<p>There are client-side faceted views like Exhibit or our own <a href="http://ode.openlinksw.com/" id="link-id0xc3db130">ODE</a>.  The problem with these is that there are a few orders of magnitude difference between the actual database size and what fits on the user agent.  This is compounded by the fact that one does not know what to cache on the user agent because of the open nature of the data web. If this were about a fixed workflow, then a good guess would be possible â but we are talking about the data web, the very soul of serendipity and unexpected discovery.</p>

<p>So we made a web service that will do faceted search on arbitrary <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xbdbf198">RDF</a>. If it does not get complete results within a timeout, it will return what it has counted so far, using <a href="http://virtuoso.openlinksw.com" id="link-id0x17691878">Virtuoso</a>&#39;s <a href="http://www.openlinksw.com/weblog/oerling/?id=1494" id="link-id117b0df0"><b>Anytime</b></a> feature.  Looking for subjects with some specific combination of properties is however a bit limited, so this will also do <code>JOINs</code>.  Many features are one or two <code>JOINs</code> away; take geographical locations or social networks, for example.</p>

<p>Yet a faceted search should be point-and-click, and should not involve a full query construction.  We put the compromise at starting with full text or property or class, then navigating down properties or classes, to arbitrary depth, tree-wise.  At each step, one can see the matching instances or their classes or properties, all with counts, faceted-style.</p>

<p>This is good enough for queries like &#39;what do Harry Potter fans also like&#39; or &#39;who are the authors of articles tagged <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x160eb950">semantic web</a> and machine learning and published in 2008&#39;.  For complex grouping, sub-queries, arithmetic or such, one must write the actual query.</p>

<p>But one can begin with facets, and then continue refining the query by hand since the service also returns <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x17e82228">SPARQL</a> text.  We made a small web interface on top of the service with all logic server side.  This proves that the web service is usable and that an interface with no AJAX, and no problems with browser interoperability or such, is possible and easy.  Also, the problem of syncing between a user-agent-based store and a database is entirely gone.</p>

<p>If we are working with a known data structure, the user interface should choose the display by the data type and offer links to related reports.  This is all easy to build as web pages or AJAX.  We show how the generic interface is done in Virtuoso PL, and you can adapt that or rewrite it in <a href="http://dbpedia.org/resource/PHP" id="link-id0xbe41d60">PHP</a>, Java, JavaScript, or anything else, to accommodate use-case specific navigation needs such as data format.</p>

<p>The web service takes an <a href="http://dbpedia.org/resource/XML" id="link-id0xc2fc358">XML</a> representation of the search, which is more restricted and easier to process by machine than the SPARQL syntax.  The web service returns the results, the SPARQL query it generated, whether the results are complete or not, and some resource use statistics.</p>

<p>The source of the PL functions, Web Service and Virtuoso Server Page (HTML UI) will be available as part of Virtuoso 6.0 and higher.  A Programmer&#39;s Guide will be available as part of the standard Virtuoso Documentation collection, including the Virtuoso Open Source Edition Website.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-01-02#1511">
  <rss:title>Linked Data &amp; The Year 2009 (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-01-02T16:17:06Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">As is fitting for the season, I will editorialize a bit about what has gone before and what is to come. Sir Tim said it at WWW08 in Beijing â linked data and the linked data web is the semantic web and the Web done right. The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general. The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft&#39;s Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else â business needs first; schema last. Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel. It is against this backdrop that this year will play out. As concerns research, I will again quote Harry Halpin at ESWC 2008: &quot;Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality&quot; [or words to this effect]. Something like the data web, and even the semantic web, will happen. Harry&#39;s question was whether this would be the descendant of what is today called semantic web research. I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though. Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF? The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can&#39;t do unless you own the RDBMS. Source access is not enough; you also need the knowledge. Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo. Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction. In hoc signo vinces. In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core. For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords. For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect. For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side. For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI&#39;s LÃ­on project. These will provide platforms for further demonstrating the &quot;web&quot; in data web, as in web-scale smart databasing. 2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one&#39;s own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time. Exciting 2009 to all.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.</p>

<p>
<a href="http://www.w3.org/People/Berners-Lee/card#i" id="link-id1119f250">Sir Tim</a> said it at WWW08 in <a href="http://www2008.org/" id="link-id0x1dcb93a0">Beijing</a> â <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x13a3efb8">linked data</a> and the linked data <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x13a44cd0">web</a> is the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x10d25788">semantic web</a> and the Web done right.</p>

<p>The grail of <i>ad hoc</i> analytics on infinite <a href="http://dbpedia.org/resource/Data" id="link-id0xa201d518">data</a> has lost none of its appeal.  We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.</p>

<p>The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft&#39;s <a href="http://dbpedia.org/resource/Entity" id="link-id0x12fa4e40">Entity</a> Frameworks technology is an example.  Agility has been a buzzword for a long time.  Everything should be offered in a service based business model and should interoperate and integrate with everything else â business needs first; schema last.</p>

<p>Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized.  <a href="http://dbpedia.org/resource/Information" id="link-id0x175b32e8">Information</a>, as the asset it is, is none the less important, on the contrary.  But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.</p>

<p>It is against this backdrop that this year will play out.</p>

<p>As concerns research, I will <a href="http://www.openlinksw.com/weblog/oerling/?id=1374" id="link-id1151b128">again quote</a> <a href="http://www.ibiblio.org/hhalpin/#" id="link-id141cb740">Harry Halpin</a> at <a href="http://www.eswc2008.org/" id="link-id0x18a8a858">ESWC 2008</a>: &quot;Men will fight in a war, and even lose a war, for what they believe just.  And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality&quot; [or words to this effect].</p>

<p>Something like the data web, and even the semantic web, will happen. Harry&#39;s question was whether this would be the descendant of what is today called semantic web research.</p>

<p>I heard in conversation about a project for making a very large metadata store.  I also heard that the makers did not particularly insist on this being <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x3c39ed80">RDF</a>-based, though.</p>

<p>Why should such a thing be RDF-based?  If it is already accepted that there will be <i>ad hoc</i> schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?</p>

<p>The justification of RDF is in reusing and linking-to data and terminology out there.  Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an <a href="http://dbpedia.org/resource/Entity-attribute-value_model" id="link-id0x14a77880">entity</a>-attribute-value (<a href="http://dbpedia.org/resource/Entity-attribute-value_model" id="link-id0x5f978e88">EAV</a>, i.e., triple) store on a generic <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x391bdcd8">RDBMS</a>.  The sem-web world has been there, trust me.  We came out well because we put all inside the RDBMS, lowest level, which you can&#39;t do unless you own the RDBMS.  Source access is not enough; you also need the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x138a3a00">knowledge</a>.</p>

<p>Technicalities aside, the question is one of proprietary vs. standards-based.  This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. <a href="http://www.zemanta.com/" id="link-id0x5f92cb38">Zemanta</a> and <a href="http://www.opencalais.com/" id="link-id0x139c3200">OpenCalais</a> serving <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x1731dc78">DBpedia</a> URIs are examples.  Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.</p>

<p>Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata).  As on the web, so on the enterprise <a href="http://dbpedia.org/resource/Intranet" id="link-id0x1324ada8">intranet</a>.  In this lies the strength of RDF as opposed to proprietary flexible database schemes.  This is a qualitative distinction.</p>
<p align="center">
 <a href="http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData" id="link-id117178a8"><img src="http://www.openlinksw.com/images/logos/LoDLogo.gif" alt="Linking Open Data project logo" />
 </a>
<br />
 <a href="http://dbpedia.org/resource/In_hoc_signo_vinces" id="link-id115f47e8"><i>In hoc signo vinces.</i>
 </a>
</p>

<p>In this light, we welcome the <a href="http://semanticweb.org/wiki/VoiD" id="link-id0x67cf560">voiD</a> (<a href="http://semanticweb.org/wiki/VoiD" id="link-id0x1898c908">VOcabulary of Interlinked Data</a>), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.</p>

<p>For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace.  <a href="http://dbpedia.org/resource/XML" id="link-id0x1588d6a8">XML</a> is for the transaction; RDF is for the discovery, query, and analytics.  As with databases in general, first there was the transaction; then there was the query.  Same here.  For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota.  For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the <a href="http://www.openlinksw.com/weblog/oerling/?id=1374" id="link-id110b8668">Virtuoso Anytime</a> feature.  With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage.  Of course, we do not forget advertising.  When data has explicit semantics, contextuality is better than with keywords.</p>

<p>For these visions to materialize on top of the linked data platform, linked data must join the world of data.  This means messaging that is geared towards the database public.  They know the problem, but the RDF proposition is still not well enough understood for it to connect.</p>

<p>For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping.  We are also bringing out new Microsoft Entity <a href="http://dbpedia.org/resource/ADO.NET_Entity_Framework" id="link-id0x13a50fd8">Framework</a> components.  This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.</p>

<p>For <a href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id0x1d2ea7f0">OpenLink Software</a>, 2008 was about developing technology for scale, RDF as well as generic relational.  We did show a tiny preview with the <a href="http://challenge.semanticweb.org/" id="link-id0x658fbc8">Billion Triples Challenge</a> demo.  Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale.  We <a href="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?id=1489" id="link-id150c6090">started offering ready-to-go Virtuoso-hosted linked open data sets</a> on Amazon EC2 in December.  Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available.  Technical specifics are amply discussed on this <a href="http://dbpedia.org/resource/Blog" id="link-id0x1424ec20">blog</a>.  There are still some new technology things to be developed this year; first among these are strong <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x14b8ca88">SPARQL</a> federation, and on-the-fly resizing of server clusters.  On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI&#39;s <a href="https://lion.deri.ie/" id="link-id115c02f8">LÃ­on project</a>.  These will provide platforms for further demonstrating the &quot;web&quot; in data web, as in web-scale smart databasing.</p>

<p>2009 will see change through scale.  The things that exist will start interconnecting and there will be emergent value.  Deployments will be larger and scale will be readily available through a services model or by installation at one&#39;s own facilities.  We may see the start of Search becoming Find, like <a href="http://myopenlink.net/dataspace/person/kidehen#this" id="link-id14e43050">Kingsley</a> says, meaning semantics of data guiding search.  Entity extraction will multiply data volumes and bring parts of the data web to real time.</p>

<p>Exciting 2009 to all.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-01-02#1510">
  <rss:title>Linked Data &amp; The Year 2009 (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-01-02T16:17:06Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">As is fitting for the season, I will editorialize a bit about what has gone before and what is to come. Sir Tim said it at WWW08 in Beijing â linked data and the linked data web is the semantic web and the Web done right. The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general. The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft&#39;s Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else â business needs first; schema last. Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel. It is against this backdrop that this year will play out. As concerns research, I will again quote Harry Halpin at ESWC 2008: &quot;Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality&quot; [or words to this effect]. Something like the data web, and even the semantic web, will happen. Harry&#39;s question was whether this would be the descendant of what is today called semantic web research. I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though. Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF? The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can&#39;t do unless you own the RDBMS. Source access is not enough; you also need the knowledge. Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo. Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction. In hoc signo vinces. In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core. For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords. For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect. For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side. For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI&#39;s LÃ­on project. These will provide platforms for further demonstrating the &quot;web&quot; in data web, as in web-scale smart databasing. 2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one&#39;s own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time. Exciting 2009 to all.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.</p>

<p>
<a href="http://www.w3.org/People/Berners-Lee/card#i" id="link-id1119f250">Sir Tim</a> said it at WWW08 in <a href="http://www2008.org/" id="link-id0x14ab66b0">Beijing</a> â <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x115a4588">linked data</a> and the linked data <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0xa5c678">web</a> is the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x7cbe5540">semantic web</a> and the Web done right.</p>

<p>The grail of <i>ad hoc</i> analytics on infinite <a href="http://dbpedia.org/resource/Data" id="link-id0xa4b25428">data</a> has lost none of its appeal.  We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.</p>

<p>The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft&#39;s <a href="http://dbpedia.org/resource/Entity" id="link-id0x1c3c72b0">Entity</a> Frameworks technology is an example.  Agility has been a buzzword for a long time.  Everything should be offered in a service based business model and should interoperate and integrate with everything else â business needs first; schema last.</p>

<p>Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized.  <a href="http://dbpedia.org/resource/Information" id="link-id0xa0743bd8">Information</a>, as the asset it is, is none the less important, on the contrary.  But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.</p>

<p>It is against this backdrop that this year will play out.</p>

<p>As concerns research, I will <a href="http://www.openlinksw.com/weblog/oerling/?id=1374" id="link-id1151b128">again quote</a> <a href="http://www.ibiblio.org/hhalpin/#" id="link-id141cb740">Harry Halpin</a> at <a href="http://www.eswc2008.org/" id="link-id0x28f68040">ESWC 2008</a>: &quot;Men will fight in a war, and even lose a war, for what they believe just.  And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality&quot; [or words to this effect].</p>

<p>Something like the data web, and even the semantic web, will happen. Harry&#39;s question was whether this would be the descendant of what is today called semantic web research.</p>

<p>I heard in conversation about a project for making a very large metadata store.  I also heard that the makers did not particularly insist on this being <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x13c8af68">RDF</a>-based, though.</p>

<p>Why should such a thing be RDF-based?  If it is already accepted that there will be <i>ad hoc</i> schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?</p>

<p>The justification of RDF is in reusing and linking-to data and terminology out there.  Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an <a href="http://dbpedia.org/resource/Entity-attribute-value_model" id="link-id0x1ca17b20">entity</a>-attribute-value (<a href="http://dbpedia.org/resource/Entity-attribute-value_model" id="link-id0x1c9d6050">EAV</a>, i.e., triple) store on a generic <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x557dff0">RDBMS</a>.  The sem-web world has been there, trust me.  We came out well because we put all inside the RDBMS, lowest level, which you can&#39;t do unless you own the RDBMS.  Source access is not enough; you also need the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1470c748">knowledge</a>.</p>

<p>Technicalities aside, the question is one of proprietary vs. standards-based.  This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. <a href="http://www.zemanta.com/" id="link-id0x524bea0">Zemanta</a> and <a href="http://www.opencalais.com/" id="link-id0x46132d38">OpenCalais</a> serving <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x13624fb8">DBpedia</a> URIs are examples.  Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.</p>

<p>Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata).  As on the web, so on the enterprise <a href="http://dbpedia.org/resource/Intranet" id="link-id0xa1392eb8">intranet</a>.  In this lies the strength of RDF as opposed to proprietary flexible database schemes.  This is a qualitative distinction.</p>
<p align="center">
 <a href="http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData" id="link-id117178a8"><img src="http://www.openlinksw.com/images/logos/LoDLogo.gif" alt="Linking Open Data project logo" />
 </a>
<br />
 <a href="http://dbpedia.org/resource/In_hoc_signo_vinces" id="link-id115f47e8"><i>In hoc signo vinces.</i>
 </a>
</p>

<p>In this light, we welcome the <a href="http://semanticweb.org/wiki/VoiD" id="link-id0x12352cc0">voiD</a> (<a href="http://semanticweb.org/wiki/VoiD" id="link-id0x722c18">VOcabulary of Interlinked Data</a>), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.</p>

<p>For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace.  <a href="http://dbpedia.org/resource/XML" id="link-id0x2c21c00">XML</a> is for the transaction; RDF is for the discovery, query, and analytics.  As with databases in general, first there was the transaction; then there was the query.  Same here.  For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota.  For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the <a href="http://www.openlinksw.com/weblog/oerling/?id=1374" id="link-id110b8668">Virtuoso Anytime</a> feature.  With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage.  Of course, we do not forget advertising.  When data has explicit semantics, contextuality is better than with keywords.</p>

<p>For these visions to materialize on top of the linked data platform, linked data must join the world of data.  This means messaging that is geared towards the database public.  They know the problem, but the RDF proposition is still not well enough understood for it to connect.</p>

<p>For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping.  We are also bringing out new Microsoft Entity <a href="http://dbpedia.org/resource/ADO.NET_Entity_Framework" id="link-id0x723080">Framework</a> components.  This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.</p>

<p>For <a href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id0x11e1dfc0">OpenLink Software</a>, 2008 was about developing technology for scale, RDF as well as generic relational.  We did show a tiny preview with the <a href="http://challenge.semanticweb.org/" id="link-id0x722d08">Billion Triples Challenge</a> demo.  Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale.  We <a href="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?id=1489" id="link-id150c6090">started offering ready-to-go Virtuoso-hosted linked open data sets</a> on Amazon EC2 in December.  Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available.  Technical specifics are amply discussed on this <a href="http://dbpedia.org/resource/Blog" id="link-id0x10fc1930">blog</a>.  There are still some new technology things to be developed this year; first among these are strong <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x7fd25590">SPARQL</a> federation, and on-the-fly resizing of server clusters.  On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI&#39;s <a href="https://lion.deri.ie/" id="link-id115c02f8">LÃ­on project</a>.  These will provide platforms for further demonstrating the &quot;web&quot; in data web, as in web-scale smart databasing.</p>

<p>2009 will see change through scale.  The things that exist will start interconnecting and there will be emergent value.  Deployments will be larger and scale will be readily available through a services model or by installation at one&#39;s own facilities.  We may see the start of Search becoming Find, like <a href="http://myopenlink.net/dataspace/person/kidehen#this" id="link-id14e43050">Kingsley</a> says, meaning semantics of data guiding search.  Entity extraction will multiply data volumes and bring parts of the data web to real time.</p>

<p>Exciting 2009 to all.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-18#1507">
  <rss:title>Virtuoso 6 FAQ directory</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-18T15:46:18Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have received various inquiries on high-end metadata stores. I will here go through some salient questions. The requested features include: Scaling to trillions of triples Running on clusters of commodity servers Running in federated environments, possibly over wide area networks Built-in inference Transactions Security Support for extra triple level metadata, such as security attributes Q: What is the storage cost per triple? answer Q: What is the cost to insert a triple? answer Q: What is the cost to delete a triple? (For the insertion itself, as well as for updating any indices) answer Q: What is the cost to search on a given property? answer Q: What data types are supported? answer Q: What inferencing is supported? answer Q: Is the inferencing dynamic or is an extra step required before inferencing can be used? answer Q: Do you support full text search? answer Q: What programming interfaces are supported? Do you support standard SPARQL protocol? answer Q: How can data be partitioned across multiple servers? answer Q: How many triples can a single server handle? answer Q: What is the performance impact of going from the billion to the trillion triples? answer Q: Do you support additional metadata for triples, such as timestamps, security tags etc? answer Q: Should we use RDF for our large metadata store? What are the alternatives? answer Q: How multithreaded is Virtuoso? answer Q: Can multiple servers run off a single shared disk database? answer Q: Can Virtuoso run on a SAN? answer Q: How does Virtuoso join across partitions? answer Q: Does Virtuoso support federated triple stores? If there are multiple SPARQL end points, can Virtuoso be used to do queries joining between these? answer Q: How many servers can a cluster contain? answer Q: How do I reconfigure a cluster, adding and removing machines, etc? answer Q: How will Virtuoso handle regional clusters? answer Q: Is there a mechanism for terminating long running queries? answer Q: Can the user be asynchronously notified when a long running query terminates? answer Q: How many concurrent queries can Virtuoso handle? answer Q: What is the relative performance of SPARQL queries vs. native relational queries answer Q: Does Virtuoso support property tables? answer Q: What performance metrics does Virtuoso offer? answer Q: What support do you provide for concurrency/multithreading operation? Is your interface thread-safe? answer Q: What level of ACID properties are supported? answer Q: Do you provide the ability to atomically add a set of triples, where either all are added or none are added? answer Q: Do you provide the ability to add a set of triples, respecting the isolation property (so concurrent accessors either see none of the triple values, or all of them)? answer Q: What is the time to start a database, create/open a graph? answer Q: What sort of security features are built into Virtuoso? answer</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have received various inquiries on high-end metadata stores.   I will here go through some salient questions.  The requested features include:</p>

<ul>
<li>Scaling to trillions of triples</li>
<li>Running on clusters of commodity servers</li>
<li>Running in federated environments, possibly over wide area networks</li>
<li>Built-in inference</li>
<li>Transactions</li>
<li>Security</li>
<li>Support for extra triple level metadata, such as security attributes</li>
</ul>


<p>Q: What is the storage cost per triple? <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#StorageCostPerTriple" id="link-id147f61e8">answer</a>
</p>

<p>Q: What is the cost to insert a triple?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#TripleInsertionCost" id="link-id112e2488">answer</a>
</p>

<p>Q: What is the cost to delete a triple? (For the insertion itself, as well as for updating any indices)  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#TripleDeletionCost" id="link-id11728528">answer</a>
</p>

<p>Q: What is the cost to search on a given property?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#PropertySearchCost" id="link-id1586e360">answer</a>
</p>

<p>Q: What <a href="http://dbpedia.org/resource/Data" id="link-id14688e38">data</a> types are supported?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#SupportedDataTypes" id="link-id1593dbf0">answer</a>
</p>

<p>Q: What inferencing is supported?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#SupportedInferencing" id="link-id112f3248">answer</a>
</p>

<p>Q: Is the inferencing dynamic or is an extra step required before inferencing can be used?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#InferencingDynamism" id="link-id1477e2e0">answer</a>
</p>

<p>Q: Do you support <a href="http://dbpedia.org/resource/Full_text_search" id="link-id1177b198">full text search</a>?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#FullTextSearchSupport" id="link-id1543b170">answer</a>
</p>

<p>Q: What programming interfaces are supported?  Do you support standard <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id14bb69c0">SPARQL protocol</a>?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#SupportedProgrammingInterfaces" id="link-id14d4eb18">answer</a>
</p>

<p>Q: How can data be partitioned across multiple servers?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#MultipleServerDataPartitioning" id="link-id13722e00">answer</a>
</p>

<p>Q: How many triples can a single server handle?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#SingleServerTripleLimits" id="link-id14046e58">answer</a>
</p>

<p>Q: What is the performance impact of going from the billion to the trillion triples?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#PerformanceImpactBillionToTrillion" id="link-id113cfc10">answer</a>
</p>

<p>Q: Do you support additional metadata for triples, such as timestamps, security tags etc?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#TripleMetadataSupport" id="link-id14c75fa8">answer</a>
</p>

<p>Q: Should we use <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id11342010">RDF</a> for our large metadata store?  What are the alternatives?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#LargeMetadataStoreFormat" id="link-id1478db38">answer</a>
</p>

<p>Q: How multithreaded is <a href="http://virtuoso.openlinksw.com" id="link-id1651d028">Virtuoso</a>?   <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#VirtuosoMultiThreading" id="link-id152ad310">answer</a> </p>

<p>Q: Can multiple servers run off a single shared disk database?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#MultipleServersOneDiskDatabase" id="link-id14d9d528">answer</a>
</p>

<p>Q: Can Virtuoso run on a SAN?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#VirtuosoOnSAN" id="link-id111b55d0">answer</a>
</p>

<p>Q: How does Virtuoso join across partitions?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#CrossPartitionJoins" id="link-id11094db8">answer</a>
</p>

<p>Q: Does Virtuoso support federated triple stores?  If there are multiple <a href="http://dbpedia.org/resource/SPARQL" id="link-id19156b48">SPARQL</a> end points, can Virtuoso be used to do queries joining between these?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#FederatedTripleStoresAndQueries" id="link-id15447ef8">answer</a>
</p>

<p>Q: How many servers can a cluster contain?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#ClusterServerLimit" id="link-id125fe0d0">answer</a>
</p>

<p>Q: How do I reconfigure a cluster, adding and removing machines, etc?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#ClusterReconfiguration" id="link-id1150c448">answer</a>
</p>

<p>Q: How will Virtuoso handle regional clusters?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#RegionalClustering" id="link-id1596ca48">answer</a>
</p>

<p>Q: Is there a mechanism for terminating long running queries?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#TerminatingLongRunningQueries" id="link-id116bbd60">answer</a>
</p>

<p>Q: Can the user be asynchronously notified when a long running query terminates?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#AsynchNotificationOfQueryTermination" id="link-id15a59a50">answer</a>
</p>

<p>Q: How many concurrent queries can Virtuoso handle?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#ConcurrentQueryLimits" id="link-id110a8c00">answer</a>
</p>

<p>Q: What is the relative performance of SPARQL queries vs. native relational queries  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#RelativePerformanceSparqlVsSql" id="link-id110914f8">answer</a>
</p>

<p>Q: Does Virtuoso support property tables?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#PropertyTableSupport" id="link-id1581f8c8">answer</a>
</p>

<p>Q: What performance metrics does Virtuoso offer?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#PerformanceMetricSupport" id="link-id14e92300">answer</a>
</p>

<p>Q: What support do you provide for concurrency/multithreading operation? Is your interface thread-safe?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#ConcurrencyAndThreadSafety" id="link-id15964b80">answer</a>
</p>

<p>Q: What level of ACID properties are supported?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#AcidComplianceLevel" id="link-id11035ac0">answer</a>
</p>

<p>Q: Do you provide the ability to atomically add a set of triples, where either all are added or none are added?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#AtomicTripleInsertion" id="link-id15290e68">answer</a>
</p>

<p>Q: Do you provide the ability to add a set of triples, respecting the isolation property (so concurrent accessors either see none of the triple values, or all of them)?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#IsolationDuringInsertion" id="link-id15855df0">answer</a>
</p>

<p>Q: What is the time to start a database, create/open a graph?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#StartupTimes" id="link-id14227f40">answer</a>
</p>

<p>Q: What sort of security features are built into Virtuoso?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#BuiltInSecurity" id="link-id11927810">answer</a>
</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-18#1506">
  <rss:title>Virtuoso 6 FAQ directory</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-18T15:46:18Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have received various inquiries on high-end metadata stores. I will here go through some salient questions. The requested features include: Scaling to trillions of triples Running on clusters of commodity servers Running in federated environments, possibly over wide area networks Built-in inference Transactions Security Support for extra triple level metadata, such as security attributes Q: What is the storage cost per triple? answer Q: What is the cost to insert a triple? answer Q: What is the cost to delete a triple? (For the insertion itself, as well as for updating any indices) answer Q: What is the cost to search on a given property? answer Q: What data types are supported? answer Q: What inferencing is supported? answer Q: Is the inferencing dynamic or is an extra step required before inferencing can be used? answer Q: Do you support full text search? answer Q: What programming interfaces are supported? Do you support standard SPARQL protocol? answer Q: How can data be partitioned across multiple servers? answer Q: How many triples can a single server handle? answer Q: What is the performance impact of going from the billion to the trillion triples? answer Q: Do you support additional metadata for triples, such as timestamps, security tags etc? answer Q: Should we use RDF for our large metadata store? What are the alternatives? answer Q: How multithreaded is Virtuoso? answer Q: Can multiple servers run off a single shared disk database? answer Q: Can Virtuoso run on a SAN? answer Q: How does Virtuoso join across partitions? answer Q: Does Virtuoso support federated triple stores? If there are multiple SPARQL end points, can Virtuoso be used to do queries joining between these? answer Q: How many servers can a cluster contain? answer Q: How do I reconfigure a cluster, adding and removing machines, etc? answer Q: How will Virtuoso handle regional clusters? answer Q: Is there a mechanism for terminating long running queries? answer Q: Can the user be asynchronously notified when a long running query terminates? answer Q: How many concurrent queries can Virtuoso handle? answer Q: What is the relative performance of SPARQL queries vs. native relational queries answer Q: Does Virtuoso support property tables? answer Q: What performance metrics does Virtuoso offer? answer Q: What support do you provide for concurrency/multithreading operation? Is your interface thread-safe? answer Q: What level of ACID properties are supported? answer Q: Do you provide the ability to atomically add a set of triples, where either all are added or none are added? answer Q: Do you provide the ability to add a set of triples, respecting the isolation property (so concurrent accessors either see none of the triple values, or all of them)? answer Q: What is the time to start a database, create/open a graph? answer Q: What sort of security features are built into Virtuoso? answer</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have received various inquiries on high-end metadata stores.   I will here go through some salient questions.  The requested features include:</p>

<ul>
<li>Scaling to trillions of triples</li>
<li>Running on clusters of commodity servers</li>
<li>Running in federated environments, possibly over wide area networks</li>
<li>Built-in inference</li>
<li>Transactions</li>
<li>Security</li>
<li>Support for extra triple level metadata, such as security attributes</li>
</ul>


<p>Q: What is the storage cost per triple? <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#StorageCostPerTriple" id="link-id147f61e8">answer</a>
</p>

<p>Q: What is the cost to insert a triple?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#TripleInsertionCost" id="link-id112e2488">answer</a>
</p>

<p>Q: What is the cost to delete a triple? (For the insertion itself, as well as for updating any indices)  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#TripleDeletionCost" id="link-id11728528">answer</a>
</p>

<p>Q: What is the cost to search on a given property?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#PropertySearchCost" id="link-id1586e360">answer</a>
</p>

<p>Q: What <a href="http://dbpedia.org/resource/Data" id="link-id14688e38">data</a> types are supported?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#SupportedDataTypes" id="link-id1593dbf0">answer</a>
</p>

<p>Q: What inferencing is supported?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#SupportedInferencing" id="link-id112f3248">answer</a>
</p>

<p>Q: Is the inferencing dynamic or is an extra step required before inferencing can be used?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#InferencingDynamism" id="link-id1477e2e0">answer</a>
</p>

<p>Q: Do you support <a href="http://dbpedia.org/resource/Full_text_search" id="link-id1177b198">full text search</a>?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#FullTextSearchSupport" id="link-id1543b170">answer</a>
</p>

<p>Q: What programming interfaces are supported?  Do you support standard <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id14bb69c0">SPARQL protocol</a>?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#SupportedProgrammingInterfaces" id="link-id14d4eb18">answer</a>
</p>

<p>Q: How can data be partitioned across multiple servers?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#MultipleServerDataPartitioning" id="link-id13722e00">answer</a>
</p>

<p>Q: How many triples can a single server handle?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#SingleServerTripleLimits" id="link-id14046e58">answer</a>
</p>

<p>Q: What is the performance impact of going from the billion to the trillion triples?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#PerformanceImpactBillionToTrillion" id="link-id113cfc10">answer</a>
</p>

<p>Q: Do you support additional metadata for triples, such as timestamps, security tags etc?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#TripleMetadataSupport" id="link-id14c75fa8">answer</a>
</p>

<p>Q: Should we use <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id11342010">RDF</a> for our large metadata store?  What are the alternatives?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#LargeMetadataStoreFormat" id="link-id1478db38">answer</a>
</p>

<p>Q: How multithreaded is <a href="http://virtuoso.openlinksw.com" id="link-id1651d028">Virtuoso</a>?   <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#VirtuosoMultiThreading" id="link-id152ad310">answer</a> </p>

<p>Q: Can multiple servers run off a single shared disk database?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#MultipleServersOneDiskDatabase" id="link-id14d9d528">answer</a>
</p>

<p>Q: Can Virtuoso run on a SAN?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#VirtuosoOnSAN" id="link-id111b55d0">answer</a>
</p>

<p>Q: How does Virtuoso join across partitions?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#CrossPartitionJoins" id="link-id11094db8">answer</a>
</p>

<p>Q: Does Virtuoso support federated triple stores?  If there are multiple <a href="http://dbpedia.org/resource/SPARQL" id="link-id19156b48">SPARQL</a> end points, can Virtuoso be used to do queries joining between these?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#FederatedTripleStoresAndQueries" id="link-id15447ef8">answer</a>
</p>

<p>Q: How many servers can a cluster contain?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#ClusterServerLimit" id="link-id125fe0d0">answer</a>
</p>

<p>Q: How do I reconfigure a cluster, adding and removing machines, etc?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#ClusterReconfiguration" id="link-id1150c448">answer</a>
</p>

<p>Q: How will Virtuoso handle regional clusters?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#RegionalClustering" id="link-id1596ca48">answer</a>
</p>

<p>Q: Is there a mechanism for terminating long running queries?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#TerminatingLongRunningQueries" id="link-id116bbd60">answer</a>
</p>

<p>Q: Can the user be asynchronously notified when a long running query terminates?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#AsynchNotificationOfQueryTermination" id="link-id15a59a50">answer</a>
</p>

<p>Q: How many concurrent queries can Virtuoso handle?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#ConcurrentQueryLimits" id="link-id110a8c00">answer</a>
</p>

<p>Q: What is the relative performance of SPARQL queries vs. native relational queries  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#RelativePerformanceSparqlVsSql" id="link-id110914f8">answer</a>
</p>

<p>Q: Does Virtuoso support property tables?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#PropertyTableSupport" id="link-id1581f8c8">answer</a>
</p>

<p>Q: What performance metrics does Virtuoso offer?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#PerformanceMetricSupport" id="link-id14e92300">answer</a>
</p>

<p>Q: What support do you provide for concurrency/multithreading operation? Is your interface thread-safe?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#ConcurrencyAndThreadSafety" id="link-id15964b80">answer</a>
</p>

<p>Q: What level of ACID properties are supported?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#AcidComplianceLevel" id="link-id11035ac0">answer</a>
</p>

<p>Q: Do you provide the ability to atomically add a set of triples, where either all are added or none are added?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#AtomicTripleInsertion" id="link-id15290e68">answer</a>
</p>

<p>Q: Do you provide the ability to add a set of triples, respecting the isolation property (so concurrent accessors either see none of the triple values, or all of them)?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#IsolationDuringInsertion" id="link-id15855df0">answer</a>
</p>

<p>Q: What is the time to start a database, create/open a graph?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#StartupTimes" id="link-id14227f40">answer</a>
</p>

<p>Q: What sort of security features are built into Virtuoso?  <a href="http://virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#BuiltInSecurity" id="link-id11927810">answer</a>
</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-17#1505">
  <rss:title>Virtuoso RDF:  A Getting Started Guide for the Developer</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-17T12:31:34Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated. The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces. You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files. To get started, cd into binsrc/tests/lubm. To verify that this works, you can do ./test_server.sh virtuoso-t This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step. Loading the Data The file lubm-load.sql contains the commands for loading the LUBM single university qualification database. The data files themselves are in lubm_8000, 15 files in RDFXML. There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark. So now let&#39;s go through this procedure. Start the server: $ virtuoso-t -f &amp; This starts the server in foreground mode, and puts it in the background of the shell. Now we connect to it with the isql utility. $ isql 1111 dba dba This gives a SQL&gt; prompt. The default username and password are both dba. When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword. The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory. Next it calls this function for the lubm_8000 directory under the server&#39;s working directory. sparql CLEAR GRAPH &lt;lubm&gt;; sparql CLEAR GRAPH &lt;inf&gt;; load_lubm ( server_root() || &#39;/lubm_8000/&#39; ); Then it verifies that the right number of triples is found in the &lt;lubm&gt; graph. sparql SELECT COUNT(*) FROM &lt;lubm&gt; WHERE { ?x ?y ?z } ; The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now. Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database. sparql PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt; INSERT INTO GRAPH &lt;lubm&gt; { ?x ub:subOrganizationOf ?z } FROM &lt;lubm&gt; WHERE { ?x ub:subOrganizationOf ?y . ?y ub:subOrganizationOf ?z . }; Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph. DB.DBA.TTLP ( file_to_string ( &#39;inf.nt&#39; ), &#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&#39;, &#39;inf&#39; ) ; sparql SELECT COUNT(*) FROM &lt;inf&gt; WHERE { ?x ?y ?z } ; Then we declare that the triples in the &lt;inf&gt; graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the &#39;inft&#39; rule set. Otherwise this has no effect. rdfs_rule_set (&#39;inft&#39;, &#39;inf&#39;); This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time. checkpoint; Now we are ready for querying. Querying the Data The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the &lt;inf&gt; graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the &lt;lubm&gt; graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file. If you wish to run all the commands in a SQL file, you can type load &lt;filename&gt;; (e.g., load lubm-cp.sql;) at the SQL&gt; prompt. If you wish to try individual statements, you can paste them to the command line. For example: SQL&gt; sparql PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt; SELECT * FROM &lt;lubm&gt; WHERE { ?x a ub:Publication . ?x ub:publicationAuthor &lt;http://www.Department0.University0.edu/AssistantProfessor0&gt; }; VARCHAR _______________________________________________________________________ http://www.Department0.University0.edu/AssistantProfessor0/Publication0 http://www.Department0.University0.edu/AssistantProfessor0/Publication1 http://www.Department0.University0.edu/AssistantProfessor0/Publication2 http://www.Department0.University0.edu/AssistantProfessor0/Publication3 http://www.Department0.University0.edu/AssistantProfessor0/Publication4 http://www.Department0.University0.edu/AssistantProfessor0/Publication5 6 Rows. -- 4 msec. To stop the server, simply type shutdown; at the SQL&gt; prompt. If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like â [HTTPServer] ServerPort = 8421 ServerRoot = . ServerThreads = 2 â to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL&gt; prompt and then virtuoso-t -f &amp; at the shell prompt). Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL. We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface. To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enableÂ (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>It is a long standing promise of mine to dispel the false impression that using <a href="http://virtuoso.openlinksw.com/" id="link-id113506d0">Virtuoso</a> to work with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id115d9528">RDF</a> is complicated.</p>

<p>The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it.  This is done programmatically, with no confusing user interfaces.</p>

<p>You should have a Virtuoso Open Source tree built and installed.  We will look at the LUBM benchmark demo that comes with the package.  All you need is a Unix shell.  Running the shell under emacs (<code>m-x shell</code>) is the best.  But the open source <code>isql</code> utility should have command line editing also.  The emacs shell is however convenient for cutting and pasting things between shell and files.</p>

<p>To get started, cd into <code>binsrc/tests/lubm</code>.</p>

<p>To verify that this works, you can do </p>

<blockquote>
<pre>./test_server.sh virtuoso-t</pre></blockquote>

<p>This will test the server with the LUBM queries.  This should report 45 tests passed.  After this we will do the tests step-by-step.</p>

<h2>Loading the <a href="http://dbpedia.org/resource/Data" id="link-id10f7bd90">Data</a>
</h2> 

<p>The file <code>lubm-load.sql</code> contains the commands for loading the LUBM single university qualification database.</p>

<p>The data files themselves are in <code>lubm_8000</code>, 15 files in RDFXML.</p>

<p>There is also a little ontology called <code>inf.nt</code>.  This declares the subclass and subproperty relations used in the benchmark.</p>

<p>So now let&#39;s go through this procedure.</p>

<p>Start the server:</p>

<blockquote>
<pre>$ virtuoso-t -f &amp;
</pre></blockquote>

<p>This starts the server in foreground mode, and puts it in the background of the shell.</p>

<p>Now we connect to it with the isql utility.</p>

<blockquote>
<pre>$ isql 1111 dba dba 
</pre></blockquote>

<p>This gives a <code>SQL&gt;</code> prompt.  The default username and password are both <code>dba</code>.</p>

<p>When a command is <a href="http://dbpedia.org/resource/SQL" id="link-id1176ce70">SQL</a>, it is entered directly.  If it is <a href="http://dbpedia.org/resource/SPARQL" id="link-id156df468">SPARQL</a>, it is prefixed with the keyword <code>sparql</code>.  This is how all the SQL clients work.  Any SQL client, such as any <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id152d0a00">ODBC</a> or <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id157ad6a0">JDBC</a> application, can use SPARQL if the SQL string starts with this keyword.</p>

<p>The <code>lubm-load.sql</code> file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, <code>DB..RDF_LOAD_RDFXML</code>, for each file in a directory.</p>

<p>Next it calls this function for the <code>lubm_8000</code> directory under the server&#39;s working directory.</p>

<blockquote>
<pre>sparql 
   CLEAR GRAPH &lt;lubm&gt;;

sparql 
   CLEAR GRAPH &lt;inf&gt;;

load_lubm ( server_root() || &#39;/lubm_8000/&#39; );
</pre></blockquote>

<p>Then it verifies that the right number of triples is found in the &lt;lubm&gt; graph.</p>

<blockquote>
<pre>sparql 
   SELECT COUNT(*) 
     FROM &lt;lubm&gt; 
    WHERE { ?x ?y ?z } ;
</pre></blockquote>

<p>The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed.  They can be ignored for now.</p>

<p>Then it adds some implied <code>subOrganizationOf</code> triples.  This is part of setting up the LUBM test database.</p>

<blockquote>
<pre>sparql 
   PREFIX  ub:  &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt;
   INSERT 
      INTO GRAPH &lt;lubm&gt; 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM &lt;lubm&gt; 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };
</pre></blockquote>

<p>Then it loads the ontology file, <code>inf.nt</code>, using the Turtle load function, <code>DB.DBA.TTLP</code>.  The arguments of the function are the text to load, the default namespace prefix, and the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id15835550">URI</a> of the target graph.</p>

<blockquote>
<pre>DB.DBA.TTLP ( file_to_string ( &#39;inf.nt&#39; ), 
              &#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&#39;, 
              &#39;inf&#39; 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM &lt;inf&gt; 
    WHERE { ?x ?y ?z } ;
</pre></blockquote>

<p>Then we declare that the triples in the <code>&lt;inf&gt;</code> graph can be used for inference at run time.  To enable this, a SPARQL query will declare that it uses the <code>&#39;inft&#39;</code> rule set.  Otherwise this has no effect.</p>

<blockquote>
<pre>rdfs_rule_set (&#39;inft&#39;, &#39;inf&#39;);
</pre></blockquote>

<p>This is just a log checkpoint to finalize the work and truncate the transaction log.  The server would also eventually do this in its own time.</p>

<blockquote>
<pre>checkpoint;
</pre></blockquote>

<p>Now we are ready for querying.</p>

<h2>Querying the Data</h2> 

<p>The queries are given in 3 different versions: The first file, <code>lubm.sql</code>, has the queries with most inference open coded as <code>UNIONs</code>. The second file, <code>lubm-inf.sql</code>, has the inference performed at run time using the ontology <a href="http://dbpedia.org/resource/Information" id="link-id1109faf0">information</a> in the <code>&lt;inf&gt;</code> graph we just loaded.  The last, <code>lubm-phys.sql</code>, relies on having the entailed triples physically present in the <code>&lt;lubm&gt;</code> graph.  These entailed triples are inserted by the SPARUL commands in the <code>lubm-cp.sql</code> file.</p>

<p>If you wish to run all the commands in a SQL file, you can type <code>load &lt;filename&gt;;</code> (e.g., <code>load lubm-cp.sql;</code>) at the <code>SQL&gt;</code> prompt. If you wish to try individual statements, you can paste them to the command line.</p>

<p>For example: </p>

<blockquote>
<pre>SQL&gt; sparql 
   PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt;
   SELECT * 
     FROM &lt;lubm&gt;
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  &lt;http://www.Department0.University0.edu/AssistantProfessor0&gt; 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.
</pre></blockquote>


<p>To stop the server, simply type <code>shutdown;</code> at the <code>SQL&gt;</code> prompt.</p>

<p>If you wish to use a <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id11384668">SPARQL protocol</a> end point, just enable the HTTP listener.  This is done by adding a stanza like â</p>

<blockquote>
<pre>[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2
</pre></blockquote>

<p>â to the end of the <code>virtuoso.ini</code> file in the <code>lubm</code> directory.  Then shutdown and restart (type <code>shutdown;</code> at the <code>SQL&gt;</code> prompt and then <code>virtuoso-t -f &amp;</code> at the shell prompt).</p>

<p>Now you can connect to the end point with a web browser.  The <a href="http://dbpedia.org/resource/Uniform_Resource_Locator" id="link-id113d02d8">URL</a> is <code>http://localhost:8421/sparql</code>. Without parameters, this will show a human readable form.  With parameters, this will execute SPARQL.</p>

<p>We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, <a href="http://dbpedia.org/resource/PHP" id="link-id142d0ba0">PHP</a>, using the PHP ODBC interface.</p>

<p>To see how to use <a href="http://jena.sourceforge.net/" id="link-id117074f0">Jena</a> or <a href="http://sourceforge.net/projects/sesame/" id="link-id1103c9b0">Sesame</a> with Virtuoso, look at <a href="http://docs.openlinksw.com/virtuoso/rdfnativestorageproviders.html" id="link-id15488ce8">Native RDF Storage Providers</a>. To see how RDF data types are supported, see <a href="http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html#jdbcrdf" id="link-id15784a40">Extension datatype for RDF</a>
</p>

<p>To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do <code>log_enableÂ (2);</code> before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space.  See <a href="http://docs.openlinksw.com/virtuoso/" id="link-id111410f0">documentation</a> for more.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1504">
  <rss:title>Virtuoso RDF:  A Getting Started Guide for the Developer</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-17T12:31:34Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated. The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces. You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files. To get started, cd into binsrc/tests/lubm. To verify that this works, you can do ./test_server.sh virtuoso-t This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step. Loading the Data The file lubm-load.sql contains the commands for loading the LUBM single university qualification database. The data files themselves are in lubm_8000, 15 files in RDFXML. There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark. So now let&#39;s go through this procedure. Start the server: $ virtuoso-t -f &amp; This starts the server in foreground mode, and puts it in the background of the shell. Now we connect to it with the isql utility. $ isql 1111 dba dba This gives a SQL&gt; prompt. The default username and password are both dba. When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword. The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory. Next it calls this function for the lubm_8000 directory under the server&#39;s working directory. sparql CLEAR GRAPH &lt;lubm&gt;; sparql CLEAR GRAPH &lt;inf&gt;; load_lubm ( server_root() || &#39;/lubm_8000/&#39; ); Then it verifies that the right number of triples is found in the &lt;lubm&gt; graph. sparql SELECT COUNT(*) FROM &lt;lubm&gt; WHERE { ?x ?y ?z } ; The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now. Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database. sparql PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt; INSERT INTO GRAPH &lt;lubm&gt; { ?x ub:subOrganizationOf ?z } FROM &lt;lubm&gt; WHERE { ?x ub:subOrganizationOf ?y . ?y ub:subOrganizationOf ?z . }; Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph. DB.DBA.TTLP ( file_to_string ( &#39;inf.nt&#39; ), &#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&#39;, &#39;inf&#39; ) ; sparql SELECT COUNT(*) FROM &lt;inf&gt; WHERE { ?x ?y ?z } ; Then we declare that the triples in the &lt;inf&gt; graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the &#39;inft&#39; rule set. Otherwise this has no effect. rdfs_rule_set (&#39;inft&#39;, &#39;inf&#39;); This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time. checkpoint; Now we are ready for querying. Querying the Data The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the &lt;inf&gt; graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the &lt;lubm&gt; graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file. If you wish to run all the commands in a SQL file, you can type load &lt;filename&gt;; (e.g., load lubm-cp.sql;) at the SQL&gt; prompt. If you wish to try individual statements, you can paste them to the command line. For example: SQL&gt; sparql PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt; SELECT * FROM &lt;lubm&gt; WHERE { ?x a ub:Publication . ?x ub:publicationAuthor &lt;http://www.Department0.University0.edu/AssistantProfessor0&gt; }; VARCHAR _______________________________________________________________________ http://www.Department0.University0.edu/AssistantProfessor0/Publication0 http://www.Department0.University0.edu/AssistantProfessor0/Publication1 http://www.Department0.University0.edu/AssistantProfessor0/Publication2 http://www.Department0.University0.edu/AssistantProfessor0/Publication3 http://www.Department0.University0.edu/AssistantProfessor0/Publication4 http://www.Department0.University0.edu/AssistantProfessor0/Publication5 6 Rows. -- 4 msec. To stop the server, simply type shutdown; at the SQL&gt; prompt. If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like â [HTTPServer] ServerPort = 8421 ServerRoot = . ServerThreads = 2 â to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL&gt; prompt and then virtuoso-t -f &amp; at the shell prompt). Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL. We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface. To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enableÂ (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>It is a long standing promise of mine to dispel the false impression that using <a href="http://virtuoso.openlinksw.com/" id="link-id113506d0">Virtuoso</a> to work with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id115d9528">RDF</a> is complicated.</p>

<p>The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it.  This is done programmatically, with no confusing user interfaces.</p>

<p>You should have a Virtuoso Open Source tree built and installed.  We will look at the LUBM benchmark demo that comes with the package.  All you need is a Unix shell.  Running the shell under emacs (<code>m-x shell</code>) is the best.  But the open source <code>isql</code> utility should have command line editing also.  The emacs shell is however convenient for cutting and pasting things between shell and files.</p>

<p>To get started, cd into <code>binsrc/tests/lubm</code>.</p>

<p>To verify that this works, you can do </p>

<blockquote>
<pre>./test_server.sh virtuoso-t</pre></blockquote>

<p>This will test the server with the LUBM queries.  This should report 45 tests passed.  After this we will do the tests step-by-step.</p>

<h2>Loading the <a href="http://dbpedia.org/resource/Data" id="link-id10f7bd90">Data</a>
</h2> 

<p>The file <code>lubm-load.sql</code> contains the commands for loading the LUBM single university qualification database.</p>

<p>The data files themselves are in <code>lubm_8000</code>, 15 files in RDFXML.</p>

<p>There is also a little ontology called <code>inf.nt</code>.  This declares the subclass and subproperty relations used in the benchmark.</p>

<p>So now let&#39;s go through this procedure.</p>

<p>Start the server:</p>

<blockquote>
<pre>$ virtuoso-t -f &amp;
</pre></blockquote>

<p>This starts the server in foreground mode, and puts it in the background of the shell.</p>

<p>Now we connect to it with the isql utility.</p>

<blockquote>
<pre>$ isql 1111 dba dba 
</pre></blockquote>

<p>This gives a <code>SQL&gt;</code> prompt.  The default username and password are both <code>dba</code>.</p>

<p>When a command is <a href="http://dbpedia.org/resource/SQL" id="link-id1176ce70">SQL</a>, it is entered directly.  If it is <a href="http://dbpedia.org/resource/SPARQL" id="link-id156df468">SPARQL</a>, it is prefixed with the keyword <code>sparql</code>.  This is how all the SQL clients work.  Any SQL client, such as any <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id152d0a00">ODBC</a> or <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id157ad6a0">JDBC</a> application, can use SPARQL if the SQL string starts with this keyword.</p>

<p>The <code>lubm-load.sql</code> file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, <code>DB..RDF_LOAD_RDFXML</code>, for each file in a directory.</p>

<p>Next it calls this function for the <code>lubm_8000</code> directory under the server&#39;s working directory.</p>

<blockquote>
<pre>sparql 
   CLEAR GRAPH &lt;lubm&gt;;

sparql 
   CLEAR GRAPH &lt;inf&gt;;

load_lubm ( server_root() || &#39;/lubm_8000/&#39; );
</pre></blockquote>

<p>Then it verifies that the right number of triples is found in the &lt;lubm&gt; graph.</p>

<blockquote>
<pre>sparql 
   SELECT COUNT(*) 
     FROM &lt;lubm&gt; 
    WHERE { ?x ?y ?z } ;
</pre></blockquote>

<p>The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed.  They can be ignored for now.</p>

<p>Then it adds some implied <code>subOrganizationOf</code> triples.  This is part of setting up the LUBM test database.</p>

<blockquote>
<pre>sparql 
   PREFIX  ub:  &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt;
   INSERT 
      INTO GRAPH &lt;lubm&gt; 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM &lt;lubm&gt; 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };
</pre></blockquote>

<p>Then it loads the ontology file, <code>inf.nt</code>, using the Turtle load function, <code>DB.DBA.TTLP</code>.  The arguments of the function are the text to load, the default namespace prefix, and the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id15835550">URI</a> of the target graph.</p>

<blockquote>
<pre>DB.DBA.TTLP ( file_to_string ( &#39;inf.nt&#39; ), 
              &#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&#39;, 
              &#39;inf&#39; 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM &lt;inf&gt; 
    WHERE { ?x ?y ?z } ;
</pre></blockquote>

<p>Then we declare that the triples in the <code>&lt;inf&gt;</code> graph can be used for inference at run time.  To enable this, a SPARQL query will declare that it uses the <code>&#39;inft&#39;</code> rule set.  Otherwise this has no effect.</p>

<blockquote>
<pre>rdfs_rule_set (&#39;inft&#39;, &#39;inf&#39;);
</pre></blockquote>

<p>This is just a log checkpoint to finalize the work and truncate the transaction log.  The server would also eventually do this in its own time.</p>

<blockquote>
<pre>checkpoint;
</pre></blockquote>

<p>Now we are ready for querying.</p>

<h2>Querying the Data</h2> 

<p>The queries are given in 3 different versions: The first file, <code>lubm.sql</code>, has the queries with most inference open coded as <code>UNIONs</code>. The second file, <code>lubm-inf.sql</code>, has the inference performed at run time using the ontology <a href="http://dbpedia.org/resource/Information" id="link-id1109faf0">information</a> in the <code>&lt;inf&gt;</code> graph we just loaded.  The last, <code>lubm-phys.sql</code>, relies on having the entailed triples physically present in the <code>&lt;lubm&gt;</code> graph.  These entailed triples are inserted by the SPARUL commands in the <code>lubm-cp.sql</code> file.</p>

<p>If you wish to run all the commands in a SQL file, you can type <code>load &lt;filename&gt;;</code> (e.g., <code>load lubm-cp.sql;</code>) at the <code>SQL&gt;</code> prompt. If you wish to try individual statements, you can paste them to the command line.</p>

<p>For example: </p>

<blockquote>
<pre>SQL&gt; sparql 
   PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt;
   SELECT * 
     FROM &lt;lubm&gt;
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  &lt;http://www.Department0.University0.edu/AssistantProfessor0&gt; 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.
</pre></blockquote>


<p>To stop the server, simply type <code>shutdown;</code> at the <code>SQL&gt;</code> prompt.</p>

<p>If you wish to use a <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id11384668">SPARQL protocol</a> end point, just enable the HTTP listener.  This is done by adding a stanza like â</p>

<blockquote>
<pre>[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2
</pre></blockquote>

<p>â to the end of the <code>virtuoso.ini</code> file in the <code>lubm</code> directory.  Then shutdown and restart (type <code>shutdown;</code> at the <code>SQL&gt;</code> prompt and then <code>virtuoso-t -f &amp;</code> at the shell prompt).</p>

<p>Now you can connect to the end point with a web browser.  The <a href="http://dbpedia.org/resource/Uniform_Resource_Locator" id="link-id113d02d8">URL</a> is <code>http://localhost:8421/sparql</code>. Without parameters, this will show a human readable form.  With parameters, this will execute SPARQL.</p>

<p>We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, <a href="http://dbpedia.org/resource/PHP" id="link-id142d0ba0">PHP</a>, using the PHP ODBC interface.</p>

<p>To see how to use <a href="http://jena.sourceforge.net/" id="link-id117074f0">Jena</a> or <a href="http://sourceforge.net/projects/sesame/" id="link-id1103c9b0">Sesame</a> with Virtuoso, look at <a href="http://docs.openlinksw.com/virtuoso/rdfnativestorageproviders.html" id="link-id15488ce8">Native RDF Storage Providers</a>. To see how RDF data types are supported, see <a href="http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html#jdbcrdf" id="link-id15784a40">Extension datatype for RDF</a>
</p>

<p>To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do <code>log_enableÂ (2);</code> before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space.  See <a href="http://docs.openlinksw.com/virtuoso/" id="link-id111410f0">documentation</a> for more.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-17#1503">
  <rss:title>See the Lite:  Embeddable/Background Virtuoso starts at 25MB</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-17T09:34:12Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have received many requests for an embeddable-scale Virtuoso. In response to this, we have added a Lite mode, where the initial size of a server process is a tiny fraction of what the initial size would be with default settings. With 2MB of disk cache buffers (ini file setting, NumberOfBuffers = 256), the process size stays under 30MB on 32-bit Linux. The value of this is that one can now have RDF and full text indexing on the desktop without running a Java VM or any other memory-intensive software. And of course, all of SQL (transactions, stored procedures, etc.) is in the same embeddably-sized container. The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file. The executable size is about 10MB for 32-bit Linux. A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting &quot;off&quot;; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc. Lite mode offers full SQL and SPARQL/SPARUL (via SPASQL), but disables all HTTP-based services (WebDAV, application hosting, etc.). Clients can still use all typical database access mechanisms (i.e., ODBC, JDBC, OLE-DB, ADO.NET, and XMLA) to connect, including the Jena and Sesame frameworks for RDF. ODBC now offers full support of RDF data types for C-based clients. A Redland-compatible API also exists, for use with Redland v1.0.8 and later. Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost. Shipping an embedded Virtuoso is easy. It just takes one executable and one configuration file. Performance is generally comparable to &quot;normal&quot; mode, except that Lite will be somewhat less scalable on multicore systems. The Lite mode will be included in the next Virtuoso 5 Open Source release.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have received many requests for an embeddable-scale <a href="http://virtuoso.openlinksw.com" id="link-id0x1cd69650">Virtuoso</a>.  In response to this, we have added a Lite mode, where the initial size of a server process is a tiny fraction of what the initial size would be with default settings.  With 2MB of disk cache buffers (ini file setting, <code>NumberOfBuffers = 256</code>), the process size stays under 30MB on 32-bit Linux.</p>

<p>The value of this is that one can now have <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1ce89340">RDF</a> and full text indexing on the desktop without running a Java VM or any other memory-intensive software.  And of course, all of <a href="http://dbpedia.org/resource/SQL" id="link-id0x1cfc9288">SQL</a> (transactions, stored procedures, etc.) is in the same embeddably-sized container.</p>

<p>The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file.  The executable size is about 10MB for 32-bit Linux.  A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting &quot;off&quot;; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc.</p>

<p>Lite mode offers full SQL and <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1c511da8">SPARQL</a>/SPARUL (via SPASQL), but disables all <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1dac1950">HTTP</a>-based services (WebDAV, application hosting, etc.).  Clients can still use all typical database access mechanisms (i.e., <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0xb19a488">ODBC</a>, <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1d93ee40">JDBC</a>, OLE-DB, <a href="http://dbpedia.org/resource/ADO.NET" id="link-id0x1ce391c0">ADO</a>.<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0xacf1168">NET</a>, and XMLA) to connect, including the <a href="http://jena.sourceforge.net/" id="link-id0xaaf5b58">Jena</a> and <a href="http://sourceforge.net/projects/sesame/" id="link-id0x1b1e4328">Sesame</a> frameworks for RDF.  ODBC now offers full support of RDF <a href="http://dbpedia.org/resource/Data" id="link-id0x1cfc9f78">data</a> types for <a href="http://dbpedia.org/resource/C%2B%2B" id="link-id0xa6059d8">C</a>-based clients.  A Redland-compatible API also exists, for use with Redland v1.0.8 and later. </p>

<p>Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost.</p>

<p>Shipping an embedded Virtuoso is easy.  It just takes one executable and one configuration file.  Performance is generally comparable to &quot;normal&quot; mode, except that Lite will be somewhat less scalable on multicore systems.</p>

<p>The Lite mode will be included in the next Virtuoso 5 Open Source release.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1502">
  <rss:title>See the Lite:  Embeddable/Background Virtuoso starts at 25MB</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-17T09:34:12Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have received many requests for an embeddable-scale Virtuoso. In response to this, we have added a Lite mode, where the initial size of a server process is a tiny fraction of what the initial size would be with default settings. With 2MB of disk cache buffers (ini file setting, NumberOfBuffers = 256), the process size stays under 30MB on 32-bit Linux. The value of this is that one can now have RDF and full text indexing on the desktop without running a Java VM or any other memory-intensive software. And of course, all of SQL (transactions, stored procedures, etc.) is in the same embeddably-sized container. The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file. The executable size is about 10MB for 32-bit Linux. A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting &quot;off&quot;; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc. Lite mode offers full SQL and SPARQL/SPARUL (via SPASQL), but disables all HTTP-based services (WebDAV, application hosting, etc.). Clients can still use all typical database access mechanisms (i.e., ODBC, JDBC, OLE-DB, ADO.NET, and XMLA) to connect, including the Jena and Sesame frameworks for RDF. ODBC now offers full support of RDF data types for C-based clients. A Redland-compatible API also exists, for use with Redland v1.0.8 and later. Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost. Shipping an embedded Virtuoso is easy. It just takes one executable and one configuration file. Performance is generally comparable to &quot;normal&quot; mode, except that Lite will be somewhat less scalable on multicore systems. The Lite mode will be included in the next Virtuoso 5 Open Source release.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have received many requests for an embeddable-scale <a href="http://virtuoso.openlinksw.com" id="link-id0xa5aa1b38">Virtuoso</a>.  In response to this, we have added a Lite mode, where the initial size of a server process is a tiny fraction of what the initial size would be with default settings.  With 2MB of disk cache buffers (ini file setting, <code>NumberOfBuffers = 256</code>), the process size stays under 30MB on 32-bit Linux.</p>

<p>The value of this is that one can now have <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1db79ac8">RDF</a> and full text indexing on the desktop without running a Java VM or any other memory-intensive software.  And of course, all of <a href="http://dbpedia.org/resource/SQL" id="link-id0xa923298">SQL</a> (transactions, stored procedures, etc.) is in the same embeddably-sized container.</p>

<p>The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file.  The executable size is about 10MB for 32-bit Linux.  A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting &quot;off&quot;; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc.</p>

<p>Lite mode offers full SQL and <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1b388830">SPARQL</a>/SPARUL (via SPASQL), but disables all <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1d56b618">HTTP</a>-based services (WebDAV, application hosting, etc.).  Clients can still use all typical database access mechanisms (i.e., <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x1c5abc38">ODBC</a>, <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1dade1f8">JDBC</a>, OLE-DB, <a href="http://dbpedia.org/resource/ADO.NET" id="link-id0x25d8e0f0">ADO</a>.<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x1d7a1a28">NET</a>, and XMLA) to connect, including the <a href="http://jena.sourceforge.net/" id="link-id0x1d929b98">Jena</a> and <a href="http://sourceforge.net/projects/sesame/" id="link-id0x1b7a9088">Sesame</a> frameworks for RDF.  ODBC now offers full support of RDF <a href="http://dbpedia.org/resource/Data" id="link-id0xaf62aa0">data</a> types for <a href="http://dbpedia.org/resource/C%2B%2B" id="link-id0xa8784b0">C</a>-based clients.  A Redland-compatible API also exists, for use with Redland v1.0.8 and later. </p>

<p>Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost.</p>

<p>Shipping an embedded Virtuoso is easy.  It just takes one executable and one configuration file.  Performance is generally comparable to &quot;normal&quot; mode, except that Lite will be somewhat less scalable on multicore systems.</p>

<p>The Lite mode will be included in the next Virtuoso 5 Open Source release.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-16#1499">
  <rss:title>&quot;E Pluribus Unum&quot;, or &quot;Inversely Functional Identity&quot;, or &quot;Smooshing Without the Stickiness&quot; (re-updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-16T14:14:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable. This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book&#39;s ISBN number, or a person&#39;s social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property. In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut. In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set. On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems. To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as &quot;same enough,&quot; but a query about criminal prosecution would obviously need to be more precise about sameness. Our ontology is defined like this: -- Populate a named graph with the triples you want to use in query time inferencing ttlp ( &#39; @prefix foaf: &lt;xmlns=&quot;http&quot; xmlns.com=&quot;xmlns.com&quot; foaf=&quot;foaf&quot;&gt; &lt;/&gt; @prefix owl: &lt;xmlns=&quot;http&quot; www.w3.org=&quot;www.w3.org&quot; owl=&quot;owl&quot;&gt; &lt;/&gt; foaf:mbox_sha1sum a owl:InverseFunctionalProperty . foaf:name a owl:InverseFunctionalProperty . &#39;, &#39;xx&#39;, &#39;b3sifp&#39; ); -- Declare that the graph contains an ontology for use in query time inferencing rdfs_rule_set ( &#39;http://example.com/rules/b3sifp#&#39;, &#39;b3sifp&#39; ); Then use it: sparql DEFINE input:inference &quot;http://example.com/rules/b3sifp#&quot; SELECT DISTINCT ?k ?f1 ?f2 WHERE { ?k foaf:name ?n . ?n bif:contains &quot;&#39;Kjetil Kjernsmo&#39;&quot; . ?k foaf:knows ?f1 . ?f1 foaf:knows ?f2 }; VARCHAR VARCHAR VARCHAR ______________________________________ _______________________________________________ ______________________________ http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin .... Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person&#39;s FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same. Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both. If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT. If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances. The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others. The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty. Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them: Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person. Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think. For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant. The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse. We count the triples that enter into the smoosh: -- the name is an existence because else we&#39;d get several times more due to -- the names occurring in many graphs sparql SELECT COUNT(*) WHERE { { SELECT DISTINCT ?person WHERE { ?person a foaf:Person } } . FILTER ( bif:exists ( SELECT (1) WHERE { ?person foaf:name ?nn } ) ) . ?person ?p ?o }; -- We get 3284674 We make a few tables for intermediate results. -- For each distinct name, gather the properties and objects from -- all subjects with this name CREATE TABLE name_prop ( np_name ANY, np_p IRI_ID_8, np_o ANY, PRIMARY KEY ( np_name, np_p, np_o ) ); ALTER INDEX name_prop ON name_prop PARTITION ( np_name VARCHAR (-1, 0hexffff) ); -- Map from name to canonical IRI used for the name CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY, ni_s IRI_ID_8 ); ALTER INDEX name_iri ON name_iri PARTITION ( ni_name VARCHAR (-1, 0hexffff) ); -- Map from person IRI to canonical person IRI CREATE TABLE pref_iri ( i IRI_ID_8, pref IRI_ID_8, PRIMARY KEY ( i ) ); ALTER INDEX pref_iri ON pref_iri PARTITION ( i INT (0hexffff00) ); -- a table for the materialization where all aliases get all properties of every other CREATE TABLE smoosh_ct ( s IRI_ID_8, p IRI_ID_8, o ANY, PRIMARY KEY ( s, p, o ) ); ALTER INDEX smoosh_ct ON smoosh_ct PARTITION ( s INT (0hexffff00) ); -- disable transaction log and enable row auto-commit. This is necessary, otherwise -- bulk operations are done transactionally and they will run out of rollback space. LOG_ENABLE (2); -- Gather all the properties of all persons with a name under that name. -- INSERT SOFT means that duplicates are ignored INSERT SOFT name_prop SELECT &quot;n&quot;, &quot;p&quot;, &quot;o&quot; FROM ( sparql DEFINE output:valmode &quot;LONG&quot; SELECT ?n ?p ?o WHERE { ?x a foaf:Person . ?x foaf:name ?n . ?x ?p ?o } ) xx ; -- Now choose for each name the canonical IRI INSERT INTO name_iri SELECT np_name, ( SELECT MIN (s) FROM rdf_quad WHERE o = np_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ) AS mini FROM name_prop WHERE np_p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- For each person IRI, map to the canonical IRI of that person INSERT SOFT pref_iri (i, pref) SELECT s, ni_s FROM name_iri, rdf_quad WHERE o = ni_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- Make a graph where all persons have one iri with all the properties of all aliases -- and where person-to-person refs are canonicalized INSERT SOFT rdf_quad (g,s,p,o) SELECT IRI_TO_ID (&#39;psmoosh&#39;), ni_s, np_p, COALESCE ( ( SELECT pref FROM pref_iri WHERE i = np_o ), np_o ) FROM name_prop, name_iri WHERE ni_name = np_name OPTION ( loop, quietcast ) ; -- A little explanation: The properties of names are copied into rdf_quad with the name -- replaced with its canonical IRI. If the object has a canonical IRI, this is used as -- the object, else the object is unmodified. This is the COALESCE with the sub-query. -- This takes a little time. To check on the progress, take another connection to the -- server and do STATUS (&#39;cluster&#39;); -- It will return something like -- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i -- buffers 549481 253929 d 8 w 0 pfs -- Now finalize the state; this makes it permanent. Else the work will be lost on server -- failure, since there was no transaction log CL_EXEC (&#39;checkpoint&#39;); -- See what we got sparql SELECT COUNT (*) FROM &lt;psmoosh&gt; WHERE {?s ?p ?o}; -- This is 2253102 -- Now make the copy where all have the properties of all synonyms. This takes so much -- space we do not insert it as RDF quads, but make a special table for it so that we can -- run some statistics. This saves time. INSERT SOFT smoosh_ct (s, p, o) SELECT s, np_p, np_o FROM name_prop, rdf_quad WHERE o = np_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- as above, INSERT SOFT so as to ignore duplicates SELECT COUNT (*) FROM smoosh_ct; -- This is 167360324 -- Find out where the bloat comes from SELECT TOP 20 COUNT (*), ID_TO_IRI (p) FROM smoosh_ct GROUP BY p ORDER BY 1 DESC; The results are: 54728777 http://www.w3.org/2002/07/owl#sameAs 48543153 http://xmlns.com/foaf/0.1/knows 13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso 12268512 http://xmlns.com/foaf/0.1/interest 11415867 http://xmlns.com/foaf/0.1/nick 6683963 http://xmlns.com/foaf/0.1/weblog 6650093 http://xmlns.com/foaf/0.1/depiction 4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum 4129629 http://xmlns.com/foaf/0.1/homepage 1776555 http://xmlns.com/foaf/0.1/holdsAccount 1219525 http://xmlns.com/foaf/0.1/based_near 305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 274965 http://xmlns.com/foaf/0.1/name 155131 http://xmlns.com/foaf/0.1/dateOfBirth 153001 http://xmlns.com/foaf/0.1/img 111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR 52930 http://xmlns.com/foaf/0.1/gender 48517 http://www.w3.org/2004/02/skos/core#subject 45697 http://www.w3.org/2000/01/rdf-schema#label 44860 http://purl.org/vocab/bio/0.1/olb Now compare with the predicate distribution of the smoosh with identities canonicalized sparql SELECT COUNT (*) ?p FROM &lt;psmoosh&gt; WHERE { ?s ?p ?o } GROUP BY ?p ORDER BY 1 DESC LIMIT 20; Results are: 748311 http://xmlns.com/foaf/0.1/knows 548391 http://xmlns.com/foaf/0.1/interest 140531 http://www.w3.org/2000/01/rdf-schema#seeAlso 105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 78497 http://xmlns.com/foaf/0.1/name 48099 http://www.w3.org/2004/02/skos/core#subject 45179 http://xmlns.com/foaf/0.1/depiction 40229 http://www.w3.org/2000/01/rdf-schema#comment 38272 http://www.w3.org/2000/01/rdf-schema#label 37378 http://xmlns.com/foaf/0.1/nick 37186 http://dbpedia.org/property/abstract 34003 http://xmlns.com/foaf/0.1/img 26182 http://xmlns.com/foaf/0.1/homepage 23795 http://www.w3.org/2002/07/owl#sameAs 17651 http://xmlns.com/foaf/0.1/mbox_sha1sum 17430 http://xmlns.com/foaf/0.1/dateOfBirth 15586 http://xmlns.com/foaf/0.1/page 12869 http://dbpedia.org/property/reference 12497 http://xmlns.com/foaf/0.1/weblog 12329 http://blogs.yandex.ru/schema/foaf/school We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state. Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist: sparql SELECT COUNT (*) WHERE { ?s foaf:knows ?thing . FILTER ( !bif:exists ( SELECT (1) WHERE { ?thing foaf:name ?nn } ) ) }; -- 1393940 So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here. Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts. select (167360324 - 54728777) / 3284674.0; 34.290022997716059 select 2229307 / 3284674.0; = 0.678699621332284 So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not. Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>What a terrible word, smooshing...  I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other.  This smooshes them together, makes them interchangeable.</p>

<p>This is complex, so I will begin with the point and the interested may read on for the details and implications.  Starting with soon to be released version 6, <a href="http://virtuoso.openlinksw.com" id="link-id15718cb8">Virtuoso</a> allows you to say that two things, if they share a uniquely identifying property, are the same.  Examples of uniquely identifying properties would be a book&#39;s ISBN number, or a person&#39;s social security plus full name.  In relational language this is a <i>unique key</i>, and in <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id145ed998">RDF</a> parlance, an <i>inverse functional property</i>.</p>

<p>In most systems, such problems are dealt with as a preprocessing step before querying.  For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules.  This is good if the rules are clear and understood.  This is so in closed situations, where things tend to have standard identifiers to begin with.  But on the open web this is not so clear cut.</p>

<p>In this post, we show how to do these things <i>ad hoc</i>, without materializing anything.  At the end, we also show how to materialize identity and what the consequences of this are with open web <a href="http://dbpedia.org/resource/Data" id="link-id11726358">data</a>.  We use real live web crawls from the <a href="http://challenge.semanticweb.org/" id="link-id14f40448">Billion Triples Challenge</a> data set.</p>

<p>On the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id156e2b10">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id1106ce08">web</a>, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated.  But this is only the beginning of the problems.</p>

<p>To address these, we have added the option of specifying that some property will be considered inversely functional in a query.  This is done at run time and the property does not really have to be inversely functional in the pure sense.  <code>foaf:name</code> will do for an example.  This simply means that for purposes of the query concerned, two subjects which have at least one <code>foaf:name</code> in common are considered the same. In this way, we can join between FOAF files.  With the same database, a query about music preferences might consider having the same name as &quot;same enough,&quot; but a query about criminal prosecution would obviously need to be more precise about sameness.</p>

<p>Our ontology is defined like this:</p>

<blockquote>
<pre>-- Populate a named graph with the triples you want to use in query time inferencing<br />
ttlp ( &#39;
        @prefix foaf: &lt;xmlns=&quot;http&quot; xmlns.com=&quot;xmlns.com&quot; foaf=&quot;foaf&quot;&gt;
                      &lt;/&gt;
        @prefix owl:  &lt;xmlns=&quot;http&quot; www.w3.org=&quot;www.w3.org&quot; owl=&quot;owl&quot;&gt;
                      &lt;/&gt;
        foaf:mbox_sha1sum  a  owl:InverseFunctionalProperty  .
        foaf:name          a  owl:InverseFunctionalProperty  .
       &#39;,
       &#39;xx&#39;,
       &#39;b3sifp&#39;
     );<br />
-- Declare that the graph contains an ontology for use in query time inferencing <br />
rdfs_rule_set ( &#39;http://example.com/rules/b3sifp#&#39;,
                &#39;b3sifp&#39;
              );
</pre></blockquote>

<p>Then use it:</p>

<blockquote>
<pre>sparql 
   DEFINE input:inference &quot;http://example.com/rules/b3sifp#&quot; 
   SELECT DISTINCT ?k ?f1 ?f2 
   WHERE { ?k   foaf:name     ?n                   . 
           ?n   bif:contains  &quot;&#39;Kjetil Kjernsmo&#39;&quot;  . 
           ?k   foaf:knows    ?f1                  . 
           ?f1  foaf:knows    ?f2 
         };<br />
VARCHAR                                  VARCHAR                                           VARCHAR
______________________________________   _______________________________________________   ______________________________<br />
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/perigrin
....
</pre></blockquote>

<p>Without the inference, we get no matches.  This is because the data in question has one graph per FOAF file, and blank nodes for persons.  No graph references any person outside the ones in the graph.  So if somebody is mentioned as known, then without the inference there is no way to get to what that person&#39;s FOAF file says, since the same individual will be a different blank node there.  The declaration in the context named <code>b3sifp</code> just means that all things with a matching <code>foaf:name</code> or <code>foaf:mbox_sha1sum</code> are the same.</p>

<p>Sameness means that two are the same for purposes of <code>DISTINCT</code> or <code>GROUP BY</code>, and if two are the same, then both have the <code>UNION</code> of all of the properties of both.</p>

<p>If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for <code>DISTINCT</code>.</p>

<p>If we have complex application rules for determining whether individuals are the same, then one can materialize <code>owl:sameAs</code> triples and keep them in a separate graph.  In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances.</p>

<p>The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up.  Best and worst cases will always depend on the data.  In rough terms, the more <i>ad hoc</i> the use, the less desirable the materialization.  If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best.  We can do that too, but so can others.</p>

<p>The principal point is about agility as concerns the inference.  Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble.  When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal.  Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion <a href="http://dbpedia.org/resource/Blog" id="link-id1156a550">blog</a> posts or something of the sort.  So queries will typically be selective.  Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. </p>

<p>The exception is corpus-wide statistics queries.  But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound.  And when things are disk bound, size matters.  Reading extra entailment on the way is just a performance penalty.</p>

<p>Enough talk. Time for an experiment.  We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:</p>

<ol>
<li>Resolve identity at insert time.  We remove duplicate person URIs, and give the single <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id11317008">URI</a> all the properties of all the duplicate URIs.  We expect these to be most often repeats.  If a person references another person, we normalize this reference to go to the single URI of the referenced person.</li>

<li>Give every duplicate URI of a person all the properties of all the duplicates.  If these are the same value, the data should not get much bigger, or so we think.</li>
</ol>

<p>For the experiment, we will consider two people the same if they have the same <code>foaf:name</code> and are both instances of <code>foaf:Person</code>.  This gets some extra hits but should not be statistically significant.</p>

<p>The following is a commented <a href="http://dbpedia.org/resource/SQL" id="link-id110945b0">SQL</a> script performing the smoosh.  We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone.  We use SPARQL where possible for readability.  As the documentation states, <code>iri_to_id</code> converts from the qualified name of an IRI to its ID and <code>id_to_iri</code> does the reverse.</p>

<p>We count the triples that enter into the smoosh:</p>

<blockquote>
<pre>-- the name is an existence because else we&#39;d get several times more due to 
-- the names occurring in many graphs <br />
sparql 
   SELECT COUNT(*) 
    WHERE { { SELECT DISTINCT ?person 
               WHERE { ?person a foaf:Person }
            } . 
            FILTER ( bif:exists ( SELECT (1) 
                                   WHERE { ?person foaf:name ?nn } 
                                )
                       ) . 
            ?person ?p ?o
          };<br />
-- We get 3284674
</pre></blockquote>

<p>We make a few tables for intermediate results.</p>

<blockquote>
<pre>-- For each distinct name, gather the properties and objects from 
-- all subjects with this name <br />
CREATE TABLE name_prop 
   ( np_name  ANY, 
     np_p     IRI_ID_8, 
     np_o     ANY, 
     PRIMARY KEY ( np_name, 
                   np_p, 
                   np_o
                 )
   );
ALTER INDEX name_prop 
   ON name_prop 
   PARTITION ( np_name VARCHAR (-1, 0hexffff) );<br />
-- Map from name to canonical IRI used for the name <br />
CREATE TABLE name_iri ( ni_name  ANY PRIMARY KEY, 
                        ni_s     IRI_ID_8
                      );
ALTER INDEX name_iri 
   ON name_iri 
   PARTITION ( ni_name VARCHAR (-1, 0hexffff) );<br />
-- Map from person IRI to canonical person IRI<br />
CREATE TABLE pref_iri 
   ( i     IRI_ID_8, 
     pref  IRI_ID_8, 
     PRIMARY KEY ( i )
   );
ALTER INDEX pref_iri 
   ON pref_iri 
   PARTITION ( i INT (0hexffff00) );<br />
-- a table for the materialization where all aliases get all properties of every other <br />
CREATE TABLE smoosh_ct 
   ( s  IRI_ID_8, 
     p  IRI_ID_8, 
     o  ANY, 
     PRIMARY KEY ( s, 
                   p, 
                   o
                 ) 
   );
ALTER INDEX smoosh_ct 
   ON smoosh_ct 
   PARTITION ( s INT (0hexffff00) );<br />
-- disable transaction log and enable row auto-commit.  This is necessary, otherwise 
-- bulk operations are done transactionally and they will run out of rollback space.<br />
LOG_ENABLE (2);<br />
-- Gather all the properties of all persons with a name under that name.  
-- INSERT SOFT means that duplicates are ignored <br />
INSERT SOFT name_prop 
   SELECT &quot;n&quot;, &quot;p&quot;, &quot;o&quot; 
   FROM ( sparql 
          DEFINE output:valmode &quot;LONG&quot; 
          SELECT ?n ?p ?o 
          WHERE { ?x a foaf:Person . 
                 ?x foaf:name ?n . 
                 ?x ?p ?o
               }
        ) xx ;<br />
-- Now choose for each name the canonical IRI <br />
INSERT INTO name_iri 
   SELECT np_name, 
          ( SELECT MIN (s) 
              FROM rdf_quad 
             WHERE o = np_name 
                   AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;)
          ) AS mini 
     FROM name_prop 
    WHERE np_p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- For each person IRI, map to the canonical IRI of that person <br />
INSERT SOFT pref_iri (i, pref) 
   SELECT s, 
          ni_s 
     FROM name_iri, 
          rdf_quad 
    WHERE o = ni_name 
          AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- Make a graph where all persons have one iri with all the properties of all aliases 
-- and where person-to-person refs are canonicalized<br />
INSERT SOFT rdf_quad (g,s,p,o) 
   SELECT IRI_TO_ID (&#39;psmoosh&#39;), 
          ni_s, 
          np_p, 
 COALESCE ( ( SELECT pref 
              FROM pref_iri 
              WHERE i = np_o
            ), 
            np_o 
          )
     FROM name_prop, 
          name_iri 
    WHERE ni_name = np_name 
   OPTION ( loop, quietcast ) ;<br />
-- A little explanation:  The properties of names are copied into rdf_quad with the name 
-- replaced with its canonical IRI.  If the object has a canonical IRI, this is used as 
-- the object, else the object is unmodified.  This is the COALESCE with the sub-query.<br />
-- This takes a little time.  To check on the progress, take another connection to the 
-- server and do <br />
STATUS (&#39;cluster&#39;);<br />
-- It will return something like 
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s  75% cpu 186%  read 12% clw threads 5r 0w 0i 
-- buffers 549481 253929 d 8 w 0 pfs<br />
-- Now finalize the state; this makes it permanent.  Else the work will be lost on server 
-- failure, since there was no transaction log <br />
CL_EXEC (&#39;checkpoint&#39;);<br />
-- See what we got<br />
sparql 
   SELECT COUNT (*) 
     FROM &lt;psmoosh&gt; 
     WHERE {?s ?p ?o};<br />
-- This is 2253102<br />
-- Now make the copy where all have the properties of all synonyms.  This takes so much 
-- space we do not insert it as RDF quads, but make a special table for it so that we can 
-- run some statistics.  This saves time.<br />
INSERT SOFT smoosh_ct (s, p, o)  
   SELECT s, np_p, np_o 
     FROM name_prop, 
          rdf_quad 
    WHERE o = np_name 
          AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- as above, INSERT SOFT so as to ignore duplicates <br />
SELECT COUNT (*) 
   FROM smoosh_ct;<br />
-- This is  167360324<br />
-- Find out where the bloat comes from <br />
SELECT TOP 20 COUNT (*), 
              ID_TO_IRI (p) 
   FROM smoosh_ct 
   GROUP BY p 
   ORDER BY 1 DESC;
</pre></blockquote>
<p>The results are:</p>

<blockquote>
<pre>54728777          http://www.w3.org/2002/07/owl#sameAs
48543153          http://xmlns.com/foaf/0.1/knows
13930234          http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512          http://xmlns.com/foaf/0.1/interest
11415867          http://xmlns.com/foaf/0.1/nick
6683963           http://xmlns.com/foaf/0.1/weblog
6650093           http://xmlns.com/foaf/0.1/depiction
4231946           http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629           http://xmlns.com/foaf/0.1/homepage
1776555           http://xmlns.com/foaf/0.1/holdsAccount
1219525           http://xmlns.com/foaf/0.1/based_near
305522            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965            http://xmlns.com/foaf/0.1/name
155131            http://xmlns.com/foaf/0.1/dateOfBirth
153001            http://xmlns.com/foaf/0.1/img
111130            http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930             http://xmlns.com/foaf/0.1/gender
48517             http://www.w3.org/2004/02/skos/core#subject
45697             http://www.w3.org/2000/01/rdf-schema#label
44860             http://purl.org/vocab/bio/0.1/olb
</pre></blockquote>

<p>Now compare with the predicate distribution of the smoosh with identities canonicalized </p>

<blockquote>
<pre>sparql 
     SELECT COUNT (*) ?p 
       FROM &lt;psmoosh&gt; 
      WHERE { ?s ?p ?o } 
   GROUP BY ?p 
   ORDER BY 1 DESC 
      LIMIT 20;</pre></blockquote>

<p>Results are:</p>
<blockquote>
<pre>748311            http://xmlns.com/foaf/0.1/knows
548391            http://xmlns.com/foaf/0.1/interest
140531            http://www.w3.org/2000/01/rdf-schema#seeAlso
105273            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497             http://xmlns.com/foaf/0.1/name
48099             http://www.w3.org/2004/02/skos/core#subject
45179             http://xmlns.com/foaf/0.1/depiction
40229             http://www.w3.org/2000/01/rdf-schema#comment
38272             http://www.w3.org/2000/01/rdf-schema#label
37378             http://xmlns.com/foaf/0.1/nick
37186             http://dbpedia.org/property/abstract
34003             http://xmlns.com/foaf/0.1/img
26182             http://xmlns.com/foaf/0.1/homepage
23795             http://www.w3.org/2002/07/owl#sameAs
17651             http://xmlns.com/foaf/0.1/mbox_sha1sum
17430             http://xmlns.com/foaf/0.1/dateOfBirth
15586             http://xmlns.com/foaf/0.1/page
12869             http://dbpedia.org/property/reference
12497             http://xmlns.com/foaf/0.1/weblog
12329             http://blogs.yandex.ru/schema/foaf/school
</pre></blockquote>

<p>We can drop the <code>owl:sameAs</code> triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.</p>

<p>Now, when we try using the psmoosh graph, we still get different results from the results with the original data.  This is because <code>foaf:knows</code> relations to things with no <code>foaf:name</code> are not represented in the smoosh.  The exist:</p>

<blockquote>
<pre>sparql 
SELECT COUNT (*) 
   WHERE { ?s foaf:knows ?thing . 
           FILTER ( !bif:exists ( SELECT (1) 
                                   WHERE { ?thing foaf:name ?nn }
                                )
                  ) 
         };<br />
-- 1393940
</pre></blockquote>

<p>So the smoosh graph is not an accurate rendition of the social network.  It would have to be smooshed further to be that, since the data in the sample is quite irregular.  But we do not go that far here.</p>

<p>Finally, we calculate the smoosh blow up factors.  We do not include <code>owl:sameAs</code> triples in the counts.</p>

<blockquote>
<pre>select (167360324 - 54728777) / 3284674.0;
34.290022997716059<br />
select 2229307 / 3284674.0;
= 0.678699621332284
</pre></blockquote>

<p>So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.</p>

<p>Making the smooshes does not take very long, some minutes for the small one.  Inserting the big one would be longer, a couple of hours maybe.  It was 33 minutes for filling the <code>smoosh_ct</code> table.  The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time.  Probably more time than allowable in an interactive situation, no matter how the process is optimized.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-16#1498">
  <rss:title>&quot;E Pluribus Unum&quot;, or &quot;Inversely Functional Identity&quot;, or &quot;Smooshing Without the Stickiness&quot; (re-updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-16T14:14:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable. This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book&#39;s ISBN number, or a person&#39;s social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property. In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut. In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set. On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems. To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as &quot;same enough,&quot; but a query about criminal prosecution would obviously need to be more precise about sameness. Our ontology is defined like this: -- Populate a named graph with the triples you want to use in query time inferencing ttlp ( &#39; @prefix foaf: &lt;xmlns=&quot;http&quot; xmlns.com=&quot;xmlns.com&quot; foaf=&quot;foaf&quot;&gt; &lt;/&gt; @prefix owl: &lt;xmlns=&quot;http&quot; www.w3.org=&quot;www.w3.org&quot; owl=&quot;owl&quot;&gt; &lt;/&gt; foaf:mbox_sha1sum a owl:InverseFunctionalProperty . foaf:name a owl:InverseFunctionalProperty . &#39;, &#39;xx&#39;, &#39;b3sifp&#39; ); -- Declare that the graph contains an ontology for use in query time inferencing rdfs_rule_set ( &#39;http://example.com/rules/b3sifp#&#39;, &#39;b3sifp&#39; ); Then use it: sparql DEFINE input:inference &quot;http://example.com/rules/b3sifp#&quot; SELECT DISTINCT ?k ?f1 ?f2 WHERE { ?k foaf:name ?n . ?n bif:contains &quot;&#39;Kjetil Kjernsmo&#39;&quot; . ?k foaf:knows ?f1 . ?f1 foaf:knows ?f2 }; VARCHAR VARCHAR VARCHAR ______________________________________ _______________________________________________ ______________________________ http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin .... Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person&#39;s FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same. Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both. If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT. If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances. The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others. The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty. Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them: Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person. Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think. For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant. The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse. We count the triples that enter into the smoosh: -- the name is an existence because else we&#39;d get several times more due to -- the names occurring in many graphs sparql SELECT COUNT(*) WHERE { { SELECT DISTINCT ?person WHERE { ?person a foaf:Person } } . FILTER ( bif:exists ( SELECT (1) WHERE { ?person foaf:name ?nn } ) ) . ?person ?p ?o }; -- We get 3284674 We make a few tables for intermediate results. -- For each distinct name, gather the properties and objects from -- all subjects with this name CREATE TABLE name_prop ( np_name ANY, np_p IRI_ID_8, np_o ANY, PRIMARY KEY ( np_name, np_p, np_o ) ); ALTER INDEX name_prop ON name_prop PARTITION ( np_name VARCHAR (-1, 0hexffff) ); -- Map from name to canonical IRI used for the name CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY, ni_s IRI_ID_8 ); ALTER INDEX name_iri ON name_iri PARTITION ( ni_name VARCHAR (-1, 0hexffff) ); -- Map from person IRI to canonical person IRI CREATE TABLE pref_iri ( i IRI_ID_8, pref IRI_ID_8, PRIMARY KEY ( i ) ); ALTER INDEX pref_iri ON pref_iri PARTITION ( i INT (0hexffff00) ); -- a table for the materialization where all aliases get all properties of every other CREATE TABLE smoosh_ct ( s IRI_ID_8, p IRI_ID_8, o ANY, PRIMARY KEY ( s, p, o ) ); ALTER INDEX smoosh_ct ON smoosh_ct PARTITION ( s INT (0hexffff00) ); -- disable transaction log and enable row auto-commit. This is necessary, otherwise -- bulk operations are done transactionally and they will run out of rollback space. LOG_ENABLE (2); -- Gather all the properties of all persons with a name under that name. -- INSERT SOFT means that duplicates are ignored INSERT SOFT name_prop SELECT &quot;n&quot;, &quot;p&quot;, &quot;o&quot; FROM ( sparql DEFINE output:valmode &quot;LONG&quot; SELECT ?n ?p ?o WHERE { ?x a foaf:Person . ?x foaf:name ?n . ?x ?p ?o } ) xx ; -- Now choose for each name the canonical IRI INSERT INTO name_iri SELECT np_name, ( SELECT MIN (s) FROM rdf_quad WHERE o = np_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ) AS mini FROM name_prop WHERE np_p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- For each person IRI, map to the canonical IRI of that person INSERT SOFT pref_iri (i, pref) SELECT s, ni_s FROM name_iri, rdf_quad WHERE o = ni_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- Make a graph where all persons have one iri with all the properties of all aliases -- and where person-to-person refs are canonicalized INSERT SOFT rdf_quad (g,s,p,o) SELECT IRI_TO_ID (&#39;psmoosh&#39;), ni_s, np_p, COALESCE ( ( SELECT pref FROM pref_iri WHERE i = np_o ), np_o ) FROM name_prop, name_iri WHERE ni_name = np_name OPTION ( loop, quietcast ) ; -- A little explanation: The properties of names are copied into rdf_quad with the name -- replaced with its canonical IRI. If the object has a canonical IRI, this is used as -- the object, else the object is unmodified. This is the COALESCE with the sub-query. -- This takes a little time. To check on the progress, take another connection to the -- server and do STATUS (&#39;cluster&#39;); -- It will return something like -- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i -- buffers 549481 253929 d 8 w 0 pfs -- Now finalize the state; this makes it permanent. Else the work will be lost on server -- failure, since there was no transaction log CL_EXEC (&#39;checkpoint&#39;); -- See what we got sparql SELECT COUNT (*) FROM &lt;psmoosh&gt; WHERE {?s ?p ?o}; -- This is 2253102 -- Now make the copy where all have the properties of all synonyms. This takes so much -- space we do not insert it as RDF quads, but make a special table for it so that we can -- run some statistics. This saves time. INSERT SOFT smoosh_ct (s, p, o) SELECT s, np_p, np_o FROM name_prop, rdf_quad WHERE o = np_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- as above, INSERT SOFT so as to ignore duplicates SELECT COUNT (*) FROM smoosh_ct; -- This is 167360324 -- Find out where the bloat comes from SELECT TOP 20 COUNT (*), ID_TO_IRI (p) FROM smoosh_ct GROUP BY p ORDER BY 1 DESC; The results are: 54728777 http://www.w3.org/2002/07/owl#sameAs 48543153 http://xmlns.com/foaf/0.1/knows 13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso 12268512 http://xmlns.com/foaf/0.1/interest 11415867 http://xmlns.com/foaf/0.1/nick 6683963 http://xmlns.com/foaf/0.1/weblog 6650093 http://xmlns.com/foaf/0.1/depiction 4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum 4129629 http://xmlns.com/foaf/0.1/homepage 1776555 http://xmlns.com/foaf/0.1/holdsAccount 1219525 http://xmlns.com/foaf/0.1/based_near 305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 274965 http://xmlns.com/foaf/0.1/name 155131 http://xmlns.com/foaf/0.1/dateOfBirth 153001 http://xmlns.com/foaf/0.1/img 111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR 52930 http://xmlns.com/foaf/0.1/gender 48517 http://www.w3.org/2004/02/skos/core#subject 45697 http://www.w3.org/2000/01/rdf-schema#label 44860 http://purl.org/vocab/bio/0.1/olb Now compare with the predicate distribution of the smoosh with identities canonicalized sparql SELECT COUNT (*) ?p FROM &lt;psmoosh&gt; WHERE { ?s ?p ?o } GROUP BY ?p ORDER BY 1 DESC LIMIT 20; Results are: 748311 http://xmlns.com/foaf/0.1/knows 548391 http://xmlns.com/foaf/0.1/interest 140531 http://www.w3.org/2000/01/rdf-schema#seeAlso 105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 78497 http://xmlns.com/foaf/0.1/name 48099 http://www.w3.org/2004/02/skos/core#subject 45179 http://xmlns.com/foaf/0.1/depiction 40229 http://www.w3.org/2000/01/rdf-schema#comment 38272 http://www.w3.org/2000/01/rdf-schema#label 37378 http://xmlns.com/foaf/0.1/nick 37186 http://dbpedia.org/property/abstract 34003 http://xmlns.com/foaf/0.1/img 26182 http://xmlns.com/foaf/0.1/homepage 23795 http://www.w3.org/2002/07/owl#sameAs 17651 http://xmlns.com/foaf/0.1/mbox_sha1sum 17430 http://xmlns.com/foaf/0.1/dateOfBirth 15586 http://xmlns.com/foaf/0.1/page 12869 http://dbpedia.org/property/reference 12497 http://xmlns.com/foaf/0.1/weblog 12329 http://blogs.yandex.ru/schema/foaf/school We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state. Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist: sparql SELECT COUNT (*) WHERE { ?s foaf:knows ?thing . FILTER ( !bif:exists ( SELECT (1) WHERE { ?thing foaf:name ?nn } ) ) }; -- 1393940 So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here. Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts. select (167360324 - 54728777) / 3284674.0; 34.290022997716059 select 2229307 / 3284674.0; = 0.678699621332284 So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not. Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>What a terrible word, smooshing...  I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other.  This smooshes them together, makes them interchangeable.</p>

<p>This is complex, so I will begin with the point and the interested may read on for the details and implications.  Starting with soon to be released version 6, <a href="http://virtuoso.openlinksw.com" id="link-id15718cb8">Virtuoso</a> allows you to say that two things, if they share a uniquely identifying property, are the same.  Examples of uniquely identifying properties would be a book&#39;s ISBN number, or a person&#39;s social security plus full name.  In relational language this is a <i>unique key</i>, and in <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id145ed998">RDF</a> parlance, an <i>inverse functional property</i>.</p>

<p>In most systems, such problems are dealt with as a preprocessing step before querying.  For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules.  This is good if the rules are clear and understood.  This is so in closed situations, where things tend to have standard identifiers to begin with.  But on the open web this is not so clear cut.</p>

<p>In this post, we show how to do these things <i>ad hoc</i>, without materializing anything.  At the end, we also show how to materialize identity and what the consequences of this are with open web <a href="http://dbpedia.org/resource/Data" id="link-id11726358">data</a>.  We use real live web crawls from the <a href="http://challenge.semanticweb.org/" id="link-id14f40448">Billion Triples Challenge</a> data set.</p>

<p>On the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id156e2b10">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id1106ce08">web</a>, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated.  But this is only the beginning of the problems.</p>

<p>To address these, we have added the option of specifying that some property will be considered inversely functional in a query.  This is done at run time and the property does not really have to be inversely functional in the pure sense.  <code>foaf:name</code> will do for an example.  This simply means that for purposes of the query concerned, two subjects which have at least one <code>foaf:name</code> in common are considered the same. In this way, we can join between FOAF files.  With the same database, a query about music preferences might consider having the same name as &quot;same enough,&quot; but a query about criminal prosecution would obviously need to be more precise about sameness.</p>

<p>Our ontology is defined like this:</p>

<blockquote>
<pre>-- Populate a named graph with the triples you want to use in query time inferencing<br />
ttlp ( &#39;
        @prefix foaf: &lt;xmlns=&quot;http&quot; xmlns.com=&quot;xmlns.com&quot; foaf=&quot;foaf&quot;&gt;
                      &lt;/&gt;
        @prefix owl:  &lt;xmlns=&quot;http&quot; www.w3.org=&quot;www.w3.org&quot; owl=&quot;owl&quot;&gt;
                      &lt;/&gt;
        foaf:mbox_sha1sum  a  owl:InverseFunctionalProperty  .
        foaf:name          a  owl:InverseFunctionalProperty  .
       &#39;,
       &#39;xx&#39;,
       &#39;b3sifp&#39;
     );<br />
-- Declare that the graph contains an ontology for use in query time inferencing <br />
rdfs_rule_set ( &#39;http://example.com/rules/b3sifp#&#39;,
                &#39;b3sifp&#39;
              );
</pre></blockquote>

<p>Then use it:</p>

<blockquote>
<pre>sparql 
   DEFINE input:inference &quot;http://example.com/rules/b3sifp#&quot; 
   SELECT DISTINCT ?k ?f1 ?f2 
   WHERE { ?k   foaf:name     ?n                   . 
           ?n   bif:contains  &quot;&#39;Kjetil Kjernsmo&#39;&quot;  . 
           ?k   foaf:knows    ?f1                  . 
           ?f1  foaf:knows    ?f2 
         };<br />
VARCHAR                                  VARCHAR                                           VARCHAR
______________________________________   _______________________________________________   ______________________________<br />
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/perigrin
....
</pre></blockquote>

<p>Without the inference, we get no matches.  This is because the data in question has one graph per FOAF file, and blank nodes for persons.  No graph references any person outside the ones in the graph.  So if somebody is mentioned as known, then without the inference there is no way to get to what that person&#39;s FOAF file says, since the same individual will be a different blank node there.  The declaration in the context named <code>b3sifp</code> just means that all things with a matching <code>foaf:name</code> or <code>foaf:mbox_sha1sum</code> are the same.</p>

<p>Sameness means that two are the same for purposes of <code>DISTINCT</code> or <code>GROUP BY</code>, and if two are the same, then both have the <code>UNION</code> of all of the properties of both.</p>

<p>If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for <code>DISTINCT</code>.</p>

<p>If we have complex application rules for determining whether individuals are the same, then one can materialize <code>owl:sameAs</code> triples and keep them in a separate graph.  In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances.</p>

<p>The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up.  Best and worst cases will always depend on the data.  In rough terms, the more <i>ad hoc</i> the use, the less desirable the materialization.  If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best.  We can do that too, but so can others.</p>

<p>The principal point is about agility as concerns the inference.  Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble.  When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal.  Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion <a href="http://dbpedia.org/resource/Blog" id="link-id1156a550">blog</a> posts or something of the sort.  So queries will typically be selective.  Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. </p>

<p>The exception is corpus-wide statistics queries.  But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound.  And when things are disk bound, size matters.  Reading extra entailment on the way is just a performance penalty.</p>

<p>Enough talk. Time for an experiment.  We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:</p>

<ol>
<li>Resolve identity at insert time.  We remove duplicate person URIs, and give the single <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id11317008">URI</a> all the properties of all the duplicate URIs.  We expect these to be most often repeats.  If a person references another person, we normalize this reference to go to the single URI of the referenced person.</li>

<li>Give every duplicate URI of a person all the properties of all the duplicates.  If these are the same value, the data should not get much bigger, or so we think.</li>
</ol>

<p>For the experiment, we will consider two people the same if they have the same <code>foaf:name</code> and are both instances of <code>foaf:Person</code>.  This gets some extra hits but should not be statistically significant.</p>

<p>The following is a commented <a href="http://dbpedia.org/resource/SQL" id="link-id110945b0">SQL</a> script performing the smoosh.  We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone.  We use SPARQL where possible for readability.  As the documentation states, <code>iri_to_id</code> converts from the qualified name of an IRI to its ID and <code>id_to_iri</code> does the reverse.</p>

<p>We count the triples that enter into the smoosh:</p>

<blockquote>
<pre>-- the name is an existence because else we&#39;d get several times more due to 
-- the names occurring in many graphs <br />
sparql 
   SELECT COUNT(*) 
    WHERE { { SELECT DISTINCT ?person 
               WHERE { ?person a foaf:Person }
            } . 
            FILTER ( bif:exists ( SELECT (1) 
                                   WHERE { ?person foaf:name ?nn } 
                                )
                       ) . 
            ?person ?p ?o
          };<br />
-- We get 3284674
</pre></blockquote>

<p>We make a few tables for intermediate results.</p>

<blockquote>
<pre>-- For each distinct name, gather the properties and objects from 
-- all subjects with this name <br />
CREATE TABLE name_prop 
   ( np_name  ANY, 
     np_p     IRI_ID_8, 
     np_o     ANY, 
     PRIMARY KEY ( np_name, 
                   np_p, 
                   np_o
                 )
   );
ALTER INDEX name_prop 
   ON name_prop 
   PARTITION ( np_name VARCHAR (-1, 0hexffff) );<br />
-- Map from name to canonical IRI used for the name <br />
CREATE TABLE name_iri ( ni_name  ANY PRIMARY KEY, 
                        ni_s     IRI_ID_8
                      );
ALTER INDEX name_iri 
   ON name_iri 
   PARTITION ( ni_name VARCHAR (-1, 0hexffff) );<br />
-- Map from person IRI to canonical person IRI<br />
CREATE TABLE pref_iri 
   ( i     IRI_ID_8, 
     pref  IRI_ID_8, 
     PRIMARY KEY ( i )
   );
ALTER INDEX pref_iri 
   ON pref_iri 
   PARTITION ( i INT (0hexffff00) );<br />
-- a table for the materialization where all aliases get all properties of every other <br />
CREATE TABLE smoosh_ct 
   ( s  IRI_ID_8, 
     p  IRI_ID_8, 
     o  ANY, 
     PRIMARY KEY ( s, 
                   p, 
                   o
                 ) 
   );
ALTER INDEX smoosh_ct 
   ON smoosh_ct 
   PARTITION ( s INT (0hexffff00) );<br />
-- disable transaction log and enable row auto-commit.  This is necessary, otherwise 
-- bulk operations are done transactionally and they will run out of rollback space.<br />
LOG_ENABLE (2);<br />
-- Gather all the properties of all persons with a name under that name.  
-- INSERT SOFT means that duplicates are ignored <br />
INSERT SOFT name_prop 
   SELECT &quot;n&quot;, &quot;p&quot;, &quot;o&quot; 
   FROM ( sparql 
          DEFINE output:valmode &quot;LONG&quot; 
          SELECT ?n ?p ?o 
          WHERE { ?x a foaf:Person . 
                 ?x foaf:name ?n . 
                 ?x ?p ?o
               }
        ) xx ;<br />
-- Now choose for each name the canonical IRI <br />
INSERT INTO name_iri 
   SELECT np_name, 
          ( SELECT MIN (s) 
              FROM rdf_quad 
             WHERE o = np_name 
                   AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;)
          ) AS mini 
     FROM name_prop 
    WHERE np_p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- For each person IRI, map to the canonical IRI of that person <br />
INSERT SOFT pref_iri (i, pref) 
   SELECT s, 
          ni_s 
     FROM name_iri, 
          rdf_quad 
    WHERE o = ni_name 
          AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- Make a graph where all persons have one iri with all the properties of all aliases 
-- and where person-to-person refs are canonicalized<br />
INSERT SOFT rdf_quad (g,s,p,o) 
   SELECT IRI_TO_ID (&#39;psmoosh&#39;), 
          ni_s, 
          np_p, 
 COALESCE ( ( SELECT pref 
              FROM pref_iri 
              WHERE i = np_o
            ), 
            np_o 
          )
     FROM name_prop, 
          name_iri 
    WHERE ni_name = np_name 
   OPTION ( loop, quietcast ) ;<br />
-- A little explanation:  The properties of names are copied into rdf_quad with the name 
-- replaced with its canonical IRI.  If the object has a canonical IRI, this is used as 
-- the object, else the object is unmodified.  This is the COALESCE with the sub-query.<br />
-- This takes a little time.  To check on the progress, take another connection to the 
-- server and do <br />
STATUS (&#39;cluster&#39;);<br />
-- It will return something like 
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s  75% cpu 186%  read 12% clw threads 5r 0w 0i 
-- buffers 549481 253929 d 8 w 0 pfs<br />
-- Now finalize the state; this makes it permanent.  Else the work will be lost on server 
-- failure, since there was no transaction log <br />
CL_EXEC (&#39;checkpoint&#39;);<br />
-- See what we got<br />
sparql 
   SELECT COUNT (*) 
     FROM &lt;psmoosh&gt; 
     WHERE {?s ?p ?o};<br />
-- This is 2253102<br />
-- Now make the copy where all have the properties of all synonyms.  This takes so much 
-- space we do not insert it as RDF quads, but make a special table for it so that we can 
-- run some statistics.  This saves time.<br />
INSERT SOFT smoosh_ct (s, p, o)  
   SELECT s, np_p, np_o 
     FROM name_prop, 
          rdf_quad 
    WHERE o = np_name 
          AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- as above, INSERT SOFT so as to ignore duplicates <br />
SELECT COUNT (*) 
   FROM smoosh_ct;<br />
-- This is  167360324<br />
-- Find out where the bloat comes from <br />
SELECT TOP 20 COUNT (*), 
              ID_TO_IRI (p) 
   FROM smoosh_ct 
   GROUP BY p 
   ORDER BY 1 DESC;
</pre></blockquote>
<p>The results are:</p>

<blockquote>
<pre>54728777          http://www.w3.org/2002/07/owl#sameAs
48543153          http://xmlns.com/foaf/0.1/knows
13930234          http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512          http://xmlns.com/foaf/0.1/interest
11415867          http://xmlns.com/foaf/0.1/nick
6683963           http://xmlns.com/foaf/0.1/weblog
6650093           http://xmlns.com/foaf/0.1/depiction
4231946           http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629           http://xmlns.com/foaf/0.1/homepage
1776555           http://xmlns.com/foaf/0.1/holdsAccount
1219525           http://xmlns.com/foaf/0.1/based_near
305522            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965            http://xmlns.com/foaf/0.1/name
155131            http://xmlns.com/foaf/0.1/dateOfBirth
153001            http://xmlns.com/foaf/0.1/img
111130            http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930             http://xmlns.com/foaf/0.1/gender
48517             http://www.w3.org/2004/02/skos/core#subject
45697             http://www.w3.org/2000/01/rdf-schema#label
44860             http://purl.org/vocab/bio/0.1/olb
</pre></blockquote>

<p>Now compare with the predicate distribution of the smoosh with identities canonicalized </p>

<blockquote>
<pre>sparql 
     SELECT COUNT (*) ?p 
       FROM &lt;psmoosh&gt; 
      WHERE { ?s ?p ?o } 
   GROUP BY ?p 
   ORDER BY 1 DESC 
      LIMIT 20;</pre></blockquote>

<p>Results are:</p>
<blockquote>
<pre>748311            http://xmlns.com/foaf/0.1/knows
548391            http://xmlns.com/foaf/0.1/interest
140531            http://www.w3.org/2000/01/rdf-schema#seeAlso
105273            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497             http://xmlns.com/foaf/0.1/name
48099             http://www.w3.org/2004/02/skos/core#subject
45179             http://xmlns.com/foaf/0.1/depiction
40229             http://www.w3.org/2000/01/rdf-schema#comment
38272             http://www.w3.org/2000/01/rdf-schema#label
37378             http://xmlns.com/foaf/0.1/nick
37186             http://dbpedia.org/property/abstract
34003             http://xmlns.com/foaf/0.1/img
26182             http://xmlns.com/foaf/0.1/homepage
23795             http://www.w3.org/2002/07/owl#sameAs
17651             http://xmlns.com/foaf/0.1/mbox_sha1sum
17430             http://xmlns.com/foaf/0.1/dateOfBirth
15586             http://xmlns.com/foaf/0.1/page
12869             http://dbpedia.org/property/reference
12497             http://xmlns.com/foaf/0.1/weblog
12329             http://blogs.yandex.ru/schema/foaf/school
</pre></blockquote>

<p>We can drop the <code>owl:sameAs</code> triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.</p>

<p>Now, when we try using the psmoosh graph, we still get different results from the results with the original data.  This is because <code>foaf:knows</code> relations to things with no <code>foaf:name</code> are not represented in the smoosh.  The exist:</p>

<blockquote>
<pre>sparql 
SELECT COUNT (*) 
   WHERE { ?s foaf:knows ?thing . 
           FILTER ( !bif:exists ( SELECT (1) 
                                   WHERE { ?thing foaf:name ?nn }
                                )
                  ) 
         };<br />
-- 1393940
</pre></blockquote>

<p>So the smoosh graph is not an accurate rendition of the social network.  It would have to be smooshed further to be that, since the data in the sample is quite irregular.  But we do not go that far here.</p>

<p>Finally, we calculate the smoosh blow up factors.  We do not include <code>owl:sameAs</code> triples in the counts.</p>

<blockquote>
<pre>select (167360324 - 54728777) / 3284674.0;
34.290022997716059<br />
select 2229307 / 3284674.0;
= 0.678699621332284
</pre></blockquote>

<p>So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.</p>

<p>Making the smooshes does not take very long, some minutes for the small one.  Inserting the big one would be longer, a couple of hours maybe.  It was 33 minutes for filling the <code>smoosh_ct</code> table.  The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time.  Probably more time than allowable in an interactive situation, no matter how the process is optimized.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-11#1495">
  <rss:title>Virtuoso Anytime:  No Query Is Too Complex (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-11T16:13:10Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">A persistent argument against the linked data web has been the cost, scalability, and vulnerability of SPARQL end points, should the linked data web gain serious mass and traffic. As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we&#39;ll do if, for example, somebody decides to count all the triples in the set. How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries? Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done. Here we are looking for the top 10 people whom people claim to know without being known in return, like this: SQL&gt; sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10; celeb callret-1 VARCHAR VARCHAR ________________________________________ _________ http://twitter.com/BarackObama 252 http://twitter.com/brianshaler 183 http://twitter.com/newmediajim 101 http://twitter.com/HenryRollins 95 http://twitter.com/wilw 81 http://twitter.com/stevegarfield 78 http://twitter.com/cote 66 mailto:adam.westerski@deri.org 66 mailto:michal.zaremba@deri.org 66 http://twitter.com/dsifry 65 *** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1R rnd 0R seq 0P disk 1.346KB / 3 messages SQL&gt; sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10; celeb callret-1 VARCHAR VARCHAR ________________________________________ _________ http://twitter.com/JasonCalacanis 496 http://twitter.com/Twitterrific 466 http://twitter.com/ev 442 http://twitter.com/BarackObama 356 http://twitter.com/laughingsquid 317 http://twitter.com/gruber 294 http://twitter.com/chrispirillo 259 http://twitter.com/ambermacarthur 224 http://twitter.com/t 219 http://twitter.com/johnedwards 188 *** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 329R rnd 44.6KR seq 342P disk 638.4KB / 46 messages The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same. If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY to the ORDER BY. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation. One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set. This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>A persistent argument against the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id1199d5f8">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id116f2730">web</a> has been the cost, scalability, and vulnerability of <a href="http://dbpedia.org/resource/SPARQL" id="link-id14e423c0">SPARQL</a> end points, should the linked data web gain serious mass and traffic.</p>

<p>As we are on the brink of hosting the whole <a href="http://dbpedia.org/resource/DBpedia" id="link-id1376a8b0">DBpedia</a> <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id113c8d20">Linked Open Data</a> cloud in <a href="http://virtuoso.openlinksw.com" id="link-id11425a78">Virtuoso</a> Cluster, we have had to think of what we&#39;ll do if, for example, somebody decides to count all the triples in the set.</p>

<p>How can we encourage clever use of <a href="http://dbpedia.org/resource/Data" id="link-id116f1210">data</a>, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?</p>

<p>Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute.  Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness.  So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.</p>

<p>Here we are looking for the top 10 people whom people claim to know without being known in return, like this:</p>

<blockquote>
<pre>SQL&gt; sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;<br />
celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________<br />
http://twitter.com/BarackObama             252
http://twitter.com/brianshaler             183
http://twitter.com/newmediajim             101
http://twitter.com/HenryRollins            95
http://twitter.com/wilw                    81
http://twitter.com/stevegarfield           78
http://twitter.com/cote                    66
mailto:adam.westerski@deri.org             66
mailto:michal.zaremba@deri.org             66
http://twitter.com/dsifry                  65<br />
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:      1R rnd      0R seq      0P disk  1.346KB /      3 messages<br />
SQL&gt; sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;<br />
celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________<br />
http://twitter.com/JasonCalacanis          496
http://twitter.com/Twitterrific            466
http://twitter.com/ev                      442
http://twitter.com/BarackObama             356
http://twitter.com/laughingsquid           317
http://twitter.com/gruber                  294
http://twitter.com/chrispirillo            259
http://twitter.com/ambermacarthur          224
http://twitter.com/t                       219
http://twitter.com/johnedwards             188<br />
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:    329R rnd   44.6KR seq    342P disk  638.4KB /     46 messages</pre></blockquote>

<p>The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better.  But the response time was the same.</p>

<p>If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple.  But such queries are not very interesting.  To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks.  The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the <code>GROUP BY</code> to the <code>ORDER BY</code>.  If this again times out, we continue with the next outer layer.  This guarantees that results are delivered if there were any results found for which the query pattern is true.  False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.</p>

<p>One can also use this as a basis for paid services.  The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.</p>

<p>This system will be deployed on our <a href="http://challenge.semanticweb.org/" id="link-id11500a58">Billion Triples Challenge</a> <a href="http://b3s.openlinksw.com/" id="link-id11683120">demo instance</a> in a few days, after some more testing.  When Virtuoso 6 ships, all <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id1157a500">LOD</a> Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default.  (AMI users will be able to disable the feature, if desired.)  The feature works with Virtuoso 6 in both single server and cluster deployment.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-11#1494">
  <rss:title>Virtuoso Anytime:  No Query Is Too Complex (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-11T16:13:10Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">A persistent argument against the linked data web has been the cost, scalability, and vulnerability of SPARQL end points, should the linked data web gain serious mass and traffic. As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we&#39;ll do if, for example, somebody decides to count all the triples in the set. How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries? Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done. Here we are looking for the top 10 people whom people claim to know without being known in return, like this: SQL&gt; sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10; celeb callret-1 VARCHAR VARCHAR ________________________________________ _________ http://twitter.com/BarackObama 252 http://twitter.com/brianshaler 183 http://twitter.com/newmediajim 101 http://twitter.com/HenryRollins 95 http://twitter.com/wilw 81 http://twitter.com/stevegarfield 78 http://twitter.com/cote 66 mailto:adam.westerski@deri.org 66 mailto:michal.zaremba@deri.org 66 http://twitter.com/dsifry 65 *** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1R rnd 0R seq 0P disk 1.346KB / 3 messages SQL&gt; sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10; celeb callret-1 VARCHAR VARCHAR ________________________________________ _________ http://twitter.com/JasonCalacanis 496 http://twitter.com/Twitterrific 466 http://twitter.com/ev 442 http://twitter.com/BarackObama 356 http://twitter.com/laughingsquid 317 http://twitter.com/gruber 294 http://twitter.com/chrispirillo 259 http://twitter.com/ambermacarthur 224 http://twitter.com/t 219 http://twitter.com/johnedwards 188 *** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 329R rnd 44.6KR seq 342P disk 638.4KB / 46 messages The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same. If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY to the ORDER BY. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation. One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set. This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>A persistent argument against the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id1199d5f8">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id116f2730">web</a> has been the cost, scalability, and vulnerability of <a href="http://dbpedia.org/resource/SPARQL" id="link-id14e423c0">SPARQL</a> end points, should the linked data web gain serious mass and traffic.</p>

<p>As we are on the brink of hosting the whole <a href="http://dbpedia.org/resource/DBpedia" id="link-id1376a8b0">DBpedia</a> <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id113c8d20">Linked Open Data</a> cloud in <a href="http://virtuoso.openlinksw.com" id="link-id11425a78">Virtuoso</a> Cluster, we have had to think of what we&#39;ll do if, for example, somebody decides to count all the triples in the set.</p>

<p>How can we encourage clever use of <a href="http://dbpedia.org/resource/Data" id="link-id116f1210">data</a>, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?</p>

<p>Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute.  Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness.  So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.</p>

<p>Here we are looking for the top 10 people whom people claim to know without being known in return, like this:</p>

<blockquote>
<pre>SQL&gt; sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;<br />
celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________<br />
http://twitter.com/BarackObama             252
http://twitter.com/brianshaler             183
http://twitter.com/newmediajim             101
http://twitter.com/HenryRollins            95
http://twitter.com/wilw                    81
http://twitter.com/stevegarfield           78
http://twitter.com/cote                    66
mailto:adam.westerski@deri.org             66
mailto:michal.zaremba@deri.org             66
http://twitter.com/dsifry                  65<br />
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:      1R rnd      0R seq      0P disk  1.346KB /      3 messages<br />
SQL&gt; sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;<br />
celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________<br />
http://twitter.com/JasonCalacanis          496
http://twitter.com/Twitterrific            466
http://twitter.com/ev                      442
http://twitter.com/BarackObama             356
http://twitter.com/laughingsquid           317
http://twitter.com/gruber                  294
http://twitter.com/chrispirillo            259
http://twitter.com/ambermacarthur          224
http://twitter.com/t                       219
http://twitter.com/johnedwards             188<br />
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:    329R rnd   44.6KR seq    342P disk  638.4KB /     46 messages</pre></blockquote>

<p>The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better.  But the response time was the same.</p>

<p>If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple.  But such queries are not very interesting.  To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks.  The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the <code>GROUP BY</code> to the <code>ORDER BY</code>.  If this again times out, we continue with the next outer layer.  This guarantees that results are delivered if there were any results found for which the query pattern is true.  False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.</p>

<p>One can also use this as a basis for paid services.  The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.</p>

<p>This system will be deployed on our <a href="http://challenge.semanticweb.org/" id="link-id11500a58">Billion Triples Challenge</a> <a href="http://b3s.openlinksw.com/" id="link-id11683120">demo instance</a> in a few days, after some more testing.  When Virtuoso 6 ships, all <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id1157a500">LOD</a> Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default.  (AMI users will be able to disable the feature, if desired.)  The feature works with Virtuoso 6 in both single server and cluster deployment.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-27#1488">
  <rss:title>An Example of RDF Scalability</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-27T11:23:47Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We hear it to exhaustion, where is RDF scalability? We have been suggesting for a while that this is a solved question. I will here give some concrete numbers to back this. The scalability dream is to add hardware and get increased performance in proportion to the power the added component has when measured by itself. A corollary dream is to take scalability effects that are measured in a simple task and see them in a complex task. Below we show how we do 3.3 million random triple lookups per second on two 8 core commodity servers producing complete results, joining across partitions. On a single 4 core server, the figure is about 1 million lookups per second. With a single thread, it is about 250K lookups per second. This is the good case. But even our worse case is quite decent. We took a simple SPARQL query, counting how many people say they reciprocally know each other. In the Billion Triples Challenge data set, there are 25M foaf:knows quads of which 92K are reciprocal. Reciprocal here means that when x knows y in some graph, y knows x in the same or any other graph. SELECT COUNT (*) WHERE { ?p1 foaf:knows ?p2 . ?p2 foaf:knows ?p1 } There is no guarantee that the triple of x knows y is in the same partition as the triple y knows x. Thus the join is randomly distributed, n partitions to n partitions. We left this out of the Billion Triples Challenge demo because this did not run fast enough for our liking. Since then, we have corrected this. If run on a single thread, this query would be a loop over all the quads with a predicate of foaf:knows, and an inner loop looking for a quad with 3 of 4 fields given (SPO). If we have a partitioned situation, we have a loop over all the foaf:knows quads in each partition, and an inner lookup looking for the reciprocal foaf:knows quad in whatever partition it may be found. We have implemented this with two different message patterns: Centralized: One process reads all the foaf:knows quads from all processes. Every 50K quads, it sends a batch of reciprocal quad checks to each partition that could contain a reciprocal quad. Each partition keeps the count of found reciprocal quads, and these are gathered and added up at the end. Symmetrical: Each process reads the foaf:knows quads in its partition, and sends a batch of checks to each process that could have the reciprocal foaf:knows quad every 50K quads. At the end, the counts are gathered from all partitions. There is some additional control traffic but we do not go into its details here. Below is the result measured on 2 machines each with 2 x Xeon 5345 (quad core; total 8 cores), 16G RAM, and each machine running 6 Virtuoso instances. The interconnect is dual 1-Gbit ethernet. Numbers are with warm cache. Centralized: 35,543 msec, 728,634 sequential + random lookups per second Cluster 12 nodes, 35 s. 1072 m/s 39,085 KB/s 316% cpu ... Symmetrical: 7706 msec, 3,360,740 sequential + random lookups per second Cluster 12 nodes, 7 s. 572 m/s 16,983 KB/s 1137% cpu ... The second line is the summary from the cluster status report for the duration of the query. The interesting numbers are the KB/s and the %CPU. The former is the cross-sectional data transfer rate for intra-cluster communication; the latter is the consolidated CPU utilization, where a constantly-busy core counts for 100%. The point to note is that the symmetrical approach takes 4x less real time with under half the data transfer rate. Further, when using multiple machines, the speed of a single interface does not limit the overall throughput as it does in the centralized situation. These figures represent the best and worst cases of distributed JOINing. If we have a straight sequence of JOINs, with single pattern optionals and existences and the order in which results are produced is not significant (i.e., there is aggregation, existence test, or ORDER BY), the symmetrical pattern is applicable. On the other hand, if there are multiple triple pattern optionals, complex sub-queries, DISTINCTs in the middle of the query, or results have to be produced in the order of an index, then the centralized approach must be used at least part of the time. Also, if we must make transitive closures, which can be thought of as an extension of a DISTINCT in a subquery, we must pass the data through a single point before moving the bindings to the next JOIN in the sequence. This happens for example in resolving owl:sameAs at run time. However, the good news is that performance does not fall much below the centralized figure even when there are complex nested structures with intermediate transitive closures, DISTINCTs, complex existence tests, etc., that require passing all intermediate results through a central point. No matter the complexity, it is always possible to vector some tens-of-thousands of variable bindings into a single message exchange. And if there are not that many intermediate results, then single query execution time is not a problem anyhow. For our sample query, we would get still more speed by using a partitioned hash join, filling the hash from the foaf:knows relations and then running the foaf:knows relations through the hash. If the hash size is right, a hash lookup is somewhat better than an index lookup. The problem is that when the hash join is not the right solution, it is an expensive mistake: the best case is good; the worst case is very bad. But if there is no index then hash join is better than nothing. One problem of hash joins is that they make temporary data structures which, if large, will skew the working set. One must be quite sure of the cardinality before it is safe to try a hash join. So we do not do hash joins with RDF, but we do use them sometimes with relational data. These same methods apply to relational data just as well. This does not make generic RDF storage outperform an application-specific relational representation on the same platform, as the latter benefits from all the same optimizations, but in terms of sheer numbers, this makes RDF representation an option where it was not an option before. RDF is all about not needing to design the schema around the queries, and not needing to limit what joins with what else.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We hear it to exhaustion, where is <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x14e828d8">RDF</a> scalability?  We have been suggesting for a while that this is a solved question.  I will here give some concrete numbers to back this.</p>

<p>The scalability dream is to add hardware and get increased performance in proportion to the power the added component has when measured by itself. A corollary dream is to take scalability effects that are measured in a simple task and see them in a complex task.</p>

<p>Below we show how we do 3.3 million random triple lookups per second on two 8 core commodity servers producing complete results, joining across partitions. On a single 4 core server, the figure is about 1 million lookups per second.  With a single thread, it is about 250K lookups per second.  This is the good case.  But even our worse case is quite decent.</p>

<p>We took a simple <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x14fef850">SPARQL</a> query, counting how many people say they reciprocally know each other.  In the <a href="http://challenge.semanticweb.org/" id="link-id0x1bca04d0">Billion Triples Challenge</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x1be84e88">data</a> set, there are 25M <code>foaf:knows</code> quads of which 92K are reciprocal. <i>Reciprocal</i> here means that when x knows y in some graph, y knows x in the same or any other graph.</p>

<pre>SELECT COUNT (*) 
WHERE { 
         ?p1  foaf:knows  ?p2  . 
         ?p2  foaf:knows  ?p1 
      }</pre>

<p>There is no guarantee that the triple of <code>x knows y</code> is in the same partition as the triple y knows x.  Thus the join is randomly distributed, n partitions to n partitions.</p>

<p>We left this out of the Billion Triples Challenge demo because this did not run fast enough for our liking.  Since then, we have corrected this.</p>

<p>If run on a single thread, this query would be a loop over all the quads with a predicate of <code>foaf:knows</code>, and an inner loop looking for a quad with 3 of 4 fields given (<code>SPO</code>). If we have a partitioned situation, we have a loop over all the <code>foaf:knows</code> quads in each partition, and an inner lookup looking for the reciprocal <code>foaf:knows</code> quad in whatever partition it may be found.</p>

<p>We have implemented this with two different message patterns: </p>

<ol>
 <li>
  <p>
    <b>Centralized:</b> One process reads all the <code>foaf:knows</code> quads from all processes.  Every 50K quads, it sends a batch of reciprocal quad checks to each partition that could contain a reciprocal quad.  Each partition keeps the count of found reciprocal quads, and these are gathered and added up at the end.</p>
 </li>

<li>
  <p>
    <b>Symmetrical:</b> Each process reads the <code>foaf:knows</code> quads in its partition, and sends a batch of checks to each process that could have the reciprocal <code>foaf:knows</code> quad every 50K quads.  At the end, the counts are gathered from all partitions.  There is some additional control traffic but we do not go into its details here.</p>
</li>
</ol>

<p>Below is the result measured on 2 machines each with 2 x Xeon 5345 (quad core; total 8 cores), 16G RAM, and each machine running 6 <a href="http://virtuoso.openlinksw.com" id="link-id0x16642a90">Virtuoso</a> instances.  The interconnect is dual 1-Gbit ethernet. Numbers are with warm cache.</p>

<blockquote>
<code>Centralized:  35,543 msec,  728,634 sequential + random lookups per second <br />
Cluster 12 nodes, 35 s. 1072 m/s 39,085 KB/s  316% cpu ...
 <br /> <br />
Symmetrical:  7706 msec, 3,360,740 sequential + random lookups per second  <br />
Cluster 12 nodes, 7 s. 572 m/s 16,983 KB/s  1137% cpu ...</code>
</blockquote>

<p>The second line is the summary from the cluster status report for the duration of the query.  The interesting numbers are the KB/s and the %CPU.  The former is the cross-sectional data transfer rate for intra-cluster communication; the latter is the consolidated CPU utilization, where a constantly-busy core counts for 100%.  The point to note is that the symmetrical approach takes 4x less real time with under half the data transfer rate.  Further, when using multiple machines, the speed of a single interface does not limit the overall throughput as it does in the centralized situation.</p>

<p>These figures represent the best and worst cases of distributed <code>JOIN</code>ing.  If we have a straight sequence of <code>JOIN</code>s, with single pattern optionals and existences and the order in which results are produced is not significant (i.e., there is aggregation, existence test, or <code>ORDER BY</code>), the symmetrical pattern is applicable.  On the other hand, if there are multiple triple pattern optionals, complex sub-queries, <code>DISTINCT</code>s in the middle of the query, or results have to be produced in the order of an index, then the centralized approach must be used at least part of the time.</p>

<p>Also, if we must make transitive closures, which can be thought of as an extension of a <code>DISTINCT</code> in a subquery, we must pass the data through a single point before moving the bindings to the next <code>JOIN</code> in the sequence. This happens for example in resolving <code><a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x14e1a160">owl</a>:sameAs</code> at run time.  However, the good news is that performance does not fall much below the centralized figure even when there are complex nested structures with intermediate transitive closures, <code>DISTINCT</code>s, complex existence tests, etc., that require passing all intermediate results through a central point. No matter the complexity, it is always possible to vector some tens-of-thousands of variable bindings into a single message exchange.  And if there are not that many intermediate results, then single query execution time is not a problem anyhow.</p>

<p>For our sample query, we would get still more speed by using a partitioned hash join, filling the hash from the <code>foaf:knows</code> relations and then running the <code>foaf:knows</code> relations through the hash.  If the hash size is right, a hash lookup is somewhat better than an index lookup.  The problem is that when the hash join is not the right solution, it is an expensive mistake:  the best case is good; the worst case is very bad. But if there is no index then hash join is better than nothing.  One problem of hash joins is that they make temporary data structures which, if large, will skew the working set.  One must be quite sure of the cardinality before it is safe to try a hash join.  So we do not do hash joins with RDF, but we do use them sometimes with relational data. </p>

<p>These same methods apply to relational data just as well.  This does not make generic RDF storage outperform an application-specific relational representation on the same platform, as the latter benefits from all the same optimizations, but in terms of sheer numbers, this makes RDF representation an option where it was not an option before. RDF is all about not needing to design the schema around the queries, and not needing to limit what joins with what else.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-27#1487">
  <rss:title>An Example of RDF Scalability</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-27T11:23:47Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We hear it to exhaustion, where is RDF scalability? We have been suggesting for a while that this is a solved question. I will here give some concrete numbers to back this. The scalability dream is to add hardware and get increased performance in proportion to the power the added component has when measured by itself. A corollary dream is to take scalability effects that are measured in a simple task and see them in a complex task. Below we show how we do 3.3 million random triple lookups per second on two 8 core commodity servers producing complete results, joining across partitions. On a single 4 core server, the figure is about 1 million lookups per second. With a single thread, it is about 250K lookups per second. This is the good case. But even our worse case is quite decent. We took a simple SPARQL query, counting how many people say they reciprocally know each other. In the Billion Triples Challenge data set, there are 25M foaf:knows quads of which 92K are reciprocal. Reciprocal here means that when x knows y in some graph, y knows x in the same or any other graph. SELECT COUNT (*) WHERE { ?p1 foaf:knows ?p2 . ?p2 foaf:knows ?p1 } There is no guarantee that the triple of x knows y is in the same partition as the triple y knows x. Thus the join is randomly distributed, n partitions to n partitions. We left this out of the Billion Triples Challenge demo because this did not run fast enough for our liking. Since then, we have corrected this. If run on a single thread, this query would be a loop over all the quads with a predicate of foaf:knows, and an inner loop looking for a quad with 3 of 4 fields given (SPO). If we have a partitioned situation, we have a loop over all the foaf:knows quads in each partition, and an inner lookup looking for the reciprocal foaf:knows quad in whatever partition it may be found. We have implemented this with two different message patterns: Centralized: One process reads all the foaf:knows quads from all processes. Every 50K quads, it sends a batch of reciprocal quad checks to each partition that could contain a reciprocal quad. Each partition keeps the count of found reciprocal quads, and these are gathered and added up at the end. Symmetrical: Each process reads the foaf:knows quads in its partition, and sends a batch of checks to each process that could have the reciprocal foaf:knows quad every 50K quads. At the end, the counts are gathered from all partitions. There is some additional control traffic but we do not go into its details here. Below is the result measured on 2 machines each with 2 x Xeon 5345 (quad core; total 8 cores), 16G RAM, and each machine running 6 Virtuoso instances. The interconnect is dual 1-Gbit ethernet. Numbers are with warm cache. Centralized: 35,543 msec, 728,634 sequential + random lookups per second Cluster 12 nodes, 35 s. 1072 m/s 39,085 KB/s 316% cpu ... Symmetrical: 7706 msec, 3,360,740 sequential + random lookups per second Cluster 12 nodes, 7 s. 572 m/s 16,983 KB/s 1137% cpu ... The second line is the summary from the cluster status report for the duration of the query. The interesting numbers are the KB/s and the %CPU. The former is the cross-sectional data transfer rate for intra-cluster communication; the latter is the consolidated CPU utilization, where a constantly-busy core counts for 100%. The point to note is that the symmetrical approach takes 4x less real time with under half the data transfer rate. Further, when using multiple machines, the speed of a single interface does not limit the overall throughput as it does in the centralized situation. These figures represent the best and worst cases of distributed JOINing. If we have a straight sequence of JOINs, with single pattern optionals and existences and the order in which results are produced is not significant (i.e., there is aggregation, existence test, or ORDER BY), the symmetrical pattern is applicable. On the other hand, if there are multiple triple pattern optionals, complex sub-queries, DISTINCTs in the middle of the query, or results have to be produced in the order of an index, then the centralized approach must be used at least part of the time. Also, if we must make transitive closures, which can be thought of as an extension of a DISTINCT in a subquery, we must pass the data through a single point before moving the bindings to the next JOIN in the sequence. This happens for example in resolving owl:sameAs at run time. However, the good news is that performance does not fall much below the centralized figure even when there are complex nested structures with intermediate transitive closures, DISTINCTs, complex existence tests, etc., that require passing all intermediate results through a central point. No matter the complexity, it is always possible to vector some tens-of-thousands of variable bindings into a single message exchange. And if there are not that many intermediate results, then single query execution time is not a problem anyhow. For our sample query, we would get still more speed by using a partitioned hash join, filling the hash from the foaf:knows relations and then running the foaf:knows relations through the hash. If the hash size is right, a hash lookup is somewhat better than an index lookup. The problem is that when the hash join is not the right solution, it is an expensive mistake: the best case is good; the worst case is very bad. But if there is no index then hash join is better than nothing. One problem of hash joins is that they make temporary data structures which, if large, will skew the working set. One must be quite sure of the cardinality before it is safe to try a hash join. So we do not do hash joins with RDF, but we do use them sometimes with relational data. These same methods apply to relational data just as well. This does not make generic RDF storage outperform an application-specific relational representation on the same platform, as the latter benefits from all the same optimizations, but in terms of sheer numbers, this makes RDF representation an option where it was not an option before. RDF is all about not needing to design the schema around the queries, and not needing to limit what joins with what else.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We hear it to exhaustion, where is <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1eab4128">RDF</a> scalability?  We have been suggesting for a while that this is a solved question.  I will here give some concrete numbers to back this.</p>

<p>The scalability dream is to add hardware and get increased performance in proportion to the power the added component has when measured by itself. A corollary dream is to take scalability effects that are measured in a simple task and see them in a complex task.</p>

<p>Below we show how we do 3.3 million random triple lookups per second on two 8 core commodity servers producing complete results, joining across partitions. On a single 4 core server, the figure is about 1 million lookups per second.  With a single thread, it is about 250K lookups per second.  This is the good case.  But even our worse case is quite decent.</p>

<p>We took a simple <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x15cb3da8">SPARQL</a> query, counting how many people say they reciprocally know each other.  In the <a href="http://challenge.semanticweb.org/" id="link-id0x1bfb7a00">Billion Triples Challenge</a> <a href="http://dbpedia.org/resource/Data" id="link-id0xa57187d8">data</a> set, there are 25M <code>foaf:knows</code> quads of which 92K are reciprocal. <i>Reciprocal</i> here means that when x knows y in some graph, y knows x in the same or any other graph.</p>

<pre>SELECT COUNT (*) 
WHERE { 
         ?p1  foaf:knows  ?p2  . 
         ?p2  foaf:knows  ?p1 
      }</pre>

<p>There is no guarantee that the triple of <code>x knows y</code> is in the same partition as the triple y knows x.  Thus the join is randomly distributed, n partitions to n partitions.</p>

<p>We left this out of the Billion Triples Challenge demo because this did not run fast enough for our liking.  Since then, we have corrected this.</p>

<p>If run on a single thread, this query would be a loop over all the quads with a predicate of <code>foaf:knows</code>, and an inner loop looking for a quad with 3 of 4 fields given (<code>SPO</code>). If we have a partitioned situation, we have a loop over all the <code>foaf:knows</code> quads in each partition, and an inner lookup looking for the reciprocal <code>foaf:knows</code> quad in whatever partition it may be found.</p>

<p>We have implemented this with two different message patterns: </p>

<ol>
 <li>
  <p>
    <b>Centralized:</b> One process reads all the <code>foaf:knows</code> quads from all processes.  Every 50K quads, it sends a batch of reciprocal quad checks to each partition that could contain a reciprocal quad.  Each partition keeps the count of found reciprocal quads, and these are gathered and added up at the end.</p>
 </li>

<li>
  <p>
    <b>Symmetrical:</b> Each process reads the <code>foaf:knows</code> quads in its partition, and sends a batch of checks to each process that could have the reciprocal <code>foaf:knows</code> quad every 50K quads.  At the end, the counts are gathered from all partitions.  There is some additional control traffic but we do not go into its details here.</p>
</li>
</ol>

<p>Below is the result measured on 2 machines each with 2 x Xeon 5345 (quad core; total 8 cores), 16G RAM, and each machine running 6 <a href="http://virtuoso.openlinksw.com" id="link-id0x1c0c94a8">Virtuoso</a> instances.  The interconnect is dual 1-Gbit ethernet. Numbers are with warm cache.</p>

<blockquote>
<code>Centralized:  35,543 msec,  728,634 sequential + random lookups per second <br />
Cluster 12 nodes, 35 s. 1072 m/s 39,085 KB/s  316% cpu ...
 <br /> <br />
Symmetrical:  7706 msec, 3,360,740 sequential + random lookups per second  <br />
Cluster 12 nodes, 7 s. 572 m/s 16,983 KB/s  1137% cpu ...</code>
</blockquote>

<p>The second line is the summary from the cluster status report for the duration of the query.  The interesting numbers are the KB/s and the %CPU.  The former is the cross-sectional data transfer rate for intra-cluster communication; the latter is the consolidated CPU utilization, where a constantly-busy core counts for 100%.  The point to note is that the symmetrical approach takes 4x less real time with under half the data transfer rate.  Further, when using multiple machines, the speed of a single interface does not limit the overall throughput as it does in the centralized situation.</p>

<p>These figures represent the best and worst cases of distributed <code>JOIN</code>ing.  If we have a straight sequence of <code>JOIN</code>s, with single pattern optionals and existences and the order in which results are produced is not significant (i.e., there is aggregation, existence test, or <code>ORDER BY</code>), the symmetrical pattern is applicable.  On the other hand, if there are multiple triple pattern optionals, complex sub-queries, <code>DISTINCT</code>s in the middle of the query, or results have to be produced in the order of an index, then the centralized approach must be used at least part of the time.</p>

<p>Also, if we must make transitive closures, which can be thought of as an extension of a <code>DISTINCT</code> in a subquery, we must pass the data through a single point before moving the bindings to the next <code>JOIN</code> in the sequence. This happens for example in resolving <code><a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x28005280">owl</a>:sameAs</code> at run time.  However, the good news is that performance does not fall much below the centralized figure even when there are complex nested structures with intermediate transitive closures, <code>DISTINCT</code>s, complex existence tests, etc., that require passing all intermediate results through a central point. No matter the complexity, it is always possible to vector some tens-of-thousands of variable bindings into a single message exchange.  And if there are not that many intermediate results, then single query execution time is not a problem anyhow.</p>

<p>For our sample query, we would get still more speed by using a partitioned hash join, filling the hash from the <code>foaf:knows</code> relations and then running the <code>foaf:knows</code> relations through the hash.  If the hash size is right, a hash lookup is somewhat better than an index lookup.  The problem is that when the hash join is not the right solution, it is an expensive mistake:  the best case is good; the worst case is very bad. But if there is no index then hash join is better than nothing.  One problem of hash joins is that they make temporary data structures which, if large, will skew the working set.  One must be quite sure of the cardinality before it is safe to try a hash join.  So we do not do hash joins with RDF, but we do use them sometimes with relational data. </p>

<p>These same methods apply to relational data just as well.  This does not make generic RDF storage outperform an application-specific relational representation on the same platform, as the latter benefits from all the same optimizations, but in terms of sheer numbers, this makes RDF representation an option where it was not an option before. RDF is all about not needing to design the schema around the queries, and not needing to limit what joins with what else.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-20#1485">
  <rss:title>Virtuoso Vs. MySQL:  Setting the Berlin Record Straight (update 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-20T11:06:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In the context of the Berlin SPARQL Benchmark, I have repeatedly written about measurement procedures and steady state. The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup. Thus it came to pass that one cut of the BSBM paper had 3 seconds for MySQL and 100 for Virtuoso, basically through ignoring cache effects. So we decided to do it ourselves. The score is (updated with revised innodb_buffer_pool_size setting, based on advice noted down below): n-clients Virtuoso MySQL (with increased buffer pool size) MySQL (with default buffer poll size) 1 41,161.33 27,023.11 12,171.41 4 127,918.30 (pending) 37,566.82 8 218,162.29 105,524.23 51,104.39 16 214,763.58 98,852.42 47,589.18 The metric is the query mixes per hour from the BSBM test driver output. For the interested, the complete output is here. The benchmark is pure SQL, nothing to do with SPARQL or RDF. The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM. The OS is 64-bit Debian Linux. The benchmark was run at a scale of 200,000. Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like. Both databases were configured to use 8G for disk cache. The test effectively runs from memory. We ran an analyze table on each MySQL table but noticed that this had no effect. Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference. The MySQL tables were served by the InnoDB engine. MySQL appears to cache results of queries in some cases. This was not apparent in the tests. The versions are 5.09 for Virtuoso and 5.1.29 for MySQL. You can download and examine -- Virtuoso configuration file MySQL configuration file Table definitions &amp; RDF views Indexes on MySQL tables MySQL ought to do better. We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL. It is possible there is a config parameter that affects this. The public is invited to tell us about such. Update: Andreas Schultz of the BSBM team advised us to increase the innodb_buffer_pool_size setting in the MySQL config. We did and it produced some improvement. Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better. We ran these things a few times so as to have warm cache. On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes. Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso. We recently made a greedy prefetch hack that should give us some mileage there. For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again. If the second measurement is the same as the first then it is good. As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of TPC-H.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In the context of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xa322b58">Berlin SPARQL Benchmark</a>, I have repeatedly written about measurement procedures and steady state.  The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup.  Thus it came to pass that one cut of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x9524730">BSBM</a> paper had 3 seconds for <a href="http://dbpedia.org/resource/MySQL" id="link-id0x2ba8db0">MySQL</a> and 100 for <a href="http://virtuoso.openlinksw.com" id="link-id0xa9137d0">Virtuoso</a>, basically through ignoring cache effects.</p>

<p>So we decided to do it ourselves.</p>

<p>The score is (updated with revised <code>innodb_buffer_pool_size</code> setting, based on advice noted down below):</p>

<table border="1" cellspacing="2" cellpadding="5">
<tr>
    <th>n-clients</th>
    <th>Virtuoso</th>
    <th>MySQL <br /> (with increased buffer pool size)</th>
    <th>MySQL <br /> (with default buffer poll size)</th>
  </tr>
<tr align="right">
    <td>1</td>
    <td> 41,161.33</td>
    <td> 27,023.11 </td>
    <td> 12,171.41</td>
  </tr>
<tr align="right">
    <td>4</td>
    <td> 127,918.30</td>
    <td> (pending) </td>
    <td>  37,566.82</td>
  </tr>
<tr align="right">
    <td>8</td>
    <td> 218,162.29 </td>
    <td> 105,524.23 </td>
    <td>  51,104.39 </td>
  </tr>
<tr align="right">
    <td>16</td>
    <td> 214,763.58 </td>
    <td>  98,852.42 </td>
    <td>  47,589.18 </td>
  </tr>
</table>


<p>The metric is the query mixes per hour from the BSBM test driver output.  For the interested, the complete output is <a href="http://www.openlinksw.com/weblog/oerling/texts/bsbmres.txt" id="link-id1119f770">here</a>.</p>

<p>The benchmark is pure <a href="http://dbpedia.org/resource/SQL" id="link-id0x2b61c88">SQL</a>, nothing to do with <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x17a6d408">SPARQL</a> or <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x9a0a968">RDF</a>.</p>

<p>The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM.  The OS is 64-bit Debian Linux.</p>

<p>The benchmark was run at a scale of 200,000.  Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like.  Both databases were configured to use 8G for disk cache.  The test effectively runs from memory.  We ran an analyze table on each MySQL table but noticed that this had no effect.  Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference.  The MySQL tables were served by the InnoDB engine.  MySQL appears to cache results of queries in some cases.  This was not apparent in the tests.</p>

<p>The versions are 5.09 for Virtuoso and 5.1.29 for MySQL.  You can download and examine --</p>
<ul> 
<li>
<a href="http://www.openlinksw.com/weblog/oerling/texts/virtuoso.ini" id="link-id14fe17f0">Virtuoso configuration file</a>
</li>
<li>
<a href="http://www.openlinksw.com/weblog/oerling/texts/my.cnf" id="link-id116fe490">MySQL configuration file</a>
</li>
<li>
    <a href="http://www.openlinksw.com/weblog/oerling/texts/create_tables_and_rdf_view.sql" id="link-id14ce9268">Table definitions &amp; RDF views</a> 
</li>
<li> <a href="http://www.openlinksw.com/weblog/oerling/texts/mysqlinx.sql" id="link-id1535e298">Indexes on MySQL tables</a>
</li>
</ul>

<p>
<strike>MySQL ought to do better.  We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL.  It is possible there is a config parameter that affects this.  The public is invited to tell us about such.</strike>
</p>

<p>
<b>Update:</b>
</p>

<p>Andreas Schultz of the BSBM team advised us to increase the <code>innodb_buffer_pool_size</code> setting in the MySQL config.  We did and it produced some improvement.  Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better.  We ran these things a few times so as to have warm cache.</p>

<p>On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes.  Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso.  We recently made a greedy prefetch hack that should give us some mileage there.  For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again.  If the second measurement is the same as the first then it is good.</p>

<p>As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x17a20498">TPC-H</a>.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-20#1484">
  <rss:title>Virtuoso Vs. MySQL:  Setting the Berlin Record Straight (update 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-20T11:06:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In the context of the Berlin SPARQL Benchmark, I have repeatedly written about measurement procedures and steady state. The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup. Thus it came to pass that one cut of the BSBM paper had 3 seconds for MySQL and 100 for Virtuoso, basically through ignoring cache effects. So we decided to do it ourselves. The score is (updated with revised innodb_buffer_pool_size setting, based on advice noted down below): n-clients Virtuoso MySQL (with increased buffer pool size) MySQL (with default buffer poll size) 1 41,161.33 27,023.11 12,171.41 4 127,918.30 (pending) 37,566.82 8 218,162.29 105,524.23 51,104.39 16 214,763.58 98,852.42 47,589.18 The metric is the query mixes per hour from the BSBM test driver output. For the interested, the complete output is here. The benchmark is pure SQL, nothing to do with SPARQL or RDF. The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM. The OS is 64-bit Debian Linux. The benchmark was run at a scale of 200,000. Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like. Both databases were configured to use 8G for disk cache. The test effectively runs from memory. We ran an analyze table on each MySQL table but noticed that this had no effect. Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference. The MySQL tables were served by the InnoDB engine. MySQL appears to cache results of queries in some cases. This was not apparent in the tests. The versions are 5.09 for Virtuoso and 5.1.29 for MySQL. You can download and examine -- Virtuoso configuration file MySQL configuration file Table definitions &amp; RDF views Indexes on MySQL tables MySQL ought to do better. We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL. It is possible there is a config parameter that affects this. The public is invited to tell us about such. Update: Andreas Schultz of the BSBM team advised us to increase the innodb_buffer_pool_size setting in the MySQL config. We did and it produced some improvement. Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better. We ran these things a few times so as to have warm cache. On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes. Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso. We recently made a greedy prefetch hack that should give us some mileage there. For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again. If the second measurement is the same as the first then it is good. As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of TPC-H.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In the context of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xa5314d8">Berlin SPARQL Benchmark</a>, I have repeatedly written about measurement procedures and steady state.  The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup.  Thus it came to pass that one cut of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x18482c20">BSBM</a> paper had 3 seconds for <a href="http://dbpedia.org/resource/MySQL" id="link-id0xb8c54de8">MySQL</a> and 100 for <a href="http://virtuoso.openlinksw.com" id="link-id0x189b2210">Virtuoso</a>, basically through ignoring cache effects.</p>

<p>So we decided to do it ourselves.</p>

<p>The score is (updated with revised <code>innodb_buffer_pool_size</code> setting, based on advice noted down below):</p>

<table border="1" cellspacing="2" cellpadding="5">
<tr>
    <th>n-clients</th>
    <th>Virtuoso</th>
    <th>MySQL <br /> (with increased buffer pool size)</th>
    <th>MySQL <br /> (with default buffer poll size)</th>
  </tr>
<tr align="right">
    <td>1</td>
    <td> 41,161.33</td>
    <td> 27,023.11 </td>
    <td> 12,171.41</td>
  </tr>
<tr align="right">
    <td>4</td>
    <td> 127,918.30</td>
    <td> (pending) </td>
    <td>  37,566.82</td>
  </tr>
<tr align="right">
    <td>8</td>
    <td> 218,162.29 </td>
    <td> 105,524.23 </td>
    <td>  51,104.39 </td>
  </tr>
<tr align="right">
    <td>16</td>
    <td> 214,763.58 </td>
    <td>  98,852.42 </td>
    <td>  47,589.18 </td>
  </tr>
</table>


<p>The metric is the query mixes per hour from the BSBM test driver output.  For the interested, the complete output is <a href="http://www.openlinksw.com/weblog/oerling/texts/bsbmres.txt" id="link-id1119f770">here</a>.</p>

<p>The benchmark is pure <a href="http://dbpedia.org/resource/SQL" id="link-id0x5257718">SQL</a>, nothing to do with <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xb8c463e0">SPARQL</a> or <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x16e68d50">RDF</a>.</p>

<p>The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM.  The OS is 64-bit Debian Linux.</p>

<p>The benchmark was run at a scale of 200,000.  Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like.  Both databases were configured to use 8G for disk cache.  The test effectively runs from memory.  We ran an analyze table on each MySQL table but noticed that this had no effect.  Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference.  The MySQL tables were served by the InnoDB engine.  MySQL appears to cache results of queries in some cases.  This was not apparent in the tests.</p>

<p>The versions are 5.09 for Virtuoso and 5.1.29 for MySQL.  You can download and examine --</p>
<ul> 
<li>
<a href="http://www.openlinksw.com/weblog/oerling/texts/virtuoso.ini" id="link-id14fe17f0">Virtuoso configuration file</a>
</li>
<li>
<a href="http://www.openlinksw.com/weblog/oerling/texts/my.cnf" id="link-id116fe490">MySQL configuration file</a>
</li>
<li>
    <a href="http://www.openlinksw.com/weblog/oerling/texts/create_tables_and_rdf_view.sql" id="link-id14ce9268">Table definitions &amp; RDF views</a> 
</li>
<li> <a href="http://www.openlinksw.com/weblog/oerling/texts/mysqlinx.sql" id="link-id1535e298">Indexes on MySQL tables</a>
</li>
</ul>

<p>
<strike>MySQL ought to do better.  We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL.  It is possible there is a config parameter that affects this.  The public is invited to tell us about such.</strike>
</p>

<p>
<b>Update:</b>
</p>

<p>Andreas Schultz of the BSBM team advised us to increase the <code>innodb_buffer_pool_size</code> setting in the MySQL config.  We did and it produced some improvement.  Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better.  We ran these things a few times so as to have warm cache.</p>

<p>On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes.  Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso.  We recently made a greedy prefetch hack that should give us some mileage there.  For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again.  If the second measurement is the same as the first then it is good.</p>

<p>As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x122eaa00">TPC-H</a>.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1481">
  <rss:title>ISWC 2008: Some Questions</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-04T15:54:42Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Inference: Is it always forward chaining? We got a number of questions about Virtuoso&#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment. Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine. For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed. We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up. The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time. Genetic Algorithms for SPARQL? Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe GuÃ©ret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our &quot;conventional wisdom&quot; as we are, this might never have occurred to us. Schema Last? It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge. There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware. It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2>Inference: Is it always forward chaining?</h2>

<p>We got a number of questions about <a href="http://virtuoso.openlinksw.com" id="link-id0x13c64b60">Virtuoso</a>&#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.</p>

<p>Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x14d17778">SPARQL</a>, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x128e55d0">owl</a>:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.</p>

<p>For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our <a href="http://dbpedia.org/resource/SQL" id="link-id0x12614770">SQL</a>/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.</p>

<p>We had a word with <a href="http://web.comlab.ox.ac.uk/people/Ian.Horrocks/" id="link-id117c99d0">Ian Horrocks</a> on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The <a href="http://dbpedia.org/resource/Data" id="link-id0x145b2980">data</a> need not always be blown up.</p>

<p>The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.</p>

<h2>Genetic Algorithms for SPARQL?</h2>

<p>Our compliments for the wildest idea of the conference go to <a href="http://www.eyaloren.org/" id="link-id1a203af8">Eyal Oren</a>, <a href="http://www.few.vu.nl/~cgueret/" id="link-id16208758">Christophe GuÃ©ret</a>, and <a href="http://www.few.vu.nl/~schlobac/" id="link-id111923e0">Stefan Schlobach</a>, <i>et al</i>, for their <a href="http://www.informatik.uni-trier.de/~ley/db/conf/semweb/iswc2008.html#OrenGS08" id="link-id11793540">paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated</a>. Prisoners of our &quot;conventional wisdom&quot; as we are, this might never have occurred to us.</p>

<h2>Schema Last?</h2>

<p>It is interesting to see how the industry comes to the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x12b57e90">semantic web</a> conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.</p>

<p>There is a point to schema last. <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x12a8ff48">RDF</a> is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.</p>

<p>It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1479">
  <rss:title>ISWC 2008: Some Questions</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-04T15:54:42Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Inference: Is it always forward chaining? We got a number of questions about Virtuoso&#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment. Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine. For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed. We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up. The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time. Genetic Algorithms for SPARQL? Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe GuÃ©ret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our &quot;conventional wisdom&quot; as we are, this might never have occurred to us. Schema Last? It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge. There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware. It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2>Inference: Is it always forward chaining?</h2>

<p>We got a number of questions about <a href="http://virtuoso.openlinksw.com" id="link-id0x131604a8">Virtuoso</a>&#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.</p>

<p>Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1422f910">SPARQL</a>, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x145894a8">owl</a>:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.</p>

<p>For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our <a href="http://dbpedia.org/resource/SQL" id="link-id0x1458a288">SQL</a>/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.</p>

<p>We had a word with <a href="http://web.comlab.ox.ac.uk/people/Ian.Horrocks/" id="link-id117c99d0">Ian Horrocks</a> on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The <a href="http://dbpedia.org/resource/Data" id="link-id0x14cf0b18">data</a> need not always be blown up.</p>

<p>The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.</p>

<h2>Genetic Algorithms for SPARQL?</h2>

<p>Our compliments for the wildest idea of the conference go to <a href="http://www.eyaloren.org/" id="link-id1a203af8">Eyal Oren</a>, <a href="http://www.few.vu.nl/~cgueret/" id="link-id16208758">Christophe GuÃ©ret</a>, and <a href="http://www.few.vu.nl/~schlobac/" id="link-id111923e0">Stefan Schlobach</a>, <i>et al</i>, for their <a href="http://www.informatik.uni-trier.de/~ley/db/conf/semweb/iswc2008.html#OrenGS08" id="link-id11793540">paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated</a>. Prisoners of our &quot;conventional wisdom&quot; as we are, this might never have occurred to us.</p>

<h2>Schema Last?</h2>

<p>It is interesting to see how the industry comes to the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1154c1b0">semantic web</a> conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.</p>

<p>There is a point to schema last. <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x14c6a930">RDF</a> is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.</p>

<p>It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1477">
  <rss:title>ISWC 2008: RDB2RDF Face-to-Face</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-04T13:26:19Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The W3C&#39;s RDB-to-RDF mapping incubator group (RDB2RDF XG) met in Karlsruhe after ISWC 2008. The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from SPARQL to SQL. There was a lot of agreement and the meeting even finished ahead of the allotted time. Whose Identifiers? There was discussion concerning using the Entity Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM&#39;s reference cannot depend on any registration. Also, even when there is a stable URI for the entity, a CRM may need a key that specifies some administrative subdivision of the customer. Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with &quot;same as&quot; assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems. XML or RDF? We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as OWL ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has XPath in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the information schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this. Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of data, such a description of the exposed data is essential. Interoperability We agreed with SÃ¶ren Auer that we could make Virtuoso&#39;s mapping language compatible with Triplify. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults. There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition. Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go. Related Web resources OpenLink Virtuoso: Open-Source Edition: Mapping SQL Data to RDF Virtuoso RDF Views â Getting Started Guide (PDF)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The W3C&#39;s RDB-to-<a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x153bdcf8">RDF</a> mapping incubator group (<a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x13e3e6b8">RDB2RDF XG</a>) met in <a href="http://dbpedia.org/resource/Karlsruhe" id="link-id0x15236b08">Karlsruhe</a> after <a href="http://iswc2008.semanticweb.org/" id="link-id0x2450fba8">ISWC 2008</a>.</p>

<p>The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x14c84338">SPARQL</a> to <a href="http://dbpedia.org/resource/SQL" id="link-id0x146db368">SQL</a>. There was a lot of agreement and the meeting even finished ahead of the allotted time.</p>

<h2>Whose Identifiers?</h2>

<p>There was discussion concerning using the <a href="http://dbpedia.org/resource/Entity" id="link-id0x12c15e58">Entity</a> Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM&#39;s reference cannot depend on any registration. Also, even when there is a stable <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x12b7b5c0">URI</a> for the entity, a CRM may need a key that specifies some administrative subdivision of the customer.</p>

<p>Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with &quot;same as&quot; assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems.</p>

<h2>
<a href="http://dbpedia.org/resource/XML" id="link-id0x14bf7da0">XML</a> or RDF?</h2>

<p>We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x1493ffc0">OWL</a> ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has <a href="http://dbpedia.org/resource/XPath" id="link-id0x1400fe98">XPath</a> in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the <a href="http://dbpedia.org/resource/Information" id="link-id0x14c846d8">information</a> schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this.</p>

<p>Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of <a href="http://dbpedia.org/resource/Data" id="link-id0x14f6f128">data</a>, such a description of the exposed data is essential.</p>

<h2>Interoperability</h2>

<p>We agreed with <a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x1e776730">SÃ¶ren Auer</a> that we could make <a href="http://virtuoso.openlinksw.com" id="link-id0x1477ad18">Virtuoso</a>&#39;s mapping language compatible with <a href="http://triplify.org/" id="link-id0x15514388">Triplify</a>. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults.</p>

<p>There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition.</p>

<p>Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go.</p>

<h2>Related Web resources</h2>
<ul>
 <li>
  <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSSQL2RDF" id="link-id14e27040">OpenLink Virtuoso: Open-Source Edition: Mapping SQL Data to RDF</a>
 </li>
<li>
  <a href="http://virtuoso.openlinksw.com/Whitepapers/pdf/Virtuoso_SQL_to_RDF_Mapping.pdf" id="link-id1baad3a8">Virtuoso RDF Views â Getting Started Guide (PDF)</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1476">
  <rss:title>ISWC 2008: RDB2RDF Face-to-Face</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-04T13:26:19Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The W3C&#39;s RDB-to-RDF mapping incubator group (RDB2RDF XG) met in Karlsruhe after ISWC 2008. The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from SPARQL to SQL. There was a lot of agreement and the meeting even finished ahead of the allotted time. Whose Identifiers? There was discussion concerning using the Entity Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM&#39;s reference cannot depend on any registration. Also, even when there is a stable URI for the entity, a CRM may need a key that specifies some administrative subdivision of the customer. Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with &quot;same as&quot; assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems. XML or RDF? We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as OWL ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has XPath in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the information schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this. Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of data, such a description of the exposed data is essential. Interoperability We agreed with SÃ¶ren Auer that we could make Virtuoso&#39;s mapping language compatible with Triplify. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults. There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition. Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go. Related Web resources OpenLink Virtuoso: Open-Source Edition: Mapping SQL Data to RDF Virtuoso RDF Views â Getting Started Guide (PDF)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The W3C&#39;s RDB-to-<a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x141f0470">RDF</a> mapping incubator group (<a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x13b8d018">RDB2RDF XG</a>) met in <a href="http://dbpedia.org/resource/Karlsruhe" id="link-id0x1e748060">Karlsruhe</a> after <a href="http://iswc2008.semanticweb.org/" id="link-id0x1eba8468">ISWC 2008</a>.</p>

<p>The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1e5abe10">SPARQL</a> to <a href="http://dbpedia.org/resource/SQL" id="link-id0x13930368">SQL</a>. There was a lot of agreement and the meeting even finished ahead of the allotted time.</p>

<h2>Whose Identifiers?</h2>

<p>There was discussion concerning using the <a href="http://dbpedia.org/resource/Entity" id="link-id0x15587978">Entity</a> Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM&#39;s reference cannot depend on any registration. Also, even when there is a stable <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x144660f8">URI</a> for the entity, a CRM may need a key that specifies some administrative subdivision of the customer.</p>

<p>Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with &quot;same as&quot; assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems.</p>

<h2>
<a href="http://dbpedia.org/resource/XML" id="link-id0x1edb8170">XML</a> or RDF?</h2>

<p>We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x2450fba8">OWL</a> ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has <a href="http://dbpedia.org/resource/XPath" id="link-id0x234e5478">XPath</a> in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the <a href="http://dbpedia.org/resource/Information" id="link-id0x22e129f8">information</a> schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this.</p>

<p>Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of <a href="http://dbpedia.org/resource/Data" id="link-id0x155139c0">data</a>, such a description of the exposed data is essential.</p>

<h2>Interoperability</h2>

<p>We agreed with <a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x132a64e0">SÃ¶ren Auer</a> that we could make <a href="http://virtuoso.openlinksw.com" id="link-id0x1272c988">Virtuoso</a>&#39;s mapping language compatible with <a href="http://triplify.org/" id="link-id0x12622738">Triplify</a>. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults.</p>

<p>There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition.</p>

<p>Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go.</p>

<h2>Related Web resources</h2>
<ul>
 <li>
  <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSSQL2RDF" id="link-id14e27040">OpenLink Virtuoso: Open-Source Edition: Mapping SQL Data to RDF</a>
 </li>
<li>
  <a href="http://virtuoso.openlinksw.com/Whitepapers/pdf/Virtuoso_SQL_to_RDF_Mapping.pdf" id="link-id1baad3a8">Virtuoso RDF Views â Getting Started Guide (PDF)</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-03#1473">
  <rss:title>ISWC 2008: The Scalable Knowledge Systems Workshop</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-03T13:16:47Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Mike Dean of BBN Technologies opened the Scalable Knowledge Systems Workshop with an invited talk. He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the semantic web field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can&#39;t join across partitions. I gave a talk about the Virtuoso Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible. Further, Mike Dean talked about ASIO, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files. Will the CODASYL days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation. Chris Bizer presented the Berlin SPARQL Benchmark (BSBM), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases. Regardless of steady state, BSBM has two generally valid conclusions: mapping relational to RDF, where possible, is faster than triple storage; and the equivalent relational solution can be some 10x faster than the pure triples representation. Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged â take the Billion Triples Challenge data set for one. Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native SQL since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let&#39;s make a benchmark that shows mapping closer to native SQL. Bridging the 10x Gap? When we run Virtuoso relational against Virtuoso triple store with the TPC-H workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions. A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store&#39;s row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store. There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the RDBMS will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables. Scale, Where Next? Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean&#39;s suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Mike Dean of <a href="http://dbpedia.org/resource/BBN_Technologies" id="link-id0x25699878">BBN Technologies</a> opened the Scalable <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1ed01750">Knowledge</a> Systems Workshop with an invited talk.  He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x21fbb9a8">semantic web</a> field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can&#39;t join across partitions.</p>

<p>I gave a talk about the <a href="http://virtuoso.openlinksw.com" id="link-id0x20b6e020">Virtuoso</a> Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing.  To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.</p>

<p>Further, Mike Dean talked about <a href="http://www.asio.bbn.com/" id="link-id0x222252f0">ASIO</a>, the BBN suite of semantic web tools.  His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files. </p>

<p>Will the <a href="http://dbpedia.org/resource/CODASYL" id="link-id0x222d8730">CODASYL</a> days come back, and will the linked list on disk be the way to store triples/quads?  I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries.  With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation.  Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages.  So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.</p>

<p>Chris Bizer presented the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x22e41c40">Berlin SPARQL Benchmark</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1c909960">BSBM</a>), which has already been discussed here in some detail.  He did acknowledge that the next round of the race must have a real steady-state rule.  This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.</p>

<p>Regardless of steady state, BSBM has two generally valid conclusions:
</p>
<ol>
<li>mapping relational to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x21d01890">RDF</a>, where possible, is faster than triple storage; and </li>
<li>the equivalent relational solution can be some 10x faster than the pure triples representation.</li>
</ol>

<p>Mike Dean asked whether BSBM was a case of a setup to have triple stores fail.  Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies.  Therefore it must have a workload where mapping makes sense.  Of course there are workloads where triples are unchallenged â take the <a href="http://challenge.semanticweb.org/" id="link-id0x1feb9250">Billion Triples Challenge</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x1fe12b60">data</a> set for one.</p>

<p>Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data.  Also, even if the scale is large, the working set is not nearly the size of the database.  This in fact penalizes mapping technologies against native <a href="http://dbpedia.org/resource/SQL" id="link-id0x1e275c88">SQL</a> since the difference there is compiling the query, especially since parameters are not used.  So, Chris, since we both like to map, let&#39;s make a benchmark that shows mapping closer to native SQL.</p>


<h2>Bridging the 10x Gap?</h2>

<p>When we run Virtuoso relational against Virtuoso triple store with the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x22046d88">TPC-H</a> workload, we see that the relational case is significantly faster.  These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times.  Why is this?  The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column.  Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower.  This is just a meter of count of executed instructions.</p>

<p>A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point.  A triple store can do the same if queries have high locality.  We do this in some SQL situations and can try this with triples also.  The RDF workload is typically more random in its access pattern, though.  The other factor is the length of control path.  A column store has a simpler control flow if it knows that the column will have exactly one value per row.  With RDF, this is not a given. Also, the column store&#39;s row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.</p>

<p>There was some discussion on how much closer a triple store could come to a relational one.  Some gains are undoubtedly possible.  We will see.  For the ideal row store workload, the <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x22f837c0">RDBMS</a> will continue to have some edge.  Large online systems typically have a large part of the workload that is simple and repetitive.  There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere.  Some degree of application-specific data structure does make sense.  We just need to show how this is done.  In this way, we have a continuum and not an either/or choice of triples vs. tables.</p>
 
<h2>Scale, Where Next?</h2>

<p>Concerning the future direction of the workshop, there were a few directions suggested.  One of the more interesting ones was Mike Dean&#39;s suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical.  Of course, there is the question of scale.  This time, we were the only ones focusing on a parallel database with no restrictions on joining.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-03#1471">
  <rss:title>ISWC 2008: The Scalable Knowledge Systems Workshop</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-03T13:16:47Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Mike Dean of BBN Technologies opened the Scalable Knowledge Systems Workshop with an invited talk. He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the semantic web field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can&#39;t join across partitions. I gave a talk about the Virtuoso Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible. Further, Mike Dean talked about ASIO, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files. Will the CODASYL days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation. Chris Bizer presented the Berlin SPARQL Benchmark (BSBM), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases. Regardless of steady state, BSBM has two generally valid conclusions: mapping relational to RDF, where possible, is faster than triple storage; and the equivalent relational solution can be some 10x faster than the pure triples representation. Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged â take the Billion Triples Challenge data set for one. Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native SQL since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let&#39;s make a benchmark that shows mapping closer to native SQL. Bridging the 10x Gap? When we run Virtuoso relational against Virtuoso triple store with the TPC-H workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions. A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store&#39;s row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store. There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the RDBMS will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables. Scale, Where Next? Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean&#39;s suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Mike Dean of <a href="http://dbpedia.org/resource/BBN_Technologies" id="link-id0x21d04768">BBN Technologies</a> opened the Scalable <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x22348c58">Knowledge</a> Systems Workshop with an invited talk.  He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x22570328">semantic web</a> field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can&#39;t join across partitions.</p>

<p>I gave a talk about the <a href="http://virtuoso.openlinksw.com" id="link-id0x23f313f0">Virtuoso</a> Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing.  To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.</p>

<p>Further, Mike Dean talked about <a href="http://www.asio.bbn.com/" id="link-id0x1d74c108">ASIO</a>, the BBN suite of semantic web tools.  His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files. </p>

<p>Will the <a href="http://dbpedia.org/resource/CODASYL" id="link-id0x1f8ee860">CODASYL</a> days come back, and will the linked list on disk be the way to store triples/quads?  I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries.  With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation.  Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages.  So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.</p>

<p>Chris Bizer presented the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1d670da0">Berlin SPARQL Benchmark</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x21928808">BSBM</a>), which has already been discussed here in some detail.  He did acknowledge that the next round of the race must have a real steady-state rule.  This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.</p>

<p>Regardless of steady state, BSBM has two generally valid conclusions:
</p>
<ol>
<li>mapping relational to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xab811020">RDF</a>, where possible, is faster than triple storage; and </li>
<li>the equivalent relational solution can be some 10x faster than the pure triples representation.</li>
</ol>

<p>Mike Dean asked whether BSBM was a case of a setup to have triple stores fail.  Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies.  Therefore it must have a workload where mapping makes sense.  Of course there are workloads where triples are unchallenged â take the <a href="http://challenge.semanticweb.org/" id="link-id0x2538c3b8">Billion Triples Challenge</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x1d673760">data</a> set for one.</p>

<p>Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data.  Also, even if the scale is large, the working set is not nearly the size of the database.  This in fact penalizes mapping technologies against native <a href="http://dbpedia.org/resource/SQL" id="link-id0xac16cc10">SQL</a> since the difference there is compiling the query, especially since parameters are not used.  So, Chris, since we both like to map, let&#39;s make a benchmark that shows mapping closer to native SQL.</p>


<h2>Bridging the 10x Gap?</h2>

<p>When we run Virtuoso relational against Virtuoso triple store with the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x1d7dc518">TPC-H</a> workload, we see that the relational case is significantly faster.  These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times.  Why is this?  The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column.  Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower.  This is just a meter of count of executed instructions.</p>

<p>A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point.  A triple store can do the same if queries have high locality.  We do this in some SQL situations and can try this with triples also.  The RDF workload is typically more random in its access pattern, though.  The other factor is the length of control path.  A column store has a simpler control flow if it knows that the column will have exactly one value per row.  With RDF, this is not a given. Also, the column store&#39;s row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.</p>

<p>There was some discussion on how much closer a triple store could come to a relational one.  Some gains are undoubtedly possible.  We will see.  For the ideal row store workload, the <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x22e5b6f8">RDBMS</a> will continue to have some edge.  Large online systems typically have a large part of the workload that is simple and repetitive.  There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere.  Some degree of application-specific data structure does make sense.  We just need to show how this is done.  In this way, we have a continuum and not an either/or choice of triples vs. tables.</p>
 
<h2>Scale, Where Next?</h2>

<p>Concerning the future direction of the workshop, there were a few directions suggested.  One of the more interesting ones was Mike Dean&#39;s suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical.  Of course, there is the question of scale.  This time, we were the only ones focusing on a parallel database with no restrictions on joining.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1467">
  <rss:title>Virtuoso - Are We Too Clever for Our Own Good? (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-26T12:15:35Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">&quot;Physician, heal thyself,&quot; it is said. We profess to say what the messaging of the semantic web ought to be, but is our own perfect? I will here engage in some critical introspection as well as amplify on some answers given to Virtuoso-related questions in recent times. I use some conversations from the Vienna Linked Data Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the ODS (OpenLink Data Spaces) applications line, OAT (OpenLink Ajax Toolkit), or ODE (OpenLink Data Explorer). &quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&quot; said SÃ¶ren Auer. Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept. This is why we put a lot of emphasis on Linked Data and the Data Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive RDF store. We can do this because we own our database and SQL and data access middleware and have a history of connecting to any RDBMS out there. The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing. There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are SPARQL options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced Sesame- and Jena-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish). Now, this message could be better reflected in our material on the web. This blog is a rather informal step in this direction; more is to come. For now we concentrate on delivering. The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted. This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the TPC-H database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam? The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform? &quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&quot; We should answer in multiple parts. For general collateral, like web sites and documentation: The web site gives a confused product image. For the Virtuoso product, we should divide at the top into Data web and RDF - Host linked data, expose relational assets as linked data; Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration; Web Services - access all the above over standard protocols, dynamic web pages, web hosting. For each point, one simple statement. We all know what the above things mean? Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear. Far as I am concerned, hosting Java or .NET does not have to be on the front page. Also, we have no great interest in going against Apache when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this. Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either. Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves. I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to. &quot;Virtuoso is very big and very difficult&quot; One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files. This gives you SQL and SPARQL out of the box. Adding ODBC and JDBC clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required. Now for the difficult side: Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â PHP or Java or anything else. For web services, simple is best. Stick to basics. &quot;The engineer is one who can invent a simple thing.&quot; Use SQL statements rather than admin UI. Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered. We should put this into a couple of use case oriented how-tos. Also, we should create a network of &quot;friendly local virtuoso geeks&quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly. &quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&quot; Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed. If one really wants to do one&#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply. We are talking about such things with different parties at present. &quot;How webby are you? What is webby?&quot; &quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&quot; We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant when it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open. Google-style crawling of everything becomes less practical if one needs to run complex ad hoc queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud. &quot;How does OpenLink see the next five years unfolding?&quot; Personally, I think we have the basics for the birth of a new inflection in the knowledge economy. The URI is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more information can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL. The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is context. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood. Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point. Ever since the .com era, the URL has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these. With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy. For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole Linked Open Data cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have. This makes some things possible that were hard thus far. On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth. Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso Sponger, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community. Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more. Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>&quot;Physician, heal thyself,&quot; it is said. We profess to say what the messaging of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1b4a25f0">semantic web</a> ought to be, but is our own perfect?</p>

<p>I will here engage in some critical introspection as well as amplify on some answers given to <a href="http://virtuoso.openlinksw.com" id="link-id0x1e4f9928">Virtuoso</a>-related questions in recent times.</p>

<p>I use some conversations from the <a href="http://dbpedia.org/resource/Vienna" id="link-id0x1e6c0ca8">Vienna</a> <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1e56df88">Linked Data</a> Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x1e680440">ODS</a> (<a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x1e140068">OpenLink Data Spaces</a>) applications line, <a href="http://oat.openlinksw.com/" id="link-id0x1f4ba630">OAT</a> (<a href="http://oat.openlinksw.com/" id="link-id0x1ba4bac8">OpenLink Ajax Toolkit</a>), or <a href="http://ode.openlinksw.com/" id="link-id0x1d4159b0">ODE</a> (<a href="http://ode.openlinksw.com/" id="link-id0x1e973c80">OpenLink Data Explorer</a>).</p>

<h3>&quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&quot; said <a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x1f8bafe0">SÃ¶ren Auer</a>.</h3>

<p>Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.</p>

<p>This is why we put a lot of emphasis on Linked Data and the <a href="http://dbpedia.org/resource/Data" id="link-id0x200bd1f0">Data</a> Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1fb03528">RDF</a> store.</p>

<p>We can do this because we own our database and <a href="http://dbpedia.org/resource/SQL" id="link-id0x1e7dcc70">SQL</a> and data access middleware and have a history of connecting to any <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1e9baf18">RDBMS</a> out there.</p>

<p>The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.</p>

<p>There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1f5f6b78">SPARQL</a> options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced <a href="http://sourceforge.net/projects/sesame/" id="link-id0x206818c8">Sesame</a>- and <a href="http://jena.sourceforge.net/" id="link-id0x202b3348">Jena</a>-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).</p>

<p>Now, this message could be better reflected in our material on the web. This <a href="http://dbpedia.org/resource/Blog" id="link-id0x1c82e508">blog</a> is a rather informal step in this direction; more is to come. For now we concentrate on delivering.</p>

<p>The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.</p>

<p>This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x20832510">TPC-H</a> database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?</p>

<p>The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list.  Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?</p>

<h3>&quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&quot;</h3>

<p>We should answer in multiple  parts.</p>

<p>For general collateral, like web sites and documentation:</p>

<p>The web site gives a confused product image.  For the Virtuoso product, we should divide at the top into</p>

<ul>  
<li> Data web and RDF - Host linked data, expose relational assets as linked data;</li>
<li> Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;</li>
<li> Web Services - access all the above over standard protocols, dynamic web pages, web hosting.</li>
</ul>

<p>For each point, one simple statement.  We all know what the above things mean?</p>

<p>Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.</p>

<p>Far as I am concerned, hosting Java or .<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x20283a88">NET</a> does not have to be on the front page. Also, we have no great interest in going against <a href="http://dbpedia.org/resource/Apache" id="link-id0x2024a068">Apache</a> when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.</p>

<p>Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.</p>

<p>Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.</p>

<p>I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.</p>

<h3>&quot;Virtuoso is very big and very difficult&quot;</h3>

<p>One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.</p>

<p>This gives you SQL and SPARQL out of the box.  Adding <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x1ee61058">ODBC</a> and <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1b8c31c0">JDBC</a> clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.</p>

<p>Now for the difficult side:</p>

<p>Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â <a href="http://dbpedia.org/resource/PHP" id="link-id0x20a13c00">PHP</a> or Java or anything else. For web services, simple is best. Stick to basics. &quot;The engineer is one who can invent a simple thing.&quot; Use SQL statements rather than admin UI.</p>

<p>Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.</p>

<p>We should put this into a couple of use case oriented how-tos.</p>

<p>Also, we should create a network of &quot;friendly local virtuoso geeks&quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.</p>

<h3>&quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&quot;</h3>

<p>Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.</p>

<p>If one really wants to do one&#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.</p>

<p>We are talking about such things with different parties at present.</p>

<h3>&quot;How webby are you?  What is webby?&quot;</h3>

<p>&quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&quot;</p>

<p>We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant <i>when</i> it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.</p>

<p>Google-style crawling of everything becomes less practical if one needs to run complex <i>ad hoc</i> queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.</p>

<h3>&quot;How does OpenLink see the next five years unfolding?&quot;</h3>

<p>Personally, I think we have the basics for the birth of a new inflection in the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1fb9ae58">knowledge</a> economy. The <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x1f07c648">URI</a> is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more <a href="http://dbpedia.org/resource/Information" id="link-id0x1f007d60">information</a> can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.</p>

<p>The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is <i>context</i>. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.</p>

<p>Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.</p>

<p>Ever since the .com era, the <a href="http://dbpedia.org/resource/Uniform_Resource_Locator" id="link-id0x2048e670">URL</a> has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.</p>

<p>With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.</p>

<p>For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1c696170">Linked Open Data</a> cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.</p>

<p>This makes some things possible that were hard thus far.</p>

<p>On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.</p>

<p>Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x206ab780">Sponger</a>, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.</p>

<p>Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.</p>

<p>Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event.  Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1465">
  <rss:title>Virtuoso - Are We Too Clever for Our Own Good? (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-26T12:15:35Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">&quot;Physician, heal thyself,&quot; it is said. We profess to say what the messaging of the semantic web ought to be, but is our own perfect? I will here engage in some critical introspection as well as amplify on some answers given to Virtuoso-related questions in recent times. I use some conversations from the Vienna Linked Data Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the ODS (OpenLink Data Spaces) applications line, OAT (OpenLink Ajax Toolkit), or ODE (OpenLink Data Explorer). &quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&quot; said SÃ¶ren Auer. Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept. This is why we put a lot of emphasis on Linked Data and the Data Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive RDF store. We can do this because we own our database and SQL and data access middleware and have a history of connecting to any RDBMS out there. The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing. There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are SPARQL options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced Sesame- and Jena-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish). Now, this message could be better reflected in our material on the web. This blog is a rather informal step in this direction; more is to come. For now we concentrate on delivering. The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted. This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the TPC-H database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam? The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform? &quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&quot; We should answer in multiple parts. For general collateral, like web sites and documentation: The web site gives a confused product image. For the Virtuoso product, we should divide at the top into Data web and RDF - Host linked data, expose relational assets as linked data; Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration; Web Services - access all the above over standard protocols, dynamic web pages, web hosting. For each point, one simple statement. We all know what the above things mean? Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear. Far as I am concerned, hosting Java or .NET does not have to be on the front page. Also, we have no great interest in going against Apache when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this. Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either. Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves. I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to. &quot;Virtuoso is very big and very difficult&quot; One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files. This gives you SQL and SPARQL out of the box. Adding ODBC and JDBC clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required. Now for the difficult side: Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â PHP or Java or anything else. For web services, simple is best. Stick to basics. &quot;The engineer is one who can invent a simple thing.&quot; Use SQL statements rather than admin UI. Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered. We should put this into a couple of use case oriented how-tos. Also, we should create a network of &quot;friendly local virtuoso geeks&quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly. &quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&quot; Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed. If one really wants to do one&#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply. We are talking about such things with different parties at present. &quot;How webby are you? What is webby?&quot; &quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&quot; We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant when it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open. Google-style crawling of everything becomes less practical if one needs to run complex ad hoc queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud. &quot;How does OpenLink see the next five years unfolding?&quot; Personally, I think we have the basics for the birth of a new inflection in the knowledge economy. The URI is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more information can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL. The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is context. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood. Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point. Ever since the .com era, the URL has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these. With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy. For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole Linked Open Data cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have. This makes some things possible that were hard thus far. On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth. Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso Sponger, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community. Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more. Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>&quot;Physician, heal thyself,&quot; it is said. We profess to say what the messaging of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1fa3da18">semantic web</a> ought to be, but is our own perfect?</p>

<p>I will here engage in some critical introspection as well as amplify on some answers given to <a href="http://virtuoso.openlinksw.com" id="link-id0x1e1eecf0">Virtuoso</a>-related questions in recent times.</p>

<p>I use some conversations from the <a href="http://dbpedia.org/resource/Vienna" id="link-id0x1ec0b2e0">Vienna</a> <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x2045ac10">Linked Data</a> Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x2045ac38">ODS</a> (<a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x14f63c58">OpenLink Data Spaces</a>) applications line, <a href="http://oat.openlinksw.com/" id="link-id0x14f63c80">OAT</a> (<a href="http://oat.openlinksw.com/" id="link-id0x1e536928">OpenLink Ajax Toolkit</a>), or <a href="http://ode.openlinksw.com/" id="link-id0x1eaed7f8">ODE</a> (<a href="http://ode.openlinksw.com/" id="link-id0x1edfff88">OpenLink Data Explorer</a>).</p>

<h3>&quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&quot; said <a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x1b8a9580">SÃ¶ren Auer</a>.</h3>

<p>Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.</p>

<p>This is why we put a lot of emphasis on Linked Data and the <a href="http://dbpedia.org/resource/Data" id="link-id0x1b85fa38">Data</a> Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1f5b4468">RDF</a> store.</p>

<p>We can do this because we own our database and <a href="http://dbpedia.org/resource/SQL" id="link-id0x20076468">SQL</a> and data access middleware and have a history of connecting to any <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1ffd6f98">RDBMS</a> out there.</p>

<p>The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.</p>

<p>There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1316f720">SPARQL</a> options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced <a href="http://sourceforge.net/projects/sesame/" id="link-id0x20756a68">Sesame</a>- and <a href="http://jena.sourceforge.net/" id="link-id0x1ec01ac0">Jena</a>-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).</p>

<p>Now, this message could be better reflected in our material on the web. This <a href="http://dbpedia.org/resource/Blog" id="link-id0x2027b410">blog</a> is a rather informal step in this direction; more is to come. For now we concentrate on delivering.</p>

<p>The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.</p>

<p>This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x2051ff98">TPC-H</a> database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?</p>

<p>The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list.  Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?</p>

<h3>&quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&quot;</h3>

<p>We should answer in multiple  parts.</p>

<p>For general collateral, like web sites and documentation:</p>

<p>The web site gives a confused product image.  For the Virtuoso product, we should divide at the top into</p>

<ul>  
<li> Data web and RDF - Host linked data, expose relational assets as linked data;</li>
<li> Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;</li>
<li> Web Services - access all the above over standard protocols, dynamic web pages, web hosting.</li>
</ul>

<p>For each point, one simple statement.  We all know what the above things mean?</p>

<p>Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.</p>

<p>Far as I am concerned, hosting Java or .<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x1f297540">NET</a> does not have to be on the front page. Also, we have no great interest in going against <a href="http://dbpedia.org/resource/Apache" id="link-id0x1ea29578">Apache</a> when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.</p>

<p>Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.</p>

<p>Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.</p>

<p>I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.</p>

<h3>&quot;Virtuoso is very big and very difficult&quot;</h3>

<p>One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.</p>

<p>This gives you SQL and SPARQL out of the box.  Adding <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x20a2e7d0">ODBC</a> and <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1e4cceb8">JDBC</a> clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.</p>

<p>Now for the difficult side:</p>

<p>Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â <a href="http://dbpedia.org/resource/PHP" id="link-id0x20b03f08">PHP</a> or Java or anything else. For web services, simple is best. Stick to basics. &quot;The engineer is one who can invent a simple thing.&quot; Use SQL statements rather than admin UI.</p>

<p>Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.</p>

<p>We should put this into a couple of use case oriented how-tos.</p>

<p>Also, we should create a network of &quot;friendly local virtuoso geeks&quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.</p>

<h3>&quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&quot;</h3>

<p>Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.</p>

<p>If one really wants to do one&#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.</p>

<p>We are talking about such things with different parties at present.</p>

<h3>&quot;How webby are you?  What is webby?&quot;</h3>

<p>&quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&quot;</p>

<p>We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant <i>when</i> it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.</p>

<p>Google-style crawling of everything becomes less practical if one needs to run complex <i>ad hoc</i> queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.</p>

<h3>&quot;How does OpenLink see the next five years unfolding?&quot;</h3>

<p>Personally, I think we have the basics for the birth of a new inflection in the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x2018bd98">knowledge</a> economy. The <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x1ec110d8">URI</a> is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more <a href="http://dbpedia.org/resource/Information" id="link-id0x1ecfba08">information</a> can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.</p>

<p>The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is <i>context</i>. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.</p>

<p>Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.</p>

<p>Ever since the .com era, the <a href="http://dbpedia.org/resource/Uniform_Resource_Locator" id="link-id0x1c4c9248">URL</a> has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.</p>

<p>With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.</p>

<p>For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x20791bf0">Linked Open Data</a> cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.</p>

<p>This makes some things possible that were hard thus far.</p>

<p>On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.</p>

<p>Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x1f27e6d8">Sponger</a>, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.</p>

<p>Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.</p>

<p>Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event.  Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1466">
  <rss:title>State of the Semantic Web, Part 2 - The Technical Questions (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-26T12:02:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Here I will talk about some more technical questions that came up. This is mostly general; Virtuoso specific questions and answers are separate. &quot;How to Bootstrap? Where will the triples come from?&quot; There are already wrappers producing RDF from many applications. Since any structured or semi-structured data can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data per se is not the concern. The triples may come from any application or database, but they will not come from the end user directly. There was a good talk about photograph annotation in Vienna, describing many ways of deriving metadata for photos. The essential wisdom is annotating on the spot and wherever possible doing so automatically. The consumer is very unlikely to go annotate photos after the fact. Further, one can infer that photos made with the same camera around the same time are from the same location. There are other such heuristics. In this use case, the end user does not need to see triples. There is some benefit though in using commonly used geographical terminology for linking to other data sources. &quot;How will one develop applications?&quot; I&#39;d say one will develop them much the same way as thus far. In PHP, for example. Whether one&#39;s query language is SPARQL or SQL does not make a large difference in how basic web UI is made. A SPARQL end-point is no more an end-user item than a SQL command-line is. A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure. The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables. The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly. Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic. What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting. Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms. For this, Ajax technologies like OAT (the OpenLink AJAX Toolkit) will be good. The UI ought not to completely hide the URIs of the data from the user. It should offer a drill down to faceted views of the triples for example. Remember when Xerox talked about graphical user interfaces in 1980? &quot;Don&#39;t mode me in&quot; was the slogan, as I recall. Since then, we have vacillated between modal and non-modal interaction models. Repetitive workflows like order entry go best modally and are anyway being replaced by web services. Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example. But enabling the knowledge worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs. Face it: the world is not getting simpler. It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important. For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested. For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as linked data. It makes for more flexible querying. Linked data does not per se mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization. For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority. Thus the user interfaces in places where RDF will have real edge must be more capable of ad hoc viewing and navigation than regular real-time or line of business user interfaces. The OpenLink Data Explorer idea of a &quot;data behind the web page&quot; view goes in this direction. Read the web as before, then hit a switch to go to the data view. There are and will be separate clarifications and demos about this. &quot;What of the proliferation of standards? Does this not look too tangled, no clear identity? How would one know where to begin?&quot; When SWEO was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure. Just the right amount of structure for the application, less rigid and easier to change than relational. Do not even mention the different serialization formats. Just say that it fits on top of the accepted web infrastructure â HTTP, URIs, and XML where desired. It is misleading to say inference is a box at some specific place in the diagram. Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step. Since there is structure and semantics, inference is possible if desired. &quot;Can I make a social network application in RDF only, with no RDBMS?&quot; Yes, in principle, but what do you have in mind? The answer is very context dependent. The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc. In such a case, RDF is a great match, especially since the user count will not be in the millions. No university has that many students and anyway they do not hang online browsing the course catalogue. On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS. If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical. This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style. This is a matter of cost more than of feasibility. Ten servers cost less than a hundred and have failures ten times less frequently. In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Here I will talk about some more technical questions that came up.  This is mostly general; <a href="http://virtuoso.openlinksw.com" id="link-id0x205901a0">Virtuoso</a> specific questions and answers are separate.
</p>

<h3>&quot;How to Bootstrap?  Where will the triples come from?&quot;</h3>

<p>There are already wrappers producing <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x13519ac8">RDF</a> from many applications. Since any structured or semi-structured <a href="http://dbpedia.org/resource/Data" id="link-id0x1c93b418">data</a> can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data <i>per se</i> is not the concern.</p>

<p>The triples may come from any application or database, but they will not come from the end user directly.  There was a good talk about photograph annotation in <a href="http://dbpedia.org/resource/Vienna" id="link-id0x1ea9d150">Vienna</a>, describing many ways of deriving metadata for photos.  The essential wisdom is annotating on the spot and wherever possible doing so automatically.  The consumer is very unlikely to go annotate  photos after the fact.  Further, one can infer that photos made with the same camera around the same time are from the same location.  There are other such heuristics.  In this use case, the end user does not need to see triples.  There is some benefit though in using commonly used geographical terminology for linking to other data sources.</p>

<h3>&quot;How will one develop applications?&quot;</h3>

<p>I&#39;d say one will develop them much the same way as thus far.  In <a href="http://dbpedia.org/resource/PHP" id="link-id0x207fca00">PHP</a>, for example.  Whether one&#39;s query language is <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x20a5fde0">SPARQL</a> or <a href="http://dbpedia.org/resource/SQL" id="link-id0x1a0bb5e0">SQL</a> does not make a large difference in how basic web UI is made.</p>

<p>A SPARQL end-point is no more an end-user item than a SQL command-line is.</p>

<p>A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure.  The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.</p>

<p>The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly.  Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic.  What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting.  Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms.  For this, Ajax technologies like <a href="http://oat.openlinksw.com/" id="link-id0x1e91d118">OAT</a> (the <a href="http://oat.openlinksw.com/" id="link-id0x174b7950">OpenLink AJAX Toolkit</a>) will be good.</p>

<p>The UI ought not to completely hide the URIs of the data from the user.  It should offer a drill down to faceted views of the triples for example.  Remember when Xerox talked about graphical user interfaces in 1980? &quot;Don&#39;t mode me in&quot; was the slogan, as I recall.</p>

<p>Since then, we have vacillated between modal and non-modal interaction models.  Repetitive workflows like order entry go best modally and are anyway being replaced by web services.  Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example.  But enabling the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1ea14610">knowledge</a> worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs.  Face it: the world is not getting simpler.  It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.</p>

<p>For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested.  For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x207bcd20">linked data</a>.  It makes for more flexible querying.  Linked data does not <i>per se</i> mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.</p>

<p>For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority.  Thus the user interfaces in places where RDF will have real edge must be more capable of <i>ad hoc</i> viewing and navigation than regular real-time or line of business user interfaces.</p>

<p>The <a href="http://ode.openlinksw.com/" id="link-id0x2083a6f0">OpenLink Data Explorer</a> idea of a &quot;data behind the web page&quot; view goes in this direction. Read the web as before, then hit a switch to go to the data view.  There are and will be separate clarifications and demos about this.</p>

<h3>&quot;What of the proliferation of standards?  Does this not look too tangled, no clear identity?  How would one know where to begin?&quot;</h3>

<p>When <a href="http://www.w3.org/2001/sw/sweo/" id="link-id0x1e8eac68">SWEO</a> was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure.  Just the right amount of structure for the application, less rigid and easier to change than relational.</p>

<p>Do not even mention the different serialization formats.  Just say that it fits on top of the accepted web infrastructure â <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1e3806b8">HTTP</a>, URIs, and <a href="http://dbpedia.org/resource/XML" id="link-id0x1f547288">XML</a> where desired.</p>

<p>It is misleading to say inference is a box at some specific place in the diagram.  Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step.  Since there is structure and semantics, inference is possible if desired.</p>

<h3>&quot;Can I make a social network application in RDF only, with no <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x20553ee0">RDBMS</a>?&quot;</h3>

<p>Yes, in principle, but what do you have in mind?  The answer is very context dependent.  The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc.  In such a case, RDF is a great match, especially since the user count will not be in the millions.  No university has that many students and anyway they do not hang online browsing the course catalogue.</p>

<p>On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.</p>

<p>If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical.  This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style.  This is a matter of cost more than of feasibility.  Ten servers cost less than a hundred and have failures ten times less frequently.</p>

<p>In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1464">
  <rss:title>State of the Semantic Web, Part 2 - The Technical Questions (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-26T12:02:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Here I will talk about some more technical questions that came up. This is mostly general; Virtuoso specific questions and answers are separate. &quot;How to Bootstrap? Where will the triples come from?&quot; There are already wrappers producing RDF from many applications. Since any structured or semi-structured data can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data per se is not the concern. The triples may come from any application or database, but they will not come from the end user directly. There was a good talk about photograph annotation in Vienna, describing many ways of deriving metadata for photos. The essential wisdom is annotating on the spot and wherever possible doing so automatically. The consumer is very unlikely to go annotate photos after the fact. Further, one can infer that photos made with the same camera around the same time are from the same location. There are other such heuristics. In this use case, the end user does not need to see triples. There is some benefit though in using commonly used geographical terminology for linking to other data sources. &quot;How will one develop applications?&quot; I&#39;d say one will develop them much the same way as thus far. In PHP, for example. Whether one&#39;s query language is SPARQL or SQL does not make a large difference in how basic web UI is made. A SPARQL end-point is no more an end-user item than a SQL command-line is. A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure. The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables. The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly. Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic. What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting. Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms. For this, Ajax technologies like OAT (the OpenLink AJAX Toolkit) will be good. The UI ought not to completely hide the URIs of the data from the user. It should offer a drill down to faceted views of the triples for example. Remember when Xerox talked about graphical user interfaces in 1980? &quot;Don&#39;t mode me in&quot; was the slogan, as I recall. Since then, we have vacillated between modal and non-modal interaction models. Repetitive workflows like order entry go best modally and are anyway being replaced by web services. Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example. But enabling the knowledge worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs. Face it: the world is not getting simpler. It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important. For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested. For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as linked data. It makes for more flexible querying. Linked data does not per se mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization. For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority. Thus the user interfaces in places where RDF will have real edge must be more capable of ad hoc viewing and navigation than regular real-time or line of business user interfaces. The OpenLink Data Explorer idea of a &quot;data behind the web page&quot; view goes in this direction. Read the web as before, then hit a switch to go to the data view. There are and will be separate clarifications and demos about this. &quot;What of the proliferation of standards? Does this not look too tangled, no clear identity? How would one know where to begin?&quot; When SWEO was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure. Just the right amount of structure for the application, less rigid and easier to change than relational. Do not even mention the different serialization formats. Just say that it fits on top of the accepted web infrastructure â HTTP, URIs, and XML where desired. It is misleading to say inference is a box at some specific place in the diagram. Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step. Since there is structure and semantics, inference is possible if desired. &quot;Can I make a social network application in RDF only, with no RDBMS?&quot; Yes, in principle, but what do you have in mind? The answer is very context dependent. The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc. In such a case, RDF is a great match, especially since the user count will not be in the millions. No university has that many students and anyway they do not hang online browsing the course catalogue. On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS. If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical. This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style. This is a matter of cost more than of feasibility. Ten servers cost less than a hundred and have failures ten times less frequently. In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Here I will talk about some more technical questions that came up.  This is mostly general; <a href="http://virtuoso.openlinksw.com" id="link-id0x1f53d1a0">Virtuoso</a> specific questions and answers are separate.
</p>

<h3>&quot;How to Bootstrap?  Where will the triples come from?&quot;</h3>

<p>There are already wrappers producing <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1beda278">RDF</a> from many applications. Since any structured or semi-structured <a href="http://dbpedia.org/resource/Data" id="link-id0x1e57c648">data</a> can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data <i>per se</i> is not the concern.</p>

<p>The triples may come from any application or database, but they will not come from the end user directly.  There was a good talk about photograph annotation in <a href="http://dbpedia.org/resource/Vienna" id="link-id0x2028b7e8">Vienna</a>, describing many ways of deriving metadata for photos.  The essential wisdom is annotating on the spot and wherever possible doing so automatically.  The consumer is very unlikely to go annotate  photos after the fact.  Further, one can infer that photos made with the same camera around the same time are from the same location.  There are other such heuristics.  In this use case, the end user does not need to see triples.  There is some benefit though in using commonly used geographical terminology for linking to other data sources.</p>

<h3>&quot;How will one develop applications?&quot;</h3>

<p>I&#39;d say one will develop them much the same way as thus far.  In <a href="http://dbpedia.org/resource/PHP" id="link-id0x1eff1748">PHP</a>, for example.  Whether one&#39;s query language is <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1d83dff8">SPARQL</a> or <a href="http://dbpedia.org/resource/SQL" id="link-id0x1e9f4e88">SQL</a> does not make a large difference in how basic web UI is made.</p>

<p>A SPARQL end-point is no more an end-user item than a SQL command-line is.</p>

<p>A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure.  The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.</p>

<p>The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly.  Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic.  What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting.  Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms.  For this, Ajax technologies like <a href="http://oat.openlinksw.com/" id="link-id0x1d780520">OAT</a> (the <a href="http://oat.openlinksw.com/" id="link-id0x20943788">OpenLink AJAX Toolkit</a>) will be good.</p>

<p>The UI ought not to completely hide the URIs of the data from the user.  It should offer a drill down to faceted views of the triples for example.  Remember when Xerox talked about graphical user interfaces in 1980? &quot;Don&#39;t mode me in&quot; was the slogan, as I recall.</p>

<p>Since then, we have vacillated between modal and non-modal interaction models.  Repetitive workflows like order entry go best modally and are anyway being replaced by web services.  Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example.  But enabling the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1e14eb88">knowledge</a> worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs.  Face it: the world is not getting simpler.  It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.</p>

<p>For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested.  For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1e91d118">linked data</a>.  It makes for more flexible querying.  Linked data does not <i>per se</i> mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.</p>

<p>For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority.  Thus the user interfaces in places where RDF will have real edge must be more capable of <i>ad hoc</i> viewing and navigation than regular real-time or line of business user interfaces.</p>

<p>The <a href="http://ode.openlinksw.com/" id="link-id0x1c7f8ee0">OpenLink Data Explorer</a> idea of a &quot;data behind the web page&quot; view goes in this direction. Read the web as before, then hit a switch to go to the data view.  There are and will be separate clarifications and demos about this.</p>

<h3>&quot;What of the proliferation of standards?  Does this not look too tangled, no clear identity?  How would one know where to begin?&quot;</h3>

<p>When <a href="http://www.w3.org/2001/sw/sweo/" id="link-id0x1d73c268">SWEO</a> was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure.  Just the right amount of structure for the application, less rigid and easier to change than relational.</p>

<p>Do not even mention the different serialization formats.  Just say that it fits on top of the accepted web infrastructure â <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1efefed0">HTTP</a>, URIs, and <a href="http://dbpedia.org/resource/XML" id="link-id0x1af89b18">XML</a> where desired.</p>

<p>It is misleading to say inference is a box at some specific place in the diagram.  Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step.  Since there is structure and semantics, inference is possible if desired.</p>

<h3>&quot;Can I make a social network application in RDF only, with no <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1cb62cd8">RDBMS</a>?&quot;</h3>

<p>Yes, in principle, but what do you have in mind?  The answer is very context dependent.  The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc.  In such a case, RDF is a great match, especially since the user count will not be in the millions.  No university has that many students and anyway they do not hang online browsing the course catalogue.</p>

<p>On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.</p>

<p>If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical.  This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style.  This is a matter of cost more than of feasibility.  Ten servers cost less than a hundred and have failures ten times less frequently.</p>

<p>In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-24#1460">
  <rss:title>State of the Semantic Web, Part 1 - Sociology, Business, and Messaging (update 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-24T10:19:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I was in Vienna for the Linked Data Practitioners gathering this week. Danny Ayers asked me if I would blog about the State of the Semantic Web or write the This Week&#39;s Semantic Web column. I don&#39;t have the time to cover all that may have happened during the past week but I will editorialize about the questions that again were raised in Vienna. How these things relate to Virtuoso will be covered separately. This is about the overarching questions of the times, not the finer points of geek craft. SÃ¶ren Auer asked me to say a few things about relational to RDF mapping. I will cite some highlights from this, as they pertain to the general scene. There was an &quot;open hacking&quot; session Wednesday night featuring lightning talks. I will use some of these too as a starting point. The messaging? The SWEO (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became &quot;Data Unleashed.&quot; Why not? Let&#39;s give this some context. So, if we are holding a Semantic Web 101 session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world. Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with myopenlink.net and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our Billion Triples Challenge demo. We will make a webcast about this to make it all clearer. Behold: The Data Web is about the world becoming a database; writing SPARQL queries or triples is incidental. You will write FOAF files by hand just as little as you now write SQL insert statements for filling in your account information on Myspace. Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the RDBMS was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more ad hoc. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced. Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity? A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain. However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic. What is the business value of this? The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web. Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data. Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with linked open data. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question. We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times. But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy? The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information. But let&#39;s get back to more specific issues At the lightning talks in Vienna, one participant said, &quot;Man&#39;s enemy is not the lion that eats men, it&#39;s his own brother. Semantic Web&#39;s enemy is the XML Web services stack that ate its lunch.&quot; There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze. SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the ad hoc component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit. &quot;I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it&#39;s too complex, the developer who says it is not what he&#39;s used to, and so on?&quot; This is an education question. One of the findings of SWEO&#39;s enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can&#39;t be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails. Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise knowledge bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I was in <a href="http://dbpedia.org/resource/Vienna" id="link-id0x1f18a540">Vienna</a> for the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1ec788a0">Linked Data</a> Practitioners gathering this week. Danny Ayers asked me if I would <a href="http://dbpedia.org/resource/Blog" id="link-id0x20838238">blog</a> about the State of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x20694ed8">Semantic Web</a> or write the <i>This Week&#39;s Semantic Web</i> column. I don&#39;t have the time to cover all that may have happened during the past week but I will editorialize about the questions that again were raised in Vienna. How these things relate to <a href="http://virtuoso.openlinksw.com" id="link-id0x20b1cd38">Virtuoso</a> will be covered separately. This is about the overarching questions of the times, not the finer points of geek craft.</p>
<p>
<a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x1ff31b30">SÃ¶ren Auer</a> asked me to say a few things about relational to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1f8118e0">RDF</a> mapping. I will cite some highlights from this, as they pertain to the general scene. There was an &quot;open hacking&quot; session Wednesday night featuring lightning talks. I will use some of these too as a starting point.</p>
<h3>The messaging?</h3>
<p>The <a href="http://www.w3.org/2001/sw/sweo/" id="link-id0x1dc39210">SWEO</a> (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became &quot;<a href="http://dbpedia.org/resource/Data" id="link-id0x1f24dd98">Data</a> Unleashed.&quot; Why not? Let&#39;s give this some context.</p>
<p>So, if we are holding a <i>Semantic Web 101</i> session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world.</p>
<p>Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with <a href="http://myopenlink.net/" id="link-id0x1f5e0198">myopenlink</a>.<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x1ec49a00">net</a> and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our <a href="http://challenge.semanticweb.org/" id="link-id0x1ee35f70">Billion Triples Challenge</a> demo. We will make a webcast about this to make it all clearer.</p>
<p>Behold: The Data Web is about the world becoming a database; writing <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x20644808">SPARQL</a> queries or triples is incidental. You will write FOAF files by hand just as little as you now write <a href="http://dbpedia.org/resource/SQL" id="link-id0x1fd9fbc0">SQL</a> insert statements for filling in your account <a href="http://dbpedia.org/resource/Information" id="link-id0x1dfd3540">information</a> on Myspace.</p>
<p>Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1e920868">RDBMS</a> was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more <i>ad hoc</i>. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced.</p>
<p>Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity?</p>
<p>A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain.</p>
<p>However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic.</p>
<h3>What is the business value of this?</h3>
<p>The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web.</p>
<p>Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data.</p>
<p>Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1e715138">linked open data</a>. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question.</p>
<p>We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times.</p>
<p>But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy?</p>
<p>The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information.</p>
<h3>But let&#39;s get back to more specific issues</h3>
<p>At the lightning talks in Vienna, one participant said, &quot;Man&#39;s enemy is not the lion that eats men, it&#39;s his own brother. Semantic Web&#39;s enemy is the <a href="http://dbpedia.org/resource/XML" id="link-id0x1aeb61b8">XML</a> Web services stack that ate its lunch.&quot; There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze.</p>
<p>SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the <i>ad hoc</i> component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit.</p>
<p>&quot;I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it&#39;s too complex, the developer who says it is not what he&#39;s used to, and so on?&quot;</p>
<p>This is an education question. One of the findings of SWEO&#39;s enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can&#39;t be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails.</p>
<p>Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1fbd3398">knowledge</a> bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-24#1459">
  <rss:title>State of the Semantic Web, Part 1 - Sociology, Business, and Messaging (update 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-24T10:19:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I was in Vienna for the Linked Data Practitioners gathering this week. Danny Ayers asked me if I would blog about the State of the Semantic Web or write the This Week&#39;s Semantic Web column. I don&#39;t have the time to cover all that may have happened during the past week but I will editorialize about the questions that again were raised in Vienna. How these things relate to Virtuoso will be covered separately. This is about the overarching questions of the times, not the finer points of geek craft. SÃ¶ren Auer asked me to say a few things about relational to RDF mapping. I will cite some highlights from this, as they pertain to the general scene. There was an &quot;open hacking&quot; session Wednesday night featuring lightning talks. I will use some of these too as a starting point. The messaging? The SWEO (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became &quot;Data Unleashed.&quot; Why not? Let&#39;s give this some context. So, if we are holding a Semantic Web 101 session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world. Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with myopenlink.net and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our Billion Triples Challenge demo. We will make a webcast about this to make it all clearer. Behold: The Data Web is about the world becoming a database; writing SPARQL queries or triples is incidental. You will write FOAF files by hand just as little as you now write SQL insert statements for filling in your account information on Myspace. Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the RDBMS was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more ad hoc. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced. Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity? A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain. However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic. What is the business value of this? The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web. Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data. Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with linked open data. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question. We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times. But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy? The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information. But let&#39;s get back to more specific issues At the lightning talks in Vienna, one participant said, &quot;Man&#39;s enemy is not the lion that eats men, it&#39;s his own brother. Semantic Web&#39;s enemy is the XML Web services stack that ate its lunch.&quot; There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze. SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the ad hoc component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit. &quot;I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it&#39;s too complex, the developer who says it is not what he&#39;s used to, and so on?&quot; This is an education question. One of the findings of SWEO&#39;s enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can&#39;t be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails. Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise knowledge bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I was in <a href="http://dbpedia.org/resource/Vienna" id="link-id0x28471870">Vienna</a> for the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x26f0ec28">Linked Data</a> Practitioners gathering this week. Danny Ayers asked me if I would <a href="http://dbpedia.org/resource/Blog" id="link-id0x26cf7678">blog</a> about the State of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x273087e0">Semantic Web</a> or write the <i>This Week&#39;s Semantic Web</i> column. I don&#39;t have the time to cover all that may have happened during the past week but I will editorialize about the questions that again were raised in Vienna. How these things relate to <a href="http://virtuoso.openlinksw.com" id="link-id0x264e11b8">Virtuoso</a> will be covered separately. This is about the overarching questions of the times, not the finer points of geek craft.</p>
<p>
<a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x2787de70">SÃ¶ren Auer</a> asked me to say a few things about relational to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x280b12f8">RDF</a> mapping. I will cite some highlights from this, as they pertain to the general scene. There was an &quot;open hacking&quot; session Wednesday night featuring lightning talks. I will use some of these too as a starting point.</p>
<h3>The messaging?</h3>
<p>The <a href="http://www.w3.org/2001/sw/sweo/" id="link-id0x28078030">SWEO</a> (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became &quot;<a href="http://dbpedia.org/resource/Data" id="link-id0x290a48c0">Data</a> Unleashed.&quot; Why not? Let&#39;s give this some context.</p>
<p>So, if we are holding a <i>Semantic Web 101</i> session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world.</p>
<p>Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with <a href="http://myopenlink.net/" id="link-id0x272ed6d0">myopenlink</a>.<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x277dbbd0">net</a> and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our <a href="http://challenge.semanticweb.org/" id="link-id0x25fda5c8">Billion Triples Challenge</a> demo. We will make a webcast about this to make it all clearer.</p>
<p>Behold: The Data Web is about the world becoming a database; writing <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x278c3878">SPARQL</a> queries or triples is incidental. You will write FOAF files by hand just as little as you now write <a href="http://dbpedia.org/resource/SQL" id="link-id0x27e6be18">SQL</a> insert statements for filling in your account <a href="http://dbpedia.org/resource/Information" id="link-id0x2727a278">information</a> on Myspace.</p>
<p>Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x26020128">RDBMS</a> was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more <i>ad hoc</i>. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced.</p>
<p>Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity?</p>
<p>A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain.</p>
<p>However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic.</p>
<h3>What is the business value of this?</h3>
<p>The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web.</p>
<p>Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data.</p>
<p>Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x261c7bc8">linked open data</a>. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question.</p>
<p>We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times.</p>
<p>But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy?</p>
<p>The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information.</p>
<h3>But let&#39;s get back to more specific issues</h3>
<p>At the lightning talks in Vienna, one participant said, &quot;Man&#39;s enemy is not the lion that eats men, it&#39;s his own brother. Semantic Web&#39;s enemy is the <a href="http://dbpedia.org/resource/XML" id="link-id0x26273118">XML</a> Web services stack that ate its lunch.&quot; There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze.</p>
<p>SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the <i>ad hoc</i> component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit.</p>
<p>&quot;I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it&#39;s too complex, the developer who says it is not what he&#39;s used to, and so on?&quot;</p>
<p>This is an education question. One of the findings of SWEO&#39;s enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can&#39;t be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails.</p>
<p>Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x260172a8">knowledge</a> bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-02#1451">
  <rss:title>Virtuoso Cluster Paper Update</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-02T10:02:33Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">An updated version of the paper about Virtuoso Cluster is available at 2008webscale_rdf.pdf</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>An updated version of the paper about <a href="http://virtuoso.openlinksw.com" id="link-id0xc0abc50">Virtuoso</a> Cluster is available at <a href="http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf" id="link-id16459248">2008webscale_rdf.pdf</a>
</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-02#1450">
  <rss:title>Virtuoso Update, Billion Triples and Outlook</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-02T10:02:32Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso Update, Billion Triples and Outlook I will say a few things about what we have been doing and where we can go. Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post. There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe. Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work. We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL. RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol. In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web. I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post. Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time. Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later. The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility. We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case. We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface. We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person. Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible. The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity. Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable. Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience. The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports. Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Virtuoso Update, Billion Triples and Outlook</div>
<p>I will say a few things about what we have been doing and where we can go.</p>

<p>Firstly, we have a fairly scalable platform with <a href="http://virtuoso.openlinksw.com" id="link-id0x1aa82dc0">Virtuoso</a> 6 Cluster. It was most recently tested with the workload discussed in the previous <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id1638a5b8">Billion Triples post</a>.</p>

<p>There is an updated version of <a href="http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf" id="link-id16280a68">the paper about this</a>. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.</p>

<p>Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more <a href="http://dbpedia.org/resource/SQL" id="link-id0x1abd3f38">SQL</a> optimizations specific to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1adbe410">RDF</a>. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.</p>

<p>We spent a lot of time around the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1aaa0e78">Berlin SPARQL Benchmark</a> story, so we got to the more advanced stuff like the <a href="http://challenge.semanticweb.org/" id="link-id0x1a860a50">Billion Triples Challenge</a> rather late. We did along the way also run <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1a27f2a8">BSBM</a> with an <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1ad5c918">Oracle</a> back-end, with Virtuoso mapping <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1cf0e4a0">SPARQL</a> to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.</p>

<p>RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x1ab96bb0">RDB2RDF XG</a>. Examples of complex warehouses include the <a href="http://neurocommons.org/page/Main_Page" id="link-id0x1adb2db0">Neurocommons</a> database, the Billion Triples Challenge, and the <a href="http://www.garlik.com/" id="link-id0x1925c7b0">Garlik DataPatrol</a>.</p>

<p>In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1c6d1480">Linked Data</a> forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the <a href="http://dbpedia.org/resource/Data" id="link-id0x1a937400">data</a> web becomes as indispensable as presence on the HTML web.</p>

<p>I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.</p>

<p>Now, all the things shown in the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id14de1d18">Billion Triples post</a> can be done with a relational system specially built for each purpose. Since we are a general purpose <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1a457c70">RDBMS</a>, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.</p>

<p>Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.</p>

<p>The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.</p>

<p>We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of <a href="http://umbel.org/about/" id="link-id0x1aa5ea18">UMBEL</a> and <a href="http://dbpedia.org/resource/Cyc" id="link-id0x1a631a20">OpenCyc</a>. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.</p>

<p>We expect to be able to combine geography, social proximity, subject matter, and <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0x1aebdcc8">named entities</a>, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.</p>

<p>We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.</p>

<p>Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.</p>

<p>The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of <a href="http://dbpedia.org/resource/Information" id="link-id0x1ab88490">information</a> and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.</p>

<p>Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like <i>list the top 10 attributes with the most distinct values for all persons</i> cannot be done in SQL. SQL simply does not allow the columns to be variable.</p>

<p>Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.</p>

<p>The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.</p>

<p>Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1449">
  <rss:title>Virtuoso Cluster Paper Update</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-02T09:38:14Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">An updated version of the paper about Virtuoso Cluster is available at 2008webscale_rdf.pdf</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>An updated version of the paper about <a href="http://virtuoso.openlinksw.com" id="link-id0x17b3d2c8">Virtuoso</a> Cluster is available at <a href="http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf" id="link-id16459248">2008webscale_rdf.pdf</a>
</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1448">
  <rss:title>Virtuoso Update, Billion Triples and Outlook</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-02T09:31:17Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I will say a few things about what we have been doing and where we can go. Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post. There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe. Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work. We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL. RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol. In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web. I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post. Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time. Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later. The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility. We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case. We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface. We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person. Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible. The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity. Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable. Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience. The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports. Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I will say a few things about what we have been doing and where we can go.</p>

<p>Firstly, we have a fairly scalable platform with <a href="http://virtuoso.openlinksw.com" id="link-id0xa412e450">Virtuoso</a> 6 Cluster. It was most recently tested with the workload discussed in the previous <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id1638a5b8">Billion Triples post</a>.</p>

<p>There is an updated version of <a href="http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf" id="link-id16280a68">the paper about this</a>. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.</p>

<p>Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more <a href="http://dbpedia.org/resource/SQL" id="link-id0x1c1c5f48">SQL</a> optimizations specific to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1bcb6108">RDF</a>. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.</p>

<p>We spent a lot of time around the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x3a4e17c8">Berlin SPARQL Benchmark</a> story, so we got to the more advanced stuff like the <a href="http://challenge.semanticweb.org/" id="link-id0x1a66c568">Billion Triples Challenge</a> rather late. We did along the way also run <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x188c2608">BSBM</a> with an <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1aa97f98">Oracle</a> back-end, with Virtuoso mapping <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1abd87a0">SPARQL</a> to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.</p>

<p>RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x1a941678">RDB2RDF XG</a>. Examples of complex warehouses include the <a href="http://neurocommons.org/page/Main_Page" id="link-id0x1aa5a9f8">Neurocommons</a> database, the Billion Triples Challenge, and the <a href="http://www.garlik.com/" id="link-id0x372df7b0">Garlik DataPatrol</a>.</p>

<p>In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1a671cf0">Linked Data</a> forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the <a href="http://dbpedia.org/resource/Data" id="link-id0x1ab83dd0">data</a> web becomes as indispensable as presence on the HTML web.</p>

<p>I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.</p>

<p>Now, all the things shown in the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id14de1d18">Billion Triples post</a> can be done with a relational system specially built for each purpose. Since we are a general purpose <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x340d3470">RDBMS</a>, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.</p>

<p>Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.</p>

<p>The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.</p>

<p>We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of <a href="http://umbel.org/about/" id="link-id0x19cabf38">UMBEL</a> and <a href="http://dbpedia.org/resource/Cyc" id="link-id0x19cecd10">OpenCyc</a>. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.</p>

<p>We expect to be able to combine geography, social proximity, subject matter, and <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0x1a8202e8">named entities</a>, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.</p>

<p>We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.</p>

<p>Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.</p>

<p>The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of <a href="http://dbpedia.org/resource/Information" id="link-id0x1ab0a918">information</a> and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.</p>

<p>Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like <i>list the top 10 attributes with the most distinct values for all persons</i> cannot be done in SQL. SQL simply does not allow the columns to be variable.</p>

<p>Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.</p>

<p>The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.</p>

<p>Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-09-30#1446">
  <rss:title>OpenLink Software&#39;s Virtuoso Submission to the Billion Triples Challenge</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-30T16:24:34Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Introduction We use Virtuoso 6 Cluster Edition to demonstrate the following: Text and structured information based lookups Analytics queries Analysis of co-occurrence of features like interests and tags. Dealing with identity of multiple IRI&#39;s (owl:sameAs) The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension. The demo queries can also be run directly against the SPARQL end point. The demo is being worked on at the time of submission and may be shown online by appointment. Automatic annotation of the data based on named entity extraction is being worked on at the time of this submission. By the time of ISWC 2008 the set of sample queries will be enhanced with queries based on extracted named entities and their relationships in the UMBEL and Open CYC ontologies. Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores. The Data The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original. The triple count is 1115 million. In the case of web harvested resources, the data is loaded in one graph per resource. In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph. All string literals are additionally indexed in a full text index. No stop words are used. Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database. The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices. The Queries The demo uses Virtuoso SPARQL extensions in most queries. These extensions consist on one hand of well known SQL features like aggregation with grouping and existence and value subqueries and on the other of RDF specific features. The latter include run time RDFS and OWL inferencing support and backward chaining subclasses and transitivity. Simple Lookups sparql select ?s ?p (bif:search_excerpt (bif:vector (&#39;semantic&#39;, &#39;web&#39;), ?o)) where { ?s ?p ?o . filter (bif:contains (?o, &quot;&#39;semantic web&#39;&quot;)) } limit 10 ; This looks up triples with semantic web in the object and makes a search hit summary of the literal, highlighting the search terms. sparql select ?tp count(*) where { ?s ?p2 ?o2 . ?o2 a ?tp . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?tp order by desc 2 limit 40 ; This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt. What are these things called? sparql select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 rdfs:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?lbl order by desc 2 ; Many of these things do not have a rdfs:label. Let us use a more general concept of lable which groups dc:title, foaf:name and other name-like properties together. The subproperties are resolved at run time, there is no materialization. sparql define input:inference &#39;b3s&#39; select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 b3s:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?lbl order by desc 2 ; We can list sources by the topics they contain. Below we look for graphs that mention terrorist bombing. sparql select ?g count(*) where { graph ?g { ?s ?p ?o . filter (bif:contains (?o, &quot;&#39;terrorist bombing&#39;&quot;)) } } group by ?g order by desc 2 ; Now some web 2.0 tagging of search results. The tag cloud of &quot;computer&quot; sparql select ?lbl count (*) where { ?s ?p ?o . ?o bif:contains &quot;computer&quot; . ?s sioc:topic ?tg . optional { ?tg rdfs:label ?lbl } } group by ?lbl order by desc 2 limit 40 ; This query will find the posters who talk the most about sex. sparql select ?auth count (*) where { ?d dc:creator ?auth . ?d ?p ?o filter (bif:contains (?o, &quot;sex&quot;)) } group by ?auth order by desc 2 ; Analytics We look for people who are joined by having relatively uncommon interests but do not know each other. sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 where { { select ?i count (*) as ?cnt where { ?p foaf:interest ?i } group by ?i } filter ( ?cnt &gt; 1 &amp;&amp; ?cnt &lt; 10) . ?p1 foaf:interest ?i . ?p2 foaf:interest ?i . filter (?p1 != ?p2 &amp;&amp; !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;&amp; !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) . ?p1 foaf:nick ?n1 . ?p2 foaf:nick ?n2 . } order by ?cnt limit 50 ; The query takes a fairly long time, mostly spent counting the interested in 25M interest triples. It then takes people that share the interest and checks that neither claims to know the other. It then sorts the results rarest interest first. The query can be written more efficently but is here just to show that database-wide scans of the population are possible ad hoc. Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style related tags line at the bottom of a search result page. This showcases the use of SQL together with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is much more efficiently done in SQL, specially since it gets updated as the data changes. This is an example of materialized intermediate results based on warehoused RDF. create table tag_count (tcn_tag iri_id_8, tcn_count int, primary key (tcn_tag)); alter index tag_count on tag_count partition (tcn_tag int (0hexffff00)); create table tag_coincidence (tc_t1 iri_id_8, tc_t2 iri_id_8, tc_count int, tc_t1_count int, tc_t2_count int, primary key (tc_t1, tc_t2)) alter index tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00)); create index tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00)); How many times each topic is mentioned? insert into tag_count select * from (sparql define output:valmode &quot;LONG&quot; select ?t count (*) as ?cnt where { ?s sioc:topic ?t } group by ?t) xx option (quietcast); Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &lt; that of t2. insert into tag_coincidence (tc_t1, tc_t2, tc_count) select &quot;t1&quot;, &quot;t2&quot;, cnt from (select &quot;t1&quot;, &quot;t2&quot;, count (*) as cnt from (sparql define output:valmode &quot;LONG&quot; select ?t1 ?t2 where { ?s sioc:topic ?t1 . ?s sioc:topic ?t2 }) tags where &quot;t1&quot; &lt; &quot;t2&quot; group by &quot;t1&quot;, &quot;t2&quot;) xx where isiri_id (&quot;t1&quot;) and isiri_id (&quot;t2&quot;) option (quietcast); Now put the individual occurrence counts into the same table with the co-occurrence. This denormalization makes the related tags lookup faster. update tag_coincidence set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1), tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2); Now each tag_coincidence row has the joint occurrence count and individual occurrence counts. A single select will return a Technorati-style related tags listing. To show the URI&#39;s of the tags: select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count from tag_coincidence order by tc_count desc; Social Networks We look at what interests people have sparql select ?o ?cnt where { { select ?o count (*) as ?cnt where { ?s foaf:interest ?o } group by ?o } filter (?cnt &gt; 100) } order by desc 2 limit 100 ; Now the same for the Harry Potter fans sparql select ?i2 count (*) where { ?p foaf:interest &lt;http://www.livejournal.com/interests.bml?int=harry+potter&gt; . ?p foaf:interest ?i2 } group by ?i2 order by desc 2 limit 20 ; We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known. sparql select ?celeb, count (*) where { ?claimant foaf:knows ?celeb . filter (!bif:exists ((select (1) where { ?celeb foaf:knows ?claimant }))) } group by ?celeb order by desc 2 limit 10 ; We look for a well connected person to start from. sparql select ?p count (*) where { ?p foaf:knows ?k } group by ?p order by desc 2 limit 50 ; We look for the most connected of the many online identities of Stefan Decker. sparql select ?sd count (distinct ?xx) where { ?sd a foaf:Person . ?sd ?name ?ns . filter (bif:contains (?ns, &quot;&#39;Stefan Decker&#39;&quot;)) . ?sd foaf:knows ?xx } group by ?sd order by desc 2 ; We count the transitive closure of Stefan Decker&#39;s connections sparql select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = &lt;mailto:stefan.decker@deri.org&gt;) } ; Now we do the same while following owl:sameAs links. sparql define input:same-as &quot;yes&quot; select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = &lt;mailto:stefan.decker@deri.org&gt;) } ; Demo System The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions, each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes and corresponding partitions can be spread over a larger number of machines. If each ran on its own server with 16GB RAM, the whole data set could be served from memory. This is desirable for search engine or fast analytics applications. Most of the demonstrated queries run in memory on second invocation. The timing difference between first and second run is easily an order of magnitude.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<h2>Introduction</h2> 

<p>We use <a href="http://virtuoso.openlinksw.com" id="link-id0xb03e418">Virtuoso</a> 6 Cluster Edition to demonstrate the following:</p>
<ul>
<li>Text and structured <a href="http://dbpedia.org/resource/Information" id="link-id0xbd9dae8">information</a> based lookups</li>
<li>Analytics queries</li>
<li>Analysis of co-occurrence of features like interests and tags.</li>
<li>Dealing with identity of multiple IRI&#39;s (<a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0xb383dd8">owl</a>:sameAs)</li>
</ul>

<p>The demo is based on a set of canned <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xbda6298">SPARQL</a> queries that can be invoked using the <a href="http://ode.openlinksw.com/" id="link-id0xbb292f0">OpenLink Data Explorer</a> (<a href="http://ode.openlinksw.com/" id="link-id0xc263528">ODE</a>) Firefox extension.</p>
<p>The demo queries can also be run directly against the SPARQL end point.</p>

<p>The demo is being worked on at the time of submission and may be shown online by appointment.</p>

<p>Automatic annotation of the <a href="http://dbpedia.org/resource/Data" id="link-id0xa173378">data</a> based on <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0xbdda558">named entity extraction</a> is
being worked on at the time of this submission.  By the time of ISWC
2008 the set of sample queries will be enhanced with queries based on
extracted <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0xa66fbe0">named entities</a> and their relationships in the <a href="http://umbel.org/about/" id="link-id0xa06e2c8">UMBEL</a> and Open
CYC ontologies.
</p>

<p>Also examples involving owl:sameAs are being added, likewise  with similarity metrics and search hit scores.</p>

<h2>The Data</h2>

<p>The database consists of the billion triples data sets and some additions like Umbel.   Also the Freebase extract is newer than the challenge original.</p>
<p>The triple count is 1115 million.</p>
<p>In the case of web harvested resources, the data is loaded in one graph per resource.</p>
<p>In the case of larger data sets like <a href="http://dbpedia.org/resource/DBpedia" id="link-id0xc2bf770">Dbpedia</a> or the US census, all triples of the provenance share a data set specific graph.</p>
<p>All string literals are additionally indexed in a full text index.  No stop words are used.</p>

<p>Most queries do not specify a graph.  Thus they are evaluated against the union of all the graphs in the database.
The indexing scheme is SPOG, GPOS, POGS, OPGS.  All indices ending in S are bitmap indices.
</p>

<h2>The Queries </h2>


<p>The demo uses Virtuoso SPARQL extensions  in most queries.  These
extensions consist on one hand of well known <a href="http://dbpedia.org/resource/SQL" id="link-id0xaf8cb40">SQL</a> features like
aggregation with grouping and existence and value subqueries and on
the other of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xafdceb8">RDF</a> specific features.
The latter include  run time RDFS and OWL inferencing support  and backward
chaining subclasses and transitivity.  
</p>


<h3>Simple Lookups</h3> 

<pre>sparql 
select ?s ?p (bif:search_excerpt (bif:vector (&#39;<a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0xbb64dd0">semantic&#39;, &#39;web</a>&#39;), ?o)) 
where 
  {
    ?s ?p ?o . 
    filter (bif:contains (?o, &quot;&#39;semantic web&#39;&quot;)) 
  } 
limit 10
;
</pre>

<p>This looks up triples with semantic web in the object and makes a search hit summary of the literal, 
highlighting the search terms.
</p>

<pre>sparql 
select ?tp count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 a ?tp . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?tp
order by desc 2
limit 40
;
</pre>

<p>This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.</p>
<p>What are these things called?</p>

<pre>sparql 
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 rdfs:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?lbl
order by desc 2
;
</pre>

<p>Many of these things do not have a rdfs:label.  Let us use a more general concept of lable 
which groups dc:title, foaf:name and other name-like properties together.  The subproperties are 
resolved at run time, there is no materialization.
</p>

<pre>sparql 
define input:inference &#39;b3s&#39;
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 b3s:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?lbl
order by desc 2
;
</pre>

<p>We can list sources by the topics they contain.  
Below we look for graphs that mention terrorist bombing.
</p>

<pre>sparql 
select ?g count(*) 
where 
  { 
    graph ?g 
      {
        ?s ?p ?o . 
        filter (bif:contains (?o, &quot;&#39;terrorist bombing&#39;&quot;)) 
      }
  } 
group by ?g 
order by desc 2
;
</pre>

<p>Now some web 2.0 tagging of search results.  The <a href="http://dbpedia.org/resource/Tag" id="link-id0xa8b89f8">tag</a> cloud of &quot;computer&quot;</p>

<pre>sparql 
select ?lbl count (*) 
where 
  { 
    ?s ?p ?o . 
    ?o bif:contains &quot;computer&quot; . 
    ?s sioc:topic ?tg .
    optional 
      {
        ?tg rdfs:label ?lbl
      }
  }
group by ?lbl 
order by desc 2 
limit 40
;
</pre>

<p>This query will find the posters who talk the most about sex.</p>

<pre>sparql 
select ?auth count (*) 
where 
  { 
    ?d dc:creator ?auth .
    ?d ?p ?o
    filter (bif:contains (?o, &quot;sex&quot;)) 
  } 
group by ?auth
order by desc 2
;
</pre>

<h3>Analytics </h3>

<p>We look for people who are joined by having relatively uncommon interests but do not know each other.</p>

<pre>sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 
where 
  {
    {
      select ?i count (*) as ?cnt 
      where 
        { ?p foaf:interest ?i } 
      group by ?i
    }
    filter ( ?cnt &gt; 1 &amp;&amp; ?cnt &lt; 10) .
    ?p1 foaf:interest ?i .
    ?p2 foaf:interest ?i .
    filter  (?p1 != ?p2 &amp;&amp; 
             !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;&amp; 
             !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
    ?p1 foaf:nick ?n1 .
    ?p2 foaf:nick ?n2 .
  } 
order by ?cnt 
limit 50
;
</pre>

<p>The query takes a fairly long time, mostly spent counting the interested in 25M interest triples.  
It then takes people that share the interest and checks that neither claims to know the other.  
It then sorts the results rarest interest first.  The query can be written more efficently but is 
here just to show that database-wide scans of the population are possible ad hoc.
</p>

<p>Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style
related tags line at the bottom of a search result page.  This showcases the use of SQL together 
with SPARQL.  The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is 
much more efficiently done in SQL, specially since it gets updated as the data changes.  
This is an example of materialized intermediate results based on warehoused RDF.
</p>

<pre>create table 
tag_count (tcn_tag iri_id_8, 
           tcn_count int, 
           primary key (tcn_tag));
           
alter index 
tag_count on tag_count partition (tcn_tag int (0hexffff00));

create table 
tag_coincidence (tc_t1 iri_id_8, 
                 tc_t2 iri_id_8, 
                 tc_count int, 
                 tc_t1_count int, 
                 tc_t2_count int, 
                 primary key  (tc_t1, tc_t2))

alter index 
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));

create index 
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));
</pre>

<p>How many times each topic is mentioned?</p>

<pre>
insert into tag_count 
  select * 
    from (sparql define output:valmode &quot;LONG&quot; 
                 select ?t count (*) as ?cnt 
                 where 
                   {
                     ?s sioc:topic ?t
                   } 
                 group by ?t) 
    xx option (quietcast);
</pre>

<p>Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &lt; that of t2.</p>

<pre>insert into tag_coincidence  (tc_t1, tc_t2, tc_count)
  select &quot;t1&quot;, &quot;t2&quot;, cnt 
    from 
      (select  &quot;t1&quot;, &quot;t2&quot;, count (*) as cnt 
         from 
           (sparql define output:valmode &quot;LONG&quot;
                   select ?t1 ?t2 
                     where 
                       {
                         ?s sioc:topic ?t1 . 
                         ?s sioc:topic ?t2 
                       }) tags
         where &quot;t1&quot; &lt; &quot;t2&quot; 
         group by &quot;t1&quot;, &quot;t2&quot;) xx
    where isiri_id (&quot;t1&quot;) and 
          isiri_id (&quot;t2&quot;) 
    option (quietcast); 
</pre>

<p>Now put the individual occurrence counts into the same table with the co-occurrence.  This 
denormalization makes the related tags lookup faster.
</p>


<pre>update tag_coincidence 
  set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
      tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);
</pre>

<p>Now each tag_coincidence row has the joint occurrence count and individual occurrence counts.  
A single select will return a Technorati-style related tags listing.
</p>

<p>To show the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x9d4bc60">URI</a>&#39;s of the tags:
</p>

<pre>select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count 
  from tag_coincidence 
  order by tc_count desc;
</pre>

<h3>Social Networks </h3>

<p>We look at what interests people have </p>

<pre>sparql 
select ?o ?cnt  
where 
  {
    {
      select ?o count (*) as ?cnt 
        where 
          {
            ?s foaf:interest ?o
          } 
        group by ?o
    } 
    filter (?cnt &gt; 100) 
  } 
order by desc 2 
limit 100
;
</pre>

<p>Now the same for the Harry Potter fans </p>

<pre>sparql 
select ?i2 count (*) 
where 
  { 
    ?p foaf:interest &lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0xba0b390">http</a>://www.livejournal.com/interests.bml?int=harry+potter&gt; .
    ?p foaf:interest ?i2 
  } 
group by ?i2 
order by desc 2 
limit 20
;
</pre>

<p>We see whether knows relations are symmmetrical.  We return the top n people that others claim to know without being reciprocally known.</p>

<pre>sparql 
select ?celeb, count (*) 
where 
  { 
    ?claimant foaf:knows ?celeb . 
    filter (!bif:exists ((select (1) 
                          where 
                            {
                              ?celeb foaf:knows ?claimant 
                            }))) 
  } 
group by ?celeb 
order by desc 2 
limit 10
;
</pre>

<p>We look for a well connected person to start from.</p>

<pre>sparql 
select ?p count (*) 
where 
  {
    ?p foaf:knows ?k 
  } 
group by ?p 
order by desc 2 
limit 50
;
</pre>

<p>We look for the most connected of the many online identities of Stefan Decker.</p>

<pre>sparql 
select ?sd count (distinct ?xx) 
where 
  { 
    ?sd a foaf:Person . 
    ?sd ?name ?ns . 
    filter (bif:contains (?ns, &quot;&#39;Stefan Decker&#39;&quot;)) . 
    ?sd foaf:knows ?xx 
  } 
group by ?sd 
order by desc 2
;
</pre>

<p>We count the transitive closure of Stefan Decker&#39;s connections </p>

<pre>sparql 
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &lt;mailto:stefan.decker@deri.org&gt;)
  }
;
</pre>

<p>Now we do the same while following owl:sameAs links.</p>

<pre>sparql 
define input:same-as &quot;yes&quot;
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &lt;mailto:stefan.decker@deri.org&gt;)
  }
;
</pre>

<h2>Demo System</h2> 

<p>The system runs on Virtuoso 6 Cluster Edition.  The database is partitioned into 12 partitions, 
each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 
machines, each with  2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes 
and corresponding partitions can be spread over a larger number of machines.  If each ran on its 
own server with 16GB RAM, the whole data set could be served from memory. This is desirable for 
search engine or fast analytics applications. Most of the demonstrated queries run in memory on 
second invocation. The timing difference between first and second run is easily an order of 
magnitude.
</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-09-30#1445">
  <rss:title>OpenLink Software&#39;s Virtuoso Submission to the Billion Triples Challenge</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-30T15:39:26Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Introduction We use Virtuoso 6 Cluster Edition to demonstrate the following: Text and structured information based lookups Analytics queries Analysis of co-occurrence of features like interests and tags. Dealing with identity of multiple IRI&#39;s (owl:sameAs) The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension. The demo queries can also be run directly against the SPARQL end point. The demo is being worked on at the time of submission and may be shown online by appointment. Automatic annotation of the data based on named entity extraction is being worked on at the time of this submission. By the time of ISWC 2008 the set of sample queries will be enhanced with queries based on extracted named entities and their relationships in the UMBEL and Open CYC ontologies. Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores. The Data The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original. The triple count is 1115 million. In the case of web harvested resources, the data is loaded in one graph per resource. In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph. All string literals are additionally indexed in a full text index. No stop words are used. Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database. The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices. The Queries The demo uses Virtuoso SPARQL extensions in most queries. These extensions consist on one hand of well known SQL features like aggregation with grouping and existence and value subqueries and on the other of RDF specific features. The latter include run time RDFS and OWL inferencing support and backward chaining subclasses and transitivity. Simple Lookups sparql select ?s ?p (bif:search_excerpt (bif:vector (&#39;semantic&#39;, &#39;web&#39;), ?o)) where { ?s ?p ?o . filter (bif:contains (?o, &quot;&#39;semantic web&#39;&quot;)) } limit 10 ; This looks up triples with semantic web in the object and makes a search hit summary of the literal, highlighting the search terms. sparql select ?tp count(*) where { ?s ?p2 ?o2 . ?o2 a ?tp . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?tp order by desc 2 limit 40 ; This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt. What are these things called? sparql select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 rdfs:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?lbl order by desc 2 ; Many of these things do not have a rdfs:label. Let us use a more general concept of lable which groups dc:title, foaf:name and other name-like properties together. The subproperties are resolved at run time, there is no materialization. sparql define input:inference &#39;b3s&#39; select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 b3s:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?lbl order by desc 2 ; We can list sources by the topics they contain. Below we look for graphs that mention terrorist bombing. sparql select ?g count(*) where { graph ?g { ?s ?p ?o . filter (bif:contains (?o, &quot;&#39;terrorist bombing&#39;&quot;)) } } group by ?g order by desc 2 ; Now some web 2.0 tagging of search results. The tag cloud of &quot;computer&quot; sparql select ?lbl count (*) where { ?s ?p ?o . ?o bif:contains &quot;computer&quot; . ?s sioc:topic ?tg . optional { ?tg rdfs:label ?lbl } } group by ?lbl order by desc 2 limit 40 ; This query will find the posters who talk the most about sex. sparql select ?auth count (*) where { ?d dc:creator ?auth . ?d ?p ?o filter (bif:contains (?o, &quot;sex&quot;)) } group by ?auth order by desc 2 ; Analytics We look for people who are joined by having relatively uncommon interests but do not know each other. sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 where { { select ?i count (*) as ?cnt where { ?p foaf:interest ?i } group by ?i } filter ( ?cnt &gt; 1 &amp;&amp; ?cnt &lt; 10) . ?p1 foaf:interest ?i . ?p2 foaf:interest ?i . filter (?p1 != ?p2 &amp;&amp; !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;&amp; !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) . ?p1 foaf:nick ?n1 . ?p2 foaf:nick ?n2 . } order by ?cnt limit 50 ; The query takes a fairly long time, mostly spent counting the interested in 25M interest triples. It then takes people that share the interest and checks that neither claims to know the other. It then sorts the results rarest interest first. The query can be written more efficently but is here just to show that database-wide scans of the population are possible ad hoc. Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style related tags line at the bottom of a search result page. This showcases the use of SQL together with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is much more efficiently done in SQL, specially since it gets updated as the data changes. This is an example of materialized intermediate results based on warehoused RDF. create table tag_count (tcn_tag iri_id_8, tcn_count int, primary key (tcn_tag)); alter index tag_count on tag_count partition (tcn_tag int (0hexffff00)); create table tag_coincidence (tc_t1 iri_id_8, tc_t2 iri_id_8, tc_count int, tc_t1_count int, tc_t2_count int, primary key (tc_t1, tc_t2)) alter index tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00)); create index tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00)); How many times each topic is mentioned? insert into tag_count select * from (sparql define output:valmode &quot;LONG&quot; select ?t count (*) as ?cnt where { ?s sioc:topic ?t } group by ?t) xx option (quietcast); Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &lt; that of t2. insert into tag_coincidence (tc_t1, tc_t2, tc_count) select &quot;t1&quot;, &quot;t2&quot;, cnt from (select &quot;t1&quot;, &quot;t2&quot;, count (*) as cnt from (sparql define output:valmode &quot;LONG&quot; select ?t1 ?t2 where { ?s sioc:topic ?t1 . ?s sioc:topic ?t2 }) tags where &quot;t1&quot; &lt; &quot;t2&quot; group by &quot;t1&quot;, &quot;t2&quot;) xx where isiri_id (&quot;t1&quot;) and isiri_id (&quot;t2&quot;) option (quietcast); Now put the individual occurrence counts into the same table with the co-occurrence. This denormalization makes the related tags lookup faster. update tag_coincidence set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1), tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2); Now each tag_coincidence row has the joint occurrence count and individual occurrence counts. A single select will return a Technorati-style related tags listing. To show the URI&#39;s of the tags: select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count from tag_coincidence order by tc_count desc; Social Networks We look at what interests people have sparql select ?o ?cnt where { { select ?o count (*) as ?cnt where { ?s foaf:interest ?o } group by ?o } filter (?cnt &gt; 100) } order by desc 2 limit 100 ; Now the same for the Harry Potter fans sparql select ?i2 count (*) where { ?p foaf:interest &lt;http://www.livejournal.com/interests.bml?int=harry+potter&gt; . ?p foaf:interest ?i2 } group by ?i2 order by desc 2 limit 20 ; We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known. sparql select ?celeb, count (*) where { ?claimant foaf:knows ?celeb . filter (!bif:exists ((select (1) where { ?celeb foaf:knows ?claimant }))) } group by ?celeb order by desc 2 limit 10 ; We look for a well connected person to start from. sparql select ?p count (*) where { ?p foaf:knows ?k } group by ?p order by desc 2 limit 50 ; We look for the most connected of the many online identities of Stefan Decker. sparql select ?sd count (distinct ?xx) where { ?sd a foaf:Person . ?sd ?name ?ns . filter (bif:contains (?ns, &quot;&#39;Stefan Decker&#39;&quot;)) . ?sd foaf:knows ?xx } group by ?sd order by desc 2 ; We count the transitive closure of Stefan Decker&#39;s connections sparql select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = &lt;mailto:stefan.decker@deri.org&gt;) } ; Now we do the same while following owl:sameAs links. sparql define input:same-as &quot;yes&quot; select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = &lt;mailto:stefan.decker@deri.org&gt;) } ; Demo System The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions, each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes and corresponding partitions can be spread over a larger number of machines. If each ran on its own server with 16GB RAM, the whole data set could be served from memory. This is desirable for search engine or fast analytics applications. Most of the demonstrated queries run in memory on second invocation. The timing difference between first and second run is easily an order of magnitude.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<h2>Introduction</h2> 

<p>We use <a href="http://virtuoso.openlinksw.com" id="link-id0xa278560">Virtuoso</a> 6 Cluster Edition to demonstrate the following:</p>
<ul>
<li>Text and structured <a href="http://dbpedia.org/resource/Information" id="link-id0xb3a4490">information</a> based lookups</li>
<li>Analytics queries</li>
<li>Analysis of co-occurrence of features like interests and tags.</li>
<li>Dealing with identity of multiple IRI&#39;s (<a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0xa904bd8">owl</a>:sameAs)</li>
</ul>

<p>The demo is based on a set of canned <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xac185d0">SPARQL</a> queries that can be invoked using the <a href="http://ode.openlinksw.com/" id="link-id0xb8efe28">OpenLink Data Explorer</a> (<a href="http://ode.openlinksw.com/" id="link-id0xb341808">ODE</a>) Firefox extension.</p>
<p>The demo queries can also be run directly against the SPARQL end point.</p>

<p>The demo is being worked on at the time of submission and may be shown online by appointment.</p>

<p>Automatic annotation of the <a href="http://dbpedia.org/resource/Data" id="link-id0xa2fcc88">data</a> based on <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0xc085440">named entity extraction</a> is
being worked on at the time of this submission.  By the time of ISWC
2008 the set of sample queries will be enhanced with queries based on
extracted <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0xa92b3e0">named entities</a> and their relationships in the <a href="http://umbel.org/about/" id="link-id0xa1c7c38">UMBEL</a> and Open
CYC ontologies.
</p>

<p>Also examples involving owl:sameAs are being added, likewise  with similarity metrics and search hit scores.</p>

<h2>The Data</h2>

<p>The database consists of the billion triples data sets and some additions like Umbel.   Also the Freebase extract is newer than the challenge original.</p>
<p>The triple count is 1115 million.</p>
<p>In the case of web harvested resources, the data is loaded in one graph per resource.</p>
<p>In the case of larger data sets like <a href="http://dbpedia.org/resource/DBpedia" id="link-id0xa949850">Dbpedia</a> or the US census, all triples of the provenance share a data set specific graph.</p>
<p>All string literals are additionally indexed in a full text index.  No stop words are used.</p>

<p>Most queries do not specify a graph.  Thus they are evaluated against the union of all the graphs in the database.
The indexing scheme is SPOG, GPOS, POGS, OPGS.  All indices ending in S are bitmap indices.
</p>

<h2>The Queries </h2>


<p>The demo uses Virtuoso SPARQL extensions  in most queries.  These
extensions consist on one hand of well known <a href="http://dbpedia.org/resource/SQL" id="link-id0xc116190">SQL</a> features like
aggregation with grouping and existence and value subqueries and on
the other of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xa9047f0">RDF</a> specific features.
The latter include  run time RDFS and OWL inferencing support  and backward
chaining subclasses and transitivity.  
</p>


<h3>Simple Lookups</h3> 

<pre>sparql 
select ?s ?p (bif:search_excerpt (bif:vector (&#39;<a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0xbe38070">semantic&#39;, &#39;web</a>&#39;), ?o)) 
where 
  {
    ?s ?p ?o . 
    filter (bif:contains (?o, &quot;&#39;semantic web&#39;&quot;)) 
  } 
limit 10
;
</pre>

<p>This looks up triples with semantic web in the object and makes a search hit summary of the literal, 
highlighting the search terms.
</p>

<pre>sparql 
select ?tp count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 a ?tp . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?tp
order by desc 2
limit 40
;
</pre>

<p>This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.</p>
<p>What are these things called?</p>

<pre>sparql 
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 rdfs:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?lbl
order by desc 2
;
</pre>

<p>Many of these things do not have a rdfs:label.  Let us use a more general concept of lable 
which groups dc:title, foaf:name and other name-like properties together.  The subproperties are 
resolved at run time, there is no materialization.
</p>

<pre>sparql 
define input:inference &#39;b3s&#39;
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 b3s:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?lbl
order by desc 2
;
</pre>

<p>We can list sources by the topics they contain.  
Below we look for graphs that mention terrorist bombing.
</p>

<pre>sparql 
select ?g count(*) 
where 
  { 
    graph ?g 
      {
        ?s ?p ?o . 
        filter (bif:contains (?o, &quot;&#39;terrorist bombing&#39;&quot;)) 
      }
  } 
group by ?g 
order by desc 2
;
</pre>

<p>Now some web 2.0 tagging of search results.  The <a href="http://dbpedia.org/resource/Tag" id="link-id0xa366510">tag</a> cloud of &quot;computer&quot;</p>

<pre>sparql 
select ?lbl count (*) 
where 
  { 
    ?s ?p ?o . 
    ?o bif:contains &quot;computer&quot; . 
    ?s sioc:topic ?tg .
    optional 
      {
        ?tg rdfs:label ?lbl
      }
  }
group by ?lbl 
order by desc 2 
limit 40
;
</pre>

<p>This query will find the posters who talk the most about sex.</p>

<pre>sparql 
select ?auth count (*) 
where 
  { 
    ?d dc:creator ?auth .
    ?d ?p ?o
    filter (bif:contains (?o, &quot;sex&quot;)) 
  } 
group by ?auth
order by desc 2
;
</pre>

<h3>Analytics </h3>

<p>We look for people who are joined by having relatively uncommon interests but do not know each other.</p>

<pre>sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 
where 
  {
    {
      select ?i count (*) as ?cnt 
      where 
        { ?p foaf:interest ?i } 
      group by ?i
    }
    filter ( ?cnt &gt; 1 &amp;&amp; ?cnt &lt; 10) .
    ?p1 foaf:interest ?i .
    ?p2 foaf:interest ?i .
    filter  (?p1 != ?p2 &amp;&amp; 
             !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;&amp; 
             !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
    ?p1 foaf:nick ?n1 .
    ?p2 foaf:nick ?n2 .
  } 
order by ?cnt 
limit 50
;
</pre>

<p>The query takes a fairly long time, mostly spent counting the interested in 25M interest triples.  
It then takes people that share the interest and checks that neither claims to know the other.  
It then sorts the results rarest interest first.  The query can be written more efficently but is 
here just to show that database-wide scans of the population are possible ad hoc.
</p>

<p>Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style
related tags line at the bottom of a search result page.  This showcases the use of SQL together 
with SPARQL.  The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is 
much more efficiently done in SQL, specially since it gets updated as the data changes.  
This is an example of materialized intermediate results based on warehoused RDF.
</p>

<pre>create table 
tag_count (tcn_tag iri_id_8, 
           tcn_count int, 
           primary key (tcn_tag));
           
alter index 
tag_count on tag_count partition (tcn_tag int (0hexffff00));

create table 
tag_coincidence (tc_t1 iri_id_8, 
                 tc_t2 iri_id_8, 
                 tc_count int, 
                 tc_t1_count int, 
                 tc_t2_count int, 
                 primary key  (tc_t1, tc_t2))

alter index 
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));

create index 
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));
</pre>

<p>How many times each topic is mentioned?</p>

<pre>
insert into tag_count 
  select * 
    from (sparql define output:valmode &quot;LONG&quot; 
                 select ?t count (*) as ?cnt 
                 where 
                   {
                     ?s sioc:topic ?t
                   } 
                 group by ?t) 
    xx option (quietcast);
</pre>

<p>Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &lt; that of t2.</p>

<pre>insert into tag_coincidence  (tc_t1, tc_t2, tc_count)
  select &quot;t1&quot;, &quot;t2&quot;, cnt 
    from 
      (select  &quot;t1&quot;, &quot;t2&quot;, count (*) as cnt 
         from 
           (sparql define output:valmode &quot;LONG&quot;
                   select ?t1 ?t2 
                     where 
                       {
                         ?s sioc:topic ?t1 . 
                         ?s sioc:topic ?t2 
                       }) tags
         where &quot;t1&quot; &lt; &quot;t2&quot; 
         group by &quot;t1&quot;, &quot;t2&quot;) xx
    where isiri_id (&quot;t1&quot;) and 
          isiri_id (&quot;t2&quot;) 
    option (quietcast); 
</pre>

<p>Now put the individual occurrence counts into the same table with the co-occurrence.  This 
denormalization makes the related tags lookup faster.
</p>


<pre>update tag_coincidence 
  set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
      tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);
</pre>

<p>Now each tag_coincidence row has the joint occurrence count and individual occurrence counts.  
A single select will return a Technorati-style related tags listing.
</p>

<p>To show the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0xaf355c8">URI</a>&#39;s of the tags:
</p>

<pre>select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count 
  from tag_coincidence 
  order by tc_count desc;
</pre>

<h3>Social Networks </h3>

<p>We look at what interests people have </p>

<pre>sparql 
select ?o ?cnt  
where 
  {
    {
      select ?o count (*) as ?cnt 
        where 
          {
            ?s foaf:interest ?o
          } 
        group by ?o
    } 
    filter (?cnt &gt; 100) 
  } 
order by desc 2 
limit 100
;
</pre>

<p>Now the same for the Harry Potter fans </p>

<pre>sparql 
select ?i2 count (*) 
where 
  { 
    ?p foaf:interest &lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0xa274410">http</a>://www.livejournal.com/interests.bml?int=harry+potter&gt; .
    ?p foaf:interest ?i2 
  } 
group by ?i2 
order by desc 2 
limit 20
;
</pre>

<p>We see whether knows relations are symmmetrical.  We return the top n people that others claim to know without being reciprocally known.</p>

<pre>sparql 
select ?celeb, count (*) 
where 
  { 
    ?claimant foaf:knows ?celeb . 
    filter (!bif:exists ((select (1) 
                          where 
                            {
                              ?celeb foaf:knows ?claimant 
                            }))) 
  } 
group by ?celeb 
order by desc 2 
limit 10
;
</pre>

<p>We look for a well connected person to start from.</p>

<pre>sparql 
select ?p count (*) 
where 
  {
    ?p foaf:knows ?k 
  } 
group by ?p 
order by desc 2 
limit 50
;
</pre>

<p>We look for the most connected of the many online identities of Stefan Decker.</p>

<pre>sparql 
select ?sd count (distinct ?xx) 
where 
  { 
    ?sd a foaf:Person . 
    ?sd ?name ?ns . 
    filter (bif:contains (?ns, &quot;&#39;Stefan Decker&#39;&quot;)) . 
    ?sd foaf:knows ?xx 
  } 
group by ?sd 
order by desc 2
;
</pre>

<p>We count the transitive closure of Stefan Decker&#39;s connections </p>

<pre>sparql 
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &lt;mailto:stefan.decker@deri.org&gt;)
  }
;
</pre>

<p>Now we do the same while following owl:sameAs links.</p>

<pre>sparql 
define input:same-as &quot;yes&quot;
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &lt;mailto:stefan.decker@deri.org&gt;)
  }
;
</pre>

<h2>Demo System</h2> 

<p>The system runs on Virtuoso 6 Cluster Edition.  The database is partitioned into 12 partitions, 
each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 
machines, each with  2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes 
and corresponding partitions can be spread over a larger number of machines.  If each ran on its 
own server with 16GB RAM, the whole data set could be served from memory. This is desirable for 
search engine or fast analytics applications. Most of the demonstrated queries run in memory on 
second invocation. The timing difference between first and second run is easily an order of 
magnitude.
</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-09-08#1436">
  <rss:title>Requirements for Relational-to-RDF Mapping</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-08T09:41:25Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Requirements for Relational-to-RDF Mapping Many of you will know about the W3C relational-to-RDF mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational-to-RDF mapping. To this effect, I recently summarized the group discussions and some of our own experiences around the topic at &lt;http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling&gt;. I will here discuss this less formally and more in the light of our own experience. A working group goal statement must be neutral vis Ã  vis the following points, even if any working group will unavoidably encounter these issues on the way. A blog post on the other hand can be more specific. I gave a talk to the RDB2RDF XG this spring, with these slides. The main point is that people would really like to map on-the-fly, if they only could. Making an RDF warehouse is not of value in itself, but it is true that in some cases this cannot be avoided. At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on-the-fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language. Eliminating JOINs We expect to have a situation where one virtual triple can have many possible sources. The mapping is a union of mapped databases. Any integration scenario will have this feature. In such a situation, if we are JOINing using such triples, we end up with UNIONs of all databases that could produce the triples in question. This is generally not desired. Therefore, in the on-demand mapping case, there must be a lot of type inference logic that is not relevant in the ETL scenario. To make the point clearer, suppose a query like &quot;list the organizations whose representatives have published about xx.&quot; Suppose that there are three databases mapped, all of which have a table of organizations, a table of persons with affiliation to organizations, a table of publications by these persons, and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with tag XX. It is a matter of common sense in this scenario that a publication will have the author and the author&#39;s affiliation in the same database. However, the RDB-to-RDF mapping does not necessarily know this, if all that it is told is that a table makes IRIs of publications by applying a certain pattern to the primary key of the publications table. To infer what needs to be inferred, the system must realize that IRIs from one mapping are disjoint from IRIs from another: A paper in database X will usually not have an author in database Y. The IDs in database Y, even if perchance equal to the IDs in X, do not mean the same thing, and there is no point joining across databases by them. This entire question is a non-issue in the ETL scenario, but is absolutely vital in the real-time mapping. This is also something that must be stated, at least implicitly, in any mapping. If a mapping translates keys of one place to IRIs with one pattern, and keys from another using another pattern, it must be inferable from the patterns whether the sets of IRIs will be disjoint. This is critical. Otherwise we will be joining everything to everything else, and there will be orders of magnitude of penalty compared to hand-crafted SQL over the same data sources. Expectations and Limitations on Queries SPARQL queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class, when there are few columns that can map to a given predicate, and when classes and predicates are literals in the query. Virtuoso has some SQL extensions for dealing with breaking a wide table into a row per column. This facilitates dealing with predicates that are not known at query compile time. If the table in question is not managed by Virtuoso, Virtuoso&#39;s SQL virtualization/federation takes care of the matter. If a mapping system goes directly to third-party SQL, no such tricks can be used. The above example suggests that for supporting on-the-fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined. For example, one will probably have to require that all predicates be literals. The alternative is prohibitive run-time cost and complexity. But we must not lose the baby with the bath-water. Aside from offering global identifiers, RDF&#39;s attractions include subclasses and sub-predicates. In relational terms, these translate to UNIONs and do involve some added cost. A mapping system just has to have means of dealing with this cost, and of recognizing cases where this cost is prohibitive. Some further work is likely to be required for defining well-behaved subsets of SPARQL and mappings. ETL Ou Ne Pas ETL? Whether to warehouse or not? If one has hundreds of sources, of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last year&#39;s RDB-to-RDF mapping workshop in Cambridge, Massachusetts, about a system of relational mapping and on-demand RDF-izers of diverse semi-structured biomedical data, e.g., spreadsheets. The issue certainly exists, and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores, and another part comes from a less structured repository of ETLed physical triples. Our take is that if something is a large or very large relational store, then map; else, ETL. With Virtuoso, we can mix mapped and local triples, but this is not a generally available feature of triple stores and standardization will likely have to wait until there are more implementations. Conclusions If you map on demand, watch out for an explosion of UNIONs when integrating sources that talk of similar things. If you integrate lots of sources, some ETL is likely unavoidable. Look for ways of dealing with part ETL, part mapping. ETLing everything is not always best or even possible. If you map a single fairly-clean RDB to RDF, mapping will work well, potentially much faster than triple storage. Higher storage density and more data per index lookup on the relational side. If you map on demand, some restrictions to SPARQL may be practically necessary. These have to do with variables in predicate position, variables in class position, etc. Individual implementations may support these, but standardization will likely have to put limits on them. This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across. This is a sort of addendum to the requirements I outlined on the ESW wiki.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Requirements for Relational-to-RDF Mapping</div>
<p>Many of you will know about the W3C relational-to-<a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1e1be0a8">RDF</a> mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational-to-RDF mapping.</p>

<p>To this effect, I recently summarized the group discussions and some of our own experiences around the topic at &lt;<a href="http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling" id="link-id146030e8">http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling</a>&gt;.</p>

<p>I will here discuss this less formally and more in the light of our own experience.  A working group goal statement must be neutral vis Ã  vis the following points, even if any working group will unavoidably encounter these issues on the way.  A <a href="http://dbpedia.org/resource/Blog" id="link-id0x1e6b3950">blog</a> post on the other hand can be more specific.</p>

<p>I gave a talk to the <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0xa0932c68">RDB2RDF XG</a> this spring, with these <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/Relational2RDF.ppt" id="link-id14572540">slides</a>.</p>

<p>The main point is that people would really like to map on-the-fly, if they only could. Making an RDF warehouse is not of value in itself, but it is true that in some cases this cannot be avoided.</p>

<p>At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on-the-fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language. </p>

<h2>Eliminating JOINs</h2> 

<p>We expect to have a situation where one virtual triple can have many possible sources.  The mapping is a union of mapped databases.  Any integration scenario will have this feature. In such a situation, if we are <code>JOIN</code>ing using such triples, we end up with <code>UNION</code>s of all databases that could produce the triples in question.   This is generally not desired.  Therefore, in the on-demand mapping case, there must be a lot of type inference logic that is not relevant in the ETL scenario.</p>

<p>To make the point clearer, suppose a query like &quot;list the organizations whose representatives have published about <i>xx</i>.&quot;  Suppose that there are three databases mapped, all of which have a table of organizations, a table of persons with affiliation to organizations, a table of publications by these persons, and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with <a href="http://dbpedia.org/resource/Tag" id="link-id0xa0977bf0">tag</a> <i>XX</i>.  It is a matter of common sense in this scenario that a publication will have the author and the author&#39;s affiliation in the same database.  However, the RDB-to-RDF mapping does not necessarily know this, if all that it is told is that a table makes IRIs of publications by applying a certain pattern to the primary key of the publications table.  To infer what needs to be inferred, the system must realize that IRIs from one mapping are disjoint from IRIs from another:  A paper in database <i>X</i> will usually not have an author in database <i>Y</i>.  The IDs in database <i>Y</i>, even if perchance equal to the IDs in <i>X</i>, do not mean the same thing, and there is no point joining across databases by them.</p>

<p>This entire question is a non-issue in the ETL scenario, but is absolutely vital in the real-time mapping. This is also something that must be stated, at least implicitly, in any mapping.  If a mapping translates keys of one place to IRIs with one pattern, and keys from another using another pattern, it must be inferable from the patterns whether the sets of IRIs will be disjoint.</p>

<p>This is critical.  Otherwise we will be joining everything to everything else, and there will be orders of magnitude of penalty compared to hand-crafted <a href="http://dbpedia.org/resource/SQL" id="link-id0xa09490f8">SQL</a> over the same <a href="http://dbpedia.org/resource/Data" id="link-id0xa095efd0">data</a> sources.</p>

<h2>Expectations and Limitations on Queries</h2>

<p>
  <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1e360230">SPARQL</a> queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class, when there are few columns that can map to a given predicate, and when classes and predicates are literals in the query.</p>

<p>
  <a href="http://virtuoso.openlinksw.com" id="link-id0x1f5edb30">Virtuoso</a> has some SQL extensions for dealing with breaking a wide table into a row per column.  This facilitates dealing with predicates that are not known at query compile time.  If the table in question is not managed by Virtuoso, Virtuoso&#39;s SQL virtualization/federation takes care of the matter.  If a mapping system goes directly to third-party SQL, no such tricks can be used.</p>

<p>The above example suggests that for supporting on-the-fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined.  For example, one will probably have to require that all predicates be literals.  The alternative is prohibitive run-time cost and complexity.</p>

<p>But we must not lose the baby with the bath-water. Aside from offering global identifiers, RDF&#39;s attractions include subclasses and sub-predicates.  In relational terms, these translate to <code>UNION</code>s and do involve some added cost.  A mapping system just has to have means of dealing with this cost, and of recognizing cases where this cost is prohibitive.  Some further work is likely to be required for defining well-behaved subsets of SPARQL and mappings.</p>

<h2>ETL Ou Ne Pas ETL?</h2>

<p>Whether to warehouse or not?  If one has hundreds of sources, of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last year&#39;s RDB-to-RDF mapping workshop in Cambridge, Massachusetts, about a system of relational mapping and on-demand RDF-izers of diverse semi-structured biomedical data, e.g., spreadsheets.  The issue certainly exists, and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores, and another part comes from a less structured repository of ETLed physical triples.</p>

<p>Our take is that if something is a large or very large relational store, then map; else, ETL.  With Virtuoso, we can mix mapped and local triples, but this is not a generally available feature of triple stores and standardization will likely have to wait until there are more implementations.</p>

<h2>Conclusions</h2> 

<ul>
<li>If you map on demand, watch out for an explosion of <code>UNION</code>s when integrating sources that talk of similar things.</li>
<li>If you integrate lots of sources, some ETL is likely unavoidable.  Look for ways of dealing with part ETL, part mapping.  ETLing everything is not always best or even possible.</li>
<li>If you map a single fairly-clean RDB to RDF, mapping will work well, potentially much faster than triple storage.  Higher storage density and more data per index lookup on the relational side.</li>
<li>If you map on demand, some restrictions to SPARQL may be practically necessary.  These have to do with variables in predicate position, variables in class position, etc.  Individual implementations may support these, but standardization will likely have to put limits on them.</li>
</ul>

<p>This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across.  This is a sort of addendum to the requirements I outlined on the ESW wiki.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-09-08#1435">
  <rss:title>Transitivity and Graphs for SQL</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-08T09:41:24Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Transitivity and Graphs for SQL Background I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the SQL query language. The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead. It is now time to apply this principle to graph traversal. The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based data structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers. The ad-hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was part of b, and b part of c, the implied fact that a is part of c would be inserted explicitly into the database as a pre-query step. This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc. Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand (&quot;backward chaining&quot;) and mapping of relational data into RDF without copying. The SQL world has taken steps towards dealing with recursion with the WITH - UNION construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a UNION of the data of the starting node plus the recursive walk of the starting node&#39;s immediate children. The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving JOINs between such recursive views. This model of recursion seems to lose SQL&#39;s non-procedural nature. One can no longer easily rearrange JOINs based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach. Take a question like &quot;list the parts of products of category C which have materials that are classified as toxic.&quot; Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &quot;toxic&quot; has a multilevel substructure. Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization. Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF. In Virtuoso, we see SPARQL as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, ipso facto, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation. SQL and Transitivity We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., SELECT in another SELECT&#39;s FROM clause, with a TRANSITIVE clause. Consider the data: CREATE TABLE &quot;knows&quot; (&quot;p1&quot; INT, &quot;p2&quot; INT, PRIMARY KEY (&quot;p1&quot;, &quot;p2&quot;) ); ALTER INDEX &quot;knows&quot; ON &quot;knows&quot; PARTITION (&quot;p1&quot; INT); CREATE INDEX &quot;knows2&quot; ON &quot;knows&quot; (&quot;p2&quot;, &quot;p1&quot;) PARTITION (&quot;p2&quot; INT); We represent a social network with the many-to-many relation &quot;knows&quot;. The persons are identified by integers. INSERT INTO &quot;knows&quot; VALUES (1, 2); INSERT INTO &quot;knows&quot; VALUES (1, 3); INSERT INTO &quot;knows&quot; VALUES (2, 4); SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p1&quot; = 1; We obtain the result: p1 p2 1 3 1 2 1 4 The operation is reversible: SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p2&quot; = 4; p1 p2 2 4 1 4 Since now we give p2, we traverse from p2 towards p1. The result set states that 4 is known by 2 and 2 is known by 1. To see what would happen if x knowing y also meant y knowing x, one could write: SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM (SELECT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; UNION ALL SELECT &quot;p2&quot;, &quot;p1&quot; FROM &quot;knows&quot; ) &quot;k2&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p2&quot; = 4; p1 p2 2 4 1 4 3 4 Now, since we know that 1 and 4 are related, we can ask how they are related. SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot;, T_STEP (1) AS &quot;via&quot;, T_STEP (&#39;step_no&#39;) AS &quot;step&quot;, T_STEP (&#39;path_id&#39;) AS &quot;path&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;p1&quot; = 1 AND &quot;p2&quot; = 4; p1 p2 via step path 1 4 1 0 0 1 4 2 1 0 1 4 4 2 0 The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., p1, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other. For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: SELECT p2, dist, (SELECT COUNT (*) FROM &quot;knows&quot; &quot;c&quot; WHERE &quot;c&quot;.&quot;p1&quot; = &quot;k&quot;.&quot;p2&quot; ) FROM (SELECT TRANSITIVE t_in (1) t_out (2) t_distinct &quot;p1&quot;, &quot;p2&quot;, t_step (&#39;step_no&#39;) AS &quot;dist&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;p1&quot; = 1 ORDER BY &quot;dist&quot;, 3 DESC; p2 dist aggregate 2 1 1 3 1 0 4 2 0 How? The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets. Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes. Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS. Next In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Transitivity and Graphs for SQL</div>
<h2>Background</h2> 

<p>I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the <a href="http://dbpedia.org/resource/SQL" id="link-id0xa1a18c58">SQL</a> query language.</p>

<p>The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example.  It is all about extracting the common features of applications and making these the features of a platform instead.</p>

<p>It is now time to apply this principle to graph traversal.</p>

<p>The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based <a href="http://dbpedia.org/resource/Data" id="link-id0xaf8c730">data</a> structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.</p>

<p>The ad-hoc nature and very large volume of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xae41ef0">RDF</a> data makes this a timely question.  Up until now, the answer to this question has been to materialize any implied facts in RDF stores.  If <i>a</i> was part of <i>b</i>, and <i>b</i> part of <i><a href="http://dbpedia.org/resource/C_(programming_language)" id="link-id0xac9d8790">c</a></i>, the implied fact that <i>a</i> is part of <i>c</i> would be inserted explicitly into the database as a pre-query step.</p>

<p>This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query.  The activity becomes less ad-hoc.</p>

<p>Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed.  This is why with <a href="http://virtuoso.openlinksw.com" id="link-id0xb68f9d0">Virtuoso</a> we have tended to favor inference on demand (&quot;backward chaining&quot;) and mapping of relational data into RDF without copying.</p>

<p>The SQL world has taken steps towards dealing with recursion with the <code>WITH - UNION</code> construct which allows definition of recursive views.  The idea there is to define, for example, a tree walk as a <code>UNION</code> of the data of the starting node plus the recursive walk of the starting node&#39;s immediate children.</p>

<p>The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving <code>JOIN</code>s between such recursive views.  This model of recursion seems to lose SQL&#39;s non-procedural nature.  One can no longer easily rearrange <code>JOIN</code>s based on what data is given and what is to be retrieved.  If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root.  At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.</p>

<p>Take a question like &quot;list the parts of products of category <i>C</i> which have materials that are classified as toxic.&quot;  Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &quot;toxic&quot; has a multilevel substructure.</p>

<p>Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category.  One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization.</p>

<p>Especially with RDF, there are many problems of this type.  In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.</p>

<p>In Virtuoso, we see <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xb3bdcc0">SPARQL</a> as reducing to SQL.  Any RDF-oriented database-engine or query-optimization feature is accessed via SQL.  Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, <i>ipso facto</i>, an SQL feature.  Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.</p>

<h2> SQL and Transitivity </h2>

<p>We will here look at some simple social network queries.  A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., <code>SELECT</code> in another <code>SELECT</code>&#39;s <code>FROM</code> clause, with a <code>TRANSITIVE</code> clause.</p>

<p>Consider the data:</p>

<blockquote>
 <pre><code>CREATE TABLE &quot;knows&quot; 
   (&quot;p1&quot; INT, 
    &quot;p2&quot; INT, 
    PRIMARY KEY (&quot;p1&quot;, &quot;p2&quot;)
   );
ALTER INDEX &quot;knows&quot; 
   ON &quot;knows&quot; 
   PARTITION (&quot;p1&quot; INT);
CREATE INDEX &quot;knows2&quot; 
   ON &quot;knows&quot; (&quot;p2&quot;, &quot;p1&quot;) 
   PARTITION (&quot;p2&quot; INT);
</code>
 </pre></blockquote>

<p>We represent a social network with the many-to-many relation &quot;knows&quot;.  The persons are identified by integers.</p>

<blockquote>
 <pre><code>INSERT INTO &quot;knows&quot; VALUES (1, 2);
INSERT INTO &quot;knows&quot; VALUES (1, 3);
INSERT INTO &quot;knows&quot; VALUES (2, 4);</code>
 </pre>

<pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p1&quot; = 1;</code></pre></blockquote>

<p>We obtain the result:</p>

<blockquote>
<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">3</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">2</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>

<p>The operation is reversible:</p>

<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p2&quot; = 4;
</code>
 </pre>

<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>

<p>Since now we give <i>p2</i>, we traverse from <i>p2</i> towards <i>p1</i>. The result set states that 4 is known by 2 and 2 is known by 1.</p>

<p>To see what would happen if <i>x</i> knowing <i>y</i> also meant <i>y</i> knowing <i>x</i>, one could write:</p>

<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
	    FROM (SELECT 
                  &quot;p1&quot;, 
                  &quot;p2&quot; 
               FROM &quot;knows&quot; 
               UNION ALL 
                  SELECT 
                     &quot;p2&quot;, 
                     &quot;p1&quot; 
                  FROM &quot;knows&quot;
              ) &quot;k2&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p2&quot; = 4;</code>
 </pre>

<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">3</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>


<p>Now, since we know that 1 and 4 are related, we can ask how they are related.</p>
<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot;, 
            T_STEP (1) AS &quot;via&quot;, 
            T_STEP (&#39;step_no&#39;) AS &quot;step&quot;, 
            T_STEP (&#39;path_id&#39;) AS &quot;path&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;p1&quot; = 1 
      AND &quot;p2&quot; = 4;</code>
 </pre>

<table width="250">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
    <th align="center" width="50">via</th>
    <th align="center" width="50">step</th>
    <th align="center" width="50">path</th>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">1</td>
    <td align="center">0</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">1</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">0</td>
  </tr>
</table>
</blockquote>


<p>The two first columns are the ends of the path.  The next column is the person that is a step on the path.  The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., <i>p1</i>, has number 0.  Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.</p>

<p>For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: </p>

<blockquote>
 <pre><code>SELECT p2, 
      dist, 
      (SELECT 
          COUNT (*) 
          FROM &quot;knows&quot; &quot;c&quot; 
          WHERE &quot;c&quot;.&quot;p1&quot; = &quot;k&quot;.&quot;p2&quot;
      ) 
   FROM (SELECT 
            TRANSITIVE t_in (1) t_out (2) t_distinct &quot;p1&quot;, 
            &quot;p2&quot;, 
            t_step (&#39;step_no&#39;) AS &quot;dist&quot;
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;p1&quot; = 1 
   ORDER BY &quot;dist&quot;, 3 DESC;</code>
 </pre>


<table width="150">
<tr>
    <th align="center" width="50">p2</th>
    <th align="center" width="50">dist</th>
    <th align="center" width="50">aggregate</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">1</td>
    <td align="center">1</td>
  </tr>
<tr>
    <td align="center">3</td>
    <td align="center">1</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">0</td>
  </tr>
</table>
</blockquote>


<h2>How?</h2>

<p>The queries shown above work on Virtuoso v6.  When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant.  By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.</p>

<p>Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection.  Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.</p>

<p>Writing a generic database driven graph traversal framework on the application side, say in Java over <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0xa8a9ef8">JDBC</a>, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query.  Besides, the traversal order in such a case could not be optimized by the DBMS.</p>

<h2>Next</h2> 

<p>In a future <a href="http://dbpedia.org/resource/Blog" id="link-id0xb526a40">blog</a> post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc.  There are lots of switches for controlling different parameters of the traversal.  This is just the beginning.  I will also give examples of the use of this in SPARQL.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-09-08#1434">
  <rss:title>Requirements for Relational-to-RDF Mapping</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-08T09:40:06Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Many of you will know about the W3C relational-to-RDF mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational-to-RDF mapping. To this effect, I recently summarized the group discussions and some of our own experiences around the topic at &lt;http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling&gt;. I will here discuss this less formally and more in the light of our own experience. A working group goal statement must be neutral vis Ã  vis the following points, even if any working group will unavoidably encounter these issues on the way. A blog post on the other hand can be more specific. I gave a talk to the RDB2RDF XG this spring, with these slides. The main point is that people would really like to map on-the-fly, if they only could. Making an RDF warehouse is not of value in itself, but it is true that in some cases this cannot be avoided. At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on-the-fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language. Eliminating JOINs We expect to have a situation where one virtual triple can have many possible sources. The mapping is a union of mapped databases. Any integration scenario will have this feature. In such a situation, if we are JOINing using such triples, we end up with UNIONs of all databases that could produce the triples in question. This is generally not desired. Therefore, in the on-demand mapping case, there must be a lot of type inference logic that is not relevant in the ETL scenario. To make the point clearer, suppose a query like &quot;list the organizations whose representatives have published about xx.&quot; Suppose that there are three databases mapped, all of which have a table of organizations, a table of persons with affiliation to organizations, a table of publications by these persons, and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with tag XX. It is a matter of common sense in this scenario that a publication will have the author and the author&#39;s affiliation in the same database. However, the RDB-to-RDF mapping does not necessarily know this, if all that it is told is that a table makes IRIs of publications by applying a certain pattern to the primary key of the publications table. To infer what needs to be inferred, the system must realize that IRIs from one mapping are disjoint from IRIs from another: A paper in database X will usually not have an author in database Y. The IDs in database Y, even if perchance equal to the IDs in X, do not mean the same thing, and there is no point joining across databases by them. This entire question is a non-issue in the ETL scenario, but is absolutely vital in the real-time mapping. This is also something that must be stated, at least implicitly, in any mapping. If a mapping translates keys of one place to IRIs with one pattern, and keys from another using another pattern, it must be inferable from the patterns whether the sets of IRIs will be disjoint. This is critical. Otherwise we will be joining everything to everything else, and there will be orders of magnitude of penalty compared to hand-crafted SQL over the same data sources. Expectations and Limitations on Queries SPARQL queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class, when there are few columns that can map to a given predicate, and when classes and predicates are literals in the query. Virtuoso has some SQL extensions for dealing with breaking a wide table into a row per column. This facilitates dealing with predicates that are not known at query compile time. If the table in question is not managed by Virtuoso, Virtuoso&#39;s SQL virtualization/federation takes care of the matter. If a mapping system goes directly to third-party SQL, no such tricks can be used. The above example suggests that for supporting on-the-fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined. For example, one will probably have to require that all predicates be literals. The alternative is prohibitive run-time cost and complexity. But we must not lose the baby with the bath-water. Aside from offering global identifiers, RDF&#39;s attractions include subclasses and sub-predicates. In relational terms, these translate to UNIONs and do involve some added cost. A mapping system just has to have means of dealing with this cost, and of recognizing cases where this cost is prohibitive. Some further work is likely to be required for defining well-behaved subsets of SPARQL and mappings. ETL Ou Ne Pas ETL? Whether to warehouse or not? If one has hundreds of sources, of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last year&#39;s RDB-to-RDF mapping workshop in Cambridge, Massachusetts, about a system of relational mapping and on-demand RDF-izers of diverse semi-structured biomedical data, e.g., spreadsheets. The issue certainly exists, and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores, and another part comes from a less structured repository of ETLed physical triples. Our take is that if something is a large or very large relational store, then map; else, ETL. With Virtuoso, we can mix mapped and local triples, but this is not a generally available feature of triple stores and standardization will likely have to wait until there are more implementations. Conclusions If you map on demand, watch out for an explosion of UNIONs when integrating sources that talk of similar things. If you integrate lots of sources, some ETL is likely unavoidable. Look for ways of dealing with part ETL, part mapping. ETLing everything is not always best or even possible. If you map a single fairly-clean RDB to RDF, mapping will work well, potentially much faster than triple storage. Higher storage density and more data per index lookup on the relational side. If you map on demand, some restrictions to SPARQL may be practically necessary. These have to do with variables in predicate position, variables in class position, etc. Individual implementations may support these, but standardization will likely have to put limits on them. This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across. This is a sort of addendum to the requirements I outlined on the ESW wiki.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Many of you will know about the W3C relational-to-<a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1d61a948">RDF</a> mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational-to-RDF mapping.</p>

<p>To this effect, I recently summarized the group discussions and some of our own experiences around the topic at &lt;<a href="http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling" id="link-id146030e8">http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling</a>&gt;.</p>

<p>I will here discuss this less formally and more in the light of our own experience.  A working group goal statement must be neutral vis Ã  vis the following points, even if any working group will unavoidably encounter these issues on the way.  A <a href="http://dbpedia.org/resource/Blog" id="link-id0x21334340">blog</a> post on the other hand can be more specific.</p>

<p>I gave a talk to the <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x1ef85f58">RDB2RDF XG</a> this spring, with these <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/Relational2RDF.ppt" id="link-id14572540">slides</a>.</p>

<p>The main point is that people would really like to map on-the-fly, if they only could. Making an RDF warehouse is not of value in itself, but it is true that in some cases this cannot be avoided.</p>

<p>At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on-the-fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language. </p>

<h2>Eliminating JOINs</h2> 

<p>We expect to have a situation where one virtual triple can have many possible sources.  The mapping is a union of mapped databases.  Any integration scenario will have this feature. In such a situation, if we are <code>JOIN</code>ing using such triples, we end up with <code>UNION</code>s of all databases that could produce the triples in question.   This is generally not desired.  Therefore, in the on-demand mapping case, there must be a lot of type inference logic that is not relevant in the ETL scenario.</p>

<p>To make the point clearer, suppose a query like &quot;list the organizations whose representatives have published about <i>xx</i>.&quot;  Suppose that there are three databases mapped, all of which have a table of organizations, a table of persons with affiliation to organizations, a table of publications by these persons, and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with <a href="http://dbpedia.org/resource/Tag" id="link-id0x1d8270d0">tag</a> <i>XX</i>.  It is a matter of common sense in this scenario that a publication will have the author and the author&#39;s affiliation in the same database.  However, the RDB-to-RDF mapping does not necessarily know this, if all that it is told is that a table makes IRIs of publications by applying a certain pattern to the primary key of the publications table.  To infer what needs to be inferred, the system must realize that IRIs from one mapping are disjoint from IRIs from another:  A paper in database <i>X</i> will usually not have an author in database <i>Y</i>.  The IDs in database <i>Y</i>, even if perchance equal to the IDs in <i>X</i>, do not mean the same thing, and there is no point joining across databases by them.</p>

<p>This entire question is a non-issue in the ETL scenario, but is absolutely vital in the real-time mapping. This is also something that must be stated, at least implicitly, in any mapping.  If a mapping translates keys of one place to IRIs with one pattern, and keys from another using another pattern, it must be inferable from the patterns whether the sets of IRIs will be disjoint.</p>

<p>This is critical.  Otherwise we will be joining everything to everything else, and there will be orders of magnitude of penalty compared to hand-crafted <a href="http://dbpedia.org/resource/SQL" id="link-id0x1d8b0b40">SQL</a> over the same <a href="http://dbpedia.org/resource/Data" id="link-id0x1ee63530">data</a> sources.</p>

<h2>Expectations and Limitations on Queries</h2>

<p>
<a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1dcfd7b0">SPARQL</a> queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class, when there are few columns that can map to a given predicate, and when classes and predicates are literals in the query.</p>

<p>
<a href="http://virtuoso.openlinksw.com" id="link-id0x2006b798">Virtuoso</a> has some SQL extensions for dealing with breaking a wide table into a row per column.  This facilitates dealing with predicates that are not known at query compile time.  If the table in question is not managed by Virtuoso, Virtuoso&#39;s SQL virtualization/federation takes care of the matter.  If a mapping system goes directly to third-party SQL, no such tricks can be used.</p>

<p>The above example suggests that for supporting on-the-fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined.  For example, one will probably have to require that all predicates be literals.  The alternative is prohibitive run-time cost and complexity.</p>

<p>But we must not lose the baby with the bath-water. Aside from offering global identifiers, RDF&#39;s attractions include subclasses and sub-predicates.  In relational terms, these translate to <code>UNION</code>s and do involve some added cost.  A mapping system just has to have means of dealing with this cost, and of recognizing cases where this cost is prohibitive.  Some further work is likely to be required for defining well-behaved subsets of SPARQL and mappings.</p>

<h2>ETL Ou Ne Pas ETL?</h2>

<p>Whether to warehouse or not?  If one has hundreds of sources, of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last year&#39;s RDB-to-RDF mapping workshop in Cambridge, Massachusetts, about a system of relational mapping and on-demand RDF-izers of diverse semi-structured biomedical data, e.g., spreadsheets.  The issue certainly exists, and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores, and another part comes from a less structured repository of ETLed physical triples.</p>

<p>Our take is that if something is a large or very large relational store, then map; else, ETL.  With Virtuoso, we can mix mapped and local triples, but this is not a generally available feature of triple stores and standardization will likely have to wait until there are more implementations.</p>

<h2>Conclusions</h2> 

<ul>
<li>If you map on demand, watch out for an explosion of <code>UNION</code>s when integrating sources that talk of similar things.</li>
<li>If you integrate lots of sources, some ETL is likely unavoidable.  Look for ways of dealing with part ETL, part mapping.  ETLing everything is not always best or even possible.</li>
<li>If you map a single fairly-clean RDB to RDF, mapping will work well, potentially much faster than triple storage.  Higher storage density and more data per index lookup on the relational side.</li>
<li>If you map on demand, some restrictions to SPARQL may be practically necessary.  These have to do with variables in predicate position, variables in class position, etc.  Individual implementations may support these, but standardization will likely have to put limits on them.</li>
</ul>

<p>This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across.  This is a sort of addendum to the requirements I outlined on the ESW wiki.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-09-08#1433">
  <rss:title>Transitivity and Graphs for SQL</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-08T09:20:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Background I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the SQL query language. The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead. It is now time to apply this principle to graph traversal. The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based data structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers. The ad-hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was part of b, and b part of c, the implied fact that a is part of c would be inserted explicitly into the database as a pre-query step. This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc. Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand (&quot;backward chaining&quot;) and mapping of relational data into RDF without copying. The SQL world has taken steps towards dealing with recursion with the WITH - UNION construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a UNION of the data of the starting node plus the recursive walk of the starting node&#39;s immediate children. The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving JOINs between such recursive views. This model of recursion seems to lose SQL&#39;s non-procedural nature. One can no longer easily rearrange JOINs based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach. Take a question like &quot;list the parts of products of category C which have materials that are classified as toxic.&quot; Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &quot;toxic&quot; has a multilevel substructure. Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization. Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF. In Virtuoso, we see SPARQL as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, ipso facto, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation. SQL and Transitivity We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., SELECT in another SELECT&#39;s FROM clause, with a TRANSITIVE clause. Consider the data: CREATE TABLE &quot;knows&quot; (&quot;p1&quot; INT, &quot;p2&quot; INT, PRIMARY KEY (&quot;p1&quot;, &quot;p2&quot;) ); ALTER INDEX &quot;knows&quot; ON &quot;knows&quot; PARTITION (&quot;p1&quot; INT); CREATE INDEX &quot;knows2&quot; ON &quot;knows&quot; (&quot;p2&quot;, &quot;p1&quot;) PARTITION (&quot;p2&quot; INT); We represent a social network with the many-to-many relation &quot;knows&quot;. The persons are identified by integers. INSERT INTO &quot;knows&quot; VALUES (1, 2); INSERT INTO &quot;knows&quot; VALUES (1, 3); INSERT INTO &quot;knows&quot; VALUES (2, 4); SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p1&quot; = 1; We obtain the result: p1 p2 1 3 1 2 1 4 The operation is reversible: SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p2&quot; = 4; p1 p2 2 4 1 4 Since now we give p2, we traverse from p2 towards p1. The result set states that 4 is known by 2 and 2 is known by 1. To see what would happen if x knowing y also meant y knowing x, one could write: SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM (SELECT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; UNION ALL SELECT &quot;p2&quot;, &quot;p1&quot; FROM &quot;knows&quot; ) &quot;k2&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p2&quot; = 4; p1 p2 2 4 1 4 3 4 Now, since we know that 1 and 4 are related, we can ask how they are related. SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot;, T_STEP (1) AS &quot;via&quot;, T_STEP (&#39;step_no&#39;) AS &quot;step&quot;, T_STEP (&#39;path_id&#39;) AS &quot;path&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;p1&quot; = 1 AND &quot;p2&quot; = 4; p1 p2 via step path 1 4 1 0 0 1 4 2 1 0 1 4 4 2 0 The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., p1, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other. For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: SELECT p2, dist, (SELECT COUNT (*) FROM &quot;knows&quot; &quot;c&quot; WHERE &quot;c&quot;.&quot;p1&quot; = &quot;k&quot;.&quot;p2&quot; ) FROM (SELECT TRANSITIVE t_in (1) t_out (2) t_distinct &quot;p1&quot;, &quot;p2&quot;, t_step (&#39;step_no&#39;) AS &quot;dist&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;p1&quot; = 1 ORDER BY &quot;dist&quot;, 3 DESC; p2 dist aggregate 2 1 1 3 1 0 4 2 0 How? The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets. Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes. Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS. Next In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2>Background</h2> 

<p>I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the <a href="http://dbpedia.org/resource/SQL" id="link-id0xb1fe830">SQL</a> query language.</p>

<p>The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example.  It is all about extracting the common features of applications and making these the features of a platform instead.</p>

<p>It is now time to apply this principle to graph traversal.</p>

<p>The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based <a href="http://dbpedia.org/resource/Data" id="link-id0x1cb37218">data</a> structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.</p>

<p>The ad-hoc nature and very large volume of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1e1850a0">RDF</a> data makes this a timely question.  Up until now, the answer to this question has been to materialize any implied facts in RDF stores.  If <i>a</i> was part of <i>b</i>, and <i>b</i> part of <i><a href="http://dbpedia.org/resource/C_(programming_language)" id="link-id0xa1a08d38">c</a></i>, the implied fact that <i>a</i> is part of <i>c</i> would be inserted explicitly into the database as a pre-query step.</p>

<p>This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query.  The activity becomes less ad-hoc.</p>

<p>Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed.  This is why with <a href="http://virtuoso.openlinksw.com" id="link-id0xa51bd10">Virtuoso</a> we have tended to favor inference on demand (&quot;backward chaining&quot;) and mapping of relational data into RDF without copying.</p>

<p>The SQL world has taken steps towards dealing with recursion with the <code>WITH - UNION</code> construct which allows definition of recursive views.  The idea there is to define, for example, a tree walk as a <code>UNION</code> of the data of the starting node plus the recursive walk of the starting node&#39;s immediate children.</p>

<p>The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving <code>JOIN</code>s between such recursive views.  This model of recursion seems to lose SQL&#39;s non-procedural nature.  One can no longer easily rearrange <code>JOIN</code>s based on what data is given and what is to be retrieved.  If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root.  At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.</p>

<p>Take a question like &quot;list the parts of products of category <i>C</i> which have materials that are classified as toxic.&quot;  Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &quot;toxic&quot; has a multilevel substructure.</p>

<p>Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category.  One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization.</p>

<p>Especially with RDF, there are many problems of this type.  In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.</p>

<p>In Virtuoso, we see <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xb4b3ce8">SPARQL</a> as reducing to SQL.  Any RDF-oriented database-engine or query-optimization feature is accessed via SQL.  Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, <i>ipso facto</i>, an SQL feature.  Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.</p>

<h2> SQL and Transitivity </h2>

<p>We will here look at some simple social network queries.  A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., <code>SELECT</code> in another <code>SELECT</code>&#39;s <code>FROM</code> clause, with a <code>TRANSITIVE</code> clause.</p>

<p>Consider the data:</p>

<blockquote>
 <pre><code>CREATE TABLE &quot;knows&quot; 
   (&quot;p1&quot; INT, 
    &quot;p2&quot; INT, 
    PRIMARY KEY (&quot;p1&quot;, &quot;p2&quot;)
   );
ALTER INDEX &quot;knows&quot; 
   ON &quot;knows&quot; 
   PARTITION (&quot;p1&quot; INT);
CREATE INDEX &quot;knows2&quot; 
   ON &quot;knows&quot; (&quot;p2&quot;, &quot;p1&quot;) 
   PARTITION (&quot;p2&quot; INT);
</code>
 </pre></blockquote>

<p>We represent a social network with the many-to-many relation &quot;knows&quot;.  The persons are identified by integers.</p>

<blockquote>
 <pre><code>INSERT INTO &quot;knows&quot; VALUES (1, 2);
INSERT INTO &quot;knows&quot; VALUES (1, 3);
INSERT INTO &quot;knows&quot; VALUES (2, 4);</code>
 </pre>

<pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p1&quot; = 1;</code></pre></blockquote>

<p>We obtain the result:</p>

<blockquote>
<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">3</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">2</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>

<p>The operation is reversible:</p>

<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p2&quot; = 4;
</code>
 </pre>

<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>

<p>Since now we give <i>p2</i>, we traverse from <i>p2</i> towards <i>p1</i>. The result set states that 4 is known by 2 and 2 is known by 1.</p>

<p>To see what would happen if <i>x</i> knowing <i>y</i> also meant <i>y</i> knowing <i>x</i>, one could write:</p>

<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
	    FROM (SELECT 
                  &quot;p1&quot;, 
                  &quot;p2&quot; 
               FROM &quot;knows&quot; 
               UNION ALL 
                  SELECT 
                     &quot;p2&quot;, 
                     &quot;p1&quot; 
                  FROM &quot;knows&quot;
              ) &quot;k2&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p2&quot; = 4;</code>
 </pre>

<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">3</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>


<p>Now, since we know that 1 and 4 are related, we can ask how they are related.</p>
<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot;, 
            T_STEP (1) AS &quot;via&quot;, 
            T_STEP (&#39;step_no&#39;) AS &quot;step&quot;, 
            T_STEP (&#39;path_id&#39;) AS &quot;path&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;p1&quot; = 1 
      AND &quot;p2&quot; = 4;</code>
 </pre>

<table width="250">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
    <th align="center" width="50">via</th>
    <th align="center" width="50">step</th>
    <th align="center" width="50">path</th>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">1</td>
    <td align="center">0</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">1</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">0</td>
  </tr>
</table>
</blockquote>


<p>The two first columns are the ends of the path.  The next column is the person that is a step on the path.  The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., <i>p1</i>, has number 0.  Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.</p>

<p>For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: </p>

<blockquote>
 <pre><code>SELECT p2, 
      dist, 
      (SELECT 
          COUNT (*) 
          FROM &quot;knows&quot; &quot;c&quot; 
          WHERE &quot;c&quot;.&quot;p1&quot; = &quot;k&quot;.&quot;p2&quot;
      ) 
   FROM (SELECT 
            TRANSITIVE t_in (1) t_out (2) t_distinct &quot;p1&quot;, 
            &quot;p2&quot;, 
            t_step (&#39;step_no&#39;) AS &quot;dist&quot;
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;p1&quot; = 1 
   ORDER BY &quot;dist&quot;, 3 DESC;</code>
 </pre>


<table width="150">
<tr>
    <th align="center" width="50">p2</th>
    <th align="center" width="50">dist</th>
    <th align="center" width="50">aggregate</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">1</td>
    <td align="center">1</td>
  </tr>
<tr>
    <td align="center">3</td>
    <td align="center">1</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">0</td>
  </tr>
</table>
</blockquote>


<h2>How?</h2>

<p>The queries shown above work on Virtuoso v6.  When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant.  By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.</p>

<p>Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection.  Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.</p>

<p>Writing a generic database driven graph traversal framework on the application side, say in Java over <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0xb595050">JDBC</a>, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query.  Besides, the traversal order in such a case could not be optimized by the DBMS.</p>

<h2>Next</h2> 

<p>In a future <a href="http://dbpedia.org/resource/Blog" id="link-id0x1e4d4f18">blog</a> post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc.  There are lots of switches for controlling different parameters of the traversal.  This is just the beginning.  I will also give examples of the use of this in SPARQL.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-09-05#1432">
  <rss:title>Epistemology of the Sponger, or How Virtuoso Drives a Web Query</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-05T09:20:56Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Epistemology of the Sponger, or How Virtuoso Drives a Web Query Virtuoso has an extensive collection of RDF-izers called Sponger Cartridges. These take a web resource in one of 30+ formats (so far) and extract RDF from it. The Virtuoso Sponger is a device which evaluates a query and along the way, finds dereferenceable links, dereferences them, and iteratively re-evaluates the query, until either nothing new is found or some limit is reached. We could call this query-driven crawling. The idea is intuitive â what one looks for, determines what one finds. This does however raise certain questions pertaining to the nature and ultimate possibility of knowledge, i.e., epistemology. The process of querying could be said to go from the few to the many, just like the process of harvesting data from the web, the way any search engine does. One follows links or makes joins and thereby increases one&#39;s reach. The difference is that a query has no a priori direction. If I ask for the phone numbers of my friends and there are no phone numbers in the database, then it is valid to give an empty result without looking at my friends at all. Closed world, as it is said. Never mind that the friends would have had a &quot;see also&quot; link to a retrievable document that did have a phone number. The problem is that a query execution plan determines what possible dereferenceable material the query will encounter during its execution. What is worse, a query plan tends toward the minimal, i.e., toward minimizing the chances of encountering something dereferenceable along the way. Where query and crawl appeared to have a similarity, in fact they have two opposite goals. The user generally has no idea of the execution plan. In the general case, the user cannot have an idea of this plan. There are valid, over 40 year old reasons for leaving the query planning to the database. In exceptional situations the user can read or direct these, but this is really quite tedious and requires understanding that is basically never present. So, given a query, how do we find data that will match it, short of having a pre-loaded database of absolutely everything? This is certainly a desirable goal, and all in the open world, distributed spirit of the web. Let us limit ourselves to queries that have some literals in the object or subject positions. A SPARQL query is basically a graph. Its vertices are variables and literals, and its edges are triple patterns. An edge is labeled by a predicate. For now, we will consider the predicate to always be a literal. From each literal, we can draw a tree, following each edge starting at this literal and descending until we find another literal. Each tree is not always a spanning tree of the graph, but all the trees collectively span the graph. Consider the query { &lt;john&gt; knows ?x . &lt;mary&gt; knows ?x . ?x label ?l }. The starting points are the literals john and mary. The john tree has one child, ?x, which has the children mary and ?l. One could notate it as { &lt;john&gt; knows ?x . {{ &lt;mary&gt; knows ?x} UNION {?x label ?l}}} That is, the head first, and if it has more than one child, a union listing them, recursively. If one composed such queries for each literal in the original pattern and evaluated each as a breadth first walk of the tree, no query optimization tricks, and for each binding of each variable, recorded whether there was something to dereference, one would in a finite time have reached all the directly reachable data. Then one could evaluate the original query, using whatever plan was preferred. The check for dereferenceable data applied to each IRI-valued binding formed in the above evaluation, would consist of looking for &quot;see also&quot;, &quot;same as&quot;, and other such properties of the IRI. It could also consult text based search engines. Since the evaluation is breadth first, it generates a large number of parallel tasks and is fairly latency tolerant, i.e., it will not die if it must retrieve a few pages from remote sources. We will leave the exact rewrite rules for unions, optionals, aggregates, subqueries, and so on, as an exercise; the general idea should be clear enough. We have here shown a way of transforming SPARQL queries in such a way as to guarantee dereferencing of findable links, without requiring the end user to either explicitly specify or understand query plans. The present Sponger does not work exactly in this manner but it will be developed in this direction. Fortunately, the algorithms outlined above are nothing complicated.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Epistemology of the Sponger, or How Virtuoso Drives a Web Query</div>
<p>
  <a href="http://virtuoso.openlinksw.com" id="link-id0x1ed6cf28">Virtuoso</a> has an extensive collection of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1f8d1f78">RDF</a>-izers called Sponger Cartridges.  These take a web resource in one of 30+ formats (so far) and extract RDF from it.  The Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x1edc90e8">Sponger</a> is a device which evaluates a query and along the way, finds dereferenceable links, dereferences them, and iteratively re-evaluates the query, until either nothing new is found or some limit is reached.</p>

<p>We could call this <i>query-driven crawling</i>.  The idea is intuitive â what one looks for, determines what one finds.</p>

<p>This does however raise certain questions pertaining to the nature and ultimate possibility of <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1f836b68">knowledge</a>, i.e., epistemology.</p>

<p>The process of querying could be said to go from the few to the many, just like the process of harvesting <a href="http://dbpedia.org/resource/Data" id="link-id0x1edb1648">data</a> from the web, the way any search engine does.  One follows links or makes joins and thereby increases one&#39;s reach.</p>

<p>The difference is that a query has no <i>a priori</i> direction.  If I ask for the phone numbers of my friends and there are no phone numbers in the database, then it is valid to give an empty result without looking at my friends at all.  <a href="http://dbpedia.org/resource/Closed_world_assumption" id="link-id0x1edf1f30">Closed world</a>, as it is said. Never mind that the friends would have had a &quot;see also&quot; link to a retrievable document that did have a phone number.</p>

<p>The problem is that a query execution plan determines what possible dereferenceable material the query will encounter during its execution.  What is worse, a query plan tends toward the minimal, i.e., toward minimizing the chances of encountering something dereferenceable along the way.  Where query and crawl appeared to have a similarity, in fact they have two opposite goals.</p>

<p>The user generally has no idea of the execution plan.  In the general case, the user <i>cannot</i> have an idea of this plan.  There are valid, over 40 year old reasons for leaving the query planning to the database.  In exceptional situations the user can read or direct these, but this is really quite tedious and requires understanding that is basically never present.</p>

<p>So, given a query, how do we find data that will match it, short of having a pre-loaded database of absolutely everything?  This is certainly a desirable goal, and all in the <a href="http://dbpedia.org/resource/Open_world_assumption" id="link-id0x1eb46548">open world</a>, distributed spirit of the web.</p>

<p>Let us limit ourselves to queries that have some literals in the object or subject positions. A <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1ed293f8">SPARQL</a> query is basically a graph.  Its vertices are variables and literals, and its edges are triple patterns.  An edge is labeled by a predicate.  For now, we will consider the predicate to always be a literal.  From each literal, we can draw a tree, following each edge starting at this literal and descending until we find another literal.  Each tree is not always a spanning tree of the graph, but all the trees collectively span the graph.</p>

<p>Consider the query </p>
<blockquote>
<code>{ &lt;john&gt; knows ?x . &lt;mary&gt; knows ?x . ?x label ?l }.</code>
</blockquote>  The starting points are the literals <code>john</code> and <code>mary</code>.  The <code>john</code> tree has one child, <code>?x</code>, which has the children <code>mary</code> and <code>?l</code>.  One could notate it as <blockquote>
<code>{ &lt;john&gt; knows ?x . {{ &lt;mary&gt; knows ?x} UNION {?x label ?l}}}</code>
</blockquote> That is, the head first, and if it has more than one child, a union listing them, recursively.

<p>If one composed such queries for each literal in the original pattern and evaluated each as a breadth first walk of the tree, no query optimization tricks, and for each binding of each variable, recorded whether there was something to dereference, one would  in a finite time have reached all the directly reachable data. Then one could evaluate the original query, using whatever plan was preferred.</p>

<p>The check for dereferenceable data applied to each IRI-valued binding formed in the above evaluation, would consist of looking for &quot;see also&quot;, &quot;same as&quot;, and other such properties of the IRI.  It could also consult text based search engines.  Since the evaluation is breadth first, it generates a large number of parallel tasks and is fairly latency tolerant, i.e., it will not die if it must retrieve a few pages from remote sources.  We will leave the exact rewrite rules for unions, optionals, aggregates, subqueries, and so on, as an exercise; the general idea should be clear enough.</p>
 
<p>We have here shown a way of transforming SPARQL queries in such a way as to guarantee dereferencing of findable links, without requiring the end user to either explicitly specify or understand query plans.</p>

<p>The present Sponger does not work exactly in this manner but it will be developed in this direction.  Fortunately, the algorithms outlined above are nothing complicated.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-09-05#1431">
  <rss:title>Epistemology of the Sponger, or How Virtuoso Drives a Web Query</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-05T09:16:20Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso has an extensive collection of RDF-izers called Sponger Cartridges. These take a web resource in one of 30+ formats (so far) and extract RDF from it. The Virtuoso Sponger is a device which evaluates a query and along the way, finds dereferenceable links, dereferences them, and iteratively re-evaluates the query, until either nothing new is found or some limit is reached. We could call this query-driven crawling. The idea is intuitive â what one looks for, determines what one finds. This does however raise certain questions pertaining to the nature and ultimate possibility of knowledge, i.e., epistemology. The process of querying could be said to go from the few to the many, just like the process of harvesting data from the web, the way any search engine does. One follows links or makes joins and thereby increases one&#39;s reach. The difference is that a query has no a priori direction. If I ask for the phone numbers of my friends and there are no phone numbers in the database, then it is valid to give an empty result without looking at my friends at all. Closed world, as it is said. Never mind that the friends would have had a &quot;see also&quot; link to a retrievable document that did have a phone number. The problem is that a query execution plan determines what possible dereferenceable material the query will encounter during its execution. What is worse, a query plan tends toward the minimal, i.e., toward minimizing the chances of encountering something dereferenceable along the way. Where query and crawl appeared to have a similarity, in fact they have two opposite goals. The user generally has no idea of the execution plan. In the general case, the user cannot have an idea of this plan. There are valid, over 40 year old reasons for leaving the query planning to the database. In exceptional situations the user can read or direct these, but this is really quite tedious and requires understanding that is basically never present. So, given a query, how do we find data that will match it, short of having a pre-loaded database of absolutely everything? This is certainly a desirable goal, and all in the open world, distributed spirit of the web. Let us limit ourselves to queries that have some literals in the object or subject positions. A SPARQL query is basically a graph. Its vertices are variables and literals, and its edges are triple patterns. An edge is labeled by a predicate. For now, we will consider the predicate to always be a literal. From each literal, we can draw a tree, following each edge starting at this literal and descending until we find another literal. Each tree is not always a spanning tree of the graph, but all the trees collectively span the graph. Consider the query { &lt;john&gt; knows ?x . &lt;mary&gt; knows ?x . ?x label ?l }. The starting points are the literals john and mary. The john tree has one child, ?x, which has the children mary and ?l. One could notate it as { &lt;john&gt; knows ?x . {{ &lt;mary&gt; knows ?x} UNION {?x label ?l}}} That is, the head first, and if it has more than one child, a union listing them, recursively. If one composed such queries for each literal in the original pattern and evaluated each as a breadth first walk of the tree, no query optimization tricks, and for each binding of each variable, recorded whether there was something to dereference, one would in a finite time have reached all the directly reachable data. Then one could evaluate the original query, using whatever plan was preferred. The check for dereferenceable data applied to each IRI-valued binding formed in the above evaluation, would consist of looking for &quot;see also&quot;, &quot;same as&quot;, and other such properties of the IRI. It could also consult text based search engines. Since the evaluation is breadth first, it generates a large number of parallel tasks and is fairly latency tolerant, i.e., it will not die if it must retrieve a few pages from remote sources. We will leave the exact rewrite rules for unions, optionals, aggregates, subqueries, and so on, as an exercise; the general idea should be clear enough. We have here shown a way of transforming SPARQL queries in such a way as to guarantee dereferencing of findable links, without requiring the end user to either explicitly specify or understand query plans. The present Sponger does not work exactly in this manner but it will be developed in this direction. Fortunately, the algorithms outlined above are nothing complicated.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://virtuoso.openlinksw.com" id="link-id0x1fabb368">Virtuoso</a> has an extensive collection of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1f8625b8">RDF</a>-izers called Sponger Cartridges.  These take a web resource in one of 30+ formats (so far) and extract RDF from it.  The Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x1f841060">Sponger</a> is a device which evaluates a query and along the way, finds dereferenceable links, dereferences them, and iteratively re-evaluates the query, until either nothing new is found or some limit is reached.</p>

<p>We could call this <i>query-driven crawling</i>.  The idea is intuitive â what one looks for, determines what one finds.</p>

<p>This does however raise certain questions pertaining to the nature and ultimate possibility of <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1f895b08">knowledge</a>, i.e., epistemology.</p>

<p>The process of querying could be said to go from the few to the many, just like the process of harvesting <a href="http://dbpedia.org/resource/Data" id="link-id0x1f879410">data</a> from the web, the way any search engine does.  One follows links or makes joins and thereby increases one&#39;s reach.</p>

<p>The difference is that a query has no <i>a priori</i> direction.  If I ask for the phone numbers of my friends and there are no phone numbers in the database, then it is valid to give an empty result without looking at my friends at all.  <a href="http://dbpedia.org/resource/Closed_world_assumption" id="link-id0x1f8b0658">Closed world</a>, as it is said. Never mind that the friends would have had a &quot;see also&quot; link to a retrievable document that did have a phone number.</p>

<p>The problem is that a query execution plan determines what possible dereferenceable material the query will encounter during its execution.  What is worse, a query plan tends toward the minimal, i.e., toward minimizing the chances of encountering something dereferenceable along the way.  Where query and crawl appeared to have a similarity, in fact they have two opposite goals.</p>

<p>The user generally has no idea of the execution plan.  In the general case, the user <i>cannot</i> have an idea of this plan.  There are valid, over 40 year old reasons for leaving the query planning to the database.  In exceptional situations the user can read or direct these, but this is really quite tedious and requires understanding that is basically never present.</p>

<p>So, given a query, how do we find data that will match it, short of having a pre-loaded database of absolutely everything?  This is certainly a desirable goal, and all in the <a href="http://dbpedia.org/resource/Open_world_assumption" id="link-id0x1f845188">open world</a>, distributed spirit of the web.</p>

<p>Let us limit ourselves to queries that have some literals in the object or subject positions. A <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1f84f348">SPARQL</a> query is basically a graph.  Its vertices are variables and literals, and its edges are triple patterns.  An edge is labeled by a predicate.  For now, we will consider the predicate to always be a literal.  From each literal, we can draw a tree, following each edge starting at this literal and descending until we find another literal.  Each tree is not always a spanning tree of the graph, but all the trees collectively span the graph.</p>

<p>Consider the query </p>
<blockquote>
<code>{ &lt;john&gt; knows ?x . &lt;mary&gt; knows ?x . ?x label ?l }.</code>
</blockquote>  The starting points are the literals <code>john</code> and <code>mary</code>.  The <code>john</code> tree has one child, <code>?x</code>, which has the children <code>mary</code> and <code>?l</code>.  One could notate it as <blockquote>
<code>{ &lt;john&gt; knows ?x . {{ &lt;mary&gt; knows ?x} UNION {?x label ?l}}}</code>
</blockquote> That is, the head first, and if it has more than one child, a union listing them, recursively.

<p>If one composed such queries for each literal in the original pattern and evaluated each as a breadth first walk of the tree, no query optimization tricks, and for each binding of each variable, recorded whether there was something to dereference, one would  in a finite time have reached all the directly reachable data. Then one could evaluate the original query, using whatever plan was preferred.</p>

<p>The check for dereferenceable data applied to each IRI-valued binding formed in the above evaluation, would consist of looking for &quot;see also&quot;, &quot;same as&quot;, and other such properties of the IRI.  It could also consult text based search engines.  Since the evaluation is breadth first, it generates a large number of parallel tasks and is fairly latency tolerant, i.e., it will not die if it must retrieve a few pages from remote sources.  We will leave the exact rewrite rules for unions, optionals, aggregates, subqueries, and so on, as an exercise; the general idea should be clear enough.</p>
 
<p>We have here shown a way of transforming SPARQL queries in such a way as to guarantee dereferencing of findable links, without requiring the end user to either explicitly specify or understand query plans.</p>

<p>The present Sponger does not work exactly in this manner but it will be developed in this direction.  Fortunately, the algorithms outlined above are nothing complicated.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-08-27#1423">
  <rss:title>A quick look at SP2B, the SPARQL Performance Benchmark</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-27T16:03:40Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">A quick look at SP2B, the SPARQL Performance Benchmark I finally got around to running the SP2B SPARQL Performance Benchmark on the current Virtuoso Open Source Edition, v5.0.8. I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers. I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds. This is better than the 800 or so seconds that the authors had measured. Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut. I also tried it with a scale of 25M, but this became I/O bound and took a bit longer. I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound. The machine was a 2GHz Xeon with 8G RAM. The query text was the one from the authors, with an explicit FROM clause added; the client was the command line Interactive SQL (iSQL). If one does the test with the default index layout without specifying a graph, things will not work very well. Also, returning the million-row results of these queries over the SPARQL protocol is not practical. I will say something more about SP2B when I get to have a closer look.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">A quick look at SP2B, the SPARQL Performance Benchmark</div>
<p>I finally got around to running the <a href="http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B" id="link-id17bac628">SP<sup>2</sup>B SPARQL Performance Benchmark</a> on the current <a href="http://virtuoso.openlinksw.com" id="link-id0x1dcaaa48">Virtuoso</a> Open Source Edition, v5.0.8.</p>
<p>I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers.</p>
<p>I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds.  This is better than the 800 or so seconds that the authors had measured.  Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut.</p>
<p>I also tried it with a scale of 25M, but this became I/O bound and took a bit longer.  I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound.</p>
<p>The machine was a 2GHz Xeon with 8G RAM.  The query text was the one from the authors, with an explicit <code>FROM</code> clause added; the client was the command line Interactive <a href="http://dbpedia.org/resource/SQL" id="link-id0x1be2c808">SQL</a> (iSQL).</p>
<p>If one does the test with the default index layout without specifying a graph, things will not work very well.  Also, returning the million-row results of these queries over the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x1d7ac018">SPARQL protocol</a> is not practical.</p>
<p>I will say something more about SP<sup>2</sup>B when I get to have a closer look.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-08-27#1422">
  <rss:title>A quick look at SP2B, the SPARQL Performance Benchmark</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-27T16:00:07Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I finally got around to running the SP2B SPARQL Performance Benchmark on the current Virtuoso Open Source Edition, v5.0.8. I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers. I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds. This is better than the 800 or so seconds that the authors had measured. Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut. I also tried it with a scale of 25M, but this became I/O bound and took a bit longer. I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound. The machine was a 2GHz Xeon with 8G RAM. The query text was the one from the authors, with an explicit FROM clause added; the client was the command line Interactive SQL (iSQL). If one does the test with the default index layout without specifying a graph, things will not work very well. Also, returning the million-row results of these queries over the SPARQL protocol is not practical. I will say something more about SP2B when I get to have a closer look.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I finally got around to running the <a href="http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B" id="link-id17bac628">SP<sup>2</sup>B SPARQL Performance Benchmark</a> on the current <a href="http://virtuoso.openlinksw.com" id="link-id0x1d2a6838">Virtuoso</a> Open Source Edition, v5.0.8.</p>
<p>I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers.</p>
<p>I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds.  This is better than the 800 or so seconds that the authors had measured.  Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut.</p>
<p>I also tried it with a scale of 25M, but this became I/O bound and took a bit longer.  I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound.</p>
<p>The machine was a 2GHz Xeon with 8G RAM.  The query text was the one from the authors, with an explicit <code>FROM</code> clause added; the client was the command line Interactive <a href="http://dbpedia.org/resource/SQL" id="link-id0x19e74ce0">SQL</a> (iSQL).</p>
<p>If one does the test with the default index layout without specifying a graph, things will not work very well.  Also, returning the million-row results of these queries over the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x1c4231a0">SPARQL protocol</a> is not practical.</p>
<p>I will say something more about SP<sup>2</sup>B when I get to have a closer look.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-08-25#1419">
  <rss:title>Configuring Virtuoso for Benchmarking</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-25T14:06:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Configuring Virtuoso for Benchmarking I will here summarize what should be known about running benchmarks with Virtuoso. Physical Memory For 8G RAM, in the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... NumberOfBuffers = 550000 For 16G RAM, double thisâ [Parameters] ... NumberOfBuffers = 1100000 Transaction Isolation For most cases, certainly all RDF cases, Read Committed should be the default transaction isolation. In the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... DefaultIsolation = 2 Multiuser Workload If ODBC, JDBC, or similarly connected client applications are used, there must be more ServerThreads available than there will be client connections. In the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... ServerThreads = 100 With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer ServerThreads than there are concurrent clients. The MaxKeepAlives should be the maximum number of expected web clients. This can be more than the ServerThreads count. In the [HTTPServer] stanza of virtuoso.ini, set â [HTTPServer] ... ServerThreads = 100 MaxKeepAlives = 1000 KeepAliveTimeout = 10 Note â The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, should not exceed the licensed thread count. Disk Use The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID. For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed). For the above described example, in the [Database] stanza of virtuoso.ini, set â [Database] ... Striping = 1 MaxCheckpointRemap = 2000000 â and in the [Striping] stanza, on one line per SegmentName, set â [Striping] ... Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6 As can be seen here, each file gets a background IO thread (the = qxxx clause). It should be noted that all files on the same physical device should have the same qxxx value. This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue. SQL Optimization If queries have lots of joins but access little data, as with the Berlin SPARQL Benchmark, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far. Thus, in the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... StopCompilerWhenXOverRunTime = 1</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Configuring Virtuoso for Benchmarking</div>
<p>I will here summarize what should be known about running benchmarks with <a href="http://virtuoso.openlinksw.com" id="link-id0xc152cf0">Virtuoso</a>.</p>

<h2>Physical Memory</h2>

<p>For 8G RAM, in the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p>

<blockquote>
<code>
[Parameters]<br />
...<br />
NumberOfBuffers = 550000
</code>
</blockquote> 
<p>For 16G RAM, double thisâ</p>

<blockquote>
<code>
[Parameters]<br />
...<br />
NumberOfBuffers = 1100000
</code>
</blockquote> 

<h2>Transaction Isolation</h2>
<p>For most cases, certainly all <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xb7ba270">RDF</a> cases, <i>Read Committed</i> should be the default transaction isolation.  In the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code>
[Parameters]<br />
...<br />
DefaultIsolation = 2 
</code>
</blockquote> 

<h2>Multiuser Workload</h2>

<p>If <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x1a40f308">ODBC</a>, <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1e003cf8">JDBC</a>, or similarly connected client applications are used, there must be more <code>ServerThreads</code> available than there will be client connections.  In the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code> 
[Parameters]<br />
...<br />
ServerThreads = 100
</code>
</blockquote> 

<p>With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer <code>ServerThreads</code> than there are concurrent clients.  The <code>MaxKeepAlives</code> should be the maximum number of expected web clients.  This can be more than the <code>ServerThreads</code> count.  In the <code>[HTTPServer]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code> 
[HTTPServer]<br />
...<br />
ServerThreads    = 100 <br />
MaxKeepAlives    = 1000 <br />
KeepAliveTimeout = 10
</code>
</blockquote> 

<p>
<i><b>Note</b> â The <code>[HTTPServer] ServerThreads</code> are taken from the total pool made available by the <code>[Parameters] ServerThreads</code>.  Thus, the <code>[Parameters] ServerThreads</code> should always be at least as large as (and is best set greater than) the <code>[HTTPServer] ServerThreads</code>, and if using the closed-source Commercial Version, should not exceed the licensed thread count.</i>
</p> 

<h2>Disk Use</h2>

<p>The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID.  For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed).  </p>

<p>For the above described example, in the <code>[Database]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code>
[Database]<br />
...<br />
Striping = 1<br />
MaxCheckpointRemap 	= 2000000 
</code>
</blockquote> 

<p>â and in the <code>[Striping]</code> stanza, on one line per <code>SegmentName</code>, set â</p> 
<blockquote>
<code>
[Striping]<br />
...<br />
Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6</code>
</blockquote> 

<p>As can be seen here, each file gets a background IO thread (the <code>= q<i>xxx</i></code> clause).  It should be noted that all files on the same physical device should have the same <code>q<i>xxx</i></code> value.  This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue.</p>

<h2>
<a href="http://dbpedia.org/resource/SQL" id="link-id0xc8b97c0">SQL</a> Optimization</h2>

<p>If queries have lots of joins but access little <a href="http://dbpedia.org/resource/Data" id="link-id0x193b2fa8">data</a>, as with the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1b283ca0">Berlin SPARQL Benchmark</a>, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far.  Thus, in the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code>
[Parameters]<br />
...<br />
StopCompilerWhenXOverRunTime = 1
</code>
</blockquote> 
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-08-25#1418">
  <rss:title>Configuring Virtuoso for Benchmarking</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-25T14:05:46Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I will here summarize what should be known about running benchmarks with Virtuoso. Physical Memory For 8G RAM, in the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... NumberOfBuffers = 550000 For 16G RAM, double thisâ [Parameters] ... NumberOfBuffers = 1100000 Transaction Isolation For most cases, certainly all RDF cases, Read Committed should be the default transaction isolation. In the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... DefaultIsolation = 2 Multiuser Workload If ODBC, JDBC, or similarly connected client applications are used, there must be more ServerThreads available than there will be client connections. In the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... ServerThreads = 100 With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer ServerThreads than there are concurrent clients. The MaxKeepAlives should be the maximum number of expected web clients. This can be more than the ServerThreads count. In the [HTTPServer] stanza of virtuoso.ini, set â [HTTPServer] ... ServerThreads = 100 MaxKeepAlives = 1000 KeepAliveTimeout = 10 Note â The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, should not exceed the licensed thread count. Disk Use The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID. For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed). For the above described example, in the [Database] stanza of virtuoso.ini, set â [Database] ... Striping = 1 MaxCheckpointRemap = 2000000 â and in the [Striping] stanza, on one line per SegmentName, set â [Striping] ... Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6 As can be seen here, each file gets a background IO thread (the = qxxx clause). It should be noted that all files on the same physical device should have the same qxxx value. This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue. SQL Optimization If queries have lots of joins but access little data, as with the Berlin SPARQL Benchmark, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far. Thus, in the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... StopCompilerWhenXOverRunTime = 1</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I will here summarize what should be known about running benchmarks with <a href="http://virtuoso.openlinksw.com" id="link-id0xc53af18">Virtuoso</a>.</p>

<h2>Physical Memory</h2>

<p>For 8G RAM, in the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p>

<blockquote>
<code>
[Parameters]<br />
...<br />
NumberOfBuffers = 550000
</code>
</blockquote> 
<p>For 16G RAM, double thisâ</p>

<blockquote>
<code>
[Parameters]<br />
...<br />
NumberOfBuffers = 1100000
</code>
</blockquote> 

<h2>Transaction Isolation</h2>
<p>For most cases, certainly all <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xc2f07a0">RDF</a> cases, <i>Read Committed</i> should be the default transaction isolation.  In the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code>
[Parameters]<br />
...<br />
DefaultIsolation = 2 
</code>
</blockquote> 

<h2>Multiuser Workload</h2>

<p>If <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0xc1c7178">ODBC</a>, <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0xd16fb40">JDBC</a>, or similarly connected client applications are used, there must be more <code>ServerThreads</code> available than there will be client connections.  In the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code> 
[Parameters]<br />
...<br />
ServerThreads = 100
</code>
</blockquote> 

<p>With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer <code>ServerThreads</code> than there are concurrent clients.  The <code>MaxKeepAlives</code> should be the maximum number of expected web clients.  This can be more than the <code>ServerThreads</code> count.  In the <code>[HTTPServer]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code> 
[HTTPServer]<br />
...<br />
ServerThreads    = 100 <br />
MaxKeepAlives    = 1000 <br />
KeepAliveTimeout = 10
</code>
</blockquote> 

<p>
<i><b>Note</b> â The <code>[HTTPServer] ServerThreads</code> are taken from the total pool made available by the <code>[Parameters] ServerThreads</code>.  Thus, the <code>[Parameters] ServerThreads</code> should always be at least as large as (and is best set greater than) the <code>[HTTPServer] ServerThreads</code>, and if using the closed-source Commercial Version, should not exceed the licensed thread count.</i>
</p> 

<h2>Disk Use</h2>

<p>The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID.  For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed).  </p>

<p>For the above described example, in the <code>[Database]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code>
[Database]<br />
...<br />
Striping = 1<br />
MaxCheckpointRemap 	= 2000000 
</code>
</blockquote> 

<p>â and in the <code>[Striping]</code> stanza, on one line per <code>SegmentName</code>, set â</p> 
<blockquote>
<code>
[Striping]<br />
...<br />
Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6</code>
</blockquote> 

<p>As can be seen here, each file gets a background IO thread (the <code>= q<i>xxx</i></code> clause).  It should be noted that all files on the same physical device should have the same <code>q<i>xxx</i></code> value.  This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue.</p>

<h2>
<a href="http://dbpedia.org/resource/SQL" id="link-id0xc9fa298">SQL</a> Optimization</h2>

<p>If queries have lots of joins but access little <a href="http://dbpedia.org/resource/Data" id="link-id0xb4e0aa0">data</a>, as with the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xb2de990">Berlin SPARQL Benchmark</a>, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far.  Thus, in the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code>
[Parameters]<br />
...<br />
StopCompilerWhenXOverRunTime = 1
</code>
</blockquote> 
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-08-06#1410">
  <rss:title>BSBM With Triples and Mapped Relational Data</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-06T19:41:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">BSBM With Triples and Mapped Relational Data The special contribution of the Berlin SPARQL Benchmark (BSBM) to the RDF world is to raise the question of doing OLTP with RDF. Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts. The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008. But why an OLTP workload with RDF to begin with? We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case. Warehousing all the world&#39;s publishable data as RDF is not our first preference, nor would it be the publisher&#39;s. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&#39;ll do here. What We Got First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit. But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated. If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor! We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in &quot;query mixes per hour&quot;. (An update or follow-on to this post will provide elapsed times for each test run.) With the unmodified benchmark we got: Physical Triples: Â  Â  1297 qmph Mapped Triples: Â  Â  3144 qmph In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.) The following were measured on the second run of a 100 query mix series, single test driver, warm cache. Physical Triples: Â  Â  5746 qmph Mapped Triples: Â  Â  7525 qmph We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time. Physical Triples: Â  Â  19459 qmph Mapped Triples: Â  Â  24531 qmph The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization. The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1 option here to cut needless compiler overhead, the queries being straightforward enough. We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so. Suggestions for BSBM Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC. Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload. Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let&#39;s use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like &quot;customers who bought this also bought xxx.&quot; For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??) would be enough, where ?? is a parameter marker, like ? in ODBC/JDBC. Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server. Comments on Query Mix The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales. Next We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">BSBM With Triples and Mapped Relational Data</div>
<p>The special contribution of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id10039db0">Berlin SPARQL Benchmark</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id106b2538">BSBM</a>) to the <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id101a75f8">RDF</a> world is to raise the question of doing OLTP with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xae54170">RDF</a>.</p>

<p>Of course, here we immediately hit the question of comparisons with relational databases.  To this effect, <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1e847b08">BSBM</a> also specifies a relational schema and can generate the <a href="http://dbpedia.org/resource/Data" id="link-id1206c378">data</a> as either triples or <a href="http://dbpedia.org/resource/SQL" id="link-id1667f040">SQL</a> inserts.</p>

<p>The benchmark effectively simulates the case of exposing an existing <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id10a93518">RDBMS</a> as RDF.  <a href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id13e46d80">OpenLink Software</a> calls this <i>RDF Views</i>.  <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id12027578">Oracle</a> is beginning to call this <i>semantic covers</i>.  The <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id161dc678">RDB2RDF XG</a>, a W3C incubator group, has been active in this area since Spring, 2008.</p>

<h3>But why an OLTP workload with RDF to begin with?</h3>

<p>We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS.  If <a href="http://dbpedia.org/resource/Data" id="link-id0x1e7119d8">data</a> is online for human consumption, it may be online via a <a href="http://dbpedia.org/resource/SPARQL" id="link-id106a8908">SPARQL</a> end-point as well.  The economic justification will come from discoverability and from applications integrating multi-source structured data.  Online shopping is a fine use case.</p>

<p>Warehousing all the world&#39;s publishable data as RDF is not our first preference, nor would it be the publisher&#39;s.  Considerations of duplicate infrastructure and maintenance are reason enough.  Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&#39;ll do here.</p>

<h3>What We Got </h3>

<p>First, we found that <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400" id="link-id150ea748">making the query plan took much too long</a> in proportion to the run time.  With BSBM this is an issue because the queries have lots of joins but access relatively little data.  So we made a faster compiler and along the way retouched the cost model a bit.</p>

<p>But the really interesting part with BSBM is mapping relational data to RDF.  For us, BSBM is a great way of showing that mapping can outperform even the best triple store.  A relational row store is as good as unbeatable with the query mix.  And when there is a clear mapping, there is no reason the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xae5aff0">SPARQL</a> could not be directly translated.</p>

<p>If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!</p>

<p>We filled two <a href="http://virtuoso.openlinksw.com" id="link-id12dbdc70">Virtuoso</a> instances with a BSBM200000 data set, for 100M triples.  One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples.  Performance figures are given in &quot;query mixes per hour&quot;.  (An update or follow-on to this post will provide elapsed times for each test run.)</p>

<p>With the unmodified benchmark we got:</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td>1297 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td><b>3144 qmph</b>
   </td>
  </tr>
</table>
</blockquote>
<p>In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label.  We altered Q6  to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)</p>

<p>The following were measured on the second run of a 100 query mix series, single test driver, warm cache.</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td> 5746 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td> <b>7525 qmph</b>
   </td>
  </tr>
</table>
</blockquote>
<p>We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td> 19459 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td> <b>24531 qmph</b>
   </td>
  </tr>
</table>
</blockquote>

<p>The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.  The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention.  The numbers do not evidence significant overhead from thread synchronization.</p>

<p>The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher.  We used the <code>StopCompilerWhenXOverRunTime = 1</code> option here to cut needless compiler overhead, the queries being straightforward enough.</p>

<p>We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.</p>

<h3>Suggestions for BSBM</h3>

<ul>
 <li>
  <p>
    <b>Reporting Rules.</b> The benchmark spec should specify a form for disclosure of test run data, TPC style.  This includes things like configuration parameters and exact text of queries.  There should be accepted variants of query text, as with the TPC.</p>
 </li>

<li>
  <p>
    <b>Multiuser operation.</b>  The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.</p>
</li>

<li>
  <p>
    <b>Add business intelligence.</b>  SPARQL has aggregates now, at least with <a href="http://jena.sourceforge.net/" id="link-id11a25ac0">Jena</a> and <a href="http://virtuoso.openlinksw.com" id="link-id0xb003180">Virtuoso</a>, so let&#39;s use these.  The BSBM business intelligence metric should be a separate metric off the same data.  Adding synthetic sales figures would make more interesting queries possible.  For example, producing recommendations like &quot;customers who bought this also bought xxx.&quot;</p>
</li>

<li>
  <p>
    <b>For the SPARQL community</b>, BSBM sends the message that one ought to support parameterized queries and stored procedures.  This would be a <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id109e2448">SPARQL protocol</a> extension; the SPARUL syntax should also have a way of calling a procedure.  Something like <code>select proc (??, ??)</code> would be enough, where <code>??</code> is a parameter marker, like <code>?</code> in <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id13febf48">ODBC</a>/<a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id120416a8">JDBC</a>.</p>
</li>

<li>
  <p>
    <b>Add transactions.</b>Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant.  In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store.  This could use stored procedures or logic in an app server.</p>
</li>
</ul>

<h3>Comments on Query Mix</h3>

<p>The time of most queries is less than linear to the scale factor.  Q6 is an exception if it is not implemented using a text index.  Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.</p>

<h2>Next</h2>

<p>We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release.  This also includes all the query optimization work done for BSBM.  This will be available in the coming days.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-08-06#1409">
  <rss:title>BSBM With Triples and Mapped Relational Data</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-06T19:35:27Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The special contribution of the Berlin SPARQL Benchmark (BSBM) to the RDF world is to raise the question of doing OLTP with RDF. Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts. The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008. But why an OLTP workload with RDF to begin with? We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case. Warehousing all the world&#39;s publishable data as RDF is not our first preference, nor would it be the publisher&#39;s. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&#39;ll do here. What We Got First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit. But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated. If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor! We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in &quot;query mixes per hour&quot;. (An update or follow-on to this post will provide elapsed times for each test run.) With the unmodified benchmark we got: Physical Triples: Â  Â  1297 qmph Mapped Triples: Â  Â  3144 qmph In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.) The following were measured on the second run of a 100 query mix series, single test driver, warm cache. Physical Triples: Â  Â  5746 qmph Mapped Triples: Â  Â  7525 qmph We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time. Physical Triples: Â  Â  19459 qmph Mapped Triples: Â  Â  24531 qmph The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization. The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1 option here to cut needless compiler overhead, the queries being straightforward enough. We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so. Suggestions for BSBM Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC. Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload. Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let&#39;s use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like &quot;customers who bought this also bought xxx.&quot; For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??) would be enough, where ?? is a parameter marker, like ? in ODBC/JDBC. Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server. Comments on Query Mix The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales. Next We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The special contribution of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id10039db0">Berlin SPARQL Benchmark</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id106b2538">BSBM</a>) to the <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id101a75f8">RDF</a> world is to raise the question of doing OLTP with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xb230eb0">RDF</a>.</p>

<p>Of course, here we immediately hit the question of comparisons with relational databases.  To this effect, <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xa832da8">BSBM</a> also specifies a relational schema and can generate the <a href="http://dbpedia.org/resource/Data" id="link-id1206c378">data</a> as either triples or <a href="http://dbpedia.org/resource/SQL" id="link-id1667f040">SQL</a> inserts.</p>

<p>The benchmark effectively simulates the case of exposing an existing <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id10a93518">RDBMS</a> as RDF.  <a href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id13e46d80">OpenLink Software</a> calls this <i>RDF Views</i>.  <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id12027578">Oracle</a> is beginning to call this <i>semantic covers</i>.  The <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id161dc678">RDB2RDF XG</a>, a W3C incubator group, has been active in this area since Spring, 2008.</p>

<h3>But why an OLTP workload with RDF to begin with?</h3>

<p>We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS.  If <a href="http://dbpedia.org/resource/Data" id="link-id0xabe48a0">data</a> is online for human consumption, it may be online via a <a href="http://dbpedia.org/resource/SPARQL" id="link-id106a8908">SPARQL</a> end-point as well.  The economic justification will come from discoverability and from applications integrating multi-source structured data.  Online shopping is a fine use case.</p>

<p>Warehousing all the world&#39;s publishable data as RDF is not our first preference, nor would it be the publisher&#39;s.  Considerations of duplicate infrastructure and maintenance are reason enough.  Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&#39;ll do here.</p>

<h3>What We Got </h3>

<p>First, we found that <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400" id="link-id150ea748">making the query plan took much too long</a> in proportion to the run time.  With BSBM this is an issue because the queries have lots of joins but access relatively little data.  So we made a faster compiler and along the way retouched the cost model a bit.</p>

<p>But the really interesting part with BSBM is mapping relational data to RDF.  For us, BSBM is a great way of showing that mapping can outperform even the best triple store.  A relational row store is as good as unbeatable with the query mix.  And when there is a clear mapping, there is no reason the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x96bb5e0">SPARQL</a> could not be directly translated.</p>

<p>If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!</p>

<p>We filled two <a href="http://virtuoso.openlinksw.com" id="link-id12dbdc70">Virtuoso</a> instances with a BSBM200000 data set, for 100M triples.  One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples.  Performance figures are given in &quot;query mixes per hour&quot;.  (An update or follow-on to this post will provide elapsed times for each test run.)</p>

<p>With the unmodified benchmark we got:</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td>1297 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td><b>3144 qmph</b>
   </td>
  </tr>
</table>
</blockquote>
<p>In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label.  We altered Q6  to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)</p>

<p>The following were measured on the second run of a 100 query mix series, single test driver, warm cache.</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td> 5746 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td> <b>7525 qmph</b>
   </td>
  </tr>
</table>
</blockquote>
<p>We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td> 19459 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td> <b>24531 qmph</b>
   </td>
  </tr>
</table>
</blockquote>

<p>The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.  The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention.  The numbers do not evidence significant overhead from thread synchronization.</p>

<p>The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher.  We used the <code>StopCompilerWhenXOverRunTime = 1</code> option here to cut needless compiler overhead, the queries being straightforward enough.</p>

<p>We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.</p>

<h3>Suggestions for BSBM</h3>

<ul>
 <li>
  <p>
    <b>Reporting Rules.</b> The benchmark spec should specify a form for disclosure of test run data, TPC style.  This includes things like configuration parameters and exact text of queries.  There should be accepted variants of query text, as with the TPC.</p>
 </li>

<li>
  <p>
    <b>Multiuser operation.</b>  The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.</p>
</li>

<li>
  <p>
    <b>Add business intelligence.</b>  SPARQL has aggregates now, at least with <a href="http://jena.sourceforge.net/" id="link-id11a25ac0">Jena</a> and <a href="http://virtuoso.openlinksw.com" id="link-id0xa83f490">Virtuoso</a>, so let&#39;s use these.  The BSBM business intelligence metric should be a separate metric off the same data.  Adding synthetic sales figures would make more interesting queries possible.  For example, producing recommendations like &quot;customers who bought this also bought xxx.&quot;</p>
</li>

<li>
  <p>
    <b>For the SPARQL community</b>, BSBM sends the message that one ought to support parameterized queries and stored procedures.  This would be a <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id109e2448">SPARQL protocol</a> extension; the SPARUL syntax should also have a way of calling a procedure.  Something like <code>select proc (??, ??)</code> would be enough, where <code>??</code> is a parameter marker, like <code>?</code> in <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id13febf48">ODBC</a>/<a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id120416a8">JDBC</a>.</p>
</li>

<li>
  <p>
    <b>Add transactions.</b>Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant.  In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store.  This could use stored procedures or logic in an app server.</p>
</li>
</ul>

<h3>Comments on Query Mix</h3>

<p>The time of most queries is less than linear to the scale factor.  Q6 is an exception if it is not implemented using a text index.  Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.</p>

<h2>Next</h2>

<p>We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release.  This also includes all the query optimization work done for BSBM.  This will be available in the coming days.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-07-30#1401">
  <rss:title>Virtuoso Optimizations f