The specification of the exercise referred to in the previous post may be found below.
Questions on the exercise can be sent to the email specified in the previous post. I may schedule a phone call to answer questions based on the initial email contact.
We seek to have all applicants complete the exercise before October 1.
The exercise consists of implementing a part of the TPC-C workload in memory, in C or C++. TPC-C is the long-time industry standard benchmark for transaction processing performance. We use this as a starting point for an exercise for assessing developer skill level in writing heavily multithreaded, performance-critical code.
The application performs a series of transactions against an in-memory database, encountering lock contention and occasional deadlocks. The application needs to provide atomicity, consistency, and isolation for transactions. The task consists of writing the low-level data structures for storing the memory-resident database and for managing concurrency, including lock queueing, deadlock detection, and commit/rollback. The solutions are evaluated based on their actual measured multithreaded performance on commodity servers, e.g., 8- or 12-cores of Intel Xeon.
OpenLink provides the code for data generation and driving the test. This is part of the TPC-C kit in Virtuoso Open Source. The task is to replace the SQL API calls with equivalent in-process function calls against the in-memory database developed as part of the exercise.
We are aware that the best solution to the problem may be running transactions single-threaded against in-memory hash tables without any concurrency control. The application data may be partitioned so that a single transaction can be in most cases assigned to a partition, which it will get for itself for the few microseconds it takes to do its job. For this exercise, this solution is explicitly ruled out. The application must demonstrate shared access to data, with a transaction holding multiple concurrent locks and being liable to deadlock.
TPC-C can be written so as to avoid deadlocks by always locking in a certain order. This is also expressly prohibited; in specific, the stock rows of a new order transaction must be locked in the order they are specified in the invocation. In application terms this makes no sense, but for purposes of the exercise this will serve as a natural source of deadlocks.
The application needs to offer an interactive or scripted interface (command line is OK) which provides the following operations:
Clear and initialize a database of n warehouses.
Run n threads, each doing m new order transactions. Each thread has a home warehouse and occasionally accesses other warehouse's data. This reports the real time elapsed and the number of retries arising from deadlocks.
Check the consistency between the stock, orders, and order_line data structures.
Report system status such as clocks spent waiting for specific mutexes. This is supplied as part of the OpenLink library used by the data generator.
The transactions are written as C functions. The data is represented as C structs, and tree indices or hash tables are used for value-based access to the structures by key. The application has no persistent storage. The structures reference each other by the key values as in the database, so no direct pointers. The key values are to be translated into pointers with a hash table or other index-like structure.
The application must be thread-safe, and transactions must be able to roll back. Transactions will sometimes wait for each other in updating shared resources such as stock or district or warehouse balances. The application must be written so as to implement fine-grained locking, and each transaction must be able to hold multiple locks. The application must be able to detect deadlocks. For deadlock recovery, it is acceptable to abort the transaction that detects the deadlock.
C++ template libraries may be used but one must pay attention to their efficiency.
The new order transaction is the only required transaction.
All numbers can be represented as integers. This holds equally for key
columns as for monetary amounts.
All index structures (e.g., hash tables) in the application must be thread safe, so that an insert would be safe with concurrent access or concurrent inserts. This holds also for index structures for tables which do not get inserts in the test (e.g. item, customer, stock, etc.).
A sequence object must not be used for assigning new values to the O_ID column of ORDERS. These values must come from the D_NEXT_O_ID column of the DISTRICT table. If a new order transaction rolls back, its update of D_NEXT_O_ID is also rolled back. This causes O_ID values to always be consecutive within a district.
The application must implement the TPC-C new order transaction in full. This must not avoid deadlocks by ordering locking on stock rows. See the rules section.
The transaction must have the semantics specified in TPC-C, except for durability.
The test driver calling the transaction procedures is in tpccodbc.c. This can be reused so as to call the transaction procedure in process instead of the ODBC exec.
The user interface may be a command line menu with run options for different numbers of transactions with different thread counts and an option for integrity check.
The integrity check consists of verifying s_cnt_order against the orders and checking that max (O_ID) and D_NEXT_O_ID match within each district.
Running the application should give different statistics such as CPU%, cumulative time spent waiting for locks, etc. The rdtsc instruction can be used for getting clock counts for timing.
This section summarizes some of the design patterns and coding tricks we expect to see in a solution to the exercise. These may seem self-evident to some, but experience indicates that this is not universally so.
The TPC-C transaction profile for new order specifies a semantics for the operation. The order of locking is left to the implementation as long as the semantics are in effect. The application will be tested with many clients on the same warehouse, running as fast as they can. So lock contention is expected. Therefore, the transaction should be written so as to acquire the locks with the greatest contention as late as possible. No locks need be acquired for the item table since none of the transactions will update it.
For implementing locks, using a mutex to serialize access to application resources is not enough. Many locks will be acquired by each transaction, in an unpredictable order. Unless explicit queueing for locks is implemented with deadlock detection, the application will not work.
If waiting for a mutex causes the operating system to stop a thread, even when there are cores free, the latency is multiple microseconds, even if the mutex is released by its owner on the next cycle after the waiting thread is suspended. This will destroy any benefit from parallelism unless one is very careful. Programmers do not seem to instinctively know this.
Therefore any structure to which access must be serialized (e.g. hash tables, locks, etc.) needs to be protected by a mutex but must be partitioned so that there are tens or hundreds of mutexes depending on which section of the structure one is accessing.
Submissions that protect a hash table or other index-like structure for a whole application table with a single mutex or rw lock will be discarded off the bat.
Even while using many mutexes, one must hold them for a minimum of time. When accessing a hash table, do the invariant parts first; acquire the mutex after that. For example, if you calculate the hash number after acquiring the mutex for the hash table, the submission will be rejected.
The TPC-C application has some local and some scattered access. Orders are local, and stock and item lines are scattered. When doing scattered memory accesses, the program should be written so that the CPU will, from a single thread, have multiple concurrent cache misses in flight at all times. So, when accessing 10 stock lines, calculate the hash numbers first; then access the memory, deferring any branches based on the accessed values. In this way, out of order execution will miss the CPU cache for many independent addresses in parallel. One can use the gcc __builtin_prefetch primitive, or simply write the program so as to have mutually data-independent memory accesses in close proximity.
For detecting deadlocks, a global transaction wait graph may have to be maintained. This will need to be maintained in a serialized manner. If many threads access this, the accesses must be serialized on a global mutex. This may be very bad if the deadlock detection takes a long time. Alternately, the wait graph may be maintained on another thread. The thread will get notices of waits and transacts from worker threads with some delay. Having spotted a cycle, it may kill one or another party. This will require some inter-thread communication. The submission may address this matter in any number of ways.
However, just acquiring a lock without wait must not involve getting a global mutex. Going to wait will have to do so, were it only for queueing a notice to a monitor thread. Using a socket-to-self might appear to circumvent this, but the communication stack will have mutexes inside so this is no better.
The exercise will be evaluated based on the run time performance, especially multicore scalability of the result.
Extra points are not given for implementing interfaces or for being object oriented. Interfaces, templates, and objects are not forbidden as such, but their cost must not exceed the difference between getting an address from a virtual table and calling a function directly.
The locking implementation must be correct. It can be limited to exclusive locks and need not support isolation other than repeatable read. Running the application must demonstrate deadlocks and working recovery from these.
The TPC-C data generator and test driver are in the Virtuoso Open Source distribution, in the files binsrc/tests/tpcc*.c and files included from these. You can make the exercise in the same directory and just alter the files or make script. The application is standalone and has no other relation to the Virtuoso code. The libsrc/Thread threading wrappers may be used. If not using these, make a wrapper similar to mutex_enter when MTX_METER is defined so that it counts the waits and clocks spent during wait. Also have a report like that in mutex_stat() for the mutex wait frequency and duration.
About this entry:
Author: Orri Erling
Published: 08/16/2012 15:26 GMT
Comment Status: 0 Comments