6. kvlayer — database abstraction for key/value stores

Database abstraction for key/value stores.

Many popular large-scale databases export a simple key/value abstraction: the database is simply a list of cells with some (possibly structured) key and a value for each key. This allows the database system itself to partition the database in some way, and in a distributed system it allows the correct system hosting a specific key to be found easily. This model is also simple enough that it can be used with in-memory storage or more traditional SQL-based databases.

This module provides a simple abstraction around these key/value-oriented databases. It works with yakonfig to hold basic configuration settings, so the top-level YAML configuration must have a kvlayer block. This will typically look something like

kvlayer:
  storage_type: redis
  storage_addresses: [redis.example.com:6379]
  app_name: app
  namespace: namespace

These four parameters are always required. storage_type gives one of the database backends described below. storage_addresses is a list of backend-specific database locations. app_name and namespace combine to form a container for the virtual tables stored in kvlayer.

Backend-specific configuration may also be broken into a separate section, whose name matches the storage_type:

kvlayer:
  storage_type: redis
  redis:
    storage_addresses: [redis.example.com:6379]
    app_name: app
    namespace: namespace

This is useful if experimenting with different backends, or with the split_s3 hybrid backend. Any configuration value may be in either the backend-specific configuration or the top-level kvlayer configuration, with deeper values taking precedence.

6.1. Backends

6.1.1. local

This is intended only for testing. Values are stored in a Python dictionary and not persisted. storage_addresses are not required.

kvlayer:
  storage_type: local

6.1.2. filestorage

This is intended only for testing. Values are stored in a local file using the shelve module. This does not require storage_addresses, but it does require:

kvlayer:
  storage_type: filestorage

  # Name of the file to use for storage
  filename: /tmp/kvlayer.bin

  # If set, actually work on a copy of "filename" at this location.
  copy_to_filename: /tmp/kvlayer-copy.bin

6.1.3. redis

Uses the Redis in-memory database.

kvlayer:
  storage_type: redis

  # host:port locations of Redis servers; only the first is used
  storage_addresses: [redis.example.com:6379]

  # Redis database number (default: 0)
  redis_db_num: 1

6.1.4. accumulo

Uses the Apache Accumulo distributed database. Your installation must be running the Accumulo proxy, and the configuration points at that proxy.

kvlayer:
  storage_type: accumulo

  # host:port location of the proxy; only the first is used
  storage_addresses: [accumulo.example.com:50096]
  username: root
  password: secret

  # all of the following parameters are default values and are optional
  accumulo_max_memory: 1000000
  accumulo_timeout_ms: 30000
  accumulo_threads: 10
  accumulo_latency_ms: 10
  thrift_framed_transport_size_in_mb: 15

Each kvlayer table is instantiated as an Accumulo table named appname_namespace_table.

6.1.5. postgrest

Uses PostgreSQL for storage. This backend is only available if the psycopg2 module is installed. The storage_addresses may be a single PostgreSQL connection string, or may be a host:port format with additional configuration. The app_name and namespace can only consist of alphanumeric characters, underscores, or $, and must begin with a letter or underscore.

This backend is newer than, less proven than, and incompatible with the postgres backend described below.

kvlayer:
  storage_type: postgrest
  storage_addresses: ['postgres.example.com:5432']
  username: test
  password: test
  dbname: test
  # Equivalently, pack this all into a single SQL connection string
  # storage_addresses:
  # - 'host=postgres.example.com port=5432 user=test dbname=test password=test'

  # all of the following parameters are default values and are optional
  # keep this many connections alive
  min_connections: 2
  # never create more than this many connections
  max_connections: 16

The backend assumes the user is able to run SQL CREATE TABLE and DROP TABLE statements. Each kvlayer table is instantiated as an SQL table named appname_namespace_table.

Within the system, the min_connections and max_connections property apply per client object. If min_connections is set to 0 then the connection pool will never hold a connection alive, which typically adds a performance cost to reconnect.

6.1.6. postgres

Uses PostgreSQL for storage. This backend is only available if the psycopg2 module is installed. The storage_type is a PostgreSQL connection string. The app_name and namespace can only consist of alphanumeric characters, underscores, or $, and must begin with a letter or underscore.

kvlayer:
  storage_type: postgres
  storage_addresses:
  - 'host=postgres.example.com port=5432 user=test dbname=test password=test'

  # all of the following parameters are default values and are optional
  # keep this many connections alive
  min_connections: 2
  # never create more than this many connections
  max_connections: 16
  # break large scans (using SQL) into chunks of this many
  scan_inner_limit: 1000

The backend assumes the user is able to run SQL CREATE TABLE and DROP TABLE statements. Each kvlayer namespace is instantiated as an SQL table named kv_appname_namespace; kvlayer tables are collections of rows within the namespace table sharing a common field.

Within the system, the min_connections and max_connections property apply per client object. If min_connections is set to 0 then the connection pool will never hold a connection alive, which typically adds a performance cost to reconnect.

6.1.7. riak

Uses Riak for storage. This backend is only available if the corresponding riak client library is installed. Multiple storage_addresses are actively encouraged for this backend. Each may be a simple string, or a dictionary containing keys host, http_port, and pb_port if your setup is using non-standard port numbers. A typical setup will look like:

kvlayer:
  storage_type: riak
  storage_addresses: [riak01, riak02, riak03, riak04, riak05]
  # optional settings with their default values
  protocol: pbc # or http or https
  scan_limit: 100

The setup from the Riak “Five-Minute Install” runs five separate Riak nodes all on localhost, resulting in configuration like

One kvlayer namespace corresponds to one Riak bucket.

The protocol setting selects between defaulting to the HTTP or protocol buffer APIs. While Riak’s default is generally http, the pbc API seems to work equally well and is much faster. The kvlayer backend’s default is pbc.

The scan_limit setting determines how many results will be returned from each secondary index search. A higher setting for this results in fewer network round-trips to get search results, but also results in higher latency to return each. This affects both calls to the kvlayer scan API as well as calls to delete kvlayer tables, which are also Riak key scans.

Your Riak cluster must be configured with secondary indexing enabled, and correspondingly, must be using the LevelDB backend. The default bucket settings, and in particular setting allow_mult to false, are correct for kvlayer.

6.1.8. split_s3

Hybrid backend that stores values for specific tables in Amazon S3, and pointers to those values and all other tables in some other kvlayer backend. This backend is only available if the boto AWS access library is installed. Using this effectively requires knowing the internal details of what table names the end application will use. A typical all-Amazon setup would store bulk objects in S3, and store all other objects in an Amazon RDS PostgreSQL instance.

kvlayer:
  storage_type: split_s3
  app_name: kvlayer
  namespace: namespace
  split_s3:
    tables: [stream_items]
    aws_access_key_id: Axxxxxxxxxxxxxxxxxxx
    aws_secret_access_key: a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0
    bucket: mybucket
    path_prefix:
    kvlayer_prefix: True
    retries: 5
    retry_interval: 0.1
    kvlayer:
      storage_type: postgres
      storage_addresses:
        - >-
            host=stuff.rds.amazonaws.com port=5432 user=db
            dbname=db password=db

An AWS key must be passed to the backend. Either an access key ID and the secret key can be embedded directly in the configuration, as shown here, or aws_access_key_id_path and aws_secret_access_key_path can be names of local files containing those credentials.

Objects are stored in the named bucket, which must exist already. Each object’s key is constructed by taking path_prefix (if any); if kvlayer_prefix is true (default), appending app_name/namespace/; table_name/; and a SHA-256 hash of the serialized object key, with the first two bytes split off to form a directory hierarchy. If your table’s schema is (str,) and you wrote keys a, b, and c with the above configuration to table table, these would be written to mybucket in respective S3 object paths

kvlayer/namespace/table/ca/97/8112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb kvlayer/namespace/table/3e/23/e8160039594a33894f6564e1b1348bbd7a0088d42c4acb73eeaed59c009d kvlayer/namespace/table/2e/7d/2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6

The kvlayer keys are written through to the underlying backend, but no data is written there. Only tables listed in the tables parameter have data stored in S3 (required setting); all other tables are processed normally. scan_keys() does not talk to S3 at all, it just relays the scan from the underlying backend. Bulk-delete operations such as clear_table() and delete_namespace() need to do a scan of the underlying table to find the keys to delete. Particularly if data is being read back immediately after being written, it is possible that a read (get or scan) operation will fail to find the just-written object; a failed read for an object expected to exist will be retried retries times (default 5), waiting retry_interval seconds (default 0.1) between each.

The underlying backend is separately configured with its own kvlayer section inside the backend configuration.

6.1.9. cassandra

Uses the Apache Cassandra distributed database. Note that this backend requires keys to be limited to tuples of UUIDs.

kvlayer:
  storage_type: cassandra
  storage_addresses: ['cassandra.example.com:9160']
  username: root
  password: secret

  connection_pool_size: 2
  max_consistency_delay: 120
  replication_factor: 1
  thrift_framed_transport_size_in_mb: 15

6.2. API

Having set up the global configuration, it is enough to call kvlayer.client() to get a storage client object.

The API works in terms of “tables”, though these are slightly different from tradational database tables. Each table has keys which are tuples of a fixed length.

kvlayer.client(config=None, storage_type=None, *args, **kwargs)[source]

Create a kvlayer client object.

With no arguments, gets the global kvlayer configuration from yakonfig and uses that. A config dictionary, if provided, is used in place of the yakonfig configuration. storage_type overrides the corresponding field in the configuration, but it must be supplied in one place or the other. Any additional parameters are passed to the corresponding backend’s constructor.

>>> local_storage = kvlayer.client(config={}, storage_type='local',
...                                app_name='app', namespace='ns')

If there is additional configuration under the value of storage_type, that is overlaid over config and passed to the storage implementation.

Parameters:
  • config (dict) – kvlayer configuration dictionary
  • storage_type (str) – name of storage implementation
Raises kvlayer._exceptions.ConfigurationError:
 

if storage_type is not provided or is invalid

class kvlayer._abstract_storage.AbstractStorage(config, app_name=None, namespace=None)[source]

Bases: object

Base class for all low-level storage implementations.

All of the table-like structures we use are setup like this:

namespace = dict(
    table_name = dict((UUID, UUID, ...): val)
    ...
)

where the number of UUIDs in the key is a configurable parameter of each table, and the “val” is always binary and might be a trivial value, like 1.

check_put_key_value(key, value, table_name, key_spec=None, value_type=None)[source]

Check that a key/value pair are consistent with the schema.

Parameters:
  • key (tuple) – key to put
  • value – value to put (ignored)
  • table_name (str) – kvlayer table name (for errors only)
  • key_spec (tuple) – definition of the table key
  • value_type – type of the table value
Raises kvlayer._exceptions.BadKey:
 

if key doesn’t match key_spec

value_to_str(value, value_type)[source]
str_to_value(value, value_type)[source]
setup_namespace(table_names, value_types=None)[source]

Create tables in the namespace.

Can be run multiple times with different table_names in order to expand the set of tables in the namespace. This generally needs to be called by every client, even if only reading data.

Tables are specified by the form of their keys. A key must be a tuple of a set number and type of parts. Currently types uuid.UUID, int, long, and str are well supported, anything else is serialzed by str(). Historically, a kvlayer key had to be a tuple of some number of UUIDs. table_names is a dictionary mapping a table name to a tuple of types. The dictionary values may also be integers, in which case the tuple is that many UUIDs.

value_types specifies the type of the values for a given table. Tables default to having a value type of str. int and float are also permitted. Value types may also be COUNTER or ACCUMULATOR; see increment() for details on these types. You must pass the corresponding type as the value parameter to put(), and that type will be returned as the value part of get() and scan().

Parameters:
  • table_names (dict) – Mapping from table name to key type tuple
  • value_types (dict) – Mapping from table name to value type
delete_namespace()[source]

Deletes all data from namespace.

clear_table(table_name)[source]

Delete all data from one table.

put(table_name, *keys_and_values, **kwargs)[source]

Save values for keys in table_name.

Each key must be a tuple of length and types as specified for table_name in setup_namespace().

log_put(table_name, start_time, end_time, num_keys, keys_size, num_values, values_size)[source]
scan(table_name, *key_ranges, **kwargs)[source]

Yield tuples of (key, value) from querying table_name for items with keys within the specified ranges. If no key_ranges are provided, then yield all (key, value) pairs in table. This may return nothing if the table is empty or there are no matching keys in any of the specified ranges.

Each of the key_ranges is a pair of a start and end tuple to scan. To specify the beginning or end, a -Inf or Inf value, use an empty tuple as the beginning or ending key of a range.

log_scan(table_name, start_time, end_time, num_keys, keys_size, num_values, values_size)[source]
scan_keys(table_name, *key_ranges, **kwargs)[source]

Scan only the keys from a table.

Yields key tuples from queying table_name for keys within the specified ranges. If no key_ranges are provided, then yield all key tuples in the table. This may yield nothing if the table is empty or there are no matching keys in any of the specified ranges.

Each of the key_ranges is a pair of a start and end tuple to scan. To specify the beginning or end, a -Inf or Inf value, use an empty tuple as the beginning or ending key of a range.

log_scan_keys(table_name, start_time, end_time, num_keys, keys_size)[source]
get(table_name, *keys, **kwargs)[source]

Yield tuples of (key, value) from querying table_name for items with keys. If any of the key tuples are not in the table, those key tuples will be yielded with value None.

log_get(table_name, start_time, end_time, num_keys, keys_size, num_values, values_size)[source]
delete(table_name, *keys, **kwargs)[source]

Delete all (key, value) pairs with specififed keys

log_delete(table_name, start_time, end_time, num_keys, keys_size)[source]
close()[source]

close connections and end use of this storage client

increment(table_name, *keys_and_values)[source]

Add values to a counter-type table.

keys_and_values are parameters of (key, value) pairs. The values must be int, if table_name is a COUNTER table, or float, if table_name is a ACCUMULATOR table. For each key, the current value is fetched from the storage, the value is added to it, and the resulting value added back into the storage.

This method is not guaranteed to be atomic, either on a specific key or across all keys, but specific backends may have better guarantees. The behavior is unspecified if the same key is included multiple times in the parameter list.

To use this, you must have passed table_name to setup_namespaces() in its value_types parameter, setting the value type to COUNTER or ACCUMULATOR. When you do this, you pass and receive float values back from all methods in this class for that table. put() directly sets the values of counter keys. Counter values default to 0; if you change a counter value to 0 then it will be “present” for purposes of get(), scan(), and scan_keys().

Parameters:
  • table_name (str) – name of table to update
  • keys_and_values – additional parameters are pairs of key tuple and numeric delta value
class kvlayer.DatabaseEmpty[source]
class kvlayer.BadKey[source]

A key value passed to a kvlayer function was not of the correct form.

Keys must be tuples of a fixed length and with specific types. The length of the tuple is specified in the initial call to kvlayer._abstract_storage.AbstractStorage.setup_namespace().