summaryrefslogtreecommitdiff
path: root/libs/libmdbx/src/mdbx.h
diff options
context:
space:
mode:
Diffstat (limited to 'libs/libmdbx/src/mdbx.h')
-rw-r--r--libs/libmdbx/src/mdbx.h3107
1 files changed, 2328 insertions, 779 deletions
diff --git a/libs/libmdbx/src/mdbx.h b/libs/libmdbx/src/mdbx.h
index 67b725139a..dcbe608b29 100644
--- a/libs/libmdbx/src/mdbx.h
+++ b/libs/libmdbx/src/mdbx.h
@@ -1,4 +1,465 @@
-/* LICENSE AND COPYRUSTING *****************************************************
+/**** BRIEFLY ******************************************************************
+ *
+ * libmdbx is superior to LMDB (https://bit.ly/26ts7tL) in terms of features
+ * and reliability, not inferior in performance. In comparison to LMDB, libmdbx
+ * makes many things just work perfectly, not silently and catastrophically
+ * break down. libmdbx supports Linux, Windows, MacOS, FreeBSD, DragonFly,
+ * Solaris, OpenSolaris, OpenIndiana, NetBSD, OpenBSD and other systems
+ * compliant with POSIX.1-2008.
+ *
+ * Look below for API description, for other information (build, embedding and
+ * amalgamation, improvements over LMDB, benchmarking, etc) please refer to
+ * README.md at https://abf.io/erthink/libmdbx.
+ *
+ * ---
+ *
+ * The next version is under active non-public development and will be released
+ * as MithrilDB and libmithrildb for libraries & packages. Admittedly mythical
+ * Mithril is resembling silver but being stronger and lighter than steel.
+ * Therefore MithrilDB is rightly relevant name.
+ *
+ * MithrilDB will be radically different from libmdbx by the new database format
+ * and API based on C++17, as well as the Apache 2.0 License. The goal of this
+ * revolution is to provide a clearer and robust API, add more features and new
+ * valuable properties of database.
+ *
+ * The Future will (be) Positive. Всё будет хорошо.
+ *
+ *
+ **** INTRODUCTION *************************************************************
+ *
+ * // For the most part, this section is a copy of the corresponding text
+ * // from LMDB description, but with some edits reflecting the improvements
+ * // and enhancements were made in MDBX.
+ *
+ * MDBX is a Btree-based database management library modeled loosely on the
+ * BerkeleyDB API, but much simplified. The entire database (aka "environment")
+ * is exposed in a memory map, and all data fetches return data directly from
+ * the mapped memory, so no malloc's or memcpy's occur during data fetches.
+ * As such, the library is extremely simple because it requires no page caching
+ * layer of its own, and it is extremely high performance and memory-efficient.
+ * It is also fully transactional with full ACID semantics, and when the memory
+ * map is read-only, the database integrity cannot be corrupted by stray pointer
+ * writes from application code.
+ *
+ * The library is fully thread-aware and supports concurrent read/write access
+ * from multiple processes and threads. Data pages use a copy-on-write strategy
+ * so no active data pages are ever overwritten, which also provides resistance
+ * to corruption and eliminates the need of any special recovery procedures
+ * after a system crash. Writes are fully serialized; only one write transaction
+ * may be active at a time, which guarantees that writers can never deadlock.
+ * The database structure is multi-versioned so readers run with no locks;
+ * writers cannot block readers, and readers don't block writers.
+ *
+ * Unlike other well-known database mechanisms which use either write-ahead
+ * transaction logs or append-only data writes, MDBX requires no maintenance
+ * during operation. Both write-ahead loggers and append-only databases require
+ * periodic checkpointing and/or compaction of their log or database files
+ * otherwise they grow without bound. MDBX tracks free pages within the database
+ * and re-uses them for new write operations, so the database size does not grow
+ * without bound in normal use. It is worth noting that the "next" version
+ * libmdbx (MithrilDB) will solve this problem.
+ *
+ * The memory map can be used as a read-only or read-write map. It is read-only
+ * by default as this provides total immunity to corruption. Using read-write
+ * mode offers much higher write performance, but adds the possibility for stray
+ * application writes thru pointers to silently corrupt the database.
+ * Of course if your application code is known to be bug-free (...) then this is
+ * not an issue.
+ *
+ * If this is your first time using a transactional embedded key-value store,
+ * you may find the "GETTING STARTED" section below to be helpful.
+ *
+ *
+ **** GETTING STARTED **********************************************************
+ *
+ * // This section is based on Bert Hubert's intro "LMDB Semantics", with
+ * // edits reflecting the improvements and enhancements were made in MDBX.
+ * // See https://bit.ly/2maejGY for Bert Hubert's original.
+ *
+ * Everything starts with an environment, created by mdbx_env_create().
+ * Once created, this environment must also be opened with mdbx_env_open(),
+ * and after use be closed by mdbx_env_close(). At that a non-zero value of the
+ * last argument "mode" supposes MDBX will create database and directory if ones
+ * does not exist. In this case the non-zero "mode" argument specifies the file
+ * mode bits be applied when a new files are created by open() function.
+ *
+ * Within that directory, a lock file (aka LCK-file) and a storage file (aka
+ * DXB-file) will be generated. If you don't want to use a directory, you can
+ * pass the MDBX_NOSUBDIR option, in which case the path you provided is used
+ * directly as the DXB-file, and another file with a "-lck" suffix added
+ * will be used for the LCK-file.
+ *
+ * Once the environment is open, a transaction can be created within it using
+ * mdbx_txn_begin(). Transactions may be read-write or read-only, and read-write
+ * transactions may be nested. A transaction must only be used by one thread at
+ * a time. Transactions are always required, even for read-only access. The
+ * transaction provides a consistent view of the data.
+ *
+ * Once a transaction has been created, a database (i.e. key-value space inside
+ * the environment) can be opened within it using mdbx_dbi_open(). If only one
+ * database will ever be used in the environment, a NULL can be passed as the
+ * database name. For named databases, the MDBX_CREATE flag must be used to
+ * create the database if it doesn't already exist. Also, mdbx_env_set_maxdbs()
+ * must be called after mdbx_env_create() and before mdbx_env_open() to set the
+ * maximum number of named databases you want to support.
+ *
+ * NOTE: a single transaction can open multiple databases. Generally databases
+ * should only be opened once, by the first transaction in the process.
+ *
+ * Within a transaction, mdbx_get() and mdbx_put() can store single key-value
+ * pairs if that is all you need to do (but see CURSORS below if you want to do
+ * more).
+ *
+ * A key-value pair is expressed as two MDBX_val structures. This struct that is
+ * exactly similar to POSIX's struct iovec and has two fields, iov_len and
+ * iov_base. The data is a void pointer to an array of iov_len bytes.
+ * (!) The notable difference between MDBX and LMDB is that MDBX support zero
+ * length keys.
+ *
+ * Because MDBX is very efficient (and usually zero-copy), the data returned in
+ * an MDBX_val structure may be memory-mapped straight from disk. In other words
+ * look but do not touch (or free() for that matter). Once a transaction is
+ * closed, the values can no longer be used, so make a copy if you need to keep
+ * them after that.
+ *
+ *
+ * CURSORS -- To do more powerful things, we must use a cursor.
+ *
+ * Within the transaction, a cursor can be created with mdbx_cursor_open().
+ * With this cursor we can store/retrieve/delete (multiple) values using
+ * mdbx_cursor_get(), mdbx_cursor_put(), and mdbx_cursor_del().
+ *
+ * mdbx_cursor_get() positions itself depending on the cursor operation
+ * requested, and for some operations, on the supplied key. For example, to list
+ * all key-value pairs in a database, use operation MDBX_FIRST for the first
+ * call to mdbx_cursor_get(), and MDBX_NEXT on subsequent calls, until the end
+ * is hit.
+ *
+ * To retrieve all keys starting from a specified key value, use MDBX_SET. For
+ * more cursor operations, see the API description below.
+ *
+ * When using mdbx_cursor_put(), either the function will position the cursor
+ * for you based on the key, or you can use operation MDBX_CURRENT to use the
+ * current position of the cursor. NOTE that key must then match the current
+ * position's key.
+ *
+ *
+ * SUMMARIZING THE OPENING
+ *
+ * So we have a cursor in a transaction which opened a database in an
+ * environment which is opened from a filesystem after it was separately
+ * created.
+ *
+ * Or, we create an environment, open it from a filesystem, create a transaction
+ * within it, open a database within that transaction, and create a cursor
+ * within all of the above.
+ *
+ * Got it?
+ *
+ *
+ * THREADS AND PROCESSES
+ *
+ * Do not have open an database twice in the same process at the same time, MDBX
+ * will track and prevent this. Instead, share the MDBX environment that has
+ * opened the file across all threads. The reason for this is:
+ * - When the "Open file description" locks (aka OFD-locks) are not available,
+ * MDBX uses POSIX locks on files, and these locks have issues if one process
+ * opens a file multiple times.
+ * - If a single process opens the same environment multiple times, closing it
+ * once will remove all the locks held on it, and the other instances will be
+ * vulnerable to corruption from other processes.
+ * + For compatibility with LMDB which allows multi-opening, MDBX can be
+ * configured at runtime by mdbx_setup_debug(MDBX_DBG_LEGACY_MULTIOPEN, ...)
+ * prior to calling other MDBX funcitons. In this way MDBX will track
+ * databases opening, detect multi-opening cases and then recover POSIX file
+ * locks as necessary. However, lock recovery can cause unexpected pauses,
+ * such as when another process opened the database in exclusive mode before
+ * the lock was restored - we have to wait until such a process releases the
+ * database, and so on.
+ *
+ * Do not use opened MDBX environment(s) after fork() in a child process(es),
+ * MDBX will check and prevent this at critical points. Instead, ensure there is
+ * no open MDBX-instance(s) during fork(), or atleast close it immediately after
+ * fork() in the child process and reopen if required - for instance by using
+ * pthread_atfork(). The reason for this is:
+ * - For competitive consistent reading, MDBX assigns a slot in the shared
+ * table for each process that interacts with the database. This slot is
+ * populated with process attributes, including the PID.
+ * - After fork(), in order to remain connected to a database, the child
+ * process must have its own such "slot", which can't be assigned in any
+ * simple and robust way another than the regular.
+ * - A write transaction from a parent process cannot continue in a child
+ * process for obvious reasons.
+ * - Moreover, in a multithreaded process at the fork() moment any number of
+ * threads could run in critical and/or intermediate sections of MDBX code
+ * with interaction and/or racing conditions with threads from other
+ * process(es). For instance: shrinking a database or copying it to a pipe,
+ * opening or closing environment, begining or finishing a transaction,
+ * and so on.
+ * = Therefore, any solution other than simply close database (and reopen if
+ * necessary) in a child process would be both extreme complicated and so
+ * fragile.
+ *
+ * Also note that a transaction is tied to one thread by default using Thread
+ * Local Storage. If you want to pass read-only transactions across threads,
+ * you can use the MDBX_NOTLS option on the environment. Nevertheless, a write
+ * transaction entirely should only be used in one thread from start to finish.
+ * MDBX checks this in a reasonable manner and return the MDBX_THREAD_MISMATCH
+ * error in rules violation.
+ *
+ *
+ * TRANSACTIONS, ROLLBACKS, etc.
+ *
+ * To actually get anything done, a transaction must be committed using
+ * mdbx_txn_commit(). Alternatively, all of a transaction's operations
+ * can be discarded using mdbx_txn_abort().
+ *
+ * (!) An important difference between MDBX and LMDB is that MDBX required that
+ * any opened cursors can be reused and must be freed explicitly, regardless
+ * ones was opened in a read-only or write transaction. The REASON for this is
+ * eliminates ambiguity which helps to avoid errors such as: use-after-free,
+ * double-free, i.e. memory corruption and segfaults.
+ *
+ * For read-only transactions, obviously there is nothing to commit to storage.
+ * (!) An another notable difference between MDBX and LMDB is that MDBX make
+ * handles opened for existing databases immediately available for other
+ * transactions, regardless this transaction will be aborted or reset. The
+ * REASON for this is to avoiding the requirement for multiple opening a same
+ * handles in concurrent read transactions, and tracking of such open but hidden
+ * handles until the completion of read transactions which opened them.
+ *
+ * In addition, as long as a transaction is open, a consistent view of the
+ * database is kept alive, which requires storage. A read-only transaction that
+ * no longer requires this consistent view should be terminated (committed or
+ * aborted) when the view is no longer needed (but see below for an
+ * optimization).
+ *
+ * There can be multiple simultaneously active read-only transactions but only
+ * one that can write. Once a single read-write transaction is opened, all
+ * further attempts to begin one will block until the first one is committed or
+ * aborted. This has no effect on read-only transactions, however, and they may
+ * continue to be opened at any time.
+ *
+ *
+ * DUPLICATE KEYS
+ *
+ * mdbx_get() and mdbx_put() respectively have no and only some support or
+ * multiple key-value pairs with identical keys. If there are multiple values
+ * for a key, mdbx_get() will only return the first value.
+ *
+ * When multiple values for one key are required, pass the MDBX_DUPSORT flag to
+ * mdbx_dbi_open(). In an MDBX_DUPSORT database, by default mdbx_put() will not
+ * replace the value for a key if the key existed already. Instead it will add
+ * the new value to the key. In addition, mdbx_del() will pay attention to the
+ * value field too, allowing for specific values of a key to be deleted.
+ *
+ * Finally, additional cursor operations become available for traversing through
+ * and retrieving duplicate values.
+ *
+ *
+ * SOME OPTIMIZATION
+ *
+ * If you frequently begin and abort read-only transactions, as an optimization,
+ * it is possible to only reset and renew a transaction.
+ *
+ * mdbx_txn_reset() releases any old copies of data kept around for a read-only
+ * transaction. To reuse this reset transaction, call mdbx_txn_renew() on it.
+ * Any cursors in this transaction can also be renewed using mdbx_cursor_renew()
+ * or freed by mdbx_cursor_close().
+ *
+ * To permanently free a transaction, reset or not, use mdbx_txn_abort().
+ *
+ *
+ * CLEANING UP
+ *
+ * Any created cursors must be closed using mdbx_cursor_close(). It is advisable
+ * to repeat:
+ * (!) An important difference between MDBX and LMDB is that MDBX required that
+ * any opened cursors can be reused and must be freed explicitly, regardless
+ * ones was opened in a read-only or write transaction. The REASON for this is
+ * eliminates ambiguity which helps to avoid errors such as: use-after-free,
+ * double-free, i.e. memory corruption and segfaults.
+ *
+ * It is very rarely necessary to close a database handle, and in general they
+ * should just be left open. When you close a handle, it immediately becomes
+ * unavailable for all transactions in the environment. Therefore, you should
+ * avoid closing the handle while at least one transaction is using it.
+ *
+ *
+ * THE FULL API
+ *
+ * The full MDBX documentation lists further details below,
+ * like how to:
+ *
+ * - configure database size and automatic size management
+ * - drop and clean a database
+ * - detect and report errors
+ * - optimize (bulk) loading speed
+ * - (temporarily) reduce robustness to gain even more speed
+ * - gather statistics about the database
+ * - define custom sort orders
+ * - estimate size of range query result
+ * - double perfomance by LIFO reclaiming on storages with write-back
+ * - use sequences and canary markers
+ * - use lack-of-space callback (aka OOM-KICK)
+ * - use exclusive mode
+ *
+ *
+ **** RESTRICTIONS & CAVEATS ***************************************************
+ * in addition to those listed for some functions.
+ *
+ * - Troubleshooting the LCK-file.
+ * 1. A broken LCK-file can cause sync issues, including appearance of
+ * wrong/inconsistent data for readers. When database opened in the
+ * cooperative read-write mode the LCK-file requires to be mapped to
+ * memory in read-write access. In this case it is always possible for
+ * stray/malfunctioned application could writes thru pointers to
+ * silently corrupt the LCK-file.
+ *
+ * Unfortunately, there is no any portable way to prevent such
+ * corruption, since the LCK-file is updated concurrently by
+ * multiple processes in a lock-free manner and any locking is
+ * unwise due to a large overhead.
+ *
+ * The "next" version of libmdbx (MithrilDB) will solve this issue.
+ *
+ * Workaround: Just make all programs using the database close it;
+ * the LCK-file is always reset on first open.
+ *
+ * 2. Stale reader transactions left behind by an aborted program cause
+ * further writes to grow the database quickly, and stale locks can
+ * block further operation.
+ * MDBX checks for stale readers while opening environment and before
+ * growth the database. But in some cases, this may not be enough.
+ *
+ * Workaround: Check for stale readers periodically, using the
+ * mdbx_reader_check() function or the mdbx_stat tool.
+ *
+ * 3. Stale writers will be cleared automatically by MDBX on supprted
+ * platforms. But this is platform-specific, especially of
+ * implementation of shared POSIX-mutexes and support for robust
+ * mutexes. For instance there are no known issues on Linux, OSX,
+ * Windows and FreeBSD.
+ *
+ * Workaround: Otherwise just make all programs using the database
+ * close it; the LCK-file is always reset on first open
+ * of the environment.
+ *
+ * - Do not use MDBX databases on remote filesystems, even between processes
+ * on the same host. This breaks file locks on some platforms, possibly
+ * memory map sync, and certainly sync between programs on different hosts.
+ *
+ * On the other hand, MDBX support the exclusive database operation over
+ * a network, and cooperative read-only access to the database placed on
+ * a read-only network shares.
+ *
+ * - Do not use opened MDBX_env instance(s) in a child processes after fork().
+ * It would be insane to call fork() and any MDBX-functions simultaneously
+ * from multiple threads. The best way is to prevent the presence of open
+ * MDBX-instances during fork().
+ *
+ * The MDBX_TXN_CHECKPID build-time option, which is ON by default on
+ * non-Windows platforms (i.e. where fork() is available), enables PID
+ * checking at a few critical points. But this does not give any guarantees,
+ * but only allows you to detect such errors a little sooner. Depending on
+ * the platform, you should expect an application crash and/or database
+ * corruption in such cases.
+ *
+ * On the other hand, MDBX allow calling mdbx_close_env() in such cases to
+ * release resources, but no more and in general this is a wrong way.
+ *
+ * - There is no pure read-only mode in a normal explicitly way, since
+ * readers need write access to LCK-file to be ones visible for writer.
+ * MDBX always tries to open/create LCK-file for read-write, but switches
+ * to without-LCK mode on appropriate errors (EROFS, EACCESS, EPERM)
+ * if the read-only mode was requested by the MDBX_RDONLY flag which is
+ * described below.
+ *
+ * The "next" version of libmdbx (MithrilDB) will solve this issue.
+ *
+ * - A thread can only use one transaction at a time, plus any nested
+ * read-write transactions in the non-writemap mode. Each transaction
+ * belongs to one thread. The MDBX_NOTLS flag changes this for read-only
+ * transactions. See below.
+ *
+ * - Do not have open an MDBX database twice in the same process at the same
+ * time. By default MDBX prevent this in most cases by tracking databases
+ * opening and return MDBX_BUSY if anyone LCK-file is already open.
+ *
+ * The reason for this is that when the "Open file description" locks (aka
+ * OFD-locks) are not available, MDBX uses POSIX locks on files, and these
+ * locks have issues if one process opens a file multiple times. If a single
+ * process opens the same environment multiple times, closing it once will
+ * remove all the locks held on it, and the other instances will be
+ * vulnerable to corruption from other processes.
+ *
+ * For compatibility with LMDB which allows multi-opening, MDBX can be
+ * configured at runtime by mdbx_setup_debug(MDBX_DBG_LEGACY_MULTIOPEN, ...)
+ * prior to calling other MDBX funcitons. In this way MDBX will track
+ * databases opening, detect multi-opening cases and then recover POSIX file
+ * locks as necessary. However, lock recovery can cause unexpected pauses,
+ * such as when another process opened the database in exclusive mode before
+ * the lock was restored - we have to wait until such a process releases the
+ * database, and so on.
+ *
+ * - Avoid long-lived transactions, especially in the scenarios with a high
+ * rate of write transactions. Read transactions prevent reuse of pages
+ * freed by newer write transactions, thus the database can grow quickly.
+ * Write transactions prevent other write transactions, since writes are
+ * serialized.
+ *
+ * Understanding the problem of long-lived read transactions requires some
+ * explanation, but can be difficult for quick perception. So is is
+ * reasonable to simplify this as follows:
+ * 1. Garbage collection problem exists in all databases one way or
+ * another, e.g. VACUUM in PostgreSQL. But in _libmdbx_ it's even more
+ * discernible because of high transaction rate and intentional
+ * internals simplification in favor of performance.
+ *
+ * 2. MDBX employs Multiversion concurrency control on the Copy-on-Write
+ * basis, that allows multiple readers runs in parallel with a write
+ * transaction without blocking. An each write transaction needs free
+ * pages to put the changed data, that pages will be placed in the new
+ * b-tree snapshot at commit. MDBX efficiently recycling pages from
+ * previous created unused snapshots, BUT this is impossible if anyone
+ * a read transaction use such snapshot.
+ *
+ * 3. Thus massive altering of data during a parallel long read operation
+ * will increase the process's work set and may exhaust entire free
+ * database space.
+ *
+ * A good example of long readers is a hot backup to the slow destination
+ * or debugging of a client application while retaining an active read
+ * transaction. LMDB this results in MAP_FULL error and subsequent write
+ * performance degradation.
+ *
+ * MDBX mostly solve "long-lived" readers issue by the lack-of-space callback
+ * which allow to aborts long readers, and by the MDBX_LIFORECLAIM mode which
+ * addresses subsequent performance degradation.
+ * The "next" version of libmdbx (MithrilDB) will completely solve this.
+ *
+ * - Avoid suspending a process with active transactions. These would then be
+ * "long-lived" as above.
+ *
+ * The "next" version of libmdbx (MithrilDB) will solve this issue.
+ *
+ * - Avoid aborting a process with an active read-only transaction in scenaries
+ * with high rate of write transactions. The transaction becomes "long-lived"
+ * as above until a check for stale readers is performed or the LCK-file is
+ * reset, since the process may not remove it from the lockfile. This does
+ * not apply to write transactions if the system clears stale writers, see
+ * above.
+ *
+ * - An MDBX database configuration will often reserve considerable unused
+ * memory address space and maybe file size for future growth. This does
+ * not use actual memory or disk space, but users may need to understand
+ * the difference so they won't be scared off.
+ *
+ * - The Write Amplification Factor.
+ * TBD.
+ *
+ **** LICENSE AND COPYRUSTING **************************************************
*
* Copyright 2015-2019 Leonid Yuriev <leo@yuriev.ru>
* and other libmdbx authors: please see AUTHORS file.
@@ -12,13 +473,13 @@
* top-level directory of the distribution or, alternatively, at
* <http://www.OpenLDAP.org/license.html>.
*
- * ---
+ * ---
*
* This code is derived from "LMDB engine" written by
* Howard Chu (Symas Corporation), which itself derived from btree.c
* written by Martin Hedenfalk.
*
- * ---
+ * ---
*
* Portions Copyright 2011-2015 Howard Chu, Symas Corp. All rights reserved.
*
@@ -30,7 +491,7 @@
* top-level directory of the distribution or, alternatively, at
* <http://www.OpenLDAP.org/license.html>.
*
- * ---
+ * ---
*
* Portions Copyright (c) 2009, 2010 Martin Hedenfalk <martin@bzero.se>
*
@@ -44,28 +505,22 @@
* ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
* ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
- * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */
-
-/* ACKNOWLEDGEMENTS ************************************************************
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ *
+ **** ACKNOWLEDGEMENTS *********************************************************
*
* Howard Chu (Symas Corporation) - the author of LMDB,
* from which originated the MDBX in 2015.
*
* Martin Hedenfalk <martin@bzero.se> - the author of `btree.c` code,
- * which was used for begin development of LMDB. */
+ * which was used for begin development of LMDB.
+ *
+ ******************************************************************************/
#pragma once
#ifndef LIBMDBX_H
#define LIBMDBX_H
-/* IMPENDING CHANGES WARNING ***************************************************
- *
- * MDBX is under active non-public development, database format and API
- * will be refined. New version won't be backwards compatible. Main focus
- * of the rework is to provide clear and robust API and new features.
- *
- ******************************************************************************/
-
#ifdef _MSC_VER
#pragma warning(push, 1)
#pragma warning(disable : 4548) /* expression before comma has no effect; \
@@ -106,6 +561,7 @@ typedef DWORD mdbx_tid_t;
#define MDBX_EPERM ERROR_INVALID_FUNCTION
#define MDBX_EINTR ERROR_CANCELLED
#define MDBX_ENOFILE ERROR_FILE_NOT_FOUND
+#define MDBX_EREMOTE ERROR_REMOTE_STORAGE_MEDIA_ERROR
#else
@@ -131,6 +587,7 @@ typedef pthread_t mdbx_tid_t;
#define MDBX_EPERM EPERM
#define MDBX_EINTR EINTR
#define MDBX_ENOFILE ENOENT
+#define MDBX_EREMOTE ENOTBLK
#endif
@@ -138,11 +595,21 @@ typedef pthread_t mdbx_tid_t;
#pragma warning(pop)
#endif
-/*--------------------------------------------------------------------------*/
+/*----------------------------------------------------------------------------*/
#ifndef __has_attribute
#define __has_attribute(x) (0)
+#endif /* __has_attribute */
+
+#ifndef __deprecated
+#if defined(__GNUC__) || __has_attribute(__deprecated__)
+#define __deprecated __attribute__((__deprecated__))
+#elif defined(_MSC_VER)
+#define __deprecated __declspec(deprecated)
+#else
+#define __deprecated
#endif
+#endif /* __deprecated */
#ifndef __dll_export
#if defined(_WIN32) || defined(__CYGWIN__)
@@ -174,78 +641,113 @@ typedef pthread_t mdbx_tid_t;
#endif
#endif /* __dll_import */
-/*--------------------------------------------------------------------------*/
+/*----------------------------------------------------------------------------*/
#define MDBX_VERSION_MAJOR 0
-#define MDBX_VERSION_MINOR 3
+#define MDBX_VERSION_MINOR 4
+#ifndef LIBMDBX_API
#if defined(LIBMDBX_EXPORTS)
#define LIBMDBX_API __dll_export
#elif defined(LIBMDBX_IMPORTS)
#define LIBMDBX_API __dll_import
#else
#define LIBMDBX_API
+#endif
#endif /* LIBMDBX_API */
#ifdef __cplusplus
extern "C" {
#endif
+/**** MDBX version information ************************************************/
+
+#if defined(LIBMDBX_IMPORTS)
+#define LIBMDBX_VERINFO_API __dll_import
+#else
+#define LIBMDBX_VERINFO_API __dll_export
+#endif /* LIBMDBX_VERINFO_API */
+
typedef struct mdbx_version_info {
uint8_t major;
uint8_t minor;
uint16_t release;
uint32_t revision;
- struct {
- const char *datetime;
- const char *tree;
- const char *commit;
- const char *describe;
+ struct /* source info from git */ {
+ const char *datetime /* committer date, strict ISO-8601 format */;
+ const char *tree /* commit hash (hexadecimal digits) */;
+ const char *commit /* tree hash, i.e. digest of the source code */;
+ const char *describe /* git-describe string */;
} git;
+ const char *sourcery /* sourcery anchor for pinning */;
} mdbx_version_info;
+extern LIBMDBX_VERINFO_API const mdbx_version_info mdbx_version;
+/* MDBX build information.
+ * WARNING: Some strings could be NULL in case no corresponding information was
+ * provided at build time (i.e. flags). */
typedef struct mdbx_build_info {
- const char *datetime;
- const char *target;
- const char *options;
- const char *compiler;
- const char *flags;
+ const char *datetime /* build timestamp (ISO-8601 or __DATE__ __TIME__) */;
+ const char *target /* cpu/arch-system-config triplet */;
+ const char *options /* mdbx-related options */;
+ const char *compiler /* compiler */;
+ const char *flags /* CFLAGS */;
} mdbx_build_info;
-
-extern LIBMDBX_API const mdbx_version_info mdbx_version;
-extern LIBMDBX_API const mdbx_build_info mdbx_build;
+extern LIBMDBX_VERINFO_API const mdbx_build_info mdbx_build;
#if defined(_WIN32) || defined(_WIN64)
-#ifndef MDBX_BUILD_DLL
-
-/* Dll initialization callback for ability to dynamically load MDBX DLL by
- * LoadLibrary() on Windows versions before Windows Vista. This function MUST be
- * called once from DllMain() for each reason (DLL_PROCESS_ATTACH,
- * DLL_PROCESS_DETACH, DLL_THREAD_ATTACH and DLL_THREAD_DETACH). Do this
- * carefully and ONLY when actual Windows version don't support initialization
- * via "TLS Directory" (e.g .CRT$XL[A-Z] sections in executable or dll file). */
+#if !MDBX_BUILD_SHARED_LIBRARY
+
+/* MDBX internally uses global and thread local storage destructors to
+ * automatically (de)initialization, releasing reader lock table slots
+ * and so on.
+ *
+ * If MDBX builded as a DLL this is done out-of-the-box by DllEntry() function,
+ * which called automatically by Windows core with passing corresponding reason
+ * argument.
+ *
+ * Otherwise, if MDBX was builded not as a DLL, some black magic
+ * may be required depending of Windows version:
+ * - Modern Windows versions, including Windows Vista and later, provides
+ * support for "TLS Directory" (e.g .CRT$XL[A-Z] sections in executable
+ * or dll file). In this case, MDBX capable of doing all automatically,
+ * and you do not need to call mdbx_dll_handler().
+ * - Obsolete versions of Windows, prior to Windows Vista, REQUIRES calling
+ * mdbx_dll_handler() manually from corresponding DllMain() or WinMain()
+ * of your DLL or application.
+ * - This behavior is under control of the MODX_CONFIG_MANUAL_TLS_CALLBACK
+ * option, which is determined by default according to the target version
+ * of Windows at build time.
+ * But you may override MODX_CONFIG_MANUAL_TLS_CALLBACK in special cases.
+ *
+ * Therefore, building MDBX as a DLL is recommended for all version of Windows.
+ * So, if you doubt, just build MDBX as the separate DLL and don't worry. */
#ifndef MDBX_CONFIG_MANUAL_TLS_CALLBACK
+#if defined(_WIN32_WINNT_VISTA) && WINVER >= _WIN32_WINNT_VISTA
+/* As described above mdbx_dll_handler() is NOT needed forWindows Vista
+ * and later. */
#define MDBX_CONFIG_MANUAL_TLS_CALLBACK 0
+#else
+/* As described above mdbx_dll_handler() IS REQUIRED for Windows versions
+ * prior to Windows Vista. */
+#define MDBX_CONFIG_MANUAL_TLS_CALLBACK 1
#endif
+#endif /* MDBX_CONFIG_MANUAL_TLS_CALLBACK */
+
#if MDBX_CONFIG_MANUAL_TLS_CALLBACK
-void LIBMDBX_API NTAPI mdbx_dll_callback(PVOID module, DWORD reason,
- PVOID reserved);
+void LIBMDBX_API NTAPI mdbx_dll_handler(PVOID module, DWORD reason,
+ PVOID reserved);
#endif /* MDBX_CONFIG_MANUAL_TLS_CALLBACK */
-#endif /* MDBX_BUILD_DLL */
+#endif /* !MDBX_BUILD_SHARED_LIBRARY */
#endif /* Windows */
-/* The name of the lock file in the DB environment */
-#define MDBX_LOCKNAME "/mdbx.lck"
-/* The name of the data file in the DB environment */
-#define MDBX_DATANAME "/mdbx.dat"
-/* The suffix of the lock file when no subdir is used */
-#define MDBX_LOCK_SUFFIX "-lck"
+/**** OPACITY STRUCTURES ******************************************************/
/* Opaque structure for a database environment.
*
- * A DB environment supports multiple databases, all residing in the same
- * shared-memory map. */
+ * An environment supports multiple key-value databases (aka key-value spaces
+ * or tables), all residing in the same shared-memory map. */
typedef struct MDBX_env MDBX_env;
/* Opaque structure for a transaction handle.
@@ -254,87 +756,552 @@ typedef struct MDBX_env MDBX_env;
* read-only or read-write. */
typedef struct MDBX_txn MDBX_txn;
-/* A handle for an individual database in the DB environment. */
+/* A handle for an individual database (key-value spaces) in the environment.
+ * Zero handle is used internally (hidden Garbage Collection DB).
+ * So, any valid DBI-handle great than 0 and less than or equal MDBX_MAX_DBI. */
typedef uint32_t MDBX_dbi;
+#define MDBX_MAX_DBI UINT32_C(32765)
/* Opaque structure for navigating through a database */
typedef struct MDBX_cursor MDBX_cursor;
-/* Generic structure used for passing keys and data in and out
- * of the database.
+/* Generic structure used for passing keys and data in and out of the database.
*
* Values returned from the database are valid only until a subsequent
* update operation, or the end of the transaction. Do not modify or
* free them, they commonly point into the database itself.
*
- * Key sizes must be between 1 and mdbx_env_get_maxkeysize() inclusive.
+ * Key sizes must be between 0 and mdbx_env_get_maxkeysize() inclusive.
* The same applies to data sizes in databases with the MDBX_DUPSORT flag.
- * Other data items can in theory be from 0 to 0xffffffff bytes long. */
+ * Other data items can in theory be from 0 to 0x7fffffff bytes long.
+ *
+ * (!) The notable difference between MDBX and LMDB is that MDBX support zero
+ * length keys. */
#ifndef HAVE_STRUCT_IOVEC
struct iovec {
- void *iov_base;
- size_t iov_len;
+ void *iov_base /* pointer to some data */;
+ size_t iov_len /* the length of data in bytes */;
};
#define HAVE_STRUCT_IOVEC
#endif /* HAVE_STRUCT_IOVEC */
+#if defined(__sun) || defined(__SVR4) || defined(__svr4__)
+/* The `iov_len` is signed on Sun/Solaris.
+ * So define custom MDBX_val to avoid a lot of warings. */
+typedef struct MDBX_val {
+ void *iov_base /* pointer to some data */;
+ size_t iov_len /* the length of data in bytes */;
+} MDBX_val;
+#else
typedef struct iovec MDBX_val;
+#endif
/* The maximum size of a data item.
* MDBX only store a 32 bit value for node sizes. */
#define MDBX_MAXDATASIZE INT32_MAX
-/* A callback function used to compare two keys in a database */
-typedef int(MDBX_cmp_func)(const MDBX_val *a, const MDBX_val *b);
+/**** DEBUG & LOGGING **********************************************************
+ * Logging and runtime debug flags.
+ *
+ * NOTE: Most of debug feature enabled only when libmdbx builded with
+ * MDBX_DEBUG options.
+ */
+
+/* Log level (requires build libmdbx with MDBX_DEBUG) */
+#define MDBX_LOG_FATAL 0 /* critical conditions, i.e. assertion failures */
+#define MDBX_LOG_ERROR 1 /* error conditions */
+#define MDBX_LOG_WARN 2 /* warning conditions */
+#define MDBX_LOG_NOTICE 3 /* normal but significant condition */
+#define MDBX_LOG_VERBOSE 4 /* verbose informational */
+#define MDBX_LOG_DEBUG 5 /* debug-level messages */
+#define MDBX_LOG_TRACE 6 /* trace debug-level messages */
+#define MDBX_LOG_EXTRA 7 /* extra debug-level messages (dump pgno lists) */
+
+/* Runtime debug flags.
+ *
+ * MDBX_DBG_DUMP and MDBX_DBG_LEGACY_MULTIOPEN always have an effect,
+ * but MDBX_DBG_ASSERT, MDBX_DBG_AUDIT and MDBX_DBG_JITTER only if libmdbx
+ * builded with MDBX_DEBUG. */
+
+#define MDBX_DBG_ASSERT 1 /* Enable assertion checks */
+#define MDBX_DBG_AUDIT 2 /* Enable pages usage audit at commit transactions */
+#define MDBX_DBG_JITTER 4 /* Enable small random delays in critical points */
+#define MDBX_DBG_DUMP 8 /* Include or not database(s) in coredump files */
+#define MDBX_DBG_LEGACY_MULTIOPEN 16 /* Enable multi-opening environment(s) */
+
+/* A debug-logger callback function,
+ * called before printing the message and aborting.
+ *
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] msg The assertion message, not including newline. */
+typedef void MDBX_debug_func(int loglevel, const char *function, int line,
+ const char *msg, va_list args);
+
+/* FIXME: Complete description */
+LIBMDBX_API int mdbx_setup_debug(int loglevel, int flags,
+ MDBX_debug_func *logger);
+
+/* A callback function for most MDBX assert() failures,
+ * called before printing the message and aborting.
+ *
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] msg The assertion message, not including newline. */
+typedef void MDBX_assert_func(const MDBX_env *env, const char *msg,
+ const char *function, unsigned line);
-/* Environment Flags */
-/* no environment directory */
+/* Set or reset the assert() callback of the environment.
+ *
+ * Does nothing if libmdbx was built with MDBX_DEBUG=0 or with NDEBUG,
+ * and will return MDBX_ENOSYS in such case.
+ *
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] func An MDBX_assert_func function, or 0.
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_env_set_assert(MDBX_env *env, MDBX_assert_func *func);
+
+/* FIXME: Complete description */
+LIBMDBX_API const char *mdbx_dump_val(const MDBX_val *key, char *const buf,
+ const size_t bufsize);
+
+/**** THE FILES ****************************************************************
+ * At the file system level, the environment corresponds to a pair of files. */
+
+/* The name of the lock file in the environment */
+#define MDBX_LOCKNAME "/mdbx.lck"
+/* The name of the data file in the environment */
+#define MDBX_DATANAME "/mdbx.dat"
+
+/* The suffix of the lock file when MDBX_NOSUBDIR is used */
+#define MDBX_LOCK_SUFFIX "-lck"
+
+/**** ENVIRONMENT FLAGS *******************************************************/
+
+/* MDBX_NOSUBDIR = no environment directory.
+ *
+ * By default, MDBX creates its environment in a directory whose pathname is
+ * given in path, and creates its data and lock files under that directory.
+ * With this option, path is used as-is for the database main data file.
+ * The database lock file is the path with "-lck" appended.
+ *
+ * - with MDBX_NOSUBDIR = in a filesystem we have the pair of MDBX-files which
+ * names derived from given pathname by appending predefined suffixes.
+ *
+ * - without MDBX_NOSUBDIR = in a filesystem we have the MDBX-directory with
+ * given pathname, within that a pair of MDBX-files with predefined names.
+ *
+ * This flag affects only at environment opening and can't be changed after. */
#define MDBX_NOSUBDIR 0x4000u
-/* don't fsync after commit */
-#define MDBX_NOSYNC 0x10000u
-/* read only */
+
+/* MDBX_RDONLY = read only mode.
+ *
+ * Open the environment in read-only mode. No write operations will be allowed.
+ * MDBX will still modify the lock file - except on read-only filesystems,
+ * where MDBX does not use locks.
+ *
+ * - with MDBX_RDONLY = open environment in read-only mode.
+ * MDBX supports pure read-only mode (i.e. without opening LCK-file) only
+ * when environment directory and/or both files are not writable (and the
+ * LCK-file may be missing). In such case allowing file(s) to be placed
+ * on a network read-only share.
+ *
+ * - without MDBX_RDONLY = open environment in read-write mode.
+ *
+ * This flag affects only at environment opening but can't be changed after. */
#define MDBX_RDONLY 0x20000u
-/* don't fsync metapage after commit */
-#define MDBX_NOMETASYNC 0x40000u
-/* use writable mmap */
+
+/* MDBX_EXCLUSIVE = open environment in exclusive/monopolistic mode.
+ *
+ * MDBX_EXCLUSIVE flag can be used as a replacement for MDB_NOLOCK, which don't
+ * supported by MDBX. In this way, you can get the minimal overhead, but with
+ * the correct multi-process and mutli-thread locking.
+ *
+ * - with MDBX_EXCLUSIVE = open environment in exclusive/monopolistic mode
+ * or return MDBX_BUSY if environment already used by other process.
+ * The main feature of the exclusive mode is the ability to open the
+ * environment placed on a network share.
+ *
+ * - without MDBX_EXCLUSIVE = open environment in cooperative mode,
+ * i.e. for multi-process access/interaction/cooperation.
+ * The main requirements of the cooperative mode are:
+ * 1. data files MUST be placed in the LOCAL file system,
+ * but NOT on a network share.
+ * 2. environment MUST be opened only by LOCAL processes,
+ * but NOT over a network.
+ * 3. OS kernel (i.e. file system and memory mapping implementation) and
+ * all processes that open the given environment MUST be running
+ * in the physically single RAM with cache-coherency. The only
+ * exception for cache-consistency requirement is Linux on MIPS
+ * architecture, but this case has not been tested for a long time).
+
+ * This flag affects only at environment opening but can't be changed after. */
+#define MDBX_EXCLUSIVE 0x400000u
+
+/* MDBX_ACCEDE = using database which already opened by another process(es).
+ *
+ * The MDBX_ACCEDE flag avoid MDBX_INCOMPATIBLE error while opening If the
+ * database is already used by another process(es) and environment mode/flags
+ * isn't compatible. In such cases, when using the MDBX_ACCEDE flag, instead of
+ * the specified incompatible options, the mode in which the database is already
+ * opened by other processes will be used, including MDBX_LIFORECLAIM,
+ * MDBX_COALESCE and MDBX_NORDAHEAD. The MDBX_ACCEDE flag is useful to open a
+ * database that already used by another process(es) and used mode/flags isn't
+ * known.
+ *
+ * MDBX_ACCEDE has no effect if the current process is the only one either
+ * opening the DB in read-only mode or other process(es) uses the DB in
+ * read-only mode. */
+#define MDBX_ACCEDE 0x40000000u
+
+/* MDBX_WRITEMAP = map data into memory with write permission.
+ *
+ * Use a writeable memory map unless MDBX_RDONLY is set. This uses fewer mallocs
+ * and requires much less work for tracking database pages, but loses protection
+ * from application bugs like wild pointer writes and other bad updates into the
+ * database. This may be slightly faster for DBs that fit entirely in RAM, but
+ * is slower for DBs larger than RAM. Also adds the possibility for stray
+ * application writes thru pointers to silently corrupt the database.
+ * Incompatible with nested transactions.
+ *
+ * NOTE: The MDBX_WRITEMAP mode is incompatible with nested transactions, since
+ * this is unreasonable. I.e. nested transactions requires mallocation of
+ * database pages and more work for tracking ones, which neuters a
+ * performance boost caused by the MDBX_WRITEMAP mode.
+ *
+ * NOTE: MDBX don't allow to mix processes with and without MDBX_WRITEMAP on
+ * the same environment. In such case MDBX_INCOMPATIBLE will be generated.
+ *
+ * - with MDBX_WRITEMAP = all data will be mapped into memory in the read-write
+ * mode. This offers a significant performance benefit, since the data will
+ * be modified directly in mapped memory and then flushed to disk by
+ * single system call, without any memory management nor copying.
+ * (!) On the other hand, MDBX_WRITEMAP adds the possibility for stray
+ * application writes thru pointers to silently corrupt the database.
+ * Moreover, MDBX_WRITEMAP disallows nested write transactions.
+ *
+ * - without MDBX_WRITEMAP = data will be mapped into memory in the read-only
+ * mode. This requires stocking all modified database pages in memory and
+ * then writing them to disk through file operations.
+ *
+ * This flag affects only at environment opening but can't be changed after. */
#define MDBX_WRITEMAP 0x80000u
-/* use asynchronous msync when MDBX_WRITEMAP is used */
-#define MDBX_MAPASYNC 0x100000u
-/* tie reader locktable slots to MDBX_txn objects instead of to threads */
+
+/* MDBX_NOTLS = tie reader locktable slots to read-only transactions instead
+ * of to threads.
+ *
+ * Don't use Thread-Local Storage, instead tie reader locktable slots to
+ * MDBX_txn objects instead of to threads. So, mdbx_txn_reset() keeps the slot
+ * reserved for the MDBX_txn object. A thread may use parallel read-only
+ * transactions. And a read-only transaction may span threads if you
+ * synchronizes its use.
+ *
+ * Applications that multiplex many user threads over individual OS threads need
+ * this option. Such an application must also serialize the write transactions
+ * in an OS thread, since MDBX's write locking is unaware of the user threads.
+ *
+ * NOTE: Regardless to MDBX_NOTLS flag a write transaction entirely should
+ * always be used in one thread from start to finish. MDBX checks this in a
+ * reasonable manner and return the MDBX_THREAD_MISMATCH error in rules
+ * violation.
+ *
+ * This flag affects only at environment opening but can't be changed after. */
#define MDBX_NOTLS 0x200000u
-/* open DB in exclusive/monopolistic mode. */
-#define MDBX_EXCLUSIVE 0x400000u
-/* don't do readahead */
+
+/* MDBX_NORDAHEAD = don't do readahead.
+ *
+ * Turn off readahead. Most operating systems perform readahead on read requests
+ * by default. This option turns it off if the OS supports it. Turning it off
+ * may help random read performance when the DB is larger than RAM and system
+ * RAM is full.
+ *
+ * By default libmdbx dynamically enables/disables readahead depending on the
+ * actual database size and currently available memory. On the other hand, such
+ * automation has some limitation, i.e. could be performed only when DB size
+ * changing but can't tracks and reacts changing a free RAM availability, since
+ * it changes independently and asynchronously.
+ *
+ * NOTE: The mdbx_is_readahead_reasonable() function allows to quickly find out
+ * whether to use readahead or not based on the size of the data and the
+ * amount of available memory.
+ *
+ * This flag affects only at environment opening and can't be changed after. */
#define MDBX_NORDAHEAD 0x800000u
-/* don't initialize malloc'd memory before writing to datafile */
+
+/* MDBX_NOMEMINIT = don't initialize malloc'd memory before writing to datafile.
+ *
+ * Don't initialize malloc'd memory before writing to unused spaces in the data
+ * file. By default, memory for pages written to the data file is obtained using
+ * malloc. While these pages may be reused in subsequent transactions, freshly
+ * malloc'd pages will be initialized to zeroes before use. This avoids
+ * persisting leftover data from other code (that used the heap and subsequently
+ * freed the memory) into the data file.
+ *
+ * Note that many other system libraries may allocate and free memory from the
+ * heap for arbitrary uses. E.g., stdio may use the heap for file I/O buffers.
+ * This initialization step has a modest performance cost so some applications
+ * may want to disable it using this flag. This option can be a problem for
+ * applications which handle sensitive data like passwords, and it makes memory
+ * checkers like Valgrind noisy. This flag is not needed with MDBX_WRITEMAP,
+ * which writes directly to the mmap instead of using malloc for pages. The
+ * initialization is also skipped if MDBX_RESERVE is used; the caller is
+ * expected to overwrite all of the memory that was reserved in that case.
+ *
+ * This flag may be changed at any time using mdbx_env_set_flags(). */
#define MDBX_NOMEMINIT 0x1000000u
-/* aim to coalesce FreeDB records */
+
+/* MDBX_COALESCE = aims to coalesce a Garbage Collection items.
+ *
+ * With MDBX_COALESCE flag MDBX will aims to coalesce items while recycling
+ * a Garbage Collection. Technically, when possible short lists of pages will
+ * be combined into longer ones, but to fit on one database page. As a result,
+ * there will be fewer items in Garbage Collection and a page lists are longer,
+ * which slightly increases the likelihood of returning pages to Unallocated
+ * space and reducing the database file.
+ *
+ * This flag may be changed at any time using mdbx_env_set_flags(). */
#define MDBX_COALESCE 0x2000000u
-/* LIFO policy for reclaiming FreeDB records */
+
+/* MDBX_LIFORECLAIM = LIFO policy for recycling a Garbage Collection items.
+ *
+ * MDBX_LIFORECLAIM flag turns on LIFO policy for recycling a Garbage
+ * Collection items, instead of FIFO by default. On systems with a disk
+ * write-back cache, this can significantly increase write performance, up to
+ * several times in a best case scenario.
+ *
+ * LIFO recycling policy means that for reuse pages will be taken which became
+ * unused the lastest (i.e. just now or most recently). Therefore the loop of
+ * database pages circulation becomes as short as possible. In other words, the
+ * number of pages, that are overwritten in memory and on disk during a series
+ * of write transactions, will be as small as possible. Thus creates ideal
+ * conditions for the efficient operation of the disk write-back cache.
+ *
+ * MDBX_LIFORECLAIM is compatible with all no-sync flags (i.e. MDBX_NOMETASYNC,
+ * MDBX_NOSYNC, MDBX_UTTERLY_NOSYNC, MDBX_MAPASYNC), but gives no noticeable
+ * impact in combination with MDB_NOSYNC and MDX_MAPASYNC. Because MDBX will
+ * not reused paged from the last "steady" MVCC-snapshot and later, i.e. the
+ * loop length of database pages circulation will be mostly defined by frequency
+ * of calling mdbx_env_sync() rather than LIFO and FIFO difference.
+ *
+ * This flag may be changed at any time using mdbx_env_set_flags(). */
#define MDBX_LIFORECLAIM 0x4000000u
-/* make a steady-sync only on close and explicit env-sync */
-#define MDBX_UTTERLY_NOSYNC (MDBX_NOSYNC | MDBX_MAPASYNC)
-/* debuging option, fill/perturb released pages */
+
+/* Debugging option, fill/perturb released pages. */
#define MDBX_PAGEPERTURB 0x8000000u
-/* Database Flags */
-/* use reverse string keys */
+/**** SYNC MODES ***************************************************************
+ * (!!!) Using any combination of MDBX_NOSYNC, MDBX_NOMETASYNC, MDBX_MAPASYNC
+ * and especially MDBX_UTTERLY_NOSYNC is always a deal to reduce durability
+ * for gain write performance. You must know exactly what you are doing and
+ * what risks you are taking!
+ *
+ * NOTE for LMDB users: MDBX_NOSYNC is NOT similar to LMDB_NOSYNC, but
+ * MDBX_UTTERLY_NOSYNC is exactly match LMDB_NOSYNC.
+ * See details below.
+ *
+ * THE SCENE:
+ * - The DAT-file contains several MVCC-snapshots of B-tree at same time,
+ * each of those B-tree has its own root page.
+ * - Each of meta pages at the beginning of the DAT file contains a pointer
+ * to the root page of B-tree which is the result of the particular
+ * transaction, and a number of this transaction.
+ * - For data durability, MDBX must first write all MVCC-snapshot data pages
+ * and ensure that are written to the disk, then update a meta page with
+ * the new transaction number and a pointer to the corresponding new root
+ * page, and flush any buffers yet again.
+ * - Thus during commit a I/O buffers should be flushed to the disk twice;
+ * i.e. fdatasync(), FlushFileBuffers() or similar syscall should be called
+ * twice for each commit. This is very expensive for performance, but
+ * guaranteed durability even on unexpected system failure or power outage.
+ * Of course, provided that the operating system and the underlying hardware
+ * (e.g. disk) work correctly.
+ *
+ * TRADE-OFF: By skipping some stages described above, you can significantly
+ * benefit in speed, while partially or completely losing in the guarantee of
+ * data durability and/or consistency in the event of system or power failure.
+ * Moreover, if for any reason disk write order is not preserved, then at moment
+ * of a system crash, a meta-page with a pointer to the new B-tree may be
+ * written to disk, while the itself B-tree not yet. In that case, the database
+ * will be corrupted!
+ *
+ *
+ * MDBX_NOMETASYNC = don't sync the meta-page after commit.
+ *
+ * Flush system buffers to disk only once per transaction, omit the
+ * metadata flush. Defer that until the system flushes files to disk,
+ * or next non-MDBX_RDONLY commit or mdbx_env_sync(). Depending on the
+ * platform and hardware, with MDBX_NOMETASYNC you may get a doubling of
+ * write performance.
+ *
+ * This trade-off maintains database integrity, but a system crash may
+ * undo the last committed transaction. I.e. it preserves the ACI
+ * (atomicity, consistency, isolation) but not D (durability) database
+ * property.
+ *
+ * MDBX_NOMETASYNC flag may be changed at any time using
+ * mdbx_env_set_flags() or by passing to mdbx_txn_begin() for particular
+ * write transaction.
+ *
+ *
+ * MDBX_UTTERLY_NOSYNC = don't sync anything and wipe previous steady commits.
+ *
+ * Don't flush system buffers to disk when committing a transaction. This
+ * optimization means a system crash can corrupt the database, if buffers
+ * are not yet flushed to disk. Depending on the platform and hardware,
+ * with MDBX_UTTERLY_NOSYNC you may get a multiple increase of write
+ * performance, even 100 times or more.
+ *
+ * If the filesystem preserves write order (which is rare and never
+ * provided unless explicitly noted) and the MDBX_WRITEMAP and
+ * MDBX_LIFORECLAIM flags are not used, then a system crash can't corrupt
+ * the database, but you can lose the last transactions, if at least one
+ * buffer is not yet flushed to disk. The risk is governed by how often the
+ * system flushes dirty buffers to disk and how often mdbx_env_sync() is
+ * called. So, transactions exhibit ACI (atomicity, consistency, isolation)
+ * properties and only lose D (durability). I.e. database integrity is
+ * maintained, but a system crash may undo the final transactions.
+ *
+ * Otherwise, if the filesystem not preserves write order (which is
+ * typically) or MDBX_WRITEMAP or MDBX_LIFORECLAIM flags are used, you
+ * should expect the corrupted database after a system crash.
+ *
+ * So, most important thing about MDBX_UTTERLY_NOSYNC:
+ * - a system crash immediately after commit the write transaction
+ * high likely lead to database corruption.
+ * - successful completion of mdbx_env_sync(force = true) after one or
+ * more commited transactions guarantees consystency and durability.
+ * - BUT by committing two or more transactions you back database into a
+ * weak state, in which a system crash may lead to database corruption!
+ * In case single transaction after mdbx_env_sync, you may lose
+ * transaction itself, but not a whole database.
+ *
+ * Nevertheless, MDBX_UTTERLY_NOSYNC provides ACID in case of a application
+ * crash, and therefore may be very useful in scenarios where data
+ * durability is not required over a system failure (e.g for short-lived
+ * data), or if you can ignore such risk.
+ *
+ * MDBX_UTTERLY_NOSYNC flag may be changed at any time using
+ * mdbx_env_set_flags(), but don't has effect if passed to mdbx_txn_begin()
+ * for particular write transaction.
+ *
+ *
+ * MDBX_NOSYNC = don't sync anything but keep previous steady commits.
+ *
+ * Like MDBX_UTTERLY_NOSYNC the MDBX_NOSYNC flag similarly disable flush
+ * system buffers to disk when committing a transaction. But there is a
+ * huge difference in how are recycled the MVCC snapshots corresponding
+ * to previous "steady" transactions (see below).
+ *
+ * Depending on the platform and hardware, with MDBX_NOSYNC you may get
+ * a multiple increase of write performance, even 10 times or more.
+ * NOTE that (MDBX_NOSYNC | MDBX_WRITEMAP) leaves the system with no hint
+ * for when to write transactions to disk. Therefore the (MDBX_MAPASYNC |
+ * MDBX_WRITEMAP) may be preferable, but without MDBX_NOSYNC because
+ * the (MDBX_MAPASYNC | MDBX_NOSYNC) actually gives MDBX_UTTERLY_NOSYNC.
+ *
+ * In contrast to MDBX_UTTERLY_NOSYNC mode, with MDBX_NOSYNC flag MDBX will
+ * keeps untouched pages within B-tree of the last transaction "steady"
+ * which was synced to disk completely. This has big implications for both
+ * data durability and (unfortunately) performance:
+ * - a system crash can't corrupt the database, but you will lose the
+ * last transactions; because MDBX will rollback to last steady commit
+ * since it kept explicitly.
+ * - the last steady transaction makes an effect similar to "long-lived"
+ * read transaction (see above in the "RESTRICTIONS & CAVEATS" section)
+ * since prevents reuse of pages freed by newer write transactions,
+ * thus the any data changes will be placed in newly allocated pages.
+ * - to avoid rapid database growth, the system will sync data and issue
+ * a steady commit-point to resume reuse pages, each time there is
+ * insufficient space and before increasing the size of the file on
+ * disk.
+ *
+ * In other words, with MDBX_NOSYNC flag MDBX insures you from the whole
+ * database corruption, at the cost increasing database size and/or number
+ * of disk IOPS. So, MDBX_NOSYNC flag could be used with mdbx_env_synv()
+ * as alternatively for batch committing or nested transaction (in some
+ * cases). As well, auto-sync feature exposed by mdbx_env_set_syncbytes()
+ * and mdbx_env_set_syncperiod() functions could be very usefull with
+ * MDBX_NOSYNC flag.
+ *
+ * The number and volume of of disk IOPS with MDBX_NOSYNC flag will
+ * exactly the as without any no-sync flags. However, you should expect
+ * a larger process's work set (https://bit.ly/2kA2tFX) and significantly
+ * worse a locality of reference (https://bit.ly/2mbYq2J), due to the
+ * more intensive allocation of previously unused pages and increase the
+ * size of the database.
+ *
+ * MDBX_NOSYNC flag may be changed at any time using
+ * mdbx_env_set_flags() or by passing to mdbx_txn_begin() for particular
+ * write transaction.
+ *
+ *
+ * MDBX_MAPASYNC = use asynchronous msync when MDBX_WRITEMAP is used.
+ *
+ * MDBX_MAPASYNC meaningful and give effect only in conjunction
+ * with MDBX_WRITEMAP or MDBX_NOSYNC:
+ * - with MDBX_NOSYNC actually gives MDBX_UTTERLY_NOSYNC, which
+ * wipe previous steady commits for reuse pages as described above.
+ * - with MDBX_WRITEMAP but without MDBX_NOSYNC instructs MDBX to use
+ * asynchronous mmap-flushes to disk as described below.
+ * - with both MDBX_WRITEMAP and MDBX_NOSYNC you get the both effects.
+ *
+ * Asynchronous mmap-flushes means that actually all writes will scheduled
+ * and performed by operation system on it own manner, i.e. unordered.
+ * MDBX itself just notify operating system that it would be nice to write
+ * data to disk, but no more.
+ *
+ * With MDBX_MAPASYNC flag, but without MDBX_UTTERLY_NOSYNC (i.e. without
+ * OR'ing with MDBX_NOSYNC) MDBX will keeps untouched pages within B-tree
+ * of the last transaction "steady" which was synced to disk completely.
+ * So, this makes exactly the same "long-lived" impact and the same
+ * consequences as described above for MDBX_NOSYNC flag.
+ *
+ * Depending on the platform and hardware, with combination of
+ * MDBX_WRITEMAP and MDBX_MAPASYNC you may get a multiple increase of write
+ * performance, even 25 times or more. MDBX_MAPASYNC flag may be changed at
+ * any time using mdbx_env_set_flags() or by passing to mdbx_txn_begin()
+ * for particular write transaction.
+ */
+
+/* Don't sync meta-page after commit,
+ * see description in the "SYNC MODES" section above. */
+#define MDBX_NOMETASYNC 0x40000u
+
+/* Don't sync anything but keep previous steady commits,
+ * see description in the "SYNC MODES" section above.
+ *
+ * (!) don't combine this flag with MDBX_MAPASYNC
+ * since you will got MDBX_UTTERLY_NOSYNC in that way (see below) */
+#define MDBX_NOSYNC 0x10000u
+
+/* Use asynchronous msync when MDBX_WRITEMAP is used,
+ * see description in the "SYNC MODES" section above.
+ *
+ * (!) don't combine this flag with MDBX_NOSYNC
+ * since you will got MDBX_UTTERLY_NOSYNC in that way (see below) */
+#define MDBX_MAPASYNC 0x100000u
+
+/* Don't sync anything and wipe previous steady commits,
+ * see description in the "SYNC MODES" section above. */
+#define MDBX_UTTERLY_NOSYNC (MDBX_NOSYNC | MDBX_MAPASYNC)
+
+/**** DATABASE FLAGS **********************************************************/
+/* Use reverse string keys */
#define MDBX_REVERSEKEY 0x02u
-/* use sorted duplicates */
+/* Use sorted duplicates */
#define MDBX_DUPSORT 0x04u
-/* numeric keys in native byte order, either uint32_t or uint64_t.
+/* Numeric keys in native byte order, either uint32_t or uint64_t.
* The keys must all be of the same size. */
#define MDBX_INTEGERKEY 0x08u
-/* with MDBX_DUPSORT, sorted dup items have fixed size */
+/* With MDBX_DUPSORT, sorted dup items have fixed size */
#define MDBX_DUPFIXED 0x10u
-/* with MDBX_DUPSORT, dups are MDBX_INTEGERKEY-style integers */
+/* With MDBX_DUPSORT, dups are MDBX_INTEGERKEY-style integers */
#define MDBX_INTEGERDUP 0x20u
-/* with MDBX_DUPSORT, use reverse string dups */
+/* With MDBX_DUPSORT, use reverse string dups */
#define MDBX_REVERSEDUP 0x40u
-/* create DB if not already existing */
+/* Create DB if not already existing */
#define MDBX_CREATE 0x40000u
-/* Write Flags */
+/**** DATA UPDATE FLAGS *******************************************************/
/* For put: Don't write if the key already exists. */
#define MDBX_NOOVERWRITE 0x10u
/* Only for MDBX_DUPSORT
@@ -355,16 +1322,15 @@ typedef int(MDBX_cmp_func)(const MDBX_val *a, const MDBX_val *b);
/* Store multiple data items in one call. Only for MDBX_DUPFIXED. */
#define MDBX_MULTIPLE 0x80000u
-/* Transaction Flags */
+/**** TRANSACTION FLAGS *******************************************************/
/* Do not block when starting a write transaction */
#define MDBX_TRYTXN 0x10000000u
-/* Copy Flags */
-/* Compacting copy: Omit free space from copy, and renumber all
- * pages sequentially. */
+/**** ENVIRONMENT COPY FLAGS **************************************************/
+/* Compacting: Omit free space from copy, and renumber all pages sequentially */
#define MDBX_CP_COMPACT 1u
-/* Cursor Get operations.
+/**** CURSOR OPERATIONS ********************************************************
*
* This is the set of all operations for retrieving data
* using a cursor. */
@@ -384,8 +1350,8 @@ typedef enum MDBX_cursor_op {
MDBX_NEXT, /* Position at next data item */
MDBX_NEXT_DUP, /* MDBX_DUPSORT-only: Position at next data item
* of current key. */
- MDBX_NEXT_MULTIPLE, /* MDBX_DUPFIXED-only: Return up to a page of duplicate
- * data items from next cursor position.
+ MDBX_NEXT_MULTIPLE, /* MDBX_DUPFIXED-only: Return up to a page of
+ * duplicate data items from next cursor position.
* Move cursor to prepare for MDBX_NEXT_MULTIPLE. */
MDBX_NEXT_NODUP, /* Position at first data item of next key */
MDBX_PREV, /* Position at previous data item */
@@ -400,12 +1366,13 @@ typedef enum MDBX_cursor_op {
* return up to a page of duplicate data items. */
} MDBX_cursor_op;
-/* Return Codes
+/**** ERRORS & RETURN CODES ****************************************************
* BerkeleyDB uses -30800 to -30999, we'll go under them */
/* Successful result */
#define MDBX_SUCCESS 0
#define MDBX_RESULT_FALSE MDBX_SUCCESS
+/* Successful result with special meaning or a flag */
#define MDBX_RESULT_TRUE (-1)
/* key/data pair already exists */
@@ -414,9 +1381,9 @@ typedef enum MDBX_cursor_op {
#define MDBX_NOTFOUND (-30798)
/* Requested page not found - this usually indicates corruption */
#define MDBX_PAGE_NOTFOUND (-30797)
-/* Located page was wrong type */
+/* Database is corrupted (page was wrong type and so on) */
#define MDBX_CORRUPTED (-30796)
-/* Update of meta page failed or environment had fatal error */
+/* Environment had fatal error (i.e. update of meta page failed and so on) */
#define MDBX_PANIC (-30795)
/* DB file version mismatch with libmdbx */
#define MDBX_VERSION_MISMATCH (-30794)
@@ -452,13 +1419,14 @@ typedef enum MDBX_cursor_op {
#define MDBX_BAD_DBI (-30780)
/* Unexpected problem - txn should abort */
#define MDBX_PROBLEM (-30779)
-/* Another write transaction is running */
+/* Another write transaction is running or environment is already used while
+ * opening with MDBX_EXCLUSIVE flag */
#define MDBX_BUSY (-30778)
/* The last defined error code */
#define MDBX_LAST_ERRCODE MDBX_BUSY
/* The mdbx_put() or mdbx_replace() was called for key,
- that has more that one associated value. */
+ * that has more that one associated value. */
#define MDBX_EMULTIVAL (-30421)
/* Bad signature of a runtime object(s), this can mean:
@@ -482,40 +1450,7 @@ typedef enum MDBX_cursor_op {
* e.g. a transaction that started by another thread. */
#define MDBX_THREAD_MISMATCH (-30416)
-/* Statistics for a database in the environment */
-typedef struct MDBX_stat {
- uint32_t ms_psize; /* Size of a database page.
- * This is currently the same for all databases. */
- uint32_t ms_depth; /* Depth (height) of the B-tree */
- uint64_t ms_branch_pages; /* Number of internal (non-leaf) pages */
- uint64_t ms_leaf_pages; /* Number of leaf pages */
- uint64_t ms_overflow_pages; /* Number of overflow pages */
- uint64_t ms_entries; /* Number of data items */
-} MDBX_stat;
-
-/* Information about the environment */
-typedef struct MDBX_envinfo {
- struct {
- uint64_t lower; /* lower limit for datafile size */
- uint64_t upper; /* upper limit for datafile size */
- uint64_t current; /* current datafile size */
- uint64_t shrink; /* shrink threshold for datafile */
- uint64_t grow; /* growth step for datafile */
- } mi_geo;
- uint64_t mi_mapsize; /* Size of the data memory map */
- uint64_t mi_last_pgno; /* ID of the last used page */
- uint64_t mi_recent_txnid; /* ID of the last committed transaction */
- uint64_t mi_latter_reader_txnid; /* ID of the last reader transaction */
- uint64_t mi_self_latter_reader_txnid; /* ID of the last reader transaction of
- caller process */
- uint64_t mi_meta0_txnid, mi_meta0_sign;
- uint64_t mi_meta1_txnid, mi_meta1_sign;
- uint64_t mi_meta2_txnid, mi_meta2_sign;
- uint32_t mi_maxreaders; /* max reader slots in the environment */
- uint32_t mi_numreaders; /* max reader slots used in the environment */
- uint32_t mi_dxb_pagesize; /* database pagesize */
- uint32_t mi_sys_pagesize; /* system pagesize */
-} MDBX_envinfo;
+/**** FUNCTIONS & RELATED STRUCTURES ******************************************/
/* Return a string describing a given error code.
*
@@ -525,164 +1460,111 @@ typedef struct MDBX_envinfo {
* is less than 0, an error string corresponding to the MDBX library error is
* returned. See errors for a list of MDBX-specific error codes.
*
- * [in] err The error code
+ * mdbx_strerror() - is NOT thread-safe because may share common internal
+ * buffer for system maessages. The returned string must
+ * NOT be modified by the application, but MAY be modified
+ * by a subsequent call to mdbx_strerror(), strerror() and
+ * other related functions.
+ *
+ * mdbx_strerror_r() - is thread-safe since uses user-supplied buffer where
+ * appropriate. The returned string must NOT be modified
+ * by the application, since it may be pointer to internal
+ * constatn string. However, there is no restriction if the
+ * returned string points to the supplied buffer.
+ *
+ * [in] err The error code.
*
- * Returns "error message" The description of the error */
+ * Returns "error message" The description of the error. */
LIBMDBX_API const char *mdbx_strerror(int errnum);
LIBMDBX_API const char *mdbx_strerror_r(int errnum, char *buf, size_t buflen);
-/* Create an MDBX environment handle.
+#if defined(_WIN32) || defined(_WIN64)
+/* Bit of Windows' madness. The similar functions but returns Windows
+ * error-messages in the OEM-encoding for console utilities. */
+LIBMDBX_API const char *mdbx_strerror_ANSI2OEM(int errnum);
+LIBMDBX_API const char *mdbx_strerror_r_ANSI2OEM(int errnum, char *buf,
+ size_t buflen);
+#endif /* Bit of Windows' madness */
+
+/* Create an MDBX environment instance.
*
* This function allocates memory for a MDBX_env structure. To release
* the allocated memory and discard the handle, call mdbx_env_close().
* Before the handle may be used, it must be opened using mdbx_env_open().
+ *
* Various other options may also need to be set before opening the handle,
- * e.g. mdbx_env_set_mapsize(), mdbx_env_set_maxreaders(),
+ * e.g. mdbx_env_set_geometry(), mdbx_env_set_maxreaders(),
* mdbx_env_set_maxdbs(), depending on usage requirements.
*
- * [out] env The address where the new handle will be stored
+ * [out] env The address where the new handle will be stored.
*
- * Returns A non-zero error value on failure and 0 on success. */
+ * Returns a non-zero error value on failure and 0 on success. */
LIBMDBX_API int mdbx_env_create(MDBX_env **penv);
-/* Open an environment handle.
+/* Open an environment instance.
*
- * If this function fails, mdbx_env_close() must be called to discard
- * the MDBX_env handle.
+ * Indifferently this function will fails or not, the mdbx_env_close() must be
+ * called later to discard the MDBX_env handle and release associated resources.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [in] path The directory in which the database files reside.
- * This directory must already exist and be writable.
- * [in] flags Special options for this environment. This parameter
- * must be set to 0 or by bitwise OR'ing together one
- * or more of the values described here.
+ * [in] env An environment handle returned by mdbx_env_create()
+ * [in] pathname The directory in which the database files reside.
+ * This directory must already exist and be writable.
+ * [in] flags Special options for this environment. This parameter
+ * must be set to 0 or by bitwise OR'ing together one
+ * or more of the values described above in the
+ * "ENVIRONMENT FLAGS" and "SYNC MODES" sections.
*
* Flags set by mdbx_env_set_flags() are also used:
- * - MDBX_NOSUBDIR
- * By default, MDBX creates its environment in a directory whose
- * pathname is given in path, and creates its data and lock files
- * under that directory. With this option, path is used as-is for
- * the database main data file. The database lock file is the path
- * with "-lock" appended.
+ * - MDBX_NOSUBDIR, MDBX_RDONLY, MDBX_EXCLUSIVE, MDBX_WRITEMAP, MDBX_NOTLS,
+ * MDBX_NORDAHEAD, MDBX_NOMEMINIT, MDBX_COALESCE, MDBX_LIFORECLAIM.
+ * See "ENVIRONMENT FLAGS" section above.
*
- * - MDBX_RDONLY
- * Open the environment in read-only mode. No write operations will
- * be allowed. MDBX will still modify the lock file - except on
- * read-only filesystems, where MDBX does not use locks.
- *
- * - MDBX_WRITEMAP
- * Use a writeable memory map unless MDBX_RDONLY is set. This uses fewer
- * mallocs but loses protection from application bugs like wild pointer
- * writes and other bad updates into the database.
- * This may be slightly faster for DBs that fit entirely in RAM,
- * but is slower for DBs larger than RAM.
- * Incompatible with nested transactions.
- * Do not mix processes with and without MDBX_WRITEMAP on the same
- * environment. This can defeat durability (mdbx_env_sync etc).
- *
- * - MDBX_NOMETASYNC
- * Flush system buffers to disk only once per transaction, omit the
- * metadata flush. Defer that until the system flushes files to disk,
- * or next non-MDBX_RDONLY commit or mdbx_env_sync(). This optimization
- * maintains database integrity, but a system crash may undo the last
- * committed transaction. I.e. it preserves the ACI (atomicity,
- * consistency, isolation) but not D (durability) database property.
- * This flag may be changed at any time using mdbx_env_set_flags().
- *
- * - MDBX_NOSYNC
- * Don't flush system buffers to disk when committing a transaction.
- * This optimization means a system crash can corrupt the database or
- * lose the last transactions if buffers are not yet flushed to disk.
- * The risk is governed by how often the system flushes dirty buffers
- * to disk and how often mdbx_env_sync() is called. However, if the
- * filesystem preserves write order and the MDBX_WRITEMAP and/or
- * MDBX_LIFORECLAIM flags are not used, transactions exhibit ACI
- * (atomicity, consistency, isolation) properties and only lose D
- * (durability). I.e. database integrity is maintained, but a system
- * crash may undo the final transactions.
- *
- * Note that (MDBX_NOSYNC | MDBX_WRITEMAP) leaves the system with no
- * hint for when to write transactions to disk.
- * Therefore the (MDBX_MAPASYNC | MDBX_WRITEMAP) may be preferable.
- * This flag may be changed at any time using mdbx_env_set_flags().
- *
- * - MDBX_UTTERLY_NOSYNC (internally MDBX_NOSYNC | MDBX_MAPASYNC)
- * FIXME: TODO
- *
- * - MDBX_MAPASYNC
- * When using MDBX_WRITEMAP, use asynchronous flushes to disk. As with
- * MDBX_NOSYNC, a system crash can then corrupt the database or lose
- * the last transactions. Calling mdbx_env_sync() ensures on-disk
- * database integrity until next commit. This flag may be changed at
- * any time using mdbx_env_set_flags().
- *
- * - MDBX_NOTLS
- * Don't use Thread-Local Storage. Tie reader locktable slots to
- * MDBX_txn objects instead of to threads. I.e. mdbx_txn_reset() keeps
- * the slot reserved for the MDBX_txn object. A thread may use parallel
- * read-only transactions. A read-only transaction may span threads if
- * the user synchronizes its use. Applications that multiplex many
- * user threads over individual OS threads need this option. Such an
- * application must also serialize the write transactions in an OS
- * thread, since MDBX's write locking is unaware of the user threads.
- *
- * - MDBX_NOLOCK (don't supported by MDBX)
- * Don't do any locking. If concurrent access is anticipated, the
- * caller must manage all concurrency itself. For proper operation
- * the caller must enforce single-writer semantics, and must ensure
- * that no readers are using old transactions while a writer is
- * active. The simplest approach is to use an exclusive lock so that
- * no readers may be active at all when a writer begins.
- *
- * - MDBX_NORDAHEAD
- * Turn off readahead. Most operating systems perform readahead on
- * read requests by default. This option turns it off if the OS
- * supports it. Turning it off may help random read performance
- * when the DB is larger than RAM and system RAM is full.
- *
- * - MDBX_NOMEMINIT
- * Don't initialize malloc'd memory before writing to unused spaces
- * in the data file. By default, memory for pages written to the data
- * file is obtained using malloc. While these pages may be reused in
- * subsequent transactions, freshly malloc'd pages will be initialized
- * to zeroes before use. This avoids persisting leftover data from other
- * code (that used the heap and subsequently freed the memory) into the
- * data file. Note that many other system libraries may allocate and free
- * memory from the heap for arbitrary uses. E.g., stdio may use the heap
- * for file I/O buffers. This initialization step has a modest performance
- * cost so some applications may want to disable it using this flag. This
- * option can be a problem for applications which handle sensitive data
- * like passwords, and it makes memory checkers like Valgrind noisy. This
- * flag is not needed with MDBX_WRITEMAP, which writes directly to the
- * mmap instead of using malloc for pages. The initialization is also
- * skipped if MDBX_RESERVE is used; the caller is expected to overwrite
- * all of the memory that was reserved in that case. This flag may be
- * changed at any time using mdbx_env_set_flags().
- *
- * - MDBX_COALESCE
- * Aim to coalesce records while reclaiming FreeDB. This flag may be
- * changed at any time using mdbx_env_set_flags().
- * FIXME: TODO
- *
- * - MDBX_LIFORECLAIM
- * LIFO policy for reclaiming FreeDB records. This significantly reduce
- * write IPOs in case MDBX_NOSYNC with periodically checkpoints.
- * FIXME: TODO
- *
- * [in] mode The UNIX permissions to set on created files.
+ * - MDBX_NOMETASYNC, MDBX_NOSYNC, MDBX_UTTERLY_NOSYNC, MDBX_MAPASYNC.
+ * See "SYNC MODES" section above.
+ *
+ * NOTE: MDB_NOLOCK flag don't supported by MDBX,
+ * try use MDBX_EXCLUSIVE as a replacement.
+ *
+ * NOTE: MDBX don't allow to mix processes with different MDBX_WRITEMAP,
+ * MDBX_NOSYNC, MDBX_NOMETASYNC, MDBX_MAPASYNC flags onthe same
+ * environment. In such case MDBX_INCOMPATIBLE will be returned.
+ *
+ * If the database is already exist and parameters specified early by
+ * mdbx_env_set_geometry() are incompatible (i.e. for instance, different page
+ * size) then mdbx_env_open() will return MDBX_INCOMPATIBLE error.
+ *
+ * [in] mode The UNIX permissions to set on created files. Zero value means
+ * to open existing, but do not create.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_VERSION_MISMATCH - the version of the MDBX library doesn't match the
+ * - MDBX_VERSION_MISMATCH = the version of the MDBX library doesn't match the
* version that created the database environment.
- * - MDBX_INVALID - the environment file headers are corrupted.
- * - MDBX_ENOENT - the directory specified by the path parameter
- * doesn't exist.
- * - MDBX_EACCES - the user didn't have permission to access
- * the environment files.
- * - MDBX_EAGAIN - the environment was locked by another process. */
-LIBMDBX_API int mdbx_env_open(MDBX_env *env, const char *path, unsigned flags,
- mode_t mode);
+ * - MDBX_INVALID = the environment file headers are corrupted.
+ * - MDBX_ENOENT = the directory specified by the path parameter
+ * doesn't exist.
+ * - MDBX_EACCES = the user didn't have permission to access
+ * the environment files.
+ * - MDBX_EAGAIN = the environment was locked by another process.
+ * - MDBX_BUSY = MDBX_EXCLUSIVE flag was specified and the
+ * environment is in use by another process,
+ * or the current process tries to open environment
+ * more than once.
+ * - MDBX_INCOMPATIBLE = Environment is already opened by another process,
+ * but with different set of MDBX_WRITEMAP,
+ * MDBX_NOSYNC, MDBX_NOMETASYNC, MDBX_MAPASYNC
+ * flags.
+ * Or if the database is already exist and
+ * parameters specified early by
+ * mdbx_env_set_geometry() are incompatible (i.e.
+ * for instance, different page size).
+ * - MDBX_WANNA_RECOVERY = MDBX_RDONLY flag was specified but read-write
+ * access is required to rollback inconsistent state
+ * after a system crash.
+ * - MDBX_TOO_LARGE = Database is too large for this process, i.e.
+ * 32-bit process tries to open >4Gb database. */
+LIBMDBX_API int mdbx_env_open(MDBX_env *env, const char *pathname,
+ unsigned flags, mode_t mode);
/* Copy an MDBX environment to the specified path, with options.
*
@@ -694,7 +1576,7 @@ LIBMDBX_API int mdbx_env_open(MDBX_env *env, const char *path, unsigned flags,
*
* [in] env An environment handle returned by mdbx_env_create(). It must
* have already been opened successfully.
- * [in] path The directory in which the copy will reside. This directory
+ * [in] dest The directory in which the copy will reside. This directory
* must already exist and be writable but must otherwise be empty.
* [in] flags Special options for this operation. This parameter must be set
* to 0 or by bitwise OR'ing together one or more of the values
@@ -706,11 +1588,8 @@ LIBMDBX_API int mdbx_env_open(MDBX_env *env, const char *path, unsigned flags,
* CPU for processing, but may running quickly than the default, on
* account skipping free pages.
*
- * NOTE: Currently it fails if the environment has suffered a page leak.
- *
* Returns A non-zero error value on failure and 0 on success. */
-LIBMDBX_API int mdbx_env_copy(MDBX_env *env, const char *dest_path,
- unsigned flags);
+LIBMDBX_API int mdbx_env_copy(MDBX_env *env, const char *dest, unsigned flags);
/* Copy an MDBX environment to the specified file descriptor,
* with options.
@@ -720,8 +1599,11 @@ LIBMDBX_API int mdbx_env_copy(MDBX_env *env, const char *dest_path,
* mdbx_env_copy() for further details.
*
* NOTE: This call can trigger significant file size growth if run in
- * parallel with write transactions, because it employs a read-only
- * transaction. See long-lived transactions under "Caveats" section.
+ * parallel with write transactions, because it employs a read-only
+ * transaction. See long-lived transactions under "Caveats" section.
+ *
+ * NOTE: Fails if the environment has suffered a page leak and the destination
+ * file descriptor is associated with a pipe, socket, or FIFO.
*
* [in] env An environment handle returned by mdbx_env_create(). It must
* have already been opened successfully.
@@ -734,43 +1616,199 @@ LIBMDBX_API int mdbx_env_copy(MDBX_env *env, const char *dest_path,
LIBMDBX_API int mdbx_env_copy2fd(MDBX_env *env, mdbx_filehandle_t fd,
unsigned flags);
+/* Statistics for a database in the environment */
+typedef struct MDBX_stat {
+ uint32_t ms_psize; /* Size of a database page.
+ * This is the same for all databases. */
+ uint32_t ms_depth; /* Depth (height) of the B-tree */
+ uint64_t ms_branch_pages; /* Number of internal (non-leaf) pages */
+ uint64_t ms_leaf_pages; /* Number of leaf pages */
+ uint64_t ms_overflow_pages; /* Number of overflow pages */
+ uint64_t ms_entries; /* Number of data items */
+ uint64_t ms_mod_txnid; /* Transaction ID of commited last modification */
+} MDBX_stat;
+
/* Return statistics about the MDBX environment.
*
+ * At least one of env or txn argument must be non-null. If txn is passed
+ * non-null then stat will be filled accordingly to the given transaction.
+ * Otherwise, if txn is null, then stat will be populated by a snapshot from the
+ * last committed write transaction, and at next time, other information can be
+ * returned.
+ *
+ * Legacy mdbx_env_stat() correspond to calling mdbx_env_stat_ex() with the null
+ * txn argument.
+ *
* [in] env An environment handle returned by mdbx_env_create()
+ * [in] txn A transaction handle returned by mdbx_txn_begin()
* [out] stat The address of an MDBX_stat structure where the statistics
- * will be copied */
-LIBMDBX_API int mdbx_env_stat(MDBX_env *env, MDBX_stat *stat, size_t bytes);
-LIBMDBX_API int mdbx_env_stat2(const MDBX_env *env, const MDBX_txn *txn,
- MDBX_stat *stat, size_t bytes);
+ * will be copied
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_env_stat_ex(const MDBX_env *env, const MDBX_txn *txn,
+ MDBX_stat *stat, size_t bytes);
+__deprecated LIBMDBX_API int mdbx_env_stat(MDBX_env *env, MDBX_stat *stat,
+ size_t bytes);
+
+/* Information about the environment */
+typedef struct MDBX_envinfo {
+ struct {
+ uint64_t lower; /* lower limit for datafile size */
+ uint64_t upper; /* upper limit for datafile size */
+ uint64_t current; /* current datafile size */
+ uint64_t shrink; /* shrink threshold for datafile */
+ uint64_t grow; /* growth step for datafile */
+ } mi_geo;
+ uint64_t mi_mapsize; /* Size of the data memory map */
+ uint64_t mi_last_pgno; /* ID of the last used page */
+ uint64_t mi_recent_txnid; /* ID of the last committed transaction */
+ uint64_t mi_latter_reader_txnid; /* ID of the last reader transaction */
+ uint64_t mi_self_latter_reader_txnid; /* ID of the last reader transaction of
+ caller process */
+ uint64_t mi_meta0_txnid, mi_meta0_sign;
+ uint64_t mi_meta1_txnid, mi_meta1_sign;
+ uint64_t mi_meta2_txnid, mi_meta2_sign;
+ uint32_t mi_maxreaders; /* max reader slots in the environment */
+ uint32_t mi_numreaders; /* max reader slots used in the environment */
+ uint32_t mi_dxb_pagesize; /* database pagesize */
+ uint32_t mi_sys_pagesize; /* system pagesize */
+
+ struct {
+ /* A mostly unique ID that is regenerated on each boot. As such it can be
+ used to identify the local machine's current boot. MDBX uses such when
+ open the database to determine whether rollback required to the last
+ steady sync point or not. I.e. if current bootid is differ from the value
+ within a database then the system was rebooted and all changes since last
+ steady sync must be reverted for data integrity. Zeros mean that no
+ relevant information is available from the system. */
+ struct {
+ uint64_t l, h;
+ } current, meta0, meta1, meta2;
+ } mi_bootid;
+
+ uint64_t mi_unsync_volume; /* bytes not explicitly synchronized to disk */
+ uint64_t mi_autosync_threshold; /* current auto-sync threshold, see
+ mdbx_env_set_syncbytes(). */
+ uint32_t mi_since_sync_seconds16dot16; /* time since the last steady sync in
+ 1/65536 of second */
+ uint32_t mi_autosync_period_seconds16dot16 /* current auto-sync period in
+ 1/65536 of second, see
+ mdbx_env_set_syncperiod(). */
+ ;
+ uint32_t mi_since_reader_check_seconds16dot16; /* time since the last readers
+ check in 1/65536 of second,
+ see mdbx_reader_check(). */
+ uint32_t mi_mode; /* current environment mode, the same as
+ mdbx_env_get_flags() returns. */
+} MDBX_envinfo;
/* Return information about the MDBX environment.
*
+ * At least one of env or txn argument must be non-null. If txn is passed
+ * non-null then stat will be filled accordingly to the given transaction.
+ * Otherwise, if txn is null, then stat will be populated by a snapshot from the
+ * last committed write transaction, and at next time, other information can be
+ * returned.
+ *
+ * Legacy mdbx_env_info() correspond to calling mdbx_env_info_ex() with the null
+ * txn argument.
+
* [in] env An environment handle returned by mdbx_env_create()
+ * [in] txn A transaction handle returned by mdbx_txn_begin()
* [out] stat The address of an MDBX_envinfo structure
- * where the information will be copied */
-LIBMDBX_API int mdbx_env_info(MDBX_env *env, MDBX_envinfo *info, size_t bytes);
-LIBMDBX_API int mdbx_env_info2(const MDBX_env *env, const MDBX_txn *txn,
- MDBX_envinfo *info, size_t bytes);
+ * where the information will be copied
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_env_info_ex(const MDBX_env *env, const MDBX_txn *txn,
+ MDBX_envinfo *info, size_t bytes);
+__deprecated LIBMDBX_API int mdbx_env_info(MDBX_env *env, MDBX_envinfo *info,
+ size_t bytes);
-/* Flush the data buffers to disk.
+/* Flush the environment data buffers to disk.
*
- * Data is always written to disk when mdbx_txn_commit() is called,
- * but the operating system may keep it buffered. MDBX always flushes
- * the OS buffers upon commit as well, unless the environment was
- * opened with MDBX_NOSYNC or in part MDBX_NOMETASYNC. This call is
- * not valid if the environment was opened with MDBX_RDONLY.
+ * Unless the environment was opened with no-sync flags (MDBX_NOMETASYNC,
+ * MDBX_NOSYNC, MDBX_UTTERLY_NOSYNC and MDBX_MAPASYNC), then data is always
+ * written an flushed to disk when mdbx_txn_commit() is called. Otherwise
+ * mdbx_env_sync() may be called to manually write and flush unsynced data to
+ * disk.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [in] force If non-zero, force a synchronous flush. Otherwise if the
- * environment has the MDBX_NOSYNC flag set the flushes will be
- * omitted, and with MDBX_MAPASYNC they will be asynchronous.
+ * Besides, mdbx_env_sync_ex() with argument force=false may be used to
+ * provide polling mode for lazy/asynchronous sync in conjunction with
+ * mdbx_env_set_syncbytes() and/or mdbx_env_set_syncperiod().
*
- * Returns A non-zero error value on failure and 0 on success, some
- * possible errors are:
- * - MDBX_EACCES - the environment is read-only.
- * - MDBX_EINVAL - an invalid parameter was specified.
- * - MDBX_EIO - an error occurred during synchronization. */
-LIBMDBX_API int mdbx_env_sync(MDBX_env *env, int force);
+ * The mdbx_env_sync() is shortcut to calling mdbx_env_sync_ex() with
+ * try force=true and nonblock=false arguments.
+ *
+ * The mdbx_env_sync_poll() is shortcut to calling mdbx_env_sync_ex() with
+ * the force=false and nonblock=true arguments.
+ *
+ * NOTE: This call is not valid if the environment was opened with MDBX_RDONLY.
+ *
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] force If non-zero, force a flush. Otherwise, if force is zero, then
+ * will run in polling mode, i.e. it will check the thresholds
+ * that were set mdbx_env_set_syncbytes() and/or
+ * mdbx_env_set_syncperiod() and perform flush If at least one
+ * of the thresholds is reached.
+ * [in] nonblock Don't wait if write transaction is running by other thread.
+ *
+ * Returns A non-zero error value on failure and MDBX_RESULT_TRUE or 0 on
+ * success. The MDBX_RESULT_TRUE means no data pending for flush to disk,
+ * and 0 otherwise. Some possible errors are:
+ * - MDBX_EACCES = the environment is read-only.
+ * - MDBX_BUSY = the environment is used by other thread and nonblock=true.
+ * - MDBX_EINVAL = an invalid parameter was specified.
+ * - MDBX_EIO = an error occurred during synchronization. */
+LIBMDBX_API int mdbx_env_sync_ex(MDBX_env *env, int force, int nonblock);
+LIBMDBX_API int mdbx_env_sync(MDBX_env *env);
+LIBMDBX_API int mdbx_env_sync_poll(MDBX_env *env);
+
+/* Sets threshold to force flush the data buffers to disk, even of MDBX_NOSYNC,
+ * MDBX_NOMETASYNC and MDBX_MAPASYNC flags in the environment. The threshold
+ * value affects all processes which operates with given environment until the
+ * last process close environment or a new value will be settled.
+ *
+ * Data is always written to disk when mdbx_txn_commit() is called, but the
+ * operating system may keep it buffered. MDBX always flushes the OS buffers
+ * upon commit as well, unless the environment was opened with MDBX_NOSYNC,
+ * MDBX_MAPASYNC or in part MDBX_NOMETASYNC.
+ *
+ * The default is 0, than mean no any threshold checked, and no additional
+ * flush will be made.
+ *
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] threshold The size in bytes of summary changes when a synchronous
+ * flush would be made.
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_env_set_syncbytes(MDBX_env *env, size_t threshold);
+
+/* Sets relative period since the last unsteay commit to force flush the data
+ * buffers to disk, even of MDBX_NOSYNC, MDBX_NOMETASYNC and MDBX_MAPASYNC flags
+ * in the environment. The relative period value affects all processes which
+ * operates with given environment until the last process close environment or a
+ * new value will be settled.
+ *
+ * Data is always written to disk when mdbx_txn_commit() is called, but the
+ * operating system may keep it buffered. MDBX always flushes the OS buffers
+ * upon commit as well, unless the environment was opened with MDBX_NOSYNC,
+ * MDBX_MAPASYNC or in part MDBX_NOMETASYNC.
+ *
+ * Settled period don't checked asynchronously, but only by the
+ * mdbx_txn_commit() and mdbx_env_sync() functions. Therefore, in cases where
+ * transactions are committed infrequently and/or irregularly, polling by
+ * mdbx_env_sync() may be a reasonable solution to timeout enforcement.
+ *
+ * The default is 0, than mean no any timeout checked, and no additional
+ * flush will be made.
+ *
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] seconds_16dot16 The period in 1/65536 of second when a synchronous
+ * flush would be made since the last unsteay commit.
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_env_set_syncperiod(MDBX_env *env,
+ unsigned seconds_16dot16);
/* Close the environment and release the memory map.
*
@@ -780,107 +1818,310 @@ LIBMDBX_API int mdbx_env_sync(MDBX_env *env, int force);
* The environment handle will be freed and must not be used again after this
* call.
*
- * [in] env An environment handle returned by mdbx_env_create()
+ * Legacy mdbx_env_close() correspond to calling mdbx_env_close_ex() with the
+ * argument dont_sync=false.
+ *
+ * [in] env An environment handle returned by mdbx_env_create().
* [in] dont_sync A dont'sync flag, if non-zero the last checkpoint (meta-page
* update) will be kept "as is" and may be still "weak" in the
* NOSYNC/MAPASYNC modes. Such "weak" checkpoint will be
* ignored on opening next time, and transactions since the
* last non-weak checkpoint (meta-page update) will rolledback
- * for consistency guarantee. */
+ * for consistency guarantee.
+ *
+ * Returns A non-zero error value on failure and 0 on success.
+ * Some possible errors are:
+ * - MDBX_BUSY = The write transaction is running by other thread, in such
+ * case MDBX_env instance has NOT be destroyed not released!
+ * NOTE: if any OTHER error code was returned then given
+ * MDBX_env instance has been destroyed and released.
+ * - MDBX_PANIC = If mdbx_env_close_ex() was called in the child process
+ * after fork(). In this case MDBX_PANIC is a expecte,
+ * i.e. MDBX_env instance was freed in proper manner.
+ * - MDBX_EIO = an error occurred during synchronization. */
+LIBMDBX_API int mdbx_env_close_ex(MDBX_env *env, int dont_sync);
LIBMDBX_API int mdbx_env_close(MDBX_env *env);
/* Set environment flags.
*
* This may be used to set some flags in addition to those from
- * mdbx_env_open(), or to unset these flags. If several threads
- * change the flags at the same time, the result is undefined.
+ * mdbx_env_open(), or to unset these flags.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [in] flags The flags to change, bitwise OR'ed together
+ * NOTE: In contrast to LMDB, the MDBX serialize threads via mutex while
+ * changing the flags. Therefore this function will be blocked while a write
+ * transaction running by other thread, or MDBX_BUSY will be returned if
+ * function called within a write transaction.
+ *
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] flags The flags to change, bitwise OR'ed together.
* [in] onoff A non-zero value sets the flags, zero clears them.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_env_set_flags(MDBX_env *env, unsigned flags, int onoff);
/* Get environment flags.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [out] flags The address of an integer to store the flags
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [out] flags The address of an integer to store the flags.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_env_get_flags(MDBX_env *env, unsigned *flags);
/* Return the path that was used in mdbx_env_open().
*
* [in] env An environment handle returned by mdbx_env_create()
- * [out] path Address of a string pointer to contain the path.
+ * [out] dest Address of a string pointer to contain the path.
* This is the actual string in the environment, not a copy.
* It should not be altered in any way.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified. */
-LIBMDBX_API int mdbx_env_get_path(MDBX_env *env, const char **path);
+ * - MDBX_EINVAL = an invalid parameter was specified. */
+LIBMDBX_API int mdbx_env_get_path(MDBX_env *env, const char **dest);
/* Return the file descriptor for the given environment.
*
* NOTE: All MDBX file descriptors have FD_CLOEXEC and
* could't be used after exec() and or fork().
*
- * [in] env An environment handle returned by mdbx_env_create()
+ * [in] env An environment handle returned by mdbx_env_create().
* [out] fd Address of a int to contain the descriptor.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_env_get_fd(MDBX_env *env, mdbx_filehandle_t *fd);
-/* Set the size of the memory map to use for this environment.
- *
- * The size should be a multiple of the OS page size. The default is
- * 10485760 bytes. The size of the memory map is also the maximum size
- * of the database. The value should be chosen as large as possible,
- * to accommodate future growth of the database.
- * This function should be called after mdbx_env_create() and before
- * mdbx_env_open(). It may be called at later times if no transactions
- * are active in this process. Note that the library does not check for
- * this condition, the caller must ensure it explicitly.
- *
- * The new size takes effect immediately for the current process but
- * will not be persisted to any others until a write transaction has been
- * committed by the current process. Also, only mapsize increases are
- * persisted into the environment.
- *
- * If the mapsize is increased by another process, and data has grown
- * beyond the range of the current mapsize, mdbx_txn_begin() will
- * return MDBX_MAP_RESIZED. This function may be called with a size
- * of zero to adopt the new size.
- *
- * Any attempt to set a size smaller than the space already consumed by the
- * environment will be silently changed to the current size of the used space.
+/* Set all size-related parameters of environment, including page size and the
+ * min/max size of the memory map.
+ *
+ * In contrast to LMDB, the MDBX provide automatic size management of an
+ * database according the given parameters, including shrinking and resizing
+ * on the fly. From user point of view all of these just working. Nevertheless,
+ * it is reasonable to know some details in order to make optimal decisions when
+ * choosing parameters.
+ *
+ * Both mdbx_env_info_ex() and legacy mdbx_env_info() are inapplicable to
+ * read-only opened environment.
+ *
+ * Both mdbx_env_info_ex() and legacy mdbx_env_info() could be called either
+ * before or after mdbx_env_open(), either within the write transaction running
+ * by current thread or not:
+ *
+ * - In case mdbx_env_info_ex() or legacy mdbx_env_info() was called BEFORE
+ * mdbx_env_open(), i.e. for closed environment, then the specified
+ * parameters will be used for new database creation, or will be appliend
+ * during openeing if database exists and no other process using it.
+ *
+ * If the database is already exist, opened with MDBX_EXCLUSIVE or not used
+ * by any other process, and parameters specified by mdbx_env_set_geometry()
+ * are incompatible (i.e. for instance, different page size) then
+ * mdbx_env_open() will return MDBX_INCOMPATIBLE error.
+ *
+ * In another way, if database will opened read-only or will used by other
+ * process during calling mdbx_env_open() that specified parameters will
+ * silently discarded (open the database with MDBX_EXCLUSIVE flag to avoid
+ * this).
+ *
+ * - In case mdbx_env_info_ex() or legacy mdbx_env_info() was called after
+ * mdbx_env_open() WITHIN the write transaction running by current thread,
+ * then specified parameters will be appliad as a part of write transaction,
+ * i.e. will not be visible to any others processes until the current write
+ * transaction has been committed by the current process. However, if
+ * transaction will be aborted, then the database file will be reverted to
+ * the previous size not immediately, but when a next transaction will be
+ * committed or when the database will be opened next time.
+ *
+ * - In case mdbx_env_info_ex() or legacy mdbx_env_info() was called after
+ * mdbx_env_open() but OUTSIDE a write transaction, then MDBX will execute
+ * internal pseudo-transaction to apply new parameters (but only if anything
+ * has been changed), and changes be visible to any others processes
+ * immediatelly after succesfull competeion of function.
+ *
+ * Essentially a concept of "automatic size management" is simple and useful:
+ * - There are the lower and upper bound of the database file size;
+ * - There is the growth step by which the database file will be increased,
+ * in case of lack of space.
+ * - There is the threshold for unused space, beyond which the database file
+ * will be shrunk.
+ * - The size of the memory map is also the maximum size of the database.
+ * - MDBX will automatically manage both the size of the database and the size
+ * of memory map, according to the given parameters.
+ *
+ * So, there some considerations about choosing these parameters:
+ * - The lower bound allows you to prevent database shrinking below some
+ * rational size to avoid unnecessary resizing costs.
+ * - The upper bound allows you to prevent database growth above some rational
+ * size. Besides, the upper bound defines the linear address space
+ * reservation in each process that opens the database. Therefore changing
+ * the upper bound is costly and may be required reopening environment in
+ * case of MDBX_MAP_RESIZED errors, and so on. Therefore, this value should
+ * be chosen reasonable as large as possible, to accommodate future growth
+ * of the database.
+ * - The growth step must be greater than zero to allow the database to grow,
+ * but also reasonable not too small, since increasing the size by little
+ * steps will result a large overhead.
+ * - The shrink threshold must be greater than zero to allow the database
+ * to shrink but also reasonable not too small (to avoid extra overhead) and
+ * not less than growth step to avoid up-and-down flouncing.
+ * - The current size (i.e. size_now argument) is an auxiliary parameter for
+ * simulation legacy mdbx_env_set_mapsize() and as workaround Windows issues
+ * (see below).
+ *
+ * Unfortunately, Windows has is a several issues
+ * with resizing of memory-mapped file:
+ * - Windows unable shrinking a memory-mapped file (i.e memory-mapped section)
+ * in any way except unmapping file entirely and then map again. Moreover,
+ * it is impossible in any way if a memory-mapped file is used more than
+ * one process.
+ * - Windows does not provide the usual API to augment a memory-mapped file
+ * (that is, a memory-mapped partition), but only by using "Native API"
+ * in an undocumented way.
+ * MDBX bypasses all Windows issues, but at a cost:
+ * - Ability to resize database on the fly requires an additional lock
+ * and release SlimReadWriteLock during each read-only transaction.
+ * - During resize all in-process threads should be paused and then resumed.
+ * - Shrinking of database file is performed only when it used by single
+ * process, i.e. when a database closes by the last process or opened
+ * by the first.
+ * = Therefore, the size_now argument may be useful to set database size
+ * by the first process which open a database, and thus avoid expensive
+ * remapping further.
+ *
+ * For create a new database with particular parameters, including the page
+ * size, mdbx_env_set_geometry() should be called after mdbx_env_create() and
+ * before mdbx_env_open(). Once the database is created, the page size cannot be
+ * changed. If you do not specify all or some of the parameters, the
+ * corresponding default values will be used. For instance, the default for
+ * database size is 10485760 bytes.
+ *
+ * If the mapsize is increased by another process, MDBX silently and
+ * transparently adopt these changes at next transaction start. However,
+ * mdbx_txn_begin() will return MDBX_MAP_RESIZED if new mapping size could not
+ * be applied for current process (for instance if address space is busy).
+ * Therefore, in the case of MDBX_MAP_RESIZED error you need close and reopen
+ * the environment to resolve error.
+ *
+ * NOTE: Actual values may be different than your have specified because of
+ * rounding to specified database page size, the system page size and/or the
+ * size of the system virtual memory management unit. You can get actual values
+ * by mdbx_env_sync_ex() or see by using the tool "mdbx_chk" with the "-v"
+ * option.
+ *
+ * Legacy mdbx_env_set_mapsize() correspond to calling mdbx_env_set_geometry()
+ * with the arguments size_lower, size_now, size_upper equal to the size
+ * and -1 (i.e. default) for all other parameters.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [in] size The size in bytes
+ * [in] env An environment handle returned by mdbx_env_create()
*
- * Returns A non-zero error value on failure and 0 on success, some
- * possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified,
- * or the environment has an active write transaction. */
-LIBMDBX_API int mdbx_env_set_mapsize(MDBX_env *env, size_t size);
+ * [in] size_lower The lower bound of database sive in bytes.
+ * Zero value means "minimal acceptable",
+ * and negative means "keep current or use default".
+ *
+ * [in] size_now The size in bytes to setup the database size for now.
+ * Zero value means "minimal acceptable",
+ * and negative means "keep current or use default".
+ * So, it is recommended always pass -1 in this argument
+ * except some special cases.
+ *
+ * [in] size_upper The upper bound of database sive in bytes.
+ * Zero value means "minimal acceptable",
+ * and negative means "keep current or use default".
+ * It is recommended to avoid change upper bound while
+ * database is used by other processes or threaded
+ * (i.e. just pass -1 in this argument except absolutely
+ * necessity). Otherwise you must be ready for
+ * MDBX_MAP_RESIZED error(s), unexpected pauses during
+ * remapping and/or system errors like "addtress busy",
+ * and so on. In other words, there is no way to handle
+ * a growth of the upper bound robustly because there may
+ * be a lack of appropriate system resources (which are
+ * extremely volatile in a multi-process multi-threaded
+ * environment).
+ *
+ * [in] growth_step The growth step in bytes, must be greater than zero
+ * to allow the database to grow.
+ * Negative value means "keep current or use default".
+ *
+ * [in] shrink_threshold The shrink threshold in bytes, must be greater than
+ * zero to allow the database to shrink.
+ * Negative value means "keep current or use default".
+ *
+ * [in] pagesize The database page size for new database creation
+ * or -1 otherwise. Must be power of 2 in the range
+ * between MDBX_MIN_PAGESIZE and MDBX_MAX_PAGESIZE.
+ * Zero value means "minimal acceptable",
+ * and negative means "keep current or use default".
+ *
+ * Returns A non-zero error value on failure and 0 on success,
+ * some possible errors are:
+ * - MDBX_EINVAL = An invalid parameter was specified,
+ * or the environment has an active write transaction.
+ * - MDBX_EPERM = specific for Windows: Shrinking was disabled before and
+ * now it wanna be enabled, but there are reading threads
+ * that don't use the additional SRWL (that is required to
+ * avoid Windows issues).
+ * - MDBX_EACCESS = The environment opened in read-only.
+ * - MDBX_MAP_FULL = Specified size smaller than the space already
+ * consumed by the environment.
+ * - MDBX_TOO_LARGE = Specified size is too large, i.e. too many pages for
+ * given size, or a 32-bit process requests too much bytes
+ * for the 32-bit address space. */
LIBMDBX_API int mdbx_env_set_geometry(MDBX_env *env, intptr_t size_lower,
intptr_t size_now, intptr_t size_upper,
intptr_t growth_step,
intptr_t shrink_threshold,
intptr_t pagesize);
+__deprecated LIBMDBX_API int mdbx_env_set_mapsize(MDBX_env *env, size_t size);
+
+/* Find out whether to use readahead or not, based on the given database size
+ * and the amount of available memory.
+ *
+ * [in] volume The expected database size in bytes.
+ * [in] redundancy Additional reserve or overload in case of negative value.
+ *
+ * Returns:
+ * - MDBX_RESULT_TRUE = readahead is reasonable.
+ * - MDBX_RESULT_FALSE = readahead is NOT reasonable, i.e. MDBX_NORDAHEAD
+ * is useful to open environment by mdbx_env_open().
+ * - Otherwise the error code. */
+LIBMDBX_API int mdbx_is_readahead_reasonable(size_t volume,
+ intptr_t redundancy);
+
+/* The minimal database page size in bytes. */
+#define MDBX_MIN_PAGESIZE 256
+__inline intptr_t mdbx_limits_pgsize_min(void) { return MDBX_MIN_PAGESIZE; }
+
+/* The maximal database page size in bytes. */
+#define MDBX_MAX_PAGESIZE 65536
+__inline intptr_t mdbx_limits_pgsize_max(void) { return MDBX_MAX_PAGESIZE; }
+
+/* Returns minimal database size in bytes for given page size,
+ * or -1 if pagesize is invalid. */
+LIBMDBX_API intptr_t mdbx_limits_dbsize_min(intptr_t pagesize);
+
+/* Returns maximal database size in bytes for given page size,
+ * or -1 if pagesize is invalid. */
+LIBMDBX_API intptr_t mdbx_limits_dbsize_max(intptr_t pagesize);
+
+/* Returns maximal key and data size in bytes for given page size
+ * and database flags (see mdbx_dbi_open_ex() description),
+ * or -1 if pagesize is invalid. */
+LIBMDBX_API intptr_t mdbx_limits_keysize_max(intptr_t pagesize, unsigned flags);
+LIBMDBX_API intptr_t mdbx_limits_valsize_max(intptr_t pagesize, unsigned flags);
+
+/* Returns maximal write transaction size (i.e. limit for summary volume of
+ * dirty pages) in bytes for given page size, or -1 if pagesize is invalid. */
+LIBMDBX_API intptr_t mdbx_limits_txnsize_max(intptr_t pagesize);
/* Set the maximum number of threads/reader slots for the environment.
*
* This defines the number of slots in the lock table that is used to track
- * readers in the the environment. The default is 61.
+ * readers in the the environment. The default is 119 for 4K system page size.
* Starting a read-only transaction normally ties a lock table slot to the
* current thread until the environment closes or the thread exits. If
* MDBX_NOTLS is in use, mdbx_txn_begin() instead ties the slot to the
@@ -888,23 +2129,23 @@ LIBMDBX_API int mdbx_env_set_geometry(MDBX_env *env, intptr_t size_lower,
* This function may only be called after mdbx_env_create() and before
* mdbx_env_open().
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [in] readers The maximum number of reader lock table slots
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] readers The maximum number of reader lock table slots.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified,
- * or the environment is already open. */
+ * - MDBX_EINVAL = an invalid parameter was specified.
+ * - MDBX_EPERM = the environment is already open. */
LIBMDBX_API int mdbx_env_set_maxreaders(MDBX_env *env, unsigned readers);
/* Get the maximum number of threads/reader slots for the environment.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [out] readers Address of an integer to store the number of readers
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [out] readers Address of an integer to store the number of readers.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_env_get_maxreaders(MDBX_env *env, unsigned *readers);
/* Set the maximum number of named databases for the environment.
@@ -919,25 +2160,30 @@ LIBMDBX_API int mdbx_env_get_maxreaders(MDBX_env *env, unsigned *readers);
* expensive: 7-120 words per transaction, and every mdbx_dbi_open()
* does a linear search of the opened slots.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [in] dbs The maximum number of databases
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] dbs The maximum number of databases.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified,
- * or the environment is already open. */
+ * - MDBX_EINVAL = an invalid parameter was specified.
+ * - MDBX_EPERM = the environment is already open. */
LIBMDBX_API int mdbx_env_set_maxdbs(MDBX_env *env, MDBX_dbi dbs);
-/* Get the maximum size of keys and MDBX_DUPSORT data we can write.
+/* Get the maximum size of keys and data we can write.
*
- * [in] env An environment handle returned by mdbx_env_create()
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] flags Database options (MDBX_DUPSORT, MDBX_INTEGERKEY ans so on),
+ * see mdbx_dbi_open_ex() description.
*
- * Returns The maximum size of a key we can write. */
-LIBMDBX_API int mdbx_env_get_maxkeysize(MDBX_env *env);
+ * Returns The maximum size of a key we can write,
+ * or -1 if something is wrong. */
+LIBMDBX_API int mdbx_env_get_maxkeysize_ex(MDBX_env *env, unsigned flags);
+LIBMDBX_API int mdbx_env_get_maxvalsize_ex(MDBX_env *env, unsigned flags);
+__deprecated LIBMDBX_API int mdbx_env_get_maxkeysize(MDBX_env *env);
/* Set application information associated with the MDBX_env.
*
- * [in] env An environment handle returned by mdbx_env_create()
+ * [in] env An environment handle returned by mdbx_env_create().
* [in] ctx An arbitrary pointer for whatever the application needs.
*
* Returns A non-zero error value on failure and 0 on success. */
@@ -949,32 +2195,15 @@ LIBMDBX_API int mdbx_env_set_userctx(MDBX_env *env, void *ctx);
* Returns The pointer set by mdbx_env_set_userctx(). */
LIBMDBX_API void *mdbx_env_get_userctx(MDBX_env *env);
-/* A callback function for most MDBX assert() failures,
- * called before printing the message and aborting.
- *
- * [in] env An environment handle returned by mdbx_env_create().
- * [in] msg The assertion message, not including newline. */
-typedef void MDBX_assert_func(const MDBX_env *env, const char *msg,
- const char *function, unsigned line);
-
-/* Set or reset the assert() callback of the environment.
- *
- * Disabled if libmdbx is buillt with MDBX_DEBUG=0.
- * NOTE: This hack should become obsolete as mdbx's error handling matures.
- *
- * [in] env An environment handle returned by mdbx_env_create().
- * [in] func An MDBX_assert_func function, or 0.
- *
- * Returns A non-zero error value on failure and 0 on success. */
-LIBMDBX_API int mdbx_env_set_assert(MDBX_env *env, MDBX_assert_func *func);
-
/* Create a transaction for use with the environment.
*
* The transaction handle may be discarded using mdbx_txn_abort()
* or mdbx_txn_commit().
- * NOTE: A transaction and its cursors must only be used by a single
- * thread, and a thread may only have a single transaction at a time.
- * If MDBX_NOTLS is in use, this does not apply to read-only transactions.
+ *
+ * NOTE: A transaction and its cursors must only be used by a single thread,
+ * and a thread may only have a single transaction at a time. If MDBX_NOTLS is
+ * in use, this does not apply to read-only transactions.
+ *
* NOTE: Cursors may not span transactions.
*
* [in] env An environment handle returned by mdbx_env_create()
@@ -992,25 +2221,93 @@ LIBMDBX_API int mdbx_env_set_assert(MDBX_env *env, MDBX_assert_func *func);
* This transaction will not perform any write operations.
*
* - MDBX_TRYTXN
- * Do not block when starting a write transaction
+ * Do not block when starting a write transaction.
+ *
+ * - MDBX_NOSYNC, MDBX_NOMETASYNC or MDBX_MAPASYNC
+ * Do not sync data to disk corresponding to MDBX_NOMETASYNC
+ * or MDBX_NOSYNC description (see abobe).
*
* [out] txn Address where the new MDBX_txn handle will be stored
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_PANIC - a fatal error occurred earlier and the environment
+ * - MDBX_PANIC = a fatal error occurred earlier and the environment
* must be shut down.
- * - MDBX_MAP_RESIZED - another process wrote data beyond this MDBX_env's
+ * - MDBX_MAP_RESIZED = another process wrote data beyond this MDBX_env's
* mapsize and this environment's map must be resized
* as well. See mdbx_env_set_mapsize().
- * - MDBX_READERS_FULL - a read-only transaction was requested and the reader
+ * - MDBX_READERS_FULL = a read-only transaction was requested and the reader
* lock table is full. See mdbx_env_set_maxreaders().
- * - MDBX_ENOMEM - out of memory.
- * - MDBX_BUSY - a write transaction is already started. */
+ * - MDBX_ENOMEM = out of memory.
+ * - MDBX_BUSY = the write transaction is already started by the
+ * current thread. */
LIBMDBX_API int mdbx_txn_begin(MDBX_env *env, MDBX_txn *parent, unsigned flags,
MDBX_txn **txn);
-/* Returns the transaction's MDBX_env
+/* Information about the transaction */
+typedef struct MDBX_txn_info {
+ uint64_t txn_id; /* The ID of the transaction. For a READ-ONLY transaction,
+ this corresponds to the snapshot being read. */
+
+ uint64_t
+ txn_reader_lag; /* For READ-ONLY transaction: the lag from a recent
+ MVCC-snapshot, i.e. the number of committed
+ transaction since read transaction started.
+ For WRITE transaction (provided if scan_rlt=true): the
+ lag of the oldest reader from current transaction (i.e.
+ atleast 1 if any reader running). */
+
+ uint64_t txn_space_used; /* Used space by this transaction, i.e. corresponding
+ to the last used database page. */
+
+ uint64_t txn_space_limit_soft; /* Current size of database file. */
+
+ uint64_t
+ txn_space_limit_hard; /* Upper bound for size the database file,
+ i.e. the value "size_upper" argument of the
+ approriate call of mdbx_env_set_geometry(). */
+
+ uint64_t txn_space_retired; /* For READ-ONLY transaction: The total size of
+ the database pages that were retired by
+ committed write transactions after the reader's
+ MVCC-snapshot, i.e. the space which would be
+ freed after the Reader releases the
+ MVCC-snapshot for reuse by completion read
+ transaction.
+ For WRITE transaction: The summarized size of
+ the database pages that were retired for now
+ due Copy-On-Write during this transaction. */
+
+ uint64_t
+ txn_space_leftover; /* For READ-ONLY transaction: the space available for
+ writer(s) and that must be exhausted for reason to
+ call the OOM-killer for this read transaction.
+ For WRITE transaction: the space inside transaction
+ that left to MDBX_TXN_FULL error. */
+
+ uint64_t txn_space_dirty; /* For READ-ONLY transaction (provided if
+ scan_rlt=true): The space that actually become
+ available for reuse when only this transaction
+ will be finished.
+ For WRITE transaction: The summarized size of the
+ dirty database pages that generated during this
+ transaction. */
+} MDBX_txn_info;
+
+/* Return information about the MDBX transaction.
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [out] stat The address of an MDBX_txn_info structure
+ * where the information will be copied.
+ * [in[ scan_rlt The boolean flag controls the scan of the read lock table to
+ * provide complete information. Such scan is relatively
+ * expensive and you can avoid it if corresponding fields are
+ * not needed (see description of MDBX_txn_info above).
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_txn_info(MDBX_txn *txn, MDBX_txn_info *info, int scan_rlt);
+
+/* Returns the transaction's MDBX_env.
*
* [in] txn A transaction handle returned by mdbx_txn_begin() */
LIBMDBX_API MDBX_env *mdbx_txn_env(MDBX_txn *txn);
@@ -1019,71 +2316,76 @@ LIBMDBX_API MDBX_env *mdbx_txn_env(MDBX_txn *txn);
*
* This returns the flags associated with this transaction.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
*
- * Returns A transaction flags, valid if input is an active transaction. */
+ * Returns A transaction flags, valid if input is an valid transaction,
+ * otherwise -1. */
LIBMDBX_API int mdbx_txn_flags(MDBX_txn *txn);
/* Return the transaction's ID.
*
- * This returns the identifier associated with this transaction. For a
- * read-only transaction, this corresponds to the snapshot being read;
- * concurrent readers will frequently have the same transaction ID.
+ * This returns the identifier associated with this transaction. For a read-only
+ * transaction, this corresponds to the snapshot being read; concurrent readers
+ * will frequently have the same transaction ID.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
*
- * Returns A transaction ID, valid if input is an active transaction. */
+ * Returns A transaction ID, valid if input is an active transaction,
+ * otherwise 0. */
LIBMDBX_API uint64_t mdbx_txn_id(MDBX_txn *txn);
/* Commit all the operations of a transaction into the database.
*
- * The transaction handle is freed. It and its cursors must not be used
- * again after this call, except with mdbx_cursor_renew().
+ * The transaction handle is freed. It and its cursors must not be used again
+ * after this call, except with mdbx_cursor_renew() and mdbx_cursor_close().
*
- * A cursor must be closed explicitly always, before
- * or after its transaction ends. It can be reused with
- * mdbx_cursor_renew() before finally closing it.
+ * A cursor must be closed explicitly always, before or after its transaction
+ * ends. It can be reused with mdbx_cursor_renew() before finally closing it.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified.
- * - MDBX_ENOSPC - no more disk space.
- * - MDBX_EIO - a low-level I/O error occurred while writing.
- * - MDBX_ENOMEM - out of memory. */
+ * - MDBX_EINVAL = an invalid parameter was specified.
+ * - MDBX_ENOSPC = no more disk space.
+ * - MDBX_EIO = a low-level I/O error occurred while writing.
+ * - MDBX_ENOMEM = out of memory. */
LIBMDBX_API int mdbx_txn_commit(MDBX_txn *txn);
/* Abandon all the operations of the transaction instead of saving them.
*
- * The transaction handle is freed. It and its cursors must not be used
- * again after this call, except with mdbx_cursor_renew().
+ * The transaction handle is freed. It and its cursors must not be used again
+ * after this call, except with mdbx_cursor_renew() and mdbx_cursor_close().
*
* A cursor must be closed explicitly always, before or after its transaction
* ends. It can be reused with mdbx_cursor_renew() before finally closing it.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin(). */
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
LIBMDBX_API int mdbx_txn_abort(MDBX_txn *txn);
/* Reset a read-only transaction.
*
- * Abort the transaction like mdbx_txn_abort(), but keep the transaction
- * handle. Therefore mdbx_txn_renew() may reuse the handle. This saves
- * allocation overhead if the process will start a new read-only transaction
- * soon, and also locking overhead if MDBX_NOTLS is in use. The reader table
- * lock is released, but the table slot stays tied to its thread or
- * MDBX_txn. Use mdbx_txn_abort() to discard a reset handle, and to free
- * its lock table slot if MDBX_NOTLS is in use.
+ * Abort the read-only transaction like mdbx_txn_abort(), but keep the
+ * transaction handle. Therefore mdbx_txn_renew() may reuse the handle. This
+ * saves allocation overhead if the process will start a new read-only
+ * transaction soon, and also locking overhead if MDBX_NOTLS is in use. The
+ * reader table lock is released, but the table slot stays tied to its thread or
+ * MDBX_txn. Use mdbx_txn_abort() to discard a reset handle, and to free its
+ * lock table slot if MDBX_NOTLS is in use.
*
- * Cursors opened within the transaction must not be used
- * again after this call, except with mdbx_cursor_renew().
+ * Cursors opened within the transaction must not be used again after this call,
+ * except with mdbx_cursor_renew() and mdbx_cursor_close().
*
* Reader locks generally don't interfere with writers, but they keep old
- * versions of database pages allocated. Thus they prevent the old pages
- * from being reused when writers commit new data, and so under heavy load
- * the database size may grow much more rapidly than otherwise.
+ * versions of database pages allocated. Thus they prevent the old pages from
+ * being reused when writers commit new data, and so under heavy load the
+ * database size may grow much more rapidly than otherwise.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin() */
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
LIBMDBX_API int mdbx_txn_reset(MDBX_txn *txn);
/* Renew a read-only transaction.
@@ -1092,41 +2394,90 @@ LIBMDBX_API int mdbx_txn_reset(MDBX_txn *txn);
* released by mdbx_txn_reset(). It must be called before a reset transaction
* may be used again.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_PANIC - a fatal error occurred earlier and the environment
+ * - MDBX_PANIC = a fatal error occurred earlier and the environment
* must be shut down.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_txn_renew(MDBX_txn *txn);
-/* Open a table in the environment.
+/* The fours integers markers (aka "canary") associated with the environment.
+ *
+ * The `x`, `y` and `z` values could be set by mdbx_canary_put(), while the 'v'
+ * will be always set to the transaction number. Updated values becomes visible
+ * outside the current transaction only after it was committed. Current values
+ * could be retrieved by mdbx_canary_get(). */
+typedef struct mdbx_canary {
+ uint64_t x, y, z, v;
+} mdbx_canary;
+
+/* Set integers markers (aka "canary") associated with the environment.
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin()
+ * [in] canary A optional pointer to mdbx_canary structure for `x`, `y`
+ * and `z` values from.
+ * - If canary is NOT NULL then the `x`, `y` and `z` values will be
+ * updated from given canary argument, but the 'v' be always set
+ * to the current transaction number if at least one `x`, `y` or
+ * `z` values have changed (i.e. if `x`, `y` and `z` have the same
+ * values as currently present then nothing will be changes or
+ * updated).
+ * - if canary is NULL then the `v` value will be explicitly update
+ * to the current transaction number without changes `x`, `y` nor
+ * `z`.
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_canary_put(MDBX_txn *txn, const mdbx_canary *canary);
+
+/* Returns fours integers markers (aka "canary") associated with the
+ * environment.
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] canary The address of an mdbx_canary structure where the information
+ * will be copied.
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_canary_get(MDBX_txn *txn, mdbx_canary *canary);
+
+/* A callback function used to compare two keys in a database */
+typedef int(MDBX_cmp_func)(const MDBX_val *a, const MDBX_val *b);
+
+/* Open a database in the environment.
+ *
+ * A database handle denotes the name and parameters of a database,
+ * independently of whether such a database exists. The database handle may be
+ * discarded by calling mdbx_dbi_close(). The old database handle is returned if
+ * the database was already open. The handle may only be closed once.
+ *
+ * (!) A notable difference between MDBX and LMDB is that MDBX make handles
+ * opened for existing databases immediately available for other transactions,
+ * regardless this transaction will be aborted or reset. The REASON for this is
+ * to avoiding the requirement for multiple opening a same handles in concurrent
+ * read transactions, and tracking of such open but hidden handles until the
+ * completion of read transactions which opened them.
*
- * A table handle denotes the name and parameters of a table, independently
- * of whether such a table exists. The table handle may be discarded by
- * calling mdbx_dbi_close(). The old table handle is returned if the table
- * was already open. The handle may only be closed once.
+ * Nevertheless, the handle for the NEWLY CREATED database will be invisible for
+ * other transactions until the this write transaction is successfully
+ * committed. If the write transaction is aborted the handle will be closed
+ * automatically. After a successful commit the such handle will reside in the
+ * shared environment, and may be used by other transactions.
*
- * The table handle will be private to the current transaction until
- * the transaction is successfully committed. If the transaction is
- * aborted the handle will be closed automatically.
- * After a successful commit the handle will reside in the shared
- * environment, and may be used by other transactions.
+ * In contrast to LMDB, the MDBX allow this function to be called from multiple
+ * concurrent transactions or threads in the same process.
*
- * This function must not be called from multiple concurrent
- * transactions in the same process. A transaction that uses
- * this function must finish (either commit or abort) before
- * any other transaction in the process may use this function.
+ * Legacy mdbx_dbi_open() correspond to calling mdbx_dbi_open_ex() with the null
+ * keycmp and datacmp arguments.
*
- * To use named table (with name != NULL), mdbx_env_set_maxdbs()
+ * To use named database (with name != NULL), mdbx_env_set_maxdbs()
* must be called before opening the environment. Table names are
- * keys in the internal unnamed table, and may be read but not written.
+ * keys in the internal unnamed database, and may be read but not written.
*
- * [in] txn transaction handle returned by mdbx_txn_begin()
- * [in] name The name of the table to open. If only a single
- * table is needed in the environment, this value may be NULL.
- * [in] flags Special options for this table. This parameter must be set
+ * [in] txn transaction handle returned by mdbx_txn_begin().
+ * [in] name The name of the database to open. If only a single
+ * database is needed in the environment, this value may be NULL.
+ * [in] flags Special options for this database. This parameter must be set
* to 0 or by bitwise OR'ing together one or more of the values
* described here:
* - MDBX_REVERSEKEY
@@ -1134,7 +2485,7 @@ LIBMDBX_API int mdbx_txn_renew(MDBX_txn *txn);
* of the strings to the beginning. By default, Keys are treated as
* strings and compared from beginning to end.
* - MDBX_DUPSORT
- * Duplicate keys may be used in the table. Or, from another point of
+ * Duplicate keys may be used in the database. Or, from another point of
* view, keys may have multiple data items, stored in sorted order. By
* default keys must be unique and may have only a single data item.
* - MDBX_INTEGERKEY
@@ -1159,14 +2510,21 @@ LIBMDBX_API int mdbx_txn_renew(MDBX_txn *txn);
* Create the named database if it doesn't exist. This option is not
* allowed in a read-only transaction or a read-only environment.
*
- * [out] dbi Address where the new MDBX_dbi handle will be stored
+ * [in] keycmp Optional custom key comparison function for a database.
+ * [in] datacmp Optional custom data comparison function for a database, takes
+ * effect only if database was opened with the MDB_DUPSORT flag.
+ * [out] dbi Address where the new MDBX_dbi handle will be stored.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_NOTFOUND - the specified database doesn't exist in the
- * environment and MDBX_CREATE was not specified.
- * - MDBX_DBS_FULL - too many databases have been opened.
- * See mdbx_env_set_maxdbs(). */
+ * - MDBX_NOTFOUND = the specified database doesn't exist in the
+ * environment and MDBX_CREATE was not specified.
+ * - MDBX_DBS_FULL = too many databases have been opened.
+ * See mdbx_env_set_maxdbs().
+ * - MDBX_INCOMPATIBLE = Database is incompatible with given flags,
+ * i.e. the passed flags is different with which the
+ * database was created, or the database was already
+ * opened with a different comparison function(s). */
LIBMDBX_API int mdbx_dbi_open_ex(MDBX_txn *txn, const char *name,
unsigned flags, MDBX_dbi *dbi,
MDBX_cmp_func *keycmp, MDBX_cmp_func *datacmp);
@@ -1175,24 +2533,27 @@ LIBMDBX_API int mdbx_dbi_open(MDBX_txn *txn, const char *name, unsigned flags,
/* Retrieve statistics for a database.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] dbi A database handle returned by mdbx_dbi_open()
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
* [out] stat The address of an MDBX_stat structure where the statistics
- * will be copied
+ * will be copied.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_dbi_stat(MDBX_txn *txn, MDBX_dbi dbi, MDBX_stat *stat,
size_t bytes);
-/* Retrieve the DB flags for a database handle.
+/* Retrieve the DB flags and status for a database handle.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] dbi A database handle returned by mdbx_dbi_open()
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
* [out] flags Address where the flags will be returned.
* [out] state Address where the state will be returned.
*
+ * Legacy mdbx_dbi_flags() correspond to calling mdbx_dbi_flags_ex() with
+ * discarding result from the last argument.
+ *
* Returns A non-zero error value on failure and 0 on success. */
#define MDBX_TBL_DIRTY 0x01 /* DB was written in this txn */
#define MDBX_TBL_STALE 0x02 /* Named-DB record is older than txnID */
@@ -1204,29 +2565,33 @@ LIBMDBX_API int mdbx_dbi_flags(MDBX_txn *txn, MDBX_dbi dbi, unsigned *flags);
/* Close a database handle. Normally unnecessary.
*
- * Use with care:
- * FIXME: This call is not mutex protected. Handles should only be closed by
- * a single thread, and only if no other threads are going to reference
- * the database handle or one of its cursors any further. Do not close
- * a handle if an existing transaction has modified its database.
- * Doing so can cause misbehavior from database corruption to errors
- * like MDBX_BAD_VALSIZE (since the DB name is gone).
+ * NOTE: Use with care.
+ * This call is synchronized via mutex with mdbx_dbi_close(), but NOT with
+ * other transactions running by other threads. The "next" version of libmdbx
+ * (MithrilDB) will solve this issue.
*
- * Closing a database handle is not necessary, but lets mdbx_dbi_open()
- * reuse the handle value. Usually it's better to set a bigger
- * mdbx_env_set_maxdbs(), unless that value would be large.
+ * Handles should only be closed if no other threads are going to reference
+ * the database handle or one of its cursors any further. Do not close a handle
+ * if an existing transaction has modified its database. Doing so can cause
+ * misbehavior from database corruption to errors like MDBX_BAD_VALSIZE (since
+ * the DB name is gone).
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [in] dbi A database handle returned by mdbx_dbi_open()
- */
+ * Closing a database handle is not necessary, but lets mdbx_dbi_open() reuse
+ * the handle value. Usually it's better to set a bigger mdbx_env_set_maxdbs(),
+ * unless that value would be large.
+ *
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
LIBMDBX_API int mdbx_dbi_close(MDBX_env *env, MDBX_dbi dbi);
-/* Empty or delete+close a database.
+/* Empty or delete and close a database.
*
* See mdbx_dbi_close() for restrictions about closing the DB handle.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] dbi A database handle returned by mdbx_dbi_open()
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
* [in] del 0 to empty the DB, 1 to delete it from the environment
* and close the DB handle.
*
@@ -1250,19 +2615,68 @@ LIBMDBX_API int mdbx_drop(MDBX_txn *txn, MDBX_dbi dbi, int del);
* NOTE: Values returned from the database are valid only until a
* subsequent update operation, or the end of the transaction.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] dbi A database handle returned by mdbx_dbi_open()
- * [in] key The key to search for in the database
- * [in,out] data The data corresponding to the key
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [in] key The key to search for in the database.
+ * [in,out] data The data corresponding to the key.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_NOTFOUND - the key was not in the database.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_NOTFOUND = the key was not in the database.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_get(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
MDBX_val *data);
-LIBMDBX_API int mdbx_get2(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
- MDBX_val *data);
+
+/* Get items from a database and optionaly number of data items for a given key.
+ *
+ * Briefly this function does the same as mdbx_get() with a few differences:
+ * 1. If values_count is NOT NULL, then returns the count
+ * of multi-values/duplicates for a given key.
+ * 2. Updates BOTH the key and the data for pointing to the actual key-value
+ * pair inside the database.
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [in,out] key The key to search for in the database.
+ * [in,out] data The data corresponding to the key.
+ * [out] values_count The optional address to return number of values
+ * associated with given key, i.e.
+ * = 0 - in case MDBX_NOTFOUND error;
+ * = 1 - exactly for databases WITHOUT MDBX_DUPSORT;
+ * >= 1 for databases WITH MDBX_DUPSORT.
+ *
+ * Returns A non-zero error value on failure and 0 on success, some
+ * possible errors are:
+ * - MDBX_NOTFOUND = the key was not in the database.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
+LIBMDBX_API int mdbx_get_ex(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
+ MDBX_val *data, size_t *values_count);
+
+/* Get nearest items from a database.
+ *
+ * Briefly this function does the same as mdbx_get() with a few differences:
+ * 1. Return nearest (i.e. equal or great due comparison function) key-value
+ * pair, but not only exactly matching with the key.
+ * 2. On success return MDBX_SUCCESS if key found exactly,
+ * and MDBX_RESULT_TRUE otherwise. Moreover, for databases with MDBX_DUPSORT
+ * flag the data argument also will be used to match over
+ * multi-value/duplicates, and MDBX_SUCCESS will be returned only when BOTH
+ * the key and the data match exactly.
+ * 3. Updates BOTH the key and the data for pointing to the actual key-value
+ * pair inside the database.
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [in,out] key The key to search for in the database.
+ * [in,out] data The data corresponding to the key.
+ *
+ * Returns A non-zero error value on failure and MDBX_RESULT_TRUE (0) or
+ * MDBX_RESULT_TRUE on success (as described above).
+ * Some possible errors are:
+ * - MDBX_NOTFOUND = the key was not in the database.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
+LIBMDBX_API int mdbx_get_nearest(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
+ MDBX_val *data);
/* Store items into a database.
*
@@ -1271,13 +2685,13 @@ LIBMDBX_API int mdbx_get2(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
* if duplicates are disallowed, or adding a duplicate data item if
* duplicates are allowed (MDBX_DUPSORT).
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] dbi A database handle returned by mdbx_dbi_open()
- * [in] key The key to store in the database
- * [in,out] data The data to store
- * [in] flags Special options for this operation. This parameter must be
- * set to 0 or by bitwise OR'ing together one or more of the
- * values described here.
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [in] key The key to store in the database.
+ * [in,out] data The data to store.
+ * [in] flags Special options for this operation. This parameter must be
+ * set to 0 or by bitwise OR'ing together one or more of the
+ * values described here.
*
* - MDBX_NODUPDATA
* Enter the new key/data pair only if it does not already appear
@@ -1318,90 +2732,143 @@ LIBMDBX_API int mdbx_get2(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
* - MDBX_KEYEXIST
- * - MDBX_MAP_FULL - the database is full, see mdbx_env_set_mapsize().
- * - MDBX_TXN_FULL - the transaction has too many dirty pages.
- * - MDBX_EACCES - an attempt was made to write in a read-only transaction.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_MAP_FULL = the database is full, see mdbx_env_set_mapsize().
+ * - MDBX_TXN_FULL = the transaction has too many dirty pages.
+ * - MDBX_EACCES = an attempt was made to write in a read-only transaction.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_put(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
MDBX_val *data, unsigned flags);
+/* Replace items in a database.
+ *
+ * This function allows to update or delete an existing value at the same time
+ * as the previous value is retrieved. If the argument new_data equal is NULL
+ * zero, the removal is performed, otherwise the update/insert.
+ *
+ * The current value may be in an already changed (aka dirty) page. In this
+ * case, the page will be overwritten during the update, and the old value will
+ * be lost. Therefore, an additional buffer must be passed via old_data argument
+ * initially to copy the old value. If the buffer passed in is too small, the
+ * function will return MDBX_RESULT_TRUE (-1) by setting iov_len field pointed
+ * by old_data argument to the appropriate value, without performing any
+ * changes.
+ *
+ * For databases with non-unique keys (i.e. with MDBX_DUPSORT flag), another use
+ * case is also possible, when by old_data argument selects a specific item from
+ * multi-value/duplicates with the same key for deletion or update. To select
+ * this scenario in flags should simultaneously specify MDBX_CURRENT and
+ * MDBX_NOOVERWRITE. This combination is chosen because it makes no sense, and
+ * thus allows you to identify the request of such a scenario.
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [in] key The key to store in the database.
+ * [in,out] new_data The data to store, if NULL then deletion will be
+ * performed.
+ * [in,out] old_data The buffer for retrieve previous value as describe
+ * above.
+ * [in] flags Special options for this operation. This parameter must
+ * be set to 0 or by bitwise OR'ing together one or more of
+ * the values described in mdbx_put() description above,
+ * and additionally (MDBX_CURRENT | MDBX_NOOVERWRITE)
+ * combination for selection particular item from
+ * multi-value/duplicates.
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_replace(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
+ MDBX_val *new_data, MDBX_val *old_data,
+ unsigned flags);
+
/* Delete items from a database.
*
* This function removes key/data pairs from the database.
*
- * The data parameter is NOT ignored regardless the database does
+ * NOTE: The data parameter is NOT ignored regardless the database does
* support sorted duplicate data items or not. If the data parameter
* is non-NULL only the matching data item will be deleted.
*
* This function will return MDBX_NOTFOUND if the specified key/data
* pair is not in the database.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] dbi A database handle returned by mdbx_dbi_open()
- * [in] key The key to delete from the database
- * [in] data The data to delete
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [in] key The key to delete from the database.
+ * [in] data The data to delete.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EACCES - an attempt was made to write in a read-only transaction.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_EACCES = an attempt was made to write in a read-only transaction.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_del(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
MDBX_val *data);
/* Create a cursor handle.
*
- * A cursor is associated with a specific transaction and database.
- * A cursor cannot be used when its database handle is closed. Nor
- * when its transaction has ended, except with mdbx_cursor_renew().
- * It can be discarded with mdbx_cursor_close().
+ * A cursor is associated with a specific transaction and database. A cursor
+ * cannot be used when its database handle is closed. Nor when its transaction
+ * has ended, except with mdbx_cursor_renew(). Also it can be discarded with
+ * mdbx_cursor_close().
*
- * A cursor must be closed explicitly always, before
- * or after its transaction ends. It can be reused with
- * mdbx_cursor_renew() before finally closing it.
+ * A cursor must be closed explicitly always, before or after its transaction
+ * ends. It can be reused with mdbx_cursor_renew() before finally closing it.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] dbi A database handle returned by mdbx_dbi_open()
- * [out] cursor Address where the new MDBX_cursor handle will be stored
+ * NOTE: In contrast to LMDB, the MDBX required that any opened cursors can be
+ * reused and must be freed explicitly, regardless ones was opened in a
+ * read-only or write transaction. The REASON for this is eliminates ambiguity
+ * which helps to avoid errors such as: use-after-free, double-free, i.e. memory
+ * corruption and segfaults.
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [out] cursor Address where the new MDBX_cursor handle will be stored.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_cursor_open(MDBX_txn *txn, MDBX_dbi dbi,
MDBX_cursor **cursor);
/* Close a cursor handle.
*
- * The cursor handle will be freed and must not be used again after this call.
- * Its transaction must still be live if it is a write-transaction.
+ * The cursor handle will be freed and must not be used again after this call,
+ * but its transaction may still be live.
+ *
+ * NOTE: In contrast to LMDB, the MDBX required that any opened cursors can be
+ * reused and must be freed explicitly, regardless ones was opened in a
+ * read-only or write transaction. The REASON for this is eliminates ambiguity
+ * which helps to avoid errors such as: use-after-free, double-free, i.e. memory
+ * corruption and segfaults.
*
- * [in] cursor A cursor handle returned by mdbx_cursor_open() */
+ * [in] cursor A cursor handle returned by mdbx_cursor_open(). */
LIBMDBX_API void mdbx_cursor_close(MDBX_cursor *cursor);
/* Renew a cursor handle.
*
- * A cursor is associated with a specific transaction and database.
- * Cursors that are only used in read-only transactions may be re-used,
- * to avoid unnecessary malloc/free overhead. The cursor may be associated
- * with a new read-only transaction, and referencing the same database handle
- * as it was created with.
+ * A cursor is associated with a specific transaction and database. The cursor
+ * may be associated with a new transaction, and referencing the same database
+ * handle as it was created with. This may be done whether the previous
+ * transaction is live or dead.
*
- * This may be done whether the previous transaction is live or dead.
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] cursor A cursor handle returned by mdbx_cursor_open()
+ * NOTE: In contrast to LMDB, the MDBX allow any cursor to be re-used by using
+ * mdbx_cursor_renew(), to avoid unnecessary malloc/free overhead until it freed
+ * by mdbx_cursor_close().
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] cursor A cursor handle returned by mdbx_cursor_open().
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_cursor_renew(MDBX_txn *txn, MDBX_cursor *cursor);
/* Return the cursor's transaction handle.
*
- * [in] cursor A cursor handle returned by mdbx_cursor_open() */
+ * [in] cursor A cursor handle returned by mdbx_cursor_open(). */
LIBMDBX_API MDBX_txn *mdbx_cursor_txn(MDBX_cursor *cursor);
/* Return the cursor's database handle.
*
- * [in] cursor A cursor handle returned by mdbx_cursor_open() */
+ * [in] cursor A cursor handle returned by mdbx_cursor_open(). */
LIBMDBX_API MDBX_dbi mdbx_cursor_dbi(MDBX_cursor *cursor);
/* Retrieve by cursor.
@@ -1412,15 +2879,15 @@ LIBMDBX_API MDBX_dbi mdbx_cursor_dbi(MDBX_cursor *cursor);
* and the address and length of the data are returned in the object to which
* data refers. See mdbx_get() for restrictions on using the output values.
*
- * [in] cursor A cursor handle returned by mdbx_cursor_open()
- * [in,out] key The key for a retrieved item
- * [in,out] data The data of a retrieved item
- * [in] op A cursor operation MDBX_cursor_op
+ * [in] cursor A cursor handle returned by mdbx_cursor_open().
+ * [in,out] key The key for a retrieved item.
+ * [in,out] data The data of a retrieved item.
+ * [in] op A cursor operation MDBX_cursor_op.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_NOTFOUND - no matching key found.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_NOTFOUND = no matching key found.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_cursor_get(MDBX_cursor *cursor, MDBX_val *key,
MDBX_val *data, MDBX_cursor_op op);
@@ -1429,7 +2896,7 @@ LIBMDBX_API int mdbx_cursor_get(MDBX_cursor *cursor, MDBX_val *key,
* This function stores key/data pairs into the database. The cursor is
* positioned at the new item, or on failure usually near it.
*
- * [in] cursor A cursor handle returned by mdbx_cursor_open()
+ * [in] cursor A cursor handle returned by mdbx_cursor_open().
* [in] key The key operated on.
* [in] data The data operated on.
* [in] flags Options for this operation. This parameter
@@ -1486,21 +2953,21 @@ LIBMDBX_API int mdbx_cursor_get(MDBX_cursor *cursor, MDBX_val *key,
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
* - MDBX_EKEYMISMATCH
- * - MDBX_MAP_FULL - the database is full, see mdbx_env_set_mapsize().
- * - MDBX_TXN_FULL - the transaction has too many dirty pages.
- * - MDBX_EACCES - an attempt was made to write in a read-only transaction.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_MAP_FULL = the database is full, see mdbx_env_set_mapsize().
+ * - MDBX_TXN_FULL = the transaction has too many dirty pages.
+ * - MDBX_EACCES = an attempt was made to write in a read-only transaction.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_cursor_put(MDBX_cursor *cursor, MDBX_val *key,
MDBX_val *data, unsigned flags);
-/* Delete current key/data pair
+/* Delete current key/data pair.
*
- * This function deletes the key/data pair to which the cursor refers.
- * This does not invalidate the cursor, so operations such as MDBX_NEXT
- * can still be used on it. Both MDBX_NEXT and MDBX_GET_CURRENT will return
- * the same record after this operation.
+ * This function deletes the key/data pair to which the cursor refers. This does
+ * not invalidate the cursor, so operations such as MDBX_NEXT can still be used
+ * on it. Both MDBX_NEXT and MDBX_GET_CURRENT will return the same record after
+ * this operation.
*
- * [in] cursor A cursor handle returned by mdbx_cursor_open()
+ * [in] cursor A cursor handle returned by mdbx_cursor_open().
* [in] flags Options for this operation. This parameter must be set to 0
* or one of the values described here.
*
@@ -1510,192 +2977,361 @@ LIBMDBX_API int mdbx_cursor_put(MDBX_cursor *cursor, MDBX_val *key,
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EACCES - an attempt was made to write in a read-only transaction.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_EACCES = an attempt was made to write in a read-only transaction.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_cursor_del(MDBX_cursor *cursor, unsigned flags);
/* Return count of duplicates for current key.
*
- * This call is only valid on databases that support sorted duplicate data
- * items MDBX_DUPSORT.
+ * This call is valid for all databases, but reasonable only for that support
+ * sorted duplicate data items MDBX_DUPSORT.
*
- * [in] cursor A cursor handle returned by mdbx_cursor_open()
- * [out] countp Address where the count will be stored
+ * [in] cursor A cursor handle returned by mdbx_cursor_open().
+ * [out] countp Address where the count will be stored.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_EINVAL - cursor is not initialized, or an invalid parameter
+ * - MDBX_EINVAL = cursor is not initialized, or an invalid parameter
* was specified. */
LIBMDBX_API int mdbx_cursor_count(MDBX_cursor *cursor, size_t *countp);
-/* Compare two data items according to a particular database.
+/* Determines whether the cursor is pointed to a key-value pair or not,
+ * i.e. was not positioned or points to the end of data.
*
- * This returns a comparison as if the two data items were keys in the
- * specified database.
+ * [in] cursor A cursor handle returned by mdbx_cursor_open().
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] dbi A database handle returned by mdbx_dbi_open()
- * [in] a The first item to compare
- * [in] b The second item to compare
+ * Returns:
+ * - MDBX_RESULT_TRUE = no more data available or cursor not positioned;
+ * - MDBX_RESULT_FALSE = data available;
+ * - Otherwise the error code. */
+LIBMDBX_API int mdbx_cursor_eof(MDBX_cursor *mc);
+
+/* Determines whether the cursor is pointed to the first key-value pair or not.
*
- * Returns < 0 if a < b, 0 if a == b, > 0 if a > b */
-LIBMDBX_API int mdbx_cmp(MDBX_txn *txn, MDBX_dbi dbi, const MDBX_val *a,
- const MDBX_val *b);
+ * [in] cursor A cursor handle returned by mdbx_cursor_open().
+ *
+ * Returns:
+ * - MDBX_RESULT_TRUE = cursor positioned to the first key-value pair.
+ * - MDBX_RESULT_FALSE = cursor NOT positioned to the first key-value pair.
+ * - Otherwise the error code. */
+LIBMDBX_API int mdbx_cursor_on_first(MDBX_cursor *mc);
-/* Compare two data items according to a particular database.
+/* Determines whether the cursor is pointed to the last key-value pair or not.
*
- * This returns a comparison as if the two items were data items of
- * the specified database. The database must have the MDBX_DUPSORT flag.
+ * [in] cursor A cursor handle returned by mdbx_cursor_open().
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] dbi A database handle returned by mdbx_dbi_open()
- * [in] a The first item to compare
- * [in] b The second item to compare
+ * Returns:
+ * - MDBX_RESULT_TRUE = cursor positioned to the last key-value pair.
+ * - MDBX_RESULT_FALSE = cursor NOT positioned to the last key-value pair.
+ * - Otherwise the error code. */
+LIBMDBX_API int mdbx_cursor_on_last(MDBX_cursor *mc);
+
+/* Estimates the distance between cursors as a number of elements. The results
+ * of such estimation can be used to build and/or optimize query execution
+ * plans.
+ *
+ * This function performs a rough estimate based only on b-tree pages that are
+ * common for the both cursor's stacks.
+ *
+ * NOTE: The result varies greatly depending on the filling of specific pages
+ * and the overall balance of the b-tree:
+ *
+ * 1. The number of items is estimated by analyzing the height and fullness of
+ * the b-tree. The accuracy of the result directly depends on the balance of the
+ * b-tree, which in turn is determined by the history of previous insert/delete
+ * operations and the nature of the data (i.e. variability of keys length and so
+ * on). Therefore, the accuracy of the estimation can vary greatly in a
+ * particular situation.
+ *
+ * 2. To understand the potential spread of results, you should consider a
+ * possible situations basing on the general criteria for splitting and merging
+ * b-tree pages:
+ * - the page is split into two when there is no space for added data;
+ * - two pages merge if the result fits in half a page;
+ * - thus, the b-tree can consist of an arbitrary combination of pages filled
+ * both completely and only 1/4. Therefore, in the worst case, the result
+ * can diverge 4 times for each level of the b-tree excepting the first and
+ * the last.
+ *
+ * 3. In practice, the probability of extreme cases of the above situation is
+ * close to zero and in most cases the error does not exceed a few percent. On
+ * the other hand, it's just a chance you shouldn't overestimate.
+ *
+ * Both cursors must be initialized for the same database and the same
+ * transaction.
+ *
+ * [in] first The first cursor for estimation.
+ * [in] last The second cursor for estimation.
+ * [out] distance_items A pointer to store estimated distance value,
+ * i.e. *distance_items = distance(first, last).
*
- * Returns < 0 if a < b, 0 if a == b, > 0 if a > b */
-LIBMDBX_API int mdbx_dcmp(MDBX_txn *txn, MDBX_dbi dbi, const MDBX_val *a,
- const MDBX_val *b);
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_estimate_distance(const MDBX_cursor *first,
+ const MDBX_cursor *last,
+ ptrdiff_t *distance_items);
-/* A callback function used to print a message from the library.
+/* Estimates the move distance, i.e. between the current cursor position and
+ * next position after the specified move-operation with given key and data.
+ * The results of such estimation can be used to build and/or optimize query
+ * execution plans. Current cursor position and state are preserved.
*
- * [in] msg The string to be printed.
- * [in] ctx An arbitrary context pointer for the callback.
+ * Please see notes on accuracy of the result in mdbx_estimate_distance()
+ * description above.
*
- * Returns < 0 on failure, >= 0 on success. */
-typedef int(MDBX_msg_func)(const char *msg, void *ctx);
+ * [in] cursor Cursor for estimation.
+ * [in,out] key The key for a retrieved item.
+ * [in,out] data The data of a retrieved item.
+ * [in] op A cursor operation MDBX_cursor_op.
+ * [out] distance_items A pointer to store estimated move distance
+ * as the number of elements.
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+LIBMDBX_API int mdbx_estimate_move(const MDBX_cursor *cursor, MDBX_val *key,
+ MDBX_val *data, MDBX_cursor_op move_op,
+ ptrdiff_t *distance_items);
-/* Dump the entries in the reader lock table.
+/* Estimates the size of a range as a number of elements. The results
+ * of such estimation can be used to build and/or optimize query execution
+ * plans.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [in] func A MDBX_msg_func function
- * [in] ctx Anything the message function needs
+ * Please see notes on accuracy of the result in mdbx_estimate_distance()
+ * description above.
*
- * Returns < 0 on failure, >= 0 on success. */
-LIBMDBX_API int mdbx_reader_list(MDBX_env *env, MDBX_msg_func *func, void *ctx);
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [in] begin_key The key of range beginning or NULL for explicit FIRST.
+ * [in] begin_data Optional additional data to seeking among sorted
+ * duplicates. Only for MDBX_DUPSORT, NULL otherwise.
+ * [in] end_key The key of range ending or NULL for explicit LAST.
+ * [in] end_data Optional additional data to seeking among sorted
+ * duplicates. Only for MDBX_DUPSORT, NULL otherwise.
+ * [out] distance_items A pointer to store range estimation result.
+ *
+ * Returns A non-zero error value on failure and 0 on success. */
+#define MDBX_EPSILON ((MDBX_val *)((ptrdiff_t)-1))
+LIBMDBX_API int mdbx_estimate_range(MDBX_txn *txn, MDBX_dbi dbi,
+ MDBX_val *begin_key, MDBX_val *begin_data,
+ MDBX_val *end_key, MDBX_val *end_data,
+ ptrdiff_t *size_items);
-/* Check for stale entries in the reader lock table.
+/* Determines whether the given address is on a dirty database page of the
+ * transaction or not. Ultimately, this allows to avoid copy data from non-dirty
+ * pages.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [out] dead Number of stale slots that were cleared
+ * "Dirty" pages are those that have already been changed during a write
+ * transaction. Accordingly, any further changes may result in such pages being
+ * overwritten. Therefore, all functions libmdbx performing changes inside the
+ * database as arguments should NOT get pointers to data in those pages. In
+ * turn, "not dirty" pages before modification will be copied.
*
- * Returns 0 on success, non-zero on failure. */
-LIBMDBX_API int mdbx_reader_check(MDBX_env *env, int *dead);
-
-LIBMDBX_API char *mdbx_dkey(const MDBX_val *key, char *const buf,
- const size_t bufsize);
+ * In other words, data from dirty pages must either be copied before being
+ * passed as arguments for further processing or rejected at the argument
+ * validation stage. Thus, mdbx_is_dirty() allows you to get rid of unnecessary
+ * copying, and perform a more complete check of the arguments.
+ *
+ * NOTE: The address passed must point to the beginning of the data. This is the
+ * only way to ensure that the actual page header is physically located in the
+ * same memory page, including for multi-pages with long data.
+ *
+ * NOTE: In rare cases the function may return a false positive answer
+ * (DBX_RESULT_TRUE when data is NOT on a dirty page), but never a false
+ * negative if the arguments are correct.
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] ptr The address of data to check.
+ *
+ * Returns:
+ * - MDBX_RESULT_TRUE = given address is on the dirty page.
+ * - MDBX_RESULT_FALSE = given address is NOT on the dirty page.
+ * - Otherwise the error code. */
+LIBMDBX_API int mdbx_is_dirty(const MDBX_txn *txn, const void *ptr);
-LIBMDBX_API int mdbx_env_close_ex(MDBX_env *env, int dont_sync);
+/* Sequence generation for a database.
+ *
+ * The function allows to create a linear sequence of unique positive integers
+ * for each database. The function can be called for a read transaction to
+ * retrieve the current sequence value, and the increment must be zero.
+ * Sequence changes become visible outside the current write transaction after
+ * it is committed, and discarded on abort.
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [out] result The optional address where the value of sequence before the
+ * change will be stored.
+ * [in] increment Value to increase the sequence,
+ * must be 0 for read-only transactions.
+ *
+ * Returns A non-zero error value on failure and 0 on success, some
+ * possible errors are:
+ * - MDBX_RESULT_TRUE = Increasing the sequence has resulted in an overflow
+ * and therefore cannot be executed. */
+LIBMDBX_API int mdbx_dbi_sequence(MDBX_txn *txn, MDBX_dbi dbi, uint64_t *result,
+ uint64_t increment);
-/* Sets threshold to force flush the data buffers to disk,
- * even of MDBX_NOSYNC, MDBX_NOMETASYNC and MDBX_MAPASYNC flags
- * in the environment. The value affects all processes which operates with given
- * DB until the last process close DB or a new value will be settled.
+/* Compare two data items according to a particular database.
*
- * Data is always written to disk when mdbx_txn_commit() is called,
- * but the operating system may keep it buffered. MDBX always flushes
- * the OS buffers upon commit as well, unless the environment was
- * opened with MDBX_NOSYNC, MDBX_MAPASYNC or in part MDBX_NOMETASYNC.
+ * This returns a comparison as if the two data items were keys in the
+ * specified database.
*
- * The default is 0, than mean no any threshold checked, and no additional
- * flush will be made.
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [in] a The first item to compare.
+ * [in] b The second item to compare.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [in] bytes The size in bytes of summary changes when a synchronous
- * flush would be made.
+ * Returns < 0 if a < b, 0 if a == b, > 0 if a > b */
+LIBMDBX_API int mdbx_cmp(MDBX_txn *txn, MDBX_dbi dbi, const MDBX_val *a,
+ const MDBX_val *b);
+
+/* Compare two data items according to a particular database.
*
- * Returns A non-zero error value on failure and 0 on success. */
-LIBMDBX_API int mdbx_env_set_syncbytes(MDBX_env *env, size_t bytes);
+ * This returns a comparison as if the two items were data items of the
+ * specified database. The database must have the MDBX_DUPSORT flag.
+ *
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [in] a The first item to compare.
+ * [in] b The second item to compare.
+ *
+ * Returns < 0 if a < b, 0 if a == b, > 0 if a > b */
+LIBMDBX_API int mdbx_dcmp(MDBX_txn *txn, MDBX_dbi dbi, const MDBX_val *a,
+ const MDBX_val *b);
-/* Sets relative period since the last unsteay commit to force flush the data
- * buffers to disk, even of MDBX_NOSYNC, MDBX_NOMETASYNC and MDBX_MAPASYNC flags
- * in the environment. The value affects all processes which operates with given
- * DB until the last process close DB or a new value will be settled.
+/* A callback function used to enumerate the reader lock table.
+ *
+ * [in] ctx An arbitrary context pointer for the callback.
+ * [in] num The serial number during enumeration, starting from 1.
+ * [in] slot The reader lock table slot number.
+ * [in] txnid The ID of the transaction being read,
+ * i.e. the MVCC-snaphot number.
+ * [in] lag The lag from a recent MVCC-snapshot, i.e. the number of
+ * committed transaction since read transaction started.
+ * [in] pid The reader process ID.
+ * [in] thread The reader thread ID.
+ * [in] bytes_used The number of last used page in the MVCC-snapshot which
+ * being read, i.e. database file can't shrinked beyond this.
+ * [in] bytes_retired The total size of the database pages that were retired by
+ * committed write transactions after the reader's
+ * MVCC-snapshot, i.e. the space which would be freed after
+ * the Reader releases the MVCC-snapshot for reuse by
+ * completion read transaction.
*
- * Data is always written to disk when mdbx_txn_commit() is called,
- * but the operating system may keep it buffered. MDBX always flushes
- * the OS buffers upon commit as well, unless the environment was
- * opened with MDBX_NOSYNC, MDBX_MAPASYNC or in part MDBX_NOMETASYNC.
+ * Returns < 0 on failure, >= 0 on success. */
+typedef int(MDBX_reader_list_func)(void *ctx, int num, int slot, mdbx_pid_t pid,
+ mdbx_tid_t thread, uint64_t txnid,
+ uint64_t lag, size_t bytes_used,
+ size_t bytes_retained);
+
+/* Enumarete the entries in the reader lock table.
*
- * Settled period don't checked asynchronously, but only inside the functions.
- * mdbx_txn_commit() and mdbx_env_sync(). Therefore, in cases where transactions
- * are committed infrequently and/or irregularly, polling by mdbx_env_sync() may
- * be a reasonable solution to timeout enforcement.
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] func A MDBX_reader_list_func function.
+ * [in] ctx An arbitrary context pointer for the enumeration function.
*
- * The default is 0, than mean no any timeout checked, and no additional
- * flush will be made.
+ * Returns A non-zero error value on failure and 0 on success,
+ * or MDBX_RESULT_TRUE (-1) if the reader lock table is empty. */
+LIBMDBX_API int mdbx_reader_list(MDBX_env *env, MDBX_reader_list_func *func,
+ void *ctx);
+
+/* Check for stale entries in the reader lock table.
*
- * [in] env An environment handle returned by mdbx_env_create()
- * [in] seconds_16dot16 The period in 1/65536 of second when a synchronous
- * flush would be made since the last unsteay commit.
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [out] dead Number of stale slots that were cleared.
*
- * Returns A non-zero error value on failure and 0 on success. */
-LIBMDBX_API int mdbx_env_set_syncperiod(MDBX_env *env,
- unsigned seconds_16dot16);
+ * Returns A non-zero error value on failure and 0 on success,
+ * or MDBX_RESULT_TRUE (-1) if a dead reader(s) found or mutex was recovered. */
+LIBMDBX_API int mdbx_reader_check(MDBX_env *env, int *dead);
/* Returns a lag of the reading for the given transaction.
*
* Returns an information for estimate how much given read-only
* transaction is lagging relative the to actual head.
+ * This is deprecated function, use mdbx_txn_info() instead.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
* [out] percent Percentage of page allocation in the database.
*
* Returns Number of transactions committed after the given was started for
* read, or negative value on failure. */
-LIBMDBX_API int mdbx_txn_straggler(MDBX_txn *txn, int *percent);
+__deprecated LIBMDBX_API int mdbx_txn_straggler(MDBX_txn *txn, int *percent);
-/* A callback function for killing a laggard readers,
- * but also could waiting ones. Called in case of MDBX_MAP_FULL error.
+/* A lack-of-space callback function to resolve issues with a laggard readers.
+ *
+ * Read transactions prevent reuse of pages freed by newer write transactions,
+ * thus the database can grow quickly. This callback will be called when there
+ * is not enough space in the database (ie. before increasing the database size
+ * or before MDBX_MAP_FULL error) and thus can be used to resolve issues with
+ * a "long-lived" read transactions.
+ *
+ * Depending on the arguments and needs, your implementation may wait, terminate
+ * a process or thread that is performing a long read, or perform some other
+ * action. In doing so it is important that the returned code always corresponds
+ * to the performed action.
*
* [in] env An environment handle returned by mdbx_env_create().
- * [in] pid pid of the reader process.
- * [in] tid thread_id of the reader thread.
- * [in] txn Transaction number on which stalled.
+ * [in] pid A pid of the reader process.
+ * [in] tid A thread_id of the reader thread.
+ * [in] txn A transaction number on which stalled.
* [in] gap A lag from the last commited txn.
- * [in] retry A retry number, less that zero for notify end of OOM-loop.
- *
- * Returns -1 on failure (reader is not killed),
- * 0 should wait or retry,
- * 1 drop reader txn-lock (reading-txn was aborted),
- * >1 drop reader registration (reader process was killed). */
-typedef int(MDBX_oom_func)(MDBX_env *env, int pid, mdbx_tid_t tid, uint64_t txn,
- unsigned gap, int retry);
+ * [in] space A space that actually become available for reuse after this
+ * reader finished. The callback function can take this value into
+ * account to evaluate the impact that a long-running transaction
+ * has.
+ * [in] retry A retry number starting from 0. if callback has returned 0
+ * at least once, then at end of current OOM-handler loop callback
+ * will be called additionally with negative value to notify about
+ * the end of loop. The callback function can use this value to
+ * implement timeout logic while waiting for readers.
+ *
+ * The RETURN CODE determines the further actions libmdbx and must match the
+ * action which was executed by the callback:
+ *
+ * -2 or less = An error condition and the reader was not killed.
+ *
+ * -1 = The callback was unable to solve the problem and agreed
+ * on MDBX_MAP_FULL error, libmdbx should increase the
+ * database size or return MDBX_MAP_FULL error.
+ *
+ * 0 (zero) = The callback solved the problem or just waited for
+ * a while, libmdbx should rescan the reader lock table and
+ * retry. This also includes a situation when corresponding
+ * transaction terminated in normal way by mdbx_txn_abort()
+ * or mdbx_txn_reset(), and my be restarted. I.e. reader
+ * slot don't needed to be cleaned from transaction.
+ *
+ * 1 = Transaction aborted asynchronous and reader slot should
+ * be cleared immediately, i.e. read transaction will not
+ * continue but mdbx_txn_abort() or mdbx_txn_reset() will
+ * be called later.
+ *
+ * 2 or great = The reader process was terminated or killed, and libmdbx
+ * should entirely reset reader registration. */
+typedef int(MDBX_oom_func)(MDBX_env *env, mdbx_pid_t pid, mdbx_tid_t tid,
+ uint64_t txn, unsigned gap, size_t space, int retry);
/* Set the OOM callback.
*
- * Callback will be called only on out-of-pages case for killing
- * a laggard readers to allowing reclaiming of freeDB.
+ * The callback will only be triggered on lack of space to resolve issues with
+ * lagging reader(s) (i.e. to kill it) for resume reuse pages from the garbage
+ * collector.
*
- * [in] env An environment handle returned by mdbx_env_create().
- * [in] oomfunc A MDBX_oom_func function or NULL to disable.
+ * [in] env An environment handle returned by mdbx_env_create().
+ * [in] oom_func A MDBX_oom_func function or NULL to disable.
*
* Returns A non-zero error value on failure and 0 on success. */
LIBMDBX_API int mdbx_env_set_oomfunc(MDBX_env *env, MDBX_oom_func *oom_func);
/* Get the current oom_func callback.
*
- * Callback will be called only on out-of-pages case for killing
- * a laggard readers to allowing reclaiming of freeDB.
- *
* [in] env An environment handle returned by mdbx_env_create().
*
* Returns A MDBX_oom_func function or NULL if disabled. */
LIBMDBX_API MDBX_oom_func *mdbx_env_get_oomfunc(MDBX_env *env);
-#define MDBX_DBG_ASSERT 1
-#define MDBX_DBG_PRINT 2
-#define MDBX_DBG_TRACE 4
-#define MDBX_DBG_EXTRA 8
-#define MDBX_DBG_AUDIT 16
-#define MDBX_DBG_JITTER 32
-#define MDBX_DBG_DUMP 64
-#define MDBX_DBG_LEGACY_MULTIOPEN 128
-
-typedef void MDBX_debug_func(int type, const char *function, int line,
- const char *msg, va_list args);
-
-LIBMDBX_API int mdbx_setup_debug(int flags, MDBX_debug_func *logger);
+/**** B-tree Traversal *********************************************************
+ * This is internal API for mdbx_chk tool. You should avoid to use it, except
+ * some extremal special cases. */
+/* Page types for traverse the b-tree. */
typedef enum {
MDBX_page_void,
MDBX_page_meta,
@@ -1711,108 +3347,20 @@ typedef enum {
#define MDBX_PGWALK_GC ((const char *)((ptrdiff_t)-1))
#define MDBX_PGWALK_META ((const char *)((ptrdiff_t)-2))
+/* Callback function for traverse the b-tree. */
typedef int
MDBX_pgvisitor_func(const uint64_t pgno, const unsigned number, void *const ctx,
const int deep, const char *const dbi,
const size_t page_size, const MDBX_page_type_t type,
const size_t nentries, const size_t payload_bytes,
const size_t header_bytes, const size_t unused_bytes);
+
+/* B-tree traversal function. */
LIBMDBX_API int mdbx_env_pgwalk(MDBX_txn *txn, MDBX_pgvisitor_func *visitor,
void *ctx);
-typedef struct mdbx_canary {
- uint64_t x, y, z, v;
-} mdbx_canary;
-
-LIBMDBX_API int mdbx_canary_put(MDBX_txn *txn, const mdbx_canary *canary);
-LIBMDBX_API int mdbx_canary_get(MDBX_txn *txn, mdbx_canary *canary);
-
-/* Returns:
- * - MDBX_RESULT_TRUE
- * when no more data available or cursor not positioned;
- * - MDBX_RESULT_FALSE
- * when data available;
- * - Otherwise the error code. */
-LIBMDBX_API int mdbx_cursor_eof(MDBX_cursor *mc);
-
-/* Returns: MDBX_RESULT_TRUE, MDBX_RESULT_FALSE or Error code. */
-LIBMDBX_API int mdbx_cursor_on_first(MDBX_cursor *mc);
-
-/* Returns: MDBX_RESULT_TRUE, MDBX_RESULT_FALSE or Error code. */
-LIBMDBX_API int mdbx_cursor_on_last(MDBX_cursor *mc);
-
-/* Estimates the distance between cursors as the number of elements.
- * Both cursors must be initialized for the same DBI.
- *
- * [in] cursor_a The first cursor for estimation.
- * [in] cursor_b The second cursor for estimation.
- * [out] distance_items A pointer to store estimated distance value,
- * i.e. *distance_items = distance(a - b).
- *
- * Returns A non-zero error value on failure and 0 on success. */
-LIBMDBX_API int mdbx_estimate_distance(const MDBX_cursor *first,
- const MDBX_cursor *last,
- ptrdiff_t *distance_items);
-
-/* Estimates the move distance, i.e. between the current cursor position and
- * next position after the specified move-operation with given key and data.
- * Current cursor position and state are preserved.
- *
- * [in] cursor Cursor for estimation.
- * [in,out] key The key for a retrieved item.
- * [in,out] data The data of a retrieved item.
- * [in] op A cursor operation MDBX_cursor_op.
- * [out] distance_items A pointer to store estimated move distance
- * as the number of elements.
- *
- * Returns A non-zero error value on failure and 0 on success. */
-LIBMDBX_API int mdbx_estimate_move(const MDBX_cursor *cursor, MDBX_val *key,
- MDBX_val *data, MDBX_cursor_op move_op,
- ptrdiff_t *distance_items);
-
-/* Estimates the size of a range in the number of elements.
- *
- * [in] txn A transaction handle returned by mdbx_txn_begin().
- * [in] dbi A database handle returned by mdbx_dbi_open().
- * [in] begin_key The key of range beginning or NULL for explicit FIRST.
- * [in] begin_data Optional additional data to seeking among sorted
- * duplicates. Only for MDBX_DUPSORT, NULL otherwise.
- * [in] end_key The key of range ending or NULL for explicit LAST.
- * [in] end_data Optional additional data to seeking among sorted
- * duplicates. Only for MDBX_DUPSORT, NULL otherwise.
- * [out] distance_items A pointer to store range estimation result.
- *
- * Returns A non-zero error value on failure and 0 on success. */
-#define MDBX_EPSILON ((MDBX_val *)((ptrdiff_t)-1))
-LIBMDBX_API int mdbx_estimate_range(MDBX_txn *txn, MDBX_dbi dbi,
- MDBX_val *begin_key, MDBX_val *begin_data,
- MDBX_val *end_key, MDBX_val *end_data,
- ptrdiff_t *size_items);
-
-LIBMDBX_API int mdbx_replace(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
- MDBX_val *new_data, MDBX_val *old_data,
- unsigned flags);
-/* Same as mdbx_get(), but:
- * 1) if values_count is not NULL, then returns the count
- * of multi-values/duplicates for a given key.
- * 2) updates the key for pointing to the actual key's data inside DB. */
-LIBMDBX_API int mdbx_get_ex(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
- MDBX_val *data, size_t *values_count);
-
-LIBMDBX_API int mdbx_is_dirty(const MDBX_txn *txn, const void *ptr);
-
-LIBMDBX_API int mdbx_dbi_sequence(MDBX_txn *txn, MDBX_dbi dbi, uint64_t *result,
- uint64_t increment);
-
-LIBMDBX_API int mdbx_limits_pgsize_min(void);
-LIBMDBX_API int mdbx_limits_pgsize_max(void);
-LIBMDBX_API intptr_t mdbx_limits_dbsize_min(intptr_t pagesize);
-LIBMDBX_API intptr_t mdbx_limits_dbsize_max(intptr_t pagesize);
-LIBMDBX_API intptr_t mdbx_limits_keysize_max(intptr_t pagesize);
-LIBMDBX_API intptr_t mdbx_limits_txnsize_max(intptr_t pagesize);
-
-/*----------------------------------------------------------------------------*/
-/* attribute support functions for Nexenta */
+/**** Attribute support functions for Nexenta *********************************/
+#ifdef MDBX_NEXENTA_ATTRS
typedef uint_fast64_t mdbx_attr_t;
/* Store by cursor with attribute.
@@ -1844,10 +3392,10 @@ typedef uint_fast64_t mdbx_attr_t;
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
* - MDBX_EKEYMISMATCH
- * - MDBX_MAP_FULL - the database is full, see mdbx_env_set_mapsize().
- * - MDBX_TXN_FULL - the transaction has too many dirty pages.
- * - MDBX_EACCES - an attempt was made to write in a read-only transaction.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_MAP_FULL = the database is full, see mdbx_env_set_mapsize().
+ * - MDBX_TXN_FULL = the transaction has too many dirty pages.
+ * - MDBX_EACCES = an attempt was made to write in a read-only transaction.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_cursor_put_attr(MDBX_cursor *cursor, MDBX_val *key,
MDBX_val *data, mdbx_attr_t attr,
unsigned flags);
@@ -1890,10 +3438,10 @@ LIBMDBX_API int mdbx_cursor_put_attr(MDBX_cursor *cursor, MDBX_val *key,
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
* - MDBX_KEYEXIST
- * - MDBX_MAP_FULL - the database is full, see mdbx_env_set_mapsize().
- * - MDBX_TXN_FULL - the transaction has too many dirty pages.
- * - MDBX_EACCES - an attempt was made to write in a read-only transaction.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_MAP_FULL = the database is full, see mdbx_env_set_mapsize().
+ * - MDBX_TXN_FULL = the transaction has too many dirty pages.
+ * - MDBX_EACCES = an attempt was made to write in a read-only transaction.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_put_attr(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
MDBX_val *data, mdbx_attr_t attr, unsigned flags);
@@ -1912,8 +3460,8 @@ LIBMDBX_API int mdbx_put_attr(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_NOTFOUND - the key-value pair was not in the database.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_NOTFOUND = the key-value pair was not in the database.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_set_attr(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
MDBX_val *data, mdbx_attr_t attr);
@@ -1925,15 +3473,15 @@ LIBMDBX_API int mdbx_set_attr(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
* and the address and length of the data are returned in the object to which
* data refers. See mdbx_get() for restrictions on using the output values.
*
- * [in] cursor A cursor handle returned by mdbx_cursor_open()
- * [in,out] key The key for a retrieved item
- * [in,out] data The data of a retrieved item
- * [in] op A cursor operation MDBX_cursor_op
+ * [in] cursor A cursor handle returned by mdbx_cursor_open().
+ * [in,out] key The key for a retrieved item.
+ * [in,out] data The data of a retrieved item.
+ * [in] op A cursor operation MDBX_cursor_op.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_NOTFOUND - no matching key found.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_NOTFOUND = no matching key found.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_cursor_get_attr(MDBX_cursor *mc, MDBX_val *key,
MDBX_val *data, mdbx_attr_t *attrptr,
MDBX_cursor_op op);
@@ -1955,20 +3503,21 @@ LIBMDBX_API int mdbx_cursor_get_attr(MDBX_cursor *mc, MDBX_val *key,
* NOTE: Values returned from the database are valid only until a
* subsequent update operation, or the end of the transaction.
*
- * [in] txn A transaction handle returned by mdbx_txn_begin()
- * [in] dbi A database handle returned by mdbx_dbi_open()
- * [in] key The key to search for in the database
- * [in,out] data The data corresponding to the key
+ * [in] txn A transaction handle returned by mdbx_txn_begin().
+ * [in] dbi A database handle returned by mdbx_dbi_open().
+ * [in] key The key to search for in the database.
+ * [in,out] data The data corresponding to the key.
*
* Returns A non-zero error value on failure and 0 on success, some
* possible errors are:
- * - MDBX_NOTFOUND - the key was not in the database.
- * - MDBX_EINVAL - an invalid parameter was specified. */
+ * - MDBX_NOTFOUND = the key was not in the database.
+ * - MDBX_EINVAL = an invalid parameter was specified. */
LIBMDBX_API int mdbx_get_attr(MDBX_txn *txn, MDBX_dbi dbi, MDBX_val *key,
MDBX_val *data, mdbx_attr_t *attrptr);
+#endif /* MDBX_NEXENTA_ATTRS */
-/*----------------------------------------------------------------------------*/
-/* LY: temporary workaround for Elbrus's memcmp() bug. */
+/*******************************************************************************
+ * LY: temporary workaround for Elbrus's memcmp() bug. */
#ifndef __GLIBC_PREREQ
#if defined(__GLIBC__) && defined(__GLIBC_MINOR__)
#define __GLIBC_PREREQ(maj, min) \