2. streamcorpus — single-document data representation

The core data representation for individual documents is a StreamItem. This includes both storage for the entire body contents in several forms and storage for tokenizers that parse the document. This is a pure in-memory form, but generated with Apache Thrift, so that it can be serialized to network storage or on-disk Chunk files.

In addition to the current version, there are two previous versions of the StreamItem structure, and files in the previous format are stored in places such as Amazon S3 public data sets.

2.1. Stream items

There are many parts to this, loosely organized as follows:

class streamcorpus.StreamItem(version=1, doc_id=None, abs_url=None, schost=None, original_url=None, source=None, body=None, source_metadata={}, stream_id=None, stream_time=None, other_content={}, rating={}, external_ids={})

This is the primary interface to the corpus data. It is a snapshot of a single document, identified by its URL, at a single point in time. All corpora are stream corpora, even if they were not explicitly created as such.

version

Version of this stream item, always streamcorpus.Versions.v0_3_0.

doc_id

Identifier of the document, which may change over time, and of which this stream item is a snapshot. This is always an MD5 hash of abs_url.

abs_url

Normalized form of the original document URL.

schost

scheme://hostname part of abs_url.

original_url

Original URL of the document. This is only present of the original URL is not a valid canonicalized URL, in which case abs_url will be derived from this field.

source

String uniquely identifying this data set. Should start with a year string, such as “news” or “social”.

body

streamcorpus.ContentItem holding the primary content of the stream item.

source_metadata

Dictionary mapping strings to arbitrary binary data. The keys should be short, descriptive, and free of whitespace. In many cases the keys will be the same as the source of this stream item, and the values will be serialized JSON matching a metadata schema from <http://trec-kba.org/schemas/v1.0/>. http_headers is also an expected key.

stream_id

Unique identifier for a stream item, snapshotting a document at a point in time. This should be constructed as:

si.stream_id = '%d-%s'.format(si.stream_time.epoch_ticks, si.doc_id)
stream_time

streamcorpus.StreamTime identifying the earliest time that this content was known to exist. Usually, body was saved at the time of that first observation.

other_content

Dictionary mapping strings to streamcorpus.ContentItem for additional data attached to this stream item. Typical keys are title, anchor, or extracted. anchor should map to the single anchor text of a hyperlink pointing to this documentation.

ratings

Dictionary mapping annotator ID strings to lists of streamcorpus.Rating judging the relationship of this entire document to some target entity.

external_ids

Two-level map from system identifier to document or stream ID to some external identifier.

class streamcorpus.Rating

Ratings are human-generated assertions about an entire document’s utility for a particular topic or entity in a reference knowledge base.

annotator

The Annotator that produced this rating.

target

The Target of the rating.

relevance

Numerical score assigned by annotator to “judge” or “rate” the utility of this stream item to addressing the target information need. The range and interpretation of relevance numbers depends on the annotator. This value can represent a rank ordering or an enumeration such as -1=Garbage, 0=Neutral, 1=Useful, 2=Vital.

contains_mention

Boolean indication of whether the document actually mentions the target entity. This is only partially correlated with relevance. For example, a document might mention the entity but only in “chrome” text, making it a Garbage-rated text for that entity.

comments

Any additional notes provided by the annotator.

mentions

Specific strings that correspond to the entity in text.

flags

List of enumeration flags from FlagType; for instance, streamcorpus.FlagType.PROFILE to indicate that this stream item is a profile for the target entity.

class streamcorpus.ContentItem

A complete representation of the item’s data. For instance, the streamcorpus.StreamItem.body is a content item.

raw

The original download as an unprocessed byte array.

encoding

Character encoding, either guessed or determined from protocol headers.

media_type

MIME type of the document, possibly from an HTTP or similar Content-Type: header

clean_html

HTML-formatted version of raw. This has correct UTF-8 encoding and no broken tags. All HTML-escaped characters are converted to their UTF-8 equivalents, and <, >, and & are escaped.

clean_visible

Copy of clean_html with all HTML tags replaced with whitespace. Byte offsets in this and clean_html are identical. <, >, and & remain HTML-escaped. This text can be directly inserted into an HTML or XML document without further escaping.

logs

List of string log messages produced from the processing pipeline

taggings

A dictionary mapping string tagger IDs to Tagging objects. This is a set of auto-generated taggings, such as a one-word-per-line (OWLP) tokenization and sentence chunking with part-of-speech, lemmatization, and NER classification. The dictionary key should match the streamcorpus.Tagging.tagger_id, and also match the key in sentences and sentence_blobs, which come from transforming a streamcorpus.Tagging.raw_tagging into Sentence and Token instances.

Taggings are generated from clean_visible, and offsets of all types refer to the clean_visible or clean_html version of the data, not raw.

labels

A dictionary mapping string annotator IDs to lists of Label annotations on the entire content item.

sentences

A dictionary mapping tagger IDs to ordered lists of Sentence objects produced by that tagger.

sentence_blobs

A dictionary the same as sentences, but the dictionary values are Thrift-serialized binary strings that can be deserialized into the lists on demand.

language

The Language of the text.

relations

A dictionary mapping tagger IDs to lists of Relation identified by the tagger.

attributes

A dictionary mapping tagger IDs to lists of Attribute identified by the tagger.

external_ids

A dictionary mapping tagger IDs to dictionaries mapping numeric mention IDs to text. This allows external systems to associate record IDs with individual mentions or sets of mentions.

class streamcorpus.Annotator

Description of a human (or a set of humans) that generated the data in a Label or Rating.

annotator_id

Name of the annotator. This is also used as the source key in several maps, and it is important for annotator IDs to be both consistent across annotations and unique. We use the following conventions:

  • Avoid whitespace.
  • An email address is the best identifier.
  • Where a single email address is not appropriate, create a descriptive string, e.g. nist-trec-kba-2012-assessors.
  • author means the person who wrote the original text.
annotation_time

Approximate StreamTime when the annotation was provided by a human. This may be None if no time is available.

class streamcorpus.Attribute

Description of an attribute of an entity discovered by a tagger in the text.

attribute_type

The AttributeType of the attribute.

evidence

String presented by the tagger as evidence of the attribute.

value

Normalized, typed string value derived from evidence. The type of this is determined by attribute_type. If the type is streamcorpus.AttributeType.PER_GENDER, for instance, this value will be a string containing an integer value from the streamcorpus.Gender enumeration. If the type implies a date-time type, the value is a streamcorpus.StreamTime.zulu_timestamp.

sentence_id

Zero-based index into the sentences array for this tagger.

mention_id

Index into the mentions in this document. This identifies the mention to which the attribute applies.

class streamcorpus.Label

Labels are human-generated assertions about a portion of a document. For example, a human author might label their own text by inserting hyperlinks to Wikipedia, or a NIST assessor might record which tokens in a text mention a target entity.

Labels appear in lists as the value parts of streamcorpus.Token.labels, streamcorpus.Sentence.labels, and streamcorpus.ContentItem.labels.

annotator

The Annotator source of this label.

target

The Target entity of this label.

offsets

Map of OffsetType enum value to Offset describing what is labeled.

positive

Labels are usually positive assertions that the token mentions the target. It is sometimes useful to collect negative assertions that a token is not the target, which can be indicated by setting this field to False.

comments

Additional notes from the annotator about this label.

mentions

List of strings that are mentions of this target in the text.

relevance

Numerical score assigned by annotator to “judge” or “rate” the utility of this stream item to addressing the target information need. The range and interpretation of relevance numbers depends on the annotator. This value can represent a rank ordering or an enumeration such as -1=Garbage, 0=Neutral, 1=Useful, 2=Vital.

stream_id

streamcorpus.StreamItem.stream_id of the stream item containing this label.

flags

List of integer FlagType enumeration values further describing this label.

class streamcorpus.Target

An entity or topic being identified by a Label. These often come from a knowledge base such as Wikipedia.

target_id

Unique string identifier for the target. This can be a URL from a Wikipedia, Freebase, or other structured reference system about the target.

kb_id

Optional string identifying the knowledge base, if it is not obvious from target_id.

kb_snapshot_time

StreamTime when the knowledge base article was accessed.

class streamcorpus.FlagType

General-purpose flags. These are integer values.

PROFILE

This label is to a profile for the target in a knowledge base.

class streamcorpus.StreamTime

Time attached to a stream item.

epoch_ticks

Time, in fractional seconds, since midnight 1 Jan 1970 UTC (“Unix time”), the same time as returned by time.time().

zulu_timestamp

Formatted version of the time string.

class streamcorpus.Offset

A range of data within some ContentItem.

type

The OffsetType value that describes what units this offset uses; for instance, streamcorpus.OffsetType.CHARS.

first

First item in the range.

length

Length of the range.

data = content_item[offset.first:offset.first+offset.length]
xpath

An optional string giving an XPath query into an XML or XHTML document.

content_form

If a ContentItem has multiple forms, the name of the form, such as “raw”, “clean_html”, or “clean_visible”. “clean_visible” is the most common case.

value

The actual content of the range. Only present as a debugging aid and frequently empty.

class streamcorpus.OffsetType

Part of an Offset that describes what units the offset uses.

LINES

The offset is a range of line numbers.

BYTES

The offset is a range of bytes.

CHARS

The offset is a range of characters, typically in Unicode.

class streamcorpus.Tagging

Information about and output from a tagging tool.

tagger_id

Opaque string identifying the tagger.

raw_tagging

Raw output of the tagging tool.

tagger_config

Short, human-readable description of the configuration parameters.

tagger_version

Short, human-readable version string of the tagging tool.

generation_time

StreamTime the tagging was generated.

class streamcorpus.Relation

Description of a relation between two entities that a tagger discovered in the text.

If a tagger discovers that Bob is located in Chicago, then relation_type would be streamcorpus.RelationType.PHYS_Located, sentence_id_1 and mention_id_1 would refer to “Bob”, and sentence_id_2 and mention_id_2 would refer to “Chicago”.

relation_type

Type of the relation, one of the RelationType enumeration values.

sentence_id_1

Zero-based sentence index for the sentences for this tagger ID for the first entity.

mention_id_1

Index into the mentions in the document for the first entity.

sentence_id_2

Zero-based sentence index for the sentences for this tagger ID for the second entity.

mention_id_2

Index into the mentions in the document for the second entity.

class streamcorpus.Language

Description of a natural language used in the text.

code

Two-letter code for the language, such as “en”

name

Full name of the language

class streamcorpus.Sentence

A complete sentence.

tokens

List of Token in the sentence.

labels

Map of string annotator ID to list of Label for labels over the entire sentence.

class streamcorpus.Token

Textual tokens identified by an NLP pipeline and marked up with metadata from automatic taggers, and possibly also labels from humans.

token_num

Zero-based index into the stream of tokens from a document.

token

Actual token string; a UTF-8 encoded Python string()

offsets

Location of the token in the original document. This is a map from OffsetType enum value to Offset, allowing the token to have line number, byte position, and character position offsets.

sentence_pos

Zero-based index into the sentence, or -1 if unavailable.

lemma

Lemmatization of the token.

pos

Part-of-speech label. For possible values, see http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

entity_type

One of the EntityType enumeration values.

mention_id

Identifier for each mention in this tagger’s description of the document. This is unique for a single tagger within a single document. -1 means “no mention”.

The mention ID distinguishes multi-token mentions; tokens that correspond to the same mention will have the same mention ID. This is needed when other fields do not change between tokens that are part of separate mentions: “the senator is known to his friends as David, Davy, Zeus, and Mr. Elephant”.

The mention ID identifies mentions in Relation objects.

equiv_id

Single-document coref chain ID. This is unique for a single tagger within a single document. Tokens that refer to the same entity will have the same equivalence ID. -1 means “none”.

parent_id

Parent sentence position in the dependency parse. -1 means “none”.

dependency_path

Grammatical relation label on the path to the parent in a dependency parse, defined by whatever tagger was used.

labels

Dictionary mapping annotator ID strings to lists of Label objects for labels on this token.

mention_type

One of the MentionType enumeration values.

custom_entity_type

If entity_type is streamcorpus.EntityType.CUSTOM_TYPE, a string describing what exactly the entity type is. This is useful when a specialized tagger has a large number of unique entity types, such as entity:artefact:weapon:blunt.

class streamcorpus.EntityType[source]

Different tagging tools have different strings for labeling the various common entity types. To avoid ambiguity, we define a canonical list here, which we will surely have to expand over time as new taggers recognize new types of entities.

LOC: physical location

MISC: uncategorized named entities, e.g. Civil War for Stanford CoreNLP

PER = 0
ORG = 1
LOC = 2
TIME = 5
DATE = 6
MONEY = 7
PERCENT = 8
MISC = 9
GPE = 10
FAC = 11
VEH = 12
WEA = 13
phone = 14
email = 15
URL = 16
CUSTOM_TYPE = 17
LIST = 18
RELIGION = 19
NATIONALITY = 20
TITLE = 21
EVENT = 22
class streamcorpus.MentionType[source]
NAME = 0
PRO = 1
NOM = 2
class streamcorpus.Gender[source]
FEMALE = 0
MALE = 1
class streamcorpus.AttributeType[source]

Attributes are based primarily on TAC KBP, see also saved in this directory http://surdeanu.info/kbp2013/TAC_2013_KBP_Slot_Descriptions_1.0.pdf

Only slots that are not resolvable to unique entities are listed here as attributes. Most slots are relations, so see RelationType.

PER_AGE = 0
PER_GENDER = 1
PER_ALTERNATE_NAMES = 3
PER_CAUSE_OF_DEATH = 4
PER_TITLE = 5
PER_CHARGES = 6
ORG_ALTERNATE_NAMES = 7
ORG_NUMBER_OF_EMPLOYEES_MEMBERS = 8
class streamcorpus.RelationType[source]

RelationType is used in Relation to map relation “name” to type.

Relations 0 through 50 borrow from ACE with these string replacements: s/-// and s/./_/ http://projects.ldc.upenn.edu/ace/docs/English-Events-Guidelines_v5.4.3.pdf

Relations 51- borrows from KBP slot filling http://surdeanu.info/kbp2013/TAC_2013_KBP_Slot_Descriptions_1.0.pdf

Most entity slots are relations, so the PER_ and ORG_ and FAC_ relations listed below are primary for slot filling.

Many of the KBP-based slots are redundant or overlapping with the ACE-based slots. The KBP-based slots are generally simpler and were developed to support knowledge base population rather than single-document extraction (as ACE was). Therefore, for KB-focused tasks, we recommend using the Relations 51-

PHYS_Located = 0
PHYS_Near = 1
PARTWHOLE_Geographical = 2
PARTWHOLE_Subsidiary = 3
PARTWHOLE_Artifact = 4
PERSOC_Business = 5
PERSOC_Family = 6
PERSOC_LastingPersonal = 7
ORGAFF_Employment = 8
ORGAFF_Ownership = 9
ORGAFF_Founder = 10
ORGAFF_StudentAlum = 11
ORGAFF_SportsAffiliation = 12
ORGAFF_InvestorShareholder = 13
ORGAFF_Membership = 14
ART_UserOwnerInventorManufacturer = 15
GENAFF_CitizenResidentReligionEthnicity = 16
GENAFF_OrgLocation = 17
Business_DeclareBankruptcy = 18
Business_EndOrg = 19
Business_MergeOrg = 20
Business_StartOrg = 21
Conflict_Attack = 22
Conflict_Demonstrate = 23
Contact_PhoneWrite = 24
Contact_Meet = 25
Justice_Acquit = 26
Justice_Appeal = 27
Justice_ArrestJail = 28
Justice_ChargeIndict = 29
Justice_Convict = 30
Justice_Execute = 31
Justice_Extradite = 32
Justice_Fine = 33
Justice_Pardon = 34
Justice_ReleaseParole = 35
Justice_Sentence = 36
Justice_Sue = 37
Justice_TrialHearing = 38
Life_BeBorn = 39
Life_Die = 40
Life_Divorce = 41
Life_Injure = 42
Life_Marry = 43
Movement_Transport = 44
Personnel_Elect = 45
Personnel_EndPosition = 46
Personnel_Nominate = 47
Personnel_StartPosition = 48
Transaction_TransferMoney = 49
Transaction_TransferOwnership = 50
PER_DATE_OF_BIRTH = 51
PER_COUNTRY_OF_BIRTH = 52
PER_STATEORPROVINCE_OF_BIRTH = 53
PER_CITY_OF_BIRTH = 54
PER_ORIGIN = 55
PER_DATE_OF_DEATH = 56
PER_COUNTRY_OF_DEATH = 57
PER_STATEORPROVINCE_OF_DEATH = 58
PER_CITY_OF_DEATH = 59
PER_COUNTRIES_OF_RESIDENCE = 60
PER_STATESORPROVINCES_OF_RESIDENCE = 61
PER_CITIES_OF_RESIDENCE = 62
PER_SCHOOLS_ATTENDED = 63
PER_EMPLOYEE_OR_MEMBER_OF = 64
PER_RELIGION = 65
PER_SPOUSE = 66
PER_CHILDREN = 67
PER_PARENTS = 68
PER_SIBLINGS = 69
PER_OTHER_FAMILY = 70
ORG_TOP_MEMBERS_EMPLOYEES = 71
ORG_MEMBERS = 72
ORG_MEMBER_OF = 73
ORG_SUBSIDIARIES = 74
ORG_PARENTS = 75
ORG_FOUNDED_BY = 76
ORG_DATE_FOUNDED = 77
ORG_DATE_DISSOLVED = 78
ORG_COUNTRY_OF_HEADQUARTERS = 79
ORG_STATEORPROVINCE_OF_HEADQUARTERS = 80
ORG_CITY_OF_HEADQUARTERS = 81
ORG_SHAREHOLDERS = 82
ORG_POLITICAL_OR_RELIGIOUS_AFFILIATION = 83
ORG_WEBSITE = 84
FAC_LOCATED = 85
FAC_VISITED_BY = 86
FAC_OWNER = 87
PER_WON_AWARD = 88
PER_MET_WITH = 89
PER_ATTENDED = 90
PER_VISITED = 91
ORG_ATTENDED = 92
ORG_VISITED = 93
PER_WEBSITE = 94
PER_NATIONALITY = 95

2.2. Storage

There are a number of helpers to assist with binary serialization and deserialization.

class streamcorpus.Chunk(*args, **kwargs)[source]

Chunk, the default Chunk, is a Thrift Chunk. See also PickleChunk, JsonChunk, and CborChunk

write_msg_impl(msg)[source]

add message instance to chunk

flush()[source]
close()[source]
read_msg_impl()[source]

Iterator over messages in the chunk

streamcorpus.decrypt_and_uncompress(data, gpg_private=None, tmp_dir=None, compression='xz', detect_compression=True)[source]

Given a data buffer of bytes, if gpg_key_path is provided, decrypt data using gnupg, and uncompress using compression scheme, which defaults to “xz” and can also be “gz”, “sz”, or “”.

Returns:a tuple of (logs, data), where logs is an array of strings and data is a binary string
streamcorpus.compress_and_encrypt(data, gpg_public=None, gpg_recipient='trec-kba', tmp_dir=None, compression='xz')[source]

Given a data buffer of bytes compress it using the compression scheme, if gpg_public is provided, encrypt data using gnupg. Compression can be “xz”, “sz”, “gz”, or “”

streamcorpus.compress_and_encrypt_path(path, gpg_public=None, gpg_recipient='trec-kba', compression='xz', tmp_dir='/tmp')[source]

Given a path in the local file system, compress it using compression, which defaults to “xz”, if gpg_public is provided, encrypt data using gnupg.

Parameters:compression – can be either “xz”, “sz”, or “”
Returns:path to file to a new file containing the of encrypted,

System Message: WARNING/2 (/home/jrf/ve12/local/lib/python2.7/site-packages/streamcorpus/_chunk.py:docstring of streamcorpus.compress_and_encrypt_path, line 8)

Field list ends without a blank line; unexpected unindent.

compressed data

Return type:str
streamcorpus.serialize(msg)[source]

Generate a serialized binary blob for a single message

streamcorpus.deserialize(blob, message=<class 'streamcorpus.ttypes.StreamItem'>)[source]

Generate a msg from a serialized binary blob for a single msg

class streamcorpus.VersionMismatchError[source]

Bases: exceptions.Exception

The version of a stream item is not what was expected.

streamcorpus.get_date_hour(stream_thing)[source]

Returns a date_hour string in the format ‘2000-01-01-12’ :param stream_time: a StreamTime or StreamItem object

streamcorpus.make_stream_time(zulu_timestamp=None, epoch_ticks=None)[source]

Creates a StreamTime object from either a string or a unix-time number. string should be formatted like ‘2000-01-01T12:34:00.000123Z’ zulu_timestamp can be either a string or a number epoch_ticks must be int, long, or float zulu_timestamp is type detected so that it can be passed through from the sole zulu_timestamp parameter of make_stream_item() below

streamcorpus.make_stream_item(zulu_timestamp, abs_url, version=1)[source]

Assemble a minimal StreamItem with internally consistent .stream_time.zulu_timestamp, .stream_time.epoch_ticks, .abs_url, .doc_id, and .stream_id

zulu_timestamp may be either a unix-time number or a string like ‘2000-01-01T12:34:00.000123Z’

streamcorpus.add_annotation(data_item, *annotations)[source]

adds each item in annotations to data_item.labels or .ratings

streamcorpus.get_entity_type(tok)[source]

returns the string name of the EntityType on this token, or None if it is not set. If Token.entity_type == CUSTOM_TYPE, then this returns the Token.custom_entity_type string instead of streamcorpus.EntityType._VALUES_TO_NAMES

2.3. streamcorpus_dump tool

streamcorpus_dump prints information on a streamcorpus.Chunk file. Basic usage is:

streamcorpus_dump --show-all input.sc