StrepHit is an intelligent reading agent that understands text and translates it into Wikidata statements.

More specifically, it is a Natural Language Processing pipeline that extracts facts from text and produces Wikidata statements with references. Its final objective is to enhance the data quality of Wikidata by suggesting references to validate statements.

StrepHit was born in January 2016 and is funded by a Wikimedia Foundation Individual Engagement Grant (IEG).

This page contains the technical documentation.

Source Code

back to top

The whole codebase can be found on GitHub: https://github.com/Wikidata/StrepHit

Features

back to top

Web spiders to collect a biographical corpus from a list of reliable sources

Corpus analysis to understand the most meaningful verbs

Extraction of sentences and semi-structured data from a corpus

Train an automatic classifier through crowdsourcing

Extract facts from text in 2 ways:

Supervised

Rule-based

Several utilities, ranging from NLP tasks like tokenization and part-of-speech tagging, to facilities for parallel processing, caching and logging

Pipeline

back to top

Corpus Harvesting
Corpus Analysis
Sentence Extraction
N-ary Relation Extraction
Dataset Serialization

strephit.annotation package

back to top

strephit.annotation.create_crowdflower_input module

back to top

strephit.annotation.create_crowdflower_input.prepare_crowdflower_input(sentences, frame_data, filter_places)

strephit.annotation.create_crowdflower_input.write_input_spreadsheet(data_units, outfile)

strephit.annotation.generate_cml module

back to top

strephit.annotation.generate_cml.generate_crowdflower_interface_template(input_csv, output_html)

Generate CrowFlower interface template based on input data spreadsheet

Parameters:

input_csv (file) -- CSV file with the input data

output_html (file) -- File in which to write the output

Returns:

0 on success

strephit.annotation.parse_results module

back to top

strephit.annotation.parse_results.process_unit(unit_id, sentences)

strephit.annotation.post_job module

back to top

strephit.annotation.post_job.activate_gold(job_id)

Activate gold units in the given job.

Corresponds to the 'Convert Uploaded Test Questions' UI button.

Parameters:

job_id (str) -- job ID registered in CrowdFlower

Returns:

True on success

Return type:

boolean

strephit.annotation.post_job.config_job(job_id)

Setup a given CrowdFlower job with default settings.

See :const: JOB_SETTINGS

Parameters:

job_id (str) -- job ID registered in CrowdFlower

Returns:

the uploaded job response object, as per https://success.crowdflower.com/hc/en-us/articles/201856229-CrowdFlower-API-API-Responses-and-Messaging#job_response on success, or an error message

Return type:

dict

strephit.annotation.post_job.create_job(title, instructions, cml, custom_js)

Create an empty CrowdFlower job with the specified title and instructions.

Raise any HTTP error that may occur.

Parameters:

title (str) -- plain text title

instructions (str) -- instructions, can contain HTML

cml (str) -- worker interface CML template. See https://success.crowdflower.com/hc/en-us/articles/202817989-CML-CrowdFlower-Markup-Language-Overview

custom_js (str) -- JavaScript code to be injected into the job

Returns:

the created job response object, as per https://success.crowdflower.com/hc/en-us/articles/201856229-CrowdFlower-API-API-Responses-and-Messaging#job_response on success, or an error message

Return type:

dict

strephit.annotation.post_job.tag_job(job_id, tags)

Tag a given job.

Parameters:

job_id (str) -- job ID registered in CrowdFlower

tags (list) -- list of tags

Returns:

True on success

Return type:

boolean

strephit.annotation.post_job.upload_units(job_id, csv_data)

Upload the job data units to the given job.

Raises any HTTP error that may occur.

Parameters:

job_id (str) -- job ID registered in CrowdFlower

csv_data (file) -- file handle pointing to the data units CSV

Returns:

the uploaded job response object, as per https://success.crowdflower.com/hc/en-us/articles/201856229-CrowdFlower-API-API-Responses-and-Messaging#job_response on success, or an error message

Return type:

dict

strephit.annotation.pull_results module

back to top

strephit.annotation.pull_results.download_full_report(job_id)

Download the full CSV report of the given job.

See https://success.crowdflower.com/hc/en-us/articles/202703075-Guide-to-Reports-Page-and-Settings-Page#full_report

Raises any HTTP error that may occur.

Parameters:

job_id (str) -- job ID registered in CrowdFlower

strephit.annotation.pull_results.get_latest_job_id()

Get the ID of the most recent job.

Returns:

the latest job ID

Return type:

str

strephit.classification package

back to top

strephit.classification.classify module

back to top

class strephit.classification.classify.SentenceClassifier(model, extractor, language, gazetteer)

Supervised Sentence classifier

classify_sentences(sentences)

Classify the given sentences

Parameters:

sentences (list) -- sentences to be classified. Each one

should be a dict with a *text*, a source *url* and some *linked_entities*

Returns:

Classified sentences with the recognized *fes*

Return type:

generator of dicts

strephit.classification.feature_extractors module

back to top

class strephit.classification.feature_extractors.BaseFeatureExtractor

Feature extractor template. Will process sentences one by one

accumulating their features and finalizes them into the final

training set.

It should be used to extract features prior to classification,

in which case the fe arguments can be used to group tokens of

the same entity into a single chunk while ignoring the actual

frame element name, e.g. *fes = dict(enumerate(entities))*

get_features()

Returns the final training set

Returns:

A matrix whose rows are samples and columns are features and a

column vector with the sample label (i.e. the correct answer for the classifier)

Return type:

tuple

process_sentence(sentence, fes, add_unknown, gazetteer)

Extracts and accumulates features for the given sentence

Parameters:

sentence (unicode) -- Text of the sentence

fes (dict) -- Dictionary with FEs and corresponding chunks

add_unknown (bol) -- Whether unknown tokens should be added to the index of treaded as a special, unknown token. Set to True when building the training set and to False when building the features used to classify new sentences

gazetteer (dict) -- Additional features to add when a given chunk is found in the sentence. Keys should be chunks and values should be list of features

Returns:

Nothing

start()

Clears the features accumulated so far and starts over.

class strephit.classification.feature_extractors.FactExtractorFeatureExtractor(language, window_width=2)

Bases: "strephit.classification.feature_extractors.BaseFeatureExtractor"

Feature extractor inspired from the fact-extractor

extract_features(sentence, fes, add_unknown, gazetteer)

Extracts the features for each token of the sentence

Parameters:

sentence (unicode) -- Text of the sentence

fes (dicr) -- mapping FE -> chunk

gazetteer (dict) -- mapping chunk -> additional features

Returns:

List of features, each one as a sparse row

(i.e. with the indexes of the relevant columns)

feature_for(term, type_, position, add_unknown)

Returns the feature for the given token, i.e. the column of the feature in a sparse matrix

Parameters:

term (str) -- Actual term

type (str) -- Type of the term, for example token, pos or lemma

position (int) -- Relative position (used for context windows)

add_unknown (bool) -- Whether to add previously unseen terms to the dictionary or use the UNK token instead

Returns:

Column of the corresponding feature

get_features()

process_sentence(sentence, fes, add_unknown, gazetteer)

sentence_to_tokens(sentence, fes)

Transforms a sentence into a list of tokens

Parameters:

sentence (unicode) -- Text of the sentence

fes (dict) -- mapping FE -> chunk

Returns:

List of tokens

start()

token_to_features(tokens, position, add_unknown, gazetteer)

Extracts the features for the token in the given position

Parameters:

tokens (list) -- POS-tagged tokens of the sentence

position (int) -- position of the token for which features are requestsd

gazetteer (dict) -- mapping chunk -> additional features

Returns:

sparse set of features (i.e. numbers are indexes in a row of a sparse matrix)

class strephit.classification.feature_extractors.SortedSet

Very simple sorted unique collection which remembers

the order of insertion of its items

index(item)

put(item)

reverse_map()

strephit.classification.train module

back to top

strephit.commons package

back to top

strephit.commons.cache module

back to top

strephit.commons.cache.cached(function)

Decorator to cache function results based on its arguments

Sample usage:

>>> from strephit.commons import cache
>>> @cache.cached
... def f(x):
...     print 'inside f'
...     return 2 * x
...
>>> f(10)
inside f
20
>>> f(10)
20

strephit.commons.cache.get(key, default=None)

Retrieves an item from the cache

Parameters:

key -- Key of the item

default -- Default value to return if the key is not in the cache

Returns:

The item associated with the given key or

the default value

Sample usage:

>>> from strephit.commons import cache
>>> cache.get('kk', 13)
13
>>> cache.get('kk', 0)
0
>>> cache.set('kk', 15)
>>> cache.get('kk', 0)
15

strephit.commons.cache.set(key, value, overwrite=True)

Stores an item in the cache under the given key

Parameters:

key -- Unique key used to identify the idem.

value -- Value to store in the cache. Must be JSON-dumpable

overwrite -- Whether to overwrite the previous value associated with the key (if any)

Returns:

Nothing

Sample usage:

>>> from strephit.commons import cache
>>> cache.get('kk', 13)
13
>>> cache.get('kk', 0)
0
>>> cache.set('kk', 15)
>>> cache.get('kk', 0)
15

strephit.commons.classification module

back to top

strephit.commons.classification.apply_custom_classification_rules(classified, language, overwrite=False)

Implements simple custom, classifier-agnostic rules for

recognizing some frame elements

Parameters:

classified (dict) -- an item produced by the classifier

language (str) -- Language of the sentence

overwrite (bool) -- Tells the priority in case the rules assign a role to the same chunk recognized by the classifier

Returns:

The same item with augmented FEs

strephit.commons.classification.reverse_gazetteer(gazetteer)

Reverses the gazetteer from feature -> chunks to chunk -> features

Parameters:

gazetteer (dict) -- Gazetteer associating chunks to features

Returns:

An equivalent gazetteer associating features to chunks

Return type:

dict

strephit.commons.date_normalizer module

back to top

class strephit.commons.date_normalizer.DateNormalizer(language=None, specs=None)

Bases: "object"

find matches in text strings using regular expressions and transforms them

according to a pattern transformation expression evaluated on the match

the specifications are given in yaml format and allow to define meta functions

and meta variables as well as the pattern and transformation rules themselves.

meta variables will be placed inside patterns which use them in order to

make writing patterns easier. meta variables will be available to use from

inside the meta functions too as a dictionary named meta_vars

a pattern transformation expression is an expression which will be evaluated

if the corresponding regular expression matches. the pattern transformation

will have access to all the meta functions and meta variables defined and

to a variable named 'match' containing the regex match found

normalize_many(expression)

Find all the matching entities in the given expression expression

Parameters:

expression (str) -- The expression in which to look for

Returns:

Generator of tuples (start, end), category, result

Sample usage:

>>> from pprint import pprint
>>> from strephit.commons.date_normalizer import DateNormalizer
>>> pprint(list(DateNormalizer('en').normalize_many('I was born on April 18th, '
...                                                 'and today is April 18th, 2016!')))
[((14, 24), 'Time', {'day': 18, 'month': 4}),
 ((39, 55), 'Time', {'day': 18, 'month': 4, 'year': 2016})]

normalize_one(expression, conflict='longest')

Find the matching part in the given expression

Parameters:

expression (str) -- The expression in which to search the match

conflict (str) -- Whether to return the first match found or scan through all the provided regular expressions and return the longest or shortest part of the string matched by a regular expression. Note that the match will always be the first one found in the string, this parameter tells how to resolve conflicts when there is more than one regular expression that returns a match. When more matches have the same length the first one found counts Allowed values are *first*, *longest* and *shortest*

Returns:

Tuple with (start, end), category, result

Return type:

tuple

Sample usage:

>>> from strephit.commons.date_normalizer import DateNormalizer
>>> DateNormalizer('en').normalize_one('Today is the 1st of June, 2016')
((13, 30), 'Time', {'month': 6, 'day': 1, 'year': 2016})

strephit.commons.date_normalizer.normalize_numerical_fes(language, text)

Normalize numerical FEs in a sentence

strephit.commons.datetime module

back to top

strephit.commons.datetime.parse(string)

Try to parse a date expressed in natural language.

Parameters:

string (str) -- Date in natural language

Returns:

dictionary with year, month, day

Type:

dict

strephit.commons.entity_linking module

back to top

strephit.commons.entity_linking.extract_entities(response_json)

Extract the list of entities from the Dandelion Entity Extraction API JSON response.

Parameters:

response_json (dict) -- JSON response returned by Dandelion

Returns:

The extracted entities, with the surface form, start and end indices URI, and ontology types

Return type:

list

strephit.commons.io module

back to top

strephit.commons.io.dump_corpus(corpus, dump_file_handle)

Dump a loaded corpus to a file with one JSON object per line .

strephit.commons.io.get_and_cache(url, use_cache=True, **kwargs)

Perform an HTTP GET request to the given url and optionally cache the

result somewhere in the file system. The cached content will be used

in the subsequent requests.

Raises all HTTP errors

Parameters:

url -- URL of the page to retrieve

use_cache -- Whether to use cache

**kwargs -- keyword arguments to pass to *requests.get*

Returns:

The content page at the given URL, unicode

strephit.commons.io.load_corpus(location, document_key, text_only=False)

Load an input corpus from a directory with scraped items, in a memory-efficient way.

Each input file must contain one JSON object per line.

Parameters:

document_key (str) -- a scraped item dictionary key holding textual documents

strephit.commons.io.load_dumped_corpus(dump_file_handle, document_key, text_only=False)

Load a previously dumped corpus file, in a memory-efficient way.

strephit.commons.io.load_scraped_items(location)

Loads all the items from a directory or file.

Parameters:

location --

Where is the corpus.

If it is a directory, all files with extension jsonlines will be loaded.

if it is a file, it can be either a jsonlines of a tar compressed file.

strephit.commons.logging module

back to top

strephit.commons.logging.log_request_data(http_response, logger)

Send a debug log message with basic information of the HTTP request that was sent for the given HTTP response.

Parameters:

http_response (requests.models.Response) -- HTTP response object

strephit.commons.logging.setLogLevel(module, level)

Sets the log level used to log messages from the given module

strephit.commons.logging.setup()

strephit.commons.parallel module

back to top

strephit.commons.parallel.execute(processes=0, *specs)

Execute the given functions parallelly

Parameters:

processes -- Number of functions to execute at the same time

specs -- a sequence of functions, each followed by its arguments (arguments as a tuple or list)

Returns:

the results that the functions returned, in the same order as they were specified

Return type:

list

Sample usage:

>>> from strephit.commons import parallel
>>> list(parallel.execute(4,
...     lambda x, y: x + y, (5, -5),
...     lambda *x: sum(x), range(5)
... ))
[0, 10]

strephit.commons.parallel.make_batches(iterable, size)

strephit.commons.parallel.map(function, iterable, processes=0, flatten=False, raise_exc=True, batch_size=0)

Applies the given function to each element of the iterable in parallel.

None* values are not allowed in the iterable nor as return values, they will

simply be discarded. Can be "safely" stopped with a keboard interrupt.

Parameters:

function -- the function used to transform the elements of the iterable

processes -- how many items to process in parallel. Use zero or a negative number to use all the available processors. No additional processes will be used if the value is 1.

flatten -- If the mapping function return an iterable flatten the resulting iterables into a single one.

raise_exc -- Only when *processes* equals 1, controls whether to propagate the exception raised by the mapping function to the called or simply to log them and carry on the computation. When *processes* is different than 1 this parameter is not used.

batch_size -- If larger than 0, the input iterable will be grouped in groups of this size and the resulting list passed to as argument to the worker.

Returns:

iterable with the results. Order is not guaranteed to be preserved

Sample usage:

>>> from strephit.commons import parallel
>>> list(parallel.map(lambda x: 2*x, range(10)))
[0, 8, 10, 12, 14, 16, 18, 2, 4, 6]

strephit.commons.pos_tag module

back to top

class strephit.commons.pos_tag.NLTKPosTagger(language)

Bases: "object"

part-of-speech tagger implemented using the NLTK library

tag_many(documents, tagset=None, **kwargs)

POS-Tag many documents.

tag_one(text, tagset, **kwargs)

POS-Tags the given text

class strephit.commons.pos_tag.TTPosTagger(language, tt_home=None, **kwargs)

Bases: "object"

part-of-speech tagger implemented using tree tagger and treetaggerwrapper

tag_many(items, document_key, pos_tag_key, batch_size=10000, **kwargs)

POS-Tags many text documents of the given items. Use this for massive text tagging

Parameters:

items -- Iterable of items to tag. Generator preferred

document_key -- Where to find the text to tag inside each item. Text must be unicode

pos_tag_key -- Where to put pos tagged text

Sample usage:

>>> from strephit.commons.pos_tag import TTPosTagger
>>> from pprint import pprint
>>> pprint(list(TTPosTagger('en').tag_many(
...     [{'text': 'Item one is in first position'}, {'text': 'In the second position is item two'}],
...     'text', 'tagged'
... )))
[{'tagged': [Tag(word='Item', pos='NN', lemma='item'),
             Tag(word='one', pos='CD', lemma='one'),
             Tag(word='is', pos='VBZ', lemma='be'),
             Tag(word='in', pos='IN', lemma='in'),
             Tag(word='first', pos='JJ', lemma='first'),
             Tag(word='position', pos='NN', lemma='position')],
  'text': 'Item one is in first position'},
 {'tagged': [Tag(word='In', pos='IN', lemma='in'),
             Tag(word='the', pos='DT', lemma='the'),
             Tag(word='second', pos='JJ', lemma='second'),
             Tag(word='position', pos='NN', lemma='position'),
             Tag(word='is', pos='VBZ', lemma='be'),
             Tag(word='item', pos='RB', lemma='item'),
             Tag(word='two', pos='CD', lemma='two')],
  'text': 'In the second position is item two'}]

tag_one(text, skip_unknown=True, **kwargs)

POS-Tags the given text, optionally skipping unknown lemmas

Parameters:

text (unicode) -- Text to be tagged

skip_unknown (bool) -- Automatically emove unrecognized tags from the result

Sample usage:

>>> from strephit.commons.pos_tag import TTPosTagger
>>> from pprint import pprint
>>> pprint(TTPosTagger('en').tag_one('sample sentence to be tagged fycgvkuhbj'))
[Tag(word='sample', pos='NN', lemma='sample'),
 Tag(word='sentence', pos='NN', lemma='sentence'),
 Tag(word='to', pos='TO', lemma='to'),
 Tag(word='be', pos='VB', lemma='be'),
 Tag(word='tagged', pos='VVN', lemma='tag')]

tokenize(text)

Splits a text into tokens

strephit.commons.pos_tag.get_pos_tagger(language, **kwargs)

Returns an initialized instance of the preferred POS tagger for the given language

strephit.commons.scoring module

back to top

strephit.commons.scoring.compute_score(sentence, score, core_fes_weight)

Computes the confidency score for a sentence based on FE scores

Parameters:

sentence (dict) -- Data of the sentence, containing FEs

score (str) -- Type of score: arithmetic-mean, weighted-mean, f-score

core_fes_weight (float) -- Weight of core FEs wrt extra FEs

strephit.commons.serialize module

back to top

class strephit.commons.serialize.ClassificationSerializer(language, frame_data, url_to_wid=None)

get_subjects(data)

Finds all subjects of the frame assigned to the sentence

Parameters:

data (dict) -- classification results

Returns:

all subjects as tuples (chunk, wikidata id)

Return type:

generator of tuples

static map_fe_to_wid(frame_data)

serialize_numerical(subj, fe, url)

Serializes a numerical FE found by the normalizer

to_statements(data, input_encoded=True)

Converts the classification results into quick statements

Parameters:

data -- Data from the classifier. Can be either str or dict

input_encoded (bool) -- Whether data is a str or a dict

Returns:

Tuples <success, item> where item is a statement if success

is true else it is a named entity which could not be resolved

Type:

generator

strephit.commons.serialize.map_url_to_wid(semistructured)

Read the quick statements generated from the semi structured data

and build a map associating url to wikidata id

strephit.commons.split_sentences module

back to top

class strephit.commons.split_sentences.PunktSentenceSplitter(language)

Bases: "object"

Sentence splitting splits a natural language text into sentences

model_path = 'tokenizers/punkt/%s.pickle'

split(text)

Split the given text into sentences.

Leading and trailing spaces are stripped.

Newline characters are first interpreted as sentence boundaries.

Then, the sentence splitter is run.

Parameters:

text (str) -- Text to be split

Returns:

the sentences in the text

Return type:

generator

Sample usage:

>>> from strephit.commons.split_sentences import PunktSentenceSplitter
>>> list(PunktSentenceSplitter('en').split(
...     "This is the first sentence. Mr. period doesn't always delimit sentences"
... ))
['This is the first sentence.', "Mr. period doesn't always delimit sentences"]

split_tokens(tokens)

Splits the given text into sentences.

Parameters:

tokens (list) -- the tokens of the text

Returns:

the sentences i the text

Return type:

generator

Sample usage:

>>> from strephit.commons.split_sentences import PunktSentenceSplitter
>>> list(PunktSentenceSplitter('en').split_tokens(
...     "This is the first sentence. Mr. period doesn't always delimit sentences".split()
... ))
[['This', 'is', 'the', 'first', 'sentence.'], ['Mr.', 'period', "doesn't", 'always', 'delimit', 'sentences']]

supported_models = {'el': 'tokenizers/punkt/greek.pickle', 'fr': 'tokenizers/punkt/french.pickle', 'en': 'tokenizers/punkt/english.pickle', 'nl': 'tokenizers/punkt/dutch.pickle', 'pt': 'tokenizers/punkt/portuguese.pickle', 'no': 'tokenizers/punkt/norwegian.pickle', 'sv': 'tokenizers/punkt/swedish.pickle', 'de': 'tokenizers/punkt/german.pickle', 'tr': 'tokenizers/punkt/turkish.pickle', 'it': 'tokenizers/punkt/italian.pickle', 'da': 'tokenizers/punkt/danish.pickle', 'cz': 'tokenizers/punkt/czech.pickle', 'es': 'tokenizers/punkt/spanish.pickle', 'fi': 'tokenizers/punkt/finnish.pickle', 'et': 'tokenizers/punkt/estonian.pickle', 'sl': 'tokenizers/punkt/slovene.pickle', 'pl': 'tokenizers/punkt/polish.pickle'}

strephit.commons.stopwords module

back to top

class strephit.commons.stopwords.StopWords

Bases: "object"

This module retrieves stop words for a given language

classmethod words(language)

Returns a list of stop words for a specified language

Parameters:

language (str) -- the language whose stop words are required

Returns:

Stop words if language is supported. Else an empty list

Return type:

list

strephit.commons.text module

back to top

strephit.commons.text.clean(s, unicode=True)

strephit.commons.text.clean_extract(sel, path, path_type='xpath', limit_from=None, limit_to=None, sep='\n', unicode=True)

strephit.commons.text.extract_dict(response, keys_selector, values_selector, keys_extractor='.//text()', values_extractor='.//text()', **kwargs)

Extracts a dictionary given the selectors for the keys and the vaues.

The selectors should point to the elements containing the text and not the

text itself.

Parameters:

response -- The response object. The methods xpath or css are used

keys_selector -- Selector pointing to the elements containing the keys, starting with the type *xpath:* or *css:* followed by the selector itself

values_selector -- Selector pointing to the elements containing the values, starting with the type *xpath:* or *css:* followed by the selector itself

keys_extracotr -- Selector used to actually extract the value of the key from each key element. xpath only

keys_extracotr -- Selector used to extract the actual value value from each value element. xpath only

**kwargs -- Other parameters to pass to *clean_extract*. Nothing good will come by passing *path_type='css'*, you have been warned.

strephit.commons.text.fix_name(name)

tries to normalize a name so that it can be searched with the wikidata APIs

Parameters:

name -- The name to normalize

Returns:

a tuple with the normalized name and a list of honorifics

strephit.commons.text.parse_birth_death(string)

Parses birth and death dates from a string.

Parameters:

string -- String with the dates. Can be 'd. <year>' to indicate the

year of death, 'b. <year>' to indicate the year of birth, <year>-<year>

to indicate both birth and death year. Can optionally include 'c.' or 'ca.'

before years to indicate approximation (ignored by the return value).

If only the century is specified, birth is the first year of the century and

death is the last one, e.g. '19th century' will be parsed as *('1801', '1900')*

Returns:

tuple *(birth_year, death_year)*, both strings as appearing in the original string.

If the string cannot be parsed *(None, None)* is returned.

strephit.commons.text.split_at(content, delimiters)

Splits content using given delimiters following their order, for example

>>> [x for x in split_at(range(11), range(3,10,3))]
[(None, [1, 2]), (3, [4, 5]), (6, [7, 8]), (None, [9, 10])]

strephit.commons.text.strip_honorifics(name)

Removes honorifics from the name

Parameters:

name -- The name

Returns:

a tuple with the name without honorifics and a list of honorifics

strephit.commons.tokenize module

back to top

class strephit.commons.tokenize.Tokenizer(language)

Tokenization splits a natural language utterance into words (tokens)

tokenization_regexps = {'en': '[^\\p{L}\\p{N}]+', 'it': '[^\\p{L}\\p{N}]+'}

tokenize(sentence)

Tokenize the given sentence.

You can also pass a generic text, but you will lose the sentence segmentation.

Parameters:

sentence (str) -- a natural language sentence or text to be tokenized

Returns:

the list of tokens

Return type:

list

strephit.commons.wikidata module

back to top

strephit.commons.wikidata.call_api(action, cache=True, **kwargs)

Invoke the given method of wikidata APIs with the given parameters

strephit.commons.wikidata.finalize_statement(subject, property, value, language, url=None, resolve_property=True, resolve_value=True, **kwargs)

Given the components of a statement, convert it into a quick statement.

Parameters:

subject -- Subject of the statement (its Wikidata ID)

property -- Property of the statement

value -- Value of the statement (to be resolved)

language -- Language used to resolve the value

url -- Source of the statement (corresponds to S854)

resolve_property -- Whether *property* is already a Wikidata ID or needs to be resolved

resolve_value -- Whether *value* can be inserted into the statement as-is or needs to be resolved

kwargs -- additional information used to resolve *value*

strephit.commons.wikidata.format_date(year=None, month=None, day=None)

Formats a date according to Wikidata syntax. Assumes that the date is mostly

correct. The allowed values of the parameters are shown in the following

truth table

year	month	day	ok
1	1	1	1
1	1	0	1
1	0	1	0
1	0	0	1
0	1	1	1
0	1	0	0
0	0	1	0
0	0	0	0

Parameters:

year -- year of the date

month -- month of the date. Only positive values allowed

day -- day of the date. Only positive values allowed

strephit.commons.wikidata.get_entities(ids, batch)

Retrieve Wikidata entities metadata.

Parameters:

ids (list) -- list of Wikidata entity IDs

batch (int) -- number of IDs per call, to serve as paging for the API.

Returns:

dict of Wikidata entities with metadata

Return type:

dict

strephit.commons.wikidata.get_labels_and_aliases(entities, language_code)

Extract language-specific label and aliases from a list of Wikidata entities metadata.

Parameters:

entities (list) -- list of Wikidata entities with metadata.

language_code (str) -- 2-letter language code, e.g., *en* for English

Returns:

dict of entities, with label and aliases only

Return type:

dict

strephit.commons.wikidata.get_property_ids(batch)

Get the full list of Wikidata property IDs (pids).

Parameters:

batch (int) -- number of pids per call, to serve as paging for the API.

Returns:

list of all pids

Return type:

list

strephit.commons.wikidata.honorifics_resolver(property, value, language, **kwargs)

Resolves honorifics such as "mr.", "dr." etc

strephit.commons.wikidata.identity_resolver(property, value, language, **kwargs)

Default resolver, converts to unicode and surrounds with double quotes

strephit.commons.wikidata.parse_date(date, precision=None)

Tries to parse a date serialized according to the wikidata format

into its components year, month and day

Returns:

dict (year, month, day)

strephit.commons.wikidata.resolve(property, value, language, **kwargs)

Tries to resolve the Wikidata ID of an object given its string representation

Parameters:

property -- Wikidata ID of the property to resolve

value -- String value

language -- Search only this language

kwargs -- Additional info that might be useful to help the resolver

strephit.commons.wikidata.resolver(*properties)

Decorator to register a function as resolver for the given properties.

strephit.commons.wikidata.resolver_with_hints(property, value, language, **kwargs)

Resolves people names. Works better if generic biographic

information, such as birth/death dates, is provided.

Parameters:

kwargs -- dictionary of wikidata property -> list of values

strephit.commons.wikidata.search(term, language, type_=None, label_exact=True, limit='15')

Uses the wikidata APIs to search for a term. Can optionally specify a type

(corresponding to the 'instance of' P31 wikidata property. If no type is

specified simply returns all the items containing *term* in *label*

Parameters:

term (str) -- The term to look for

language (str) -- Search in this language

type (iterable) -- Type of the entity to look for, wikidata numeric id (i.e. without starting Q) Can be int or anything iterable

label_exact (bool) -- Filter entities whose labels matches exactly the search term

limit (str) -- How many results to return at most

Returns:

List of dicts with details (which details depend on *type_*)

Return type:

list of dicts

strephit.corpus_analysis package

back to top

strephit.corpus_analysis.compute_lu_distribution module

back to top

strephit.corpus_analysis.compute_lu_distribution.worker_with_sentences(bio)

Produces an histogram counting the number of verbs

for each sentence appearing in the biography

Parameters:

bio (str) -- The biography to analyze

Returns:

histogram of frequenties

Type:

dict

strephit.corpus_analysis.compute_lu_distribution.worker_with_sub_sentences(bio)

Produces an histogram counting the number of verbs

for each phrase appearing in the biography

Parameters:

bio (str) -- The biography to analyze

Returns:

histogram of frequenties

Type:

dict

strephit.corpus_analysis.extract_framenet_frames module

back to top

strephit.corpus_analysis.extract_framenet_frames.extract_top_corpus_tokens(enriched_lemmas, all_lemma_tokens)

Extract the subset of corpus lemmas with tokens gievn the set of top lemmas

Parameters:

enriched_lemmas (dict) -- Dict returned by "intersect_lemmas_with_framenet()"

all_lemma_tokens (dict) -- Dict of all corpus lemmas with tokens

Returns:

the top lemmas with tokens dict

Return type:

dict

strephit.corpus_analysis.extract_framenet_frames.get_top_n_lus(ranked_lus, n)

Extract the top N Lexical Units (LUs) from a ranking.

Parameters:

ranked_lus (dict) -- LUs ranking, as returned by "compute_ranking()"

n (int) -- Number of top LUs to return

Returns:

the top N LUs with their ranking scores

Return type:

dict

strephit.corpus_analysis.extract_framenet_frames.intersect_lemmas_with_framenet(corpus_lemmas, wikidata_properties)

Intersect verb lemmas extracted from the input corpus with FrameNet Lexical Units (LUs).

Parameters:

corpus_lemmas (dict) -- dict of verb lemmas with their ranking scores

wikidata_properties (dict) -- dict with all Wikidata properties

Returns:

a dictionary of corpus lemmas enriched with FrameNet LUs data (dicts)

Return type:

dict

strephit.corpus_analysis.rank_verbs module

back to top

class strephit.corpus_analysis.rank_verbs.PopularityRanking(corpus_path, pos_tag_key)

Ranking based on the popularity of each verb. Simply counts the

frequency of each lemma over all corpus

find_ranking(processes=0, bulk_size=10000, normalize=True)

static score_from_tokens(tokens)

class strephit.corpus_analysis.rank_verbs.TFIDFRanking(vectorizer, verbs, tfidf_matrix)

Computes TF-IDF based rankings.

The first ranking is based on the average TF-IDF score of each lemma over all corpus

The second ranking is based on the average standard deviation of TF-IDF scores

of each lemma over all corpus

find_ranking(processes=0)

Ranks the verbs

Parameters:

processes (int) -- How many processes to use for parallel ranking

Returns:

tuple with average tf-idf and average standard deviation ordered rankings

Return type:

tuple of (OrderedDict, OrderedDict)

score_lemma(lemma)

Computess TF-IDF based score of a single lemma

Parameters:

lemma (str) -- The lemma to score

Returns:

tuple with lemma, average tf-idf, average of tf-idf standard deviations

Return type:

tuple of (str, float, float)

strephit.corpus_analysis.rank_verbs.compute_tf_idf_matrix(corpus_path, document_key)

Computes the TF-IDF matrix of the corpus

Parameters:

corpus_path (str) -- path of the corpus

document_key (str) -- where the textual content is in the corpus

Returns:

a vectorizer and the computed matrix

Return type:

tuple

strephit.corpus_analysis.rank_verbs.get_similarity_scores(verb_token, vectorizer, tf_idf_matrix)

Compute the cosine similarity score of a given verb token against the input corpus TF/IDF matrix.

Parameters:

verb_token (str) -- Surface form of a verb, e.g., born

vectorizer (sklearn.feature_extraction.text.TfidfVectorizer) -- Vectorizer used to transform verbs into vectors

Returns:

cosine similarity score

Return type:

ndarray

strephit.corpus_analysis.rank_verbs.harmonic_ranking(*rankings)

Combines individual rankings with an harmonic mean to obtain a final ranking

Parameters:

rankings -- dictionary of individual rankings

Returns:

the new, combined ranking

strephit.corpus_analysis.rank_verbs.produce_lemma_tokens(pos_tagged_path, pos_tag_key, language)

Extracts a map from lemma to all its tokens

Parameters:

pos_tagged_path (str) -- path of the pos-tagged corpus

pos_tag_key (str) -- where the pos tag data is in each item

language -- language of the corpus

Returns:

mapping from lemma to tokens

Return type:

dict

strephit.corpus_analysis.test_pos_taggers module

back to top

strephit.corpus_analysis.test_pos_taggers.tag(text, tt_home)

strephit.extraction package

back to top

strephit.extraction.balanced_extract module

back to top

strephit.extraction.balanced_extract.extract_sentences(sentences, probabilities, processes=0, input_encoded=False, output_encoded=False)

Extracts some sentences from the corpus following the given probabilities

Parameters:

sentences (iterable) -- Extracted sentences

probabilities (dict) -- Conditional probabilities of extracting a sentence containing a specific LU given the source of the sentence. It is therefore a mapping source -> probabilities, where probabilities is itself a mapping LU -> probability

processes (int) -- how many processes to use for parallel execution

input_encoded (bool) -- whether the corpus is an iterable of dictionaries or an iterable of JSON-encoded documents. JSON-encoded documents are preferable over large size dictionaries for performance reasons

output_encoded (bool) -- whether to return a generator of dictionaries or a generator of JSON-encoded documents. Prefer encoded output for performance reasons

Returns:

Generator of sentences

strephit.extraction.balanced_extract.lu_count(sentences, processes=0, input_encoded=False)

Count how many sentences per LU there are for each source

Parameters:

sentences (iterable) -- Corpus with the POS-tagged sentences

processes (int) -- how many processes to use for parallel execution

input_encoded (bool) -- whether the corpus is an iterable of dictionaries or an iterable of JSON-input_encoded documents. JSON-input_encoded documents are preferable over large size dictionaries for performance reasons

Returns:

A dictionary source -> frequencies, where frequencies is

another dictionary lemma -> count

Type:

bool

strephit.extraction.extract_sentences module

back to top

class strephit.extraction.extract_sentences.GrammarExtractor(corpus, document_key, sentences_key, language, lemma_to_token, match_base_form)

Bases: "strephit.extraction.extract_sentences.SentenceExtractor"

Grammar-based extraction strategy: pick sentences that comply with a pre-defined grammar.

extract_from_item(item)

grammars = {'en': '\n NOPH: {<PDT>?<DT|PP.*|>?<CD>?<JJ.*|VVN>*<N.+|FW>+<CC>?}\n CHUNK: {<NOPH>+<MD>?<V.+>+<IN|TO>?<NOPH>+}\n ', 'it': '\n SN: {<PRO.*|DET.*|>?<ADJ>*<NUM>?<NOM|NPR>+<NUM>?<ADJ|VER:pper>*}\n CHUNK: {<SN><VER.*>+<SN>}\n '}

parser = None

setup_extractor()

splitter = None

class strephit.extraction.extract_sentences.ManyToManyExtractor(corpus, document_key, sentences_key, language, lemma_to_token, match_base_form)

Bases: "strephit.extraction.extract_sentences.SentenceExtractor"

n2n extraction strategy: many sentences per many LUs

N.B.: the same sentence is likely to appear multiple times

extract_from_item(item)

setup_extractor()

splitter = None

class strephit.extraction.extract_sentences.OneToOneExtractor(corpus, document_key, sentences_key, language, lemma_to_token, match_base_form)

Bases: "strephit.extraction.extract_sentences.SentenceExtractor"

121 extraction strategy: 1 sentence per 1 LU

N.B.: the same sentence will appear only once

the sentence is assigned to a RANDOM LU

all_verb_tokens = None

extract_from_item(item)

setup_extractor()

splitter = None

token_to_lemma = None

class strephit.extraction.extract_sentences.SentenceExtractor(corpus, document_key, sentences_key, language, lemma_to_token, match_base_form)

Base class for sentence extractors.

extract(processes=0)

Processes the corpus extracting sentences from each item

and storing them in the item itself.

Parameters:

processes (int) -- how many processes to use for parallel tagging

Returns:

the extracted sentences

Type:

generator of dicts

extract_from_item(item)

Extract sentences from an item. Relies on *setup_extractor*

having been called

Parameters:

item (dict) -- Item from which to extract sentences

Returns:

The original item and list of extracted sentences

Return type:

tuple of dict, list

setup_extractor()

Optional setup code, run before starting the extraction

teardown_extractor()

Optional teardown code, run after the extraction

class strephit.extraction.extract_sentences.SyntacticExtractor(corpus, document_key, sentences_key, language, lemma_to_token, match_base_form)

Bases: "strephit.extraction.extract_sentences.SentenceExtractor"

Tries to split sentences into sub-sentences so that each of them

contains only one LU

all_verbs = None

extract_from_item(item)

find_sub_sentences(tree)

find_terminals(tree, label=None)

parser = None

setup_extractor()

splitter = None

token_to_lemma = None

strephit.extraction.extract_sentences.extract_sentences(corpus, sentences_key, document_key, language, lemma_to_tokens, strategy, match_base_form, processes=0)

Extract sentences from the given corpus by matching tokens against a given set.

Parameters:

corpus -- Corpus as an iterable of documents

sentences_key (str) -- dict key where to put extracted sentences

document_key (str) -- dict key where the textual document is

language (str) -- ISO 639-1 language code used for tokenization and sentence splitting

lemma_to_tokens (dict) -- Dict with corpus lemmas as keys and tokens to be matched as values

strategy (str) -- One of the 4 extraction strategies ['121', 'n2n', 'grammar', 'syntactic']

match_base_form (bool) -- whether to match verbs base form

processes (int) -- How many concurrent processes to use

Returns:

the corpus, updated with the extracted sentences and the number of extracted sentences

Return type:

generator of tuples

strephit.extraction.process_semistructured module

back to top

class strephit.extraction.process_semistructured.SemistructuredSerializer(language, sourced_only)

process_corpus(items, output_file, dump_unresolved_file=None, genealogics=None, processes=0)

resolve_genealogics_family(input_file, url_to_id)

Performs a second pass on genealogics to resolve additional family members

serialize_item(item)

Converts an item to quick statements.

Parameters:

item -- Scraped item, either str (json) or dict

Returns:

tuples <success, item> where item is an entity which

could not be resolved if success is false, otherwise it is a

<subject, property, object, source> tuple

Return type:

generator

strephit.extraction.source_id_mappings module

back to top

strephit.rule_based.resources package

back to top

strephit.rule_based.resources.frame_repo module

back to top

strephit.rule_based package

back to top

Subpackages

back to top

strephit.rule_based.resources package

Submodules

strephit.rule_based.resources.frame_repo module

strephit.rule_based.classify module

back to top

class strephit.rule_based.classify.RuleBasedClassifier(frame_data, language)

A simple rule-based classifier

The frame is recognized solely based on the lexical unit

and frame elements are assigned to linked entities with

a suitable type

assign_frame_elements(linked, frame)

Try to assign a frame element to each of the linked entities

based on their ontology type(s)

Parameters:

linked -- Entities found in the sentence

frame -- Frame data

Returns:

List of assigned frames

label_sentence(sentence, normalize_numerical, score_type, core_weight)

Labels a single sentence

Parameters:

sentence -- Sentence data to label

normalize_numerical -- Automatically normalize numerical FEs

score_type -- Which type of score (if any) to use to compute the classification confidence

core_weight -- Weight of the core FEs (used in the scoring)

Returns:

Labeled data

label_sentences(sentences, normalize_numerical, score_type, core_weight, processes=0, input_encoded=False, output_encoded=False)

Process all the given sentences with the rule-based classifier,

optionally giving a confidence score

Parameters:

sentences -- List of sentence data

normalize_numerical -- Whether to automatically normalize numerical expressions

score_type -- Which type of score (if any) to use to compute the classification confidence

core_weight -- Weight of the core FEs (used in the scoring)

processes -- how many processes to use to concurrently label sentences

input_encoded -- whether the corpus is an iterable of dictionaries or an iterable of JSON-encoded documents. JSON-encoded documents are preferable over large size dictionaries for performance reasons

output_encoded -- whether to return a generator of dictionaries or a generator of JSON-encoded documents. Prefer encoded output for performance reasons

Returns:

Generator of labeled sentences

strephit.rule_based.cli module

back to top

strephit.side_projects package

back to top

strephit.side_projects.wlm module

back to top

strephit.side_projects.wlm.process_row(data)

strephit.side_projects.wlm.wlmid_resolver(property, value, language, **kwargs)

strephit.sphinx_wikisyntax package

back to top

sphinx_wikisyntax

back to top

Sphinx extension to generate documentation in wikisyntax format

strephit.sphinx_wikisyntax.setup(app)

strephit.sphinx_wikisyntax.builder module

back to top

sphinx_wikisyntax

back to top

Wikisyntax Sphinx builder.

class strephit.sphinx_wikisyntax.builder.WikisyntaxBuilder(app)

Bases: "sphinx.builders.text.TextBuilder"

allow_parallel = True

format = 'wikisyntax'

name = 'wikisyntax'

out_suffix = '.wiki'

prepare_writing(docnames)

strephit.sphinx_wikisyntax.writer module

back to top

sphinx_wikisyntax

back to top

Custom docutils writer for wikisyntax

class strephit.sphinx_wikisyntax.writer.WikisyntaxTranslator(document, builder)

Bases: "sphinx.writers.text.TextTranslator"

MAXWIDTH = 20000000000

STDINDENT = 1

depart_block_quote(node)

depart_centered(node)

depart_doctest_block(node)

depart_document(node)

depart_emphasis(node)

depart_list_item(node)

depart_literal_emphasis(node)

depart_literal_strong(node)

depart_strong(node)

depart_subscript(node)

depart_superscript(node)

depart_table(node)

depart_target(node)

depart_title(node)

Called when the end of a section's title is encountered

end_state(wrap=False, end=[''], first=None)

visit_block_quote(node)

visit_centered(node)

visit_desc_parameterlist(node)

Called when the parameter list of a function is encountered

visit_desc_signature(node)

Called when the full name (incl. module) of a function is encountered

visit_doctest_block(node)

visit_emphasis(node)

visit_literal_emphasis(node)

visit_literal_strong(node)

visit_strong(node)

visit_subscript(node)

visit_superscript(node)

visit_target(node)

visit_transition(node)

class strephit.sphinx_wikisyntax.writer.WikisyntaxWriter(builder)

Bases: "docutils.writers.Writer"

output = None

settings_defaults = {}

settings_spec = ('No options here.', '', ())

supported = ('text',)

translate()

strephit.web_sources_corpus.spiders package

back to top

strephit.web_sources_corpus.spiders.BaseSpider module

back to top

class strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

Generic base spider, to abstract most of the work.

Specify the selectors to suit the website to scrape. The spider first uses

a list of selectors to reach a page containing the list of items to scrape.

Another selector is used to extract urls pointing to detail pages, containing

the details of the items to scrape. Finally a third selector is used to

extract the url pointing to the next "list" page.

*list_page_selectors* is a list of selectors used to reach the page containing the items to scrape. Each selector is applied to the page(s) fetched by extracting the url from the previous page using the preceding selector.

*detail_page_selectors* extract the urls pointing to the detail pages. Can be a single selector or a list.

*next_page_selectors* extracts the url pointing to the next page

Selector starting with *css:* are css selectors, those starting with *xpath:*

are xpath selectors, all others should follow the syntax *method:selector*,

where *method* is the name of a method of the spider and *selector* is another

selector specified in the same way as above). The method is used to transform

the result obtained by extracting the item pointed by the selecctor and should

accept the response as first parameter and the result of extracting the data

pointed by the selector (only if specified).

The spider provides a simple method to parse items. Item class is specified in

item_class* (must inherit from *scrapy.Item*) and item fields are specified

in the dict *item_fields*, whose keys are field names and values are selectors

following the syntax described above. They can also be lists or dicts arbitrarily

nested eventually containing selectors.

Each item can be processed and refined by the method *refine_item*.

clean(response, strings, unicode=True)

Utility function to clean strings. Can be used within your selectors

detail_page_selectors = None

get_elements_from_selector(response, selector)

item_class = None

item_fields = {}

list_page_selectors = None

make_url_absolute(page_url, url)

next_page_selectors = None

parse(response)

First stage of the spider with the goal of reaching the list page.

parse_detail(response)

Third stage of the spider, parses the detail page to produce an item

parse_list(response)

Second stage of the spider implementing pagination

refine_item(response, item)

Applies any custom post-processing to the item, override if needed.

Return None to discard the item

strephit.web_sources_corpus.spiders.academia_net module

back to top

class strephit.web_sources_corpus.spiders.academia_net.AcademiaNetSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['www.academia-net.org']

detail_page_selectors = 'xpath:.//li[@class="profil"]/div[1]/a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'name': 'clean:xpath:.//h1[contains(@class, "profilname")]/text()'}

list_page_selectors = None

name = 'academia_net'

next_page_selectors = 'xpath:.//div[@class="jumplist"]/a[last()]/@href'

refine_item(response, item)

start_urls = ('http://www.academia-net.org/search/?sv%5Barea_id%5D%5B0%5D=1252&sv%5Br_rbs_fachgebiete%5D%5B0%5D=&_seite=1',)

strephit.web_sources_corpus.spiders.american_bio module

back to top

class strephit.web_sources_corpus.spiders.american_bio.AmericanBioSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/table[1]//tr[3]//a/@href'

get_name_from_title(response, title)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}

list_page_selectors = 'xpath:.//div[@id="mw-content-text"]/table[2]//ul[1]/li/a/@href'

name = 'american_bio'

next_page_selectors = None

start_urls = ('https://en.wikisource.org/wiki/Appletons%27_Cyclop%C3%A6dia_of_American_Biography',)

strephit.web_sources_corpus.spiders.australasian_bio module

back to top

class strephit.web_sources_corpus.spiders.australasian_bio.AustralasianBioSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//tr[2]//a/@href'

get_name_from_title(response, title)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}

list_page_selectors = None

name = 'australasian_bio'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/The_Dictionary_of_Australasian_Biography',)

strephit.web_sources_corpus.spiders.australian_dictionary_of_biography module

back to top

class strephit.web_sources_corpus.spiders.australian_dictionary_of_biography.AustralianDictionaryOfBiographySpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

A spider for the Australian Dictionary of Biography website

allowed_domains = ['adb.anu.edu.au']

name = 'australian_dictionary_of_biography'

parse(response)

parse_person(response)

start_urls = ['http://adb.anu.edu.au/biographies/name/']

strephit.web_sources_corpus.spiders.bbc_co_uk module

back to top

class strephit.web_sources_corpus.spiders.bbc_co_uk.BbcCoUkSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['www.bbc.co.uk']

detail_page_selectors = 'xpath:.//a[@class="artist"]/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="info"]/div[@id="bio"]//text()', 'other': {'read-more': 'clean:xpath:.//div[@id="info"]//div[@id="read-more"]//text()', 'short-desc': 'xpath:.//div[@id="info"]/ul[@id="short-desc"]/li//text()', 'oup': 'clean:xpath:.//div[@id="info"]/div[@id="oup"]/p[1]/text()', 'how-to-cite': 'clean:xpath:.//div[@id="how-to-cite"]//text()'}, 'name': 'clean:xpath:.//div[@id="info"]/h1/text()'}

list_page_selectors = None

name = 'bbc_co_uk'

next_page_selectors = 'xpath:.//div[@class="topPagination"]//li[@class="next"]//a/@href'

refine_item(response, item)

start_requests()

strephit.web_sources_corpus.spiders.bio_english_lit module

back to top

class strephit.web_sources_corpus.spiders.bio_english_lit.BioEnglishLitSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul/li/a/@href'

get_name_from_title(response, title)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}

list_page_selectors = None

name = 'bio_english_lit'

next_page_selectors = None

start_urls = ('https://en.wikisource.org/wiki/A_Short_Biographical_Dictionary_of_English_Literature',)

strephit.web_sources_corpus.spiders.bishops module

back to top

class strephit.web_sources_corpus.spiders.bishops.BishopsSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['www.catholic-hierarchy.org']

clean_name(response, name)

detail_page_selectors = 'xpath:/html/body/ul/li/a[1]/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'name': 'clean_name:clean:xpath:.//h1[@align="center"]//text()'}

list_page_selectors = 'xpath:.//a[starts-with(@href, "la")]/@href'

name = 'bishops'

next_page_selectors = None

parse_bio(response)

parse_microdata(response)

parse_other(response)

refine_item(response, item)

start_urls = ('http://www.catholic-hierarchy.org/bishop/la.html',)

strephit.web_sources_corpus.spiders.brown_edu module

back to top

class strephit.web_sources_corpus.spiders.brown_edu.BrownEduSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['www.brown.edu']

custom_settings = {'DOWNLOAD_DELAY': 0.5, 'RETRY_HTTP_CODES': ['403']}

detail_page_selectors = 'xpath:.//div[@class="index"]//a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@class="index"]//text()', 'other': {'credit': 'clean:xpath:.//div[@class="credit"]//text()'}, 'name': 'clean:xpath:.//p[@class="head"]/following-sibling::p[1]/strong/text()'}

list_page_selectors = None

name = 'brown_edu'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://www.brown.edu/Administration/News_Bureau/Databases/Encyclopedia/',)

strephit.web_sources_corpus.spiders.catholic_encyclopedia module

back to top

class strephit.web_sources_corpus.spiders.catholic_encyclopedia.CatholicEncyclopediaSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/table[1]//tr[4]//a/@href'

get_name_from_title(response, title)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}

list_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul[1]//a/@href'

name = 'catholic_encyclopedia'

next_page_selectors = None

start_urls = ('https://en.wikisource.org/wiki/Catholic_Encyclopedia_%281913%29',)

strephit.web_sources_corpus.spiders.cesar_org_uk module

back to top

class strephit.web_sources_corpus.spiders.cesar_org_uk.CesarOrgUkSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['cesar.org.uk']

detail_page_selectors = 'xpath:.//td[@id="keywordColumn"]//a/@href'

item_class

alias of "WebSourcesCorpusItem"

list_page_selector = None

name = 'cesar_org_uk'

next_page_selectors = None

refine_item(response, item)

start_urls = ('http://cesar.org.uk/cesar2/people/people.php?fct=list&search=%25&listMaxRows=999999',)

strephit.web_sources_corpus.spiders.chinese_bio module

back to top

class strephit.web_sources_corpus.spiders.chinese_bio.ChineseBioSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@class="poem"]//a[not(@class="new")]/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p//text()', 'name': 'clean:xpath://div[@id="headerContainer"]/following-sibling::div[1]//p/b[1]/text()'}

list_page_selectors = None

name = 'chinese_bio'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/A_Chinese_Biographical_Dictionary',)

strephit.web_sources_corpus.spiders.christian_bio module

back to top

class strephit.web_sources_corpus.spiders.christian_bio.ChristianBioSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

base_url = 'https://en.wikisource.org/wiki/Dictionary_of_Christian_Biography_and_Literature_to_the_End_of_the_Sixth_Century/'

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul//a/@href'

get_name_from_title(response, title)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}

list_page_selectors = None

name = 'christian_bio'

next_page_selectors = None

start_requests()

strephit.web_sources_corpus.spiders.cooperhewitt_org module

back to top

class strephit.web_sources_corpus.spiders.cooperhewitt_org.CooperhewittOrgSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['collection.cooperhewitt.org']

detail_page_selectors = 'get_detail_page:xpath:.//div[@class="row"]/div[2]/ul[@class="list-o-things"]//h1/a/@href'

get_detail_page(response, urls)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[contains(@class, "person-bio")]/p//text()', 'name': 'clean:xpath:.//div[@class="page-header"]/h1/a/text()'}

list_page_selectors = None

name = 'cooperhewitt_org'

next_page_selectors = 'xpath:.//ul[@class="pagination"]/li[last()]/a/@href'

refine_item(response, item)

start_urls = ('http://collection.cooperhewitt.org/people/page1',)

strephit.web_sources_corpus.spiders.design_and_art_australia_online module

back to top

class strephit.web_sources_corpus.spiders.design_and_art_australia_online.DesignAndArtAustraliaOnlineSpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

A spider for the Design & Art Australia Online website

allowed_domains = ['www.daao.org.au']

name = 'design_and_art_australia_online'

parse(response)

parse_bio(response)

parse_person(response)

start_urls = ['https://www.daao.org.au/search/?q&selected_facets=record_type_exact%3APerson&page=1&advanced=false&view=view_list&results_per_page=100']

strephit.web_sources_corpus.spiders.dictionaryofarthistorians_org module

back to top

class strephit.web_sources_corpus.spiders.dictionaryofarthistorians_org.DictionaryofarthistoriansOrgSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['dictionaryofarthistorians.org']

detail_page_selectors = 'xpath:.//div[@class="navigation-by-letter"]/following-sibling::p/a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@class="arthist-publish-profile__body"]/p//text()', 'death': 'clean:xpath:.//div[@class="arthist-publish-profile__deathdate"]/p//text()', 'name': 'clean:xpath:.//h1[@class="arthist-publish-profile__name"]//text()', 'birth': 'clean:xpath:.//div[@class="arthist-publish-profile__birthdate"]/p//text()'}

list_page_selectors = None

name = 'dictionaryofarthistorians_org'

next_page_selectors = None

start_requests()

strephit.web_sources_corpus.spiders.dnb module

back to top

class strephit.web_sources_corpus.spiders.dnb.DictionaryOfNationalBiographySpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

A spider for the Dictionary of National Biography, in Wikisource

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//table//li/a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div//p//text()'}

list_page_selectors = 'xpath:.//dd/a/@href'

name = 'dnb'

next_page_selectors = 'xpath:.//span[@id="headernext"]/a/@href'

refine_item(response, item)

start_urls = ['https://en.wikisource.org/wiki/Dictionary_of_National_Biography,_1885-1900']

strephit.web_sources_corpus.spiders.dsi module

back to top

class strephit.web_sources_corpus.spiders.dsi.DsiSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['www.uni-stuttgart.de']

detail_page_selectors = 'xpath:.//a[contains(., "Detail page of this illustrator")]/@href'

item_class

alias of "WebSourcesCorpusItem"

list_page_selectors = None

name = 'dsi'

next_page_selectors = 'xpath:.//a[contains(., ">")]/@href'

page_url = 'http://www.uni-stuttgart.de/hi/gnt/dsi2/index.php?table_name=dsi&function=search&where_clause=&order=lastname&order_type=ASC&page=%d'

refine_item(response, item)

start_requests()

strephit.web_sources_corpus.spiders.english_artists module

back to top

class strephit.web_sources_corpus.spiders.english_artists.EnglishArtistsSpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

allowed_domains = ['en.wikisource.org']

finalize(item)

name = 'english_artists'

parse(response)

parse_detail(response)

start_urls = ('https://en.wikisource.org/wiki/A_Dictionary_of_Artists_of_the_English_School',)

text_from_node(node)

strephit.web_sources_corpus.spiders.freethinkers module

back to top

class strephit.web_sources_corpus.spiders.freethinkers.FreethinkersSpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

allowed_domains = ['en.wikisource.org']

name = 'freethinkers'

parse(response)

start_urls = ('https://en.wikisource.org/wiki/A_Biographical_Dictionary_of_Ancient,_Medieval,_and_Modern_Freethinkers',)

strephit.web_sources_corpus.spiders.gameo_org module

back to top

class strephit.web_sources_corpus.spiders.gameo_org.GameoOrgSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['gameo.org']

detail_page_selectors = 'xpath:.//table[@class="mw-allpages-table-chunk"]//a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]/h1[1]/preceding-sibling::*//text()'}

list_page_selectors = None

name = 'gameo_org'

next_page_selectors = 'xpath:.//td[@class="mw-allpages-nav"]/a[3]/@href'

parse_title(title)

refine_item(response, item)

start_urls = ('http://gameo.org/index.php?title=Special:AllPages&from=108+Chapel+%28100+Mile+House%2C+British+Columbia%2C+Canada%29',)

strephit.web_sources_corpus.spiders.genealogics module

back to top

class strephit.web_sources_corpus.spiders.genealogics.GenealogicsSpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

A spider for Leo's Genealogics website

allowed_domains = ['www.genealogics.org']

name = 'genealogics'

parse(response)

parse_person(response)

start_urls = ['http://www.genealogics.org/search.php?mybool=AND&nr=200']

strephit.web_sources_corpus.spiders.greek_roman_bio_myth module

back to top

class strephit.web_sources_corpus.spiders.greek_roman_bio_myth.GreekRomanBioMythSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul/li/a[not(@class="new")]/@href'

get_name_from_title(response, title)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]/p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]/text()'}

list_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul/li[position()>2]/a/@href'

name = 'greek_roman_bio_myth'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/Dictionary_of_Greek_and_Roman_Biography_and_Mythology',)

strephit.web_sources_corpus.spiders.indian_bio module

back to top

class strephit.web_sources_corpus.spiders.indian_bio.IndianBioSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul[position()>4]//a/@href'

get_name_from_title(response, title)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}

list_page_selectors = None

name = 'indian_bio'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/The_Indian_Biographical_Dictionary_(1915)',)

strephit.web_sources_corpus.spiders.irish_officers module

back to top

class strephit.web_sources_corpus.spiders.irish_officers.IrishOfficersSpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

allowed_domains = ['en.wikisource.org']

name = 'irish_officers'

parse(response)

parse_detail(response)

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/Chronicle_of_the_law_officers_of_Ireland',)

strephit.web_sources_corpus.spiders.medical_bio module

back to top

class strephit.web_sources_corpus.spiders.medical_bio.MedicalBioSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p[position()>1]//text()', 'other': {'born_died': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p[1]/text()'}, 'name': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p[1]/b/text()'}

list_page_selectors = 'xpath:(.//div[@id="mw-content-text"]//ol)[2]//a/@href'

name = 'medical_bio'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/American_Medical_Biographies',)

strephit.web_sources_corpus.spiders.men_at_the_bar module

back to top

class strephit.web_sources_corpus.spiders.men_at_the_bar.MenAtTheBarSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

base_url = 'https://en.wikisource.org/wiki/Men-at-the-Bar/Names_'

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'

get_name_from_title(response, title)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}

list_page_selectors = None

name = 'men_at_the_bar'

next_page_selectors = None

refine_item(response, item)

start_requests()

strephit.web_sources_corpus.spiders.men_of_time module

back to top

class strephit.web_sources_corpus.spiders.men_of_time.MenOfTimeSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//text()', 'name': 'clean:xpath:.//span[@id="header_section_text"]//text()'}

list_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//ul//a[not(@class="new")]/@href'

name = 'men_of_time'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/Men_of_the_Time,_eleventh_edition',)

strephit.web_sources_corpus.spiders.metal_archives_com module

back to top

class strephit.web_sources_corpus.spiders.metal_archives_com.MetalArchivesComSpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

allowed_domains = ['www.metal-archives.com']

base_url = 'http://www.metal-archives.com/search/ajax-artist-search/?field=alias&query=%2Aa%2A+OR+%2Ae%2A+OR+%2Ai%2A+OR+%2Ao%2A+OR+%2Au%2A&sEcho=1&iDisplayStart={}'

name = 'metal_archives_com'

parse(response)

parse_detail(response)

parse_extern(response)

start_urls = ('http://www.metal-archives.com/search/ajax-artist-search/?field=alias&query=%2Aa%2A+OR+%2Ae%2A+OR+%2Ai%2A+OR+%2Ao%2A+OR+%2Au%2A&sEcho=1&iDisplayStart=0',)

strephit.web_sources_corpus.spiders.modern_english_bio module

back to top

class strephit.web_sources_corpus.spiders.modern_english_bio.ModernEnglishBioSpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

allowed_domains = ['en.wikisource.org']

name = 'modern_english_bio'

parse(response)

parse_detail(response)

start_urls = ('https://en.wikisource.org/wiki/Modern_English_Biography',)

strephit.web_sources_corpus.spiders.munksroll module

back to top

class strephit.web_sources_corpus.spiders.munksroll.MunksrollSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['munksroll.rcplondon.ac.uk']

detail_page_selectors = 'xpath:.//div[@id="maincontent"]/table//a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="prose"]//text()', 'name': 'clean:xpath:.//h2[@class="PageTitle"]/text()'}

list_page_selectors = None

name = 'munksroll'

next_page_selectors = None

refine_item(response, item)

start_requests()

strephit.web_sources_corpus.spiders.museothyssen_org module

back to top

class strephit.web_sources_corpus.spiders.museothyssen_org.MuseothyssenOrgSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['www.museothyssen.org']

detail_page_selectors = 'xpath:.//ul[@id="autoresAZ"]/li/ul/li/a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//span[@id="contReader1"]//text()', 'other': {'born': 'clean:xpath:.//dl[@class="datosAutor"]/dt[contains(., "Born/Dead:")]/following-sibling::dd[1]//text()'}, 'name': 'clean:xpath:.//dl[@class="datosAutor"]/dt[contains(., "Author:")]/following-sibling::dd[1]//text()'}

list_page_selectors = None

name = 'museothyssen_org'

next_page_selectors = None

refine_item(response, item)

start_urls = ('http://www.museothyssen.org/en/thyssen/artistas',)

strephit.web_sources_corpus.spiders.musicians module

back to top

class strephit.web_sources_corpus.spiders.musicians.MusiciansSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//table[@id="multicol"]//a/@href'

item_class

alias of "WebSourcesCorpusItem"

list_page_selectors = ['xpath:.//span[@class="mw-headline"]/parent::h2/following-sibling::ul//a/@href', 'xpath:.//span[.="Articles"]/parent::h2/following-sibling::ul//a/@href']

name = 'musicians'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/A_Dictionary_of_Music_and_Musicians',)

strephit.web_sources_corpus.spiders.national_bio module

back to top

class strephit.web_sources_corpus.spiders.national_bio.NationalBioSpider(year)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//table[@class="prettytable"]//tr[4]//a/@href'

get_name_from_title(response, title)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]/text()'}

list_page_selectors = None

name = 'national_bio'

next_page_selectors = None

strephit.web_sources_corpus.spiders.naval_bio module

back to top

class strephit.web_sources_corpus.spiders.naval_bio.NavalBioSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul[position()>4]//a/@href'

get_name_from_title(response, title)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p[position()>1]//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}

list_page_selectors = None

name = 'naval_bio'

next_page_selectors = None

start_urls = ('https://en.wikisource.org/wiki/A_Naval_Biographical_Dictionary',)

strephit.web_sources_corpus.spiders.newulsterbiography_co_uk module

back to top

class strephit.web_sources_corpus.spiders.newulsterbiography_co_uk.NewulsterbiographyCoUkSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['www.newulsterbiography.co.uk']

detail_page_selectors = 'xpath:.//div[@id="search_results"]/p/a/@href'

get_bio(response, values)

get_name(response, values)

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'other': {'profession': 'xpath:.//span[@class="person_heading_profession"]//text()'}, 'bio': 'get_bio:xpath:.//div[@id="person_details"]/div/br[1]/preceding-sibling::*//text()', 'death': 'clean:xpath:.//div[@id="person_details"]/div/table[2]//tr[2]/td[2]/text()', 'name': 'get_name:xpath:.//h1[@class="person_heading"]/br/preceding-sibling::text()', 'birth': 'clean:xpath:.//div[@id="person_details"]/div/table[2]//tr[1]/td[2]/text()'}

list_page_selectors = None

name = 'newulsterbiography_co_uk'

next_page_selectors = None

start_urls = ('http://www.newulsterbiography.co.uk/index.php/home/browse/all',)

strephit.web_sources_corpus.spiders.nndb_com module

back to top

class strephit.web_sources_corpus.spiders.nndb_com.NndbComSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['www.nndb.com']

detail_page_selectors = 'xpath:.//a[contains(@href, "http://www.nndb.com/people/")]/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'name': 'clean:xpath:.//td/font/b/text()'}

list_page_selectors = 'xpath:.//a[@class="newslink"]/@href'

name = 'nndb_com'

refine_item(response, item)

start_urls = ('http://www.nndb.com/',)

strephit.web_sources_corpus.spiders.parliament_uk module

back to top

class strephit.web_sources_corpus.spiders.parliament_uk.ParliamentUkSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['www.parliament.uk']

clean_name(response, name)

detail_page_selectors = 'xpath:.//table//tr/td/a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'name': 'clean_name:clean:xpath:.//div[@id="commons-biography-header"]/h1//text()'}

list_page_selectors = None

name = 'parliament_uk'

next_page_selectors = None

refine_item(response, item)

start_urls = ('http://www.parliament.uk/mps-lords-and-offices/mps/',)

strephit.web_sources_corpus.spiders.portraits_and_sketches module

back to top

class strephit.web_sources_corpus.spiders.portraits_and_sketches.PortraitsAndSketchesSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//text()', 'name': 'clean:xpath:(.//div[@class="tiInherit"]/p/span)[1]//text()'}

list_page_selectors = None

name = 'portraits_and_sketches'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/Cartoon_portraits_and_biographical_sketches_of_men_of_the_day',)

strephit.web_sources_corpus.spiders.rkd_nl module

back to top

class strephit.web_sources_corpus.spiders.rkd_nl.RKDArtistsSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

A spider for RKD Netherlands Institute for Art History website

allowed_domains = ['rkd.nl']

detail_page_selectors = 'xpath:.//div[@class="header"]/a/@href'

extract_dl_key_value(dl_pairs, item)

Feed the item with key-value pairs extracted from <dl> tags

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'url': 'make_url:xpath:.//div[@class="record-id"]//text()', 'name': 'clean:xpath:.//h2/text()'}

list_page_selectors = None

make_url(response, artist_id)

name = 'rkd_nl'

next_page_selectors = 'xpath:.//a[@title="Next page"]/@href'

refine_item(response, item)

start_urls = ['https://rkd.nl/en/explore/artists']

strephit.web_sources_corpus.spiders.royalsociety_org module

back to top

class strephit.web_sources_corpus.spiders.royalsociety_org.RoyalsocietyOrgSpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

allowed_domains = ['royalsociety.org']

name = 'royalsociety_org'

parse(response)

parse_fellow(response)

start_requests()

start_urls = ('http://www.royalsociety.org/',)

strephit.web_sources_corpus.spiders.sculpture_uk module

back to top

class strephit.web_sources_corpus.spiders.sculpture_uk.SculptureUkSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['sculpture.gla.ac.uk']

detail_page_selectors = 'xpath:.//div[@class="featured"]/table//a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@class="featured"]/p[child::b][last()]/following-sibling::p//text()', 'death': 'clean:xpath:.//b[.="Died"]/following-sibling::text()[1]', 'name': 'clean:xpath:.//div[@class="featured"]/h1//text()', 'birth': 'clean:xpath:.//b[.="Born"]/following-sibling::text()[1]'}

list_page_selectors = 'xpath:.//div[@class="featuredpeople"]//a/@href'

name = 'sculpture_uk'

next_page_selectors = None

refine_item(response, item)

start_urls = ('http://sculpture.gla.ac.uk/browse/index.php',)

strephit.web_sources_corpus.spiders.structurae_net module

back to top

class strephit.web_sources_corpus.spiders.structurae_net.StructuraeNetSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['structurae.net']

detail_page_selectors = 'xpath:.//ol[@class="searchlist"]//a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'other': {'bibliography': 'xpath:.//div[@id="person-bibliography"]//li/a/@href', 'publications': 'xpath:.//div[@id="person-literature"]//li//a/@href', 'websites': 'xpath:.//div[@id="person-websites"]//li/a/@href', 'participated_in': 'xpath:.//div[@id="person-references"]//a/@href'}, 'name': 'clean:xpath:.//h1/span[@itemprop="name"]//text()'}

list_page_selectors = 'xpath:.//ol[@class="commalist"]//a/@href'

name = 'structurae_net'

next_page_selectors = 'xpath:(.//div[@class="nextPageNav"])[1]//a[1]/@href'

refine_item(response, item)

start_urls = ('http://structurae.net/persons/',)

strephit.web_sources_corpus.spiders.vocab_getty_edu module

back to top

class strephit.web_sources_corpus.spiders.vocab_getty_edu.VocabGettyEduSpider(name=None, **kwargs)

Bases: "scrapy.spiders.Spider"

allowed_domains = ['vocab.getty.edu']

bio_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fbio2%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++skos%3AscopeNote+%3Fnote.%0D%0A+%3Fnote+rdf%3Avalue+%3Fbio2.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

bio_query_2 = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FshortBio%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3Adescription+%3FshortBio.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

birth_place_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FdeathPlace%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3AdeathPlace+%3Fdpf.%0D%0A+%3Fdp+foaf%3Afocus+%3Fdpf%3B%0D%0A++++++gvp%3AparentString+%3FdeathPlace.%0D%0A%7D&_implicit=false&implicit=true&_equivalent=false&_form=%2Fsparql'

birth_year_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fbirth%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+gvp%3AestStart+%3Fbirth.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

completed_queries = set([])

db_connection = <sqlite3.Connection object>

death_place_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FbirthPlace%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3AbirthPlace+%3Fbpf.%0D%0A+%3Fbp+foaf%3Afocus+%3Fbpf%3B%0D%0A++++++gvp%3AparentString+%3FbirthPlace.%0D%0A%7D&_implicit=false&implicit=true&_equivalent=false&_form=%2Fsparql'

death_year_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fdeath%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+gvp%3AestEnd+%3Fdeath%3B%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

finalize_data(table)

This method will be called after *table* has been populated. When all tables have been

populated with data joins them and yields the polished items.

gender_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fgender%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3Agender+%3Fgender%3B%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

load_into_db(table)

name = 'vocab_getty_edu'

name_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fname%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++gvp%3AprefLabelGVP+%3Flabel.%0D%0A%3Flabel+gvp%3Aterm+%3Fname%0D%0A%7D&_implicit=false&_equivalent=false&_form=%2Fsparql'

nationality_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fnationality%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AnationalityPreferred+%3Fny.%0D%0A+%3Fny+gvp%3AprefLabelGVP+%3FlblNationality.%0D%0A+%3FlblNationality+gvp%3Aterm+%3Fnationality.+%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

queries = [('name', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fname%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++gvp%3AprefLabelGVP+%3Flabel.%0D%0A%3Flabel+gvp%3Aterm+%3Fname%0D%0A%7D&_implicit=false&_equivalent=false&_form=%2Fsparql'), ('bio', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fbio2%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++skos%3AscopeNote+%3Fnote.%0D%0A+%3Fnote+rdf%3Avalue+%3Fbio2.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('bio2', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FshortBio%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3Adescription+%3FshortBio.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('nationality', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fnationality%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AnationalityPreferred+%3Fny.%0D%0A+%3Fny+gvp%3AprefLabelGVP+%3FlblNationality.%0D%0A+%3FlblNationality+gvp%3Aterm+%3Fnationality.+%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('birth_year', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fbirth%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+gvp%3AestStart+%3Fbirth.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('birth_place', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FdeathPlace%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3AdeathPlace+%3Fdpf.%0D%0A+%3Fdp+foaf%3Afocus+%3Fdpf%3B%0D%0A++++++gvp%3AparentString+%3FdeathPlace.%0D%0A%7D&_implicit=false&implicit=true&_equivalent=false&_form=%2Fsparql'), ('death_year', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fdeath%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+gvp%3AestEnd+%3Fdeath%3B%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('death_place', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FbirthPlace%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3AbirthPlace+%3Fbpf.%0D%0A+%3Fbp+foaf%3Afocus+%3Fbpf%3B%0D%0A++++++gvp%3AparentString+%3FbirthPlace.%0D%0A%7D&_implicit=false&implicit=true&_equivalent=false&_form=%2Fsparql'), ('gender', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fgender%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3Agender+%3Fgender%3B%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql')]

row_to_item(row)

Converts a single row, result of the join between all tables, into a finished item

start_requests()

strephit.web_sources_corpus.spiders.wga_hu module

back to top

class strephit.web_sources_corpus.spiders.wga_hu.WgaHuSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['www.wga.hu']

detail_page_selectors = ['xpath:.//table//td[@class="ARTISTLIST"]//a/@href', 'xpath:.//a[starts-with(@href, "/bio/")]/@href']

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//h3[.="Biography"]/following-sibling::p/text()', 'other': {'born-died': 'clean:xpath:.//div[@class="INDEX3"]//text()'}, 'name': 'clean:xpath:.//div[@class="INDEX2"]/text()'}

list_page_selectors = None

name = 'wga_hu'

next_page_selectors = None

refine_item(response, item)

start_requests()

strephit.web_sources_corpus.spiders.who_is_who_america module

back to top

class strephit.web_sources_corpus.spiders.who_is_who_america.WhoIsWhoAmericaSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div//p[2]//text()', 'name': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div//p/b/a/text()'}

list_page_selectors = 'xpath:.//table[@class="headertemplate"]//tr[3]//a[not(@class="new")]/@href'

name = 'who_is_who_america'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/Woman%27s_Who%27s_Who_of_America,_1914-15',)

strephit.web_sources_corpus.spiders.who_is_who_in_china module

back to top

class strephit.web_sources_corpus.spiders.who_is_who_in_china.WhoIsWhoInChinaSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['en.wikisource.org']

detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//a[not(@class="new")]/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean:xpath:.//div[@class="tiInherit"]/following-sibling::p//text()', 'name': 'clean:xpath:(.//p/b)[2]/text()'}

list_page_selectors = None

name = 'who_is_who_in_china'

next_page_selectors = None

refine_item(response, item)

start_urls = ('https://en.wikisource.org/wiki/Who%27s_Who_in_China_(3rd_edition)',)

strephit.web_sources_corpus.spiders.yba_llgc_org_uk module

back to top

class strephit.web_sources_corpus.spiders.yba_llgc_org_uk.YbaLlgcOrgUkSpider(name=None, **kwargs)

Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

allowed_domains = ['yba.llgc.org.uk']

clean_nu(response, strings)

detail_page_selectors = 'xpath:.//div[@id="text"]/p/a/@href'

item_class

alias of "WebSourcesCorpusItem"

item_fields = {'bio': 'clean_nu:xpath:.//div[@id="text"]//text()', 'other': {'sources': 'clean_nu:xpath:.//div[@id="text"]/div[@class="biog"]/ul/li[@class="bib_item"]//text()', 'contributer': 'clean_nu:xpath:.//div[@id="text"]/p[@class="contributer"]//text()', 'surname': 'clean_nu:xpath:.//div[@id="text"]/span[@class="article_header"]/b/span[@class="surname"]/text()', 'forename': 'clean_nu:xpath:.//div[@id="text"]/span[@class="article_header"]/b/span[@class="forename"]/text()'}}

list_page_selectors = None

name = 'yba_llgc_org_uk'

next_page_selectors = None

refine_item(response, item)

start_requests()

strephit.web_sources_corpus package

back to top

Subpackages

back to top

strephit.web_sources_corpus.spiders package

Submodules

strephit.web_sources_corpus.spiders.BaseSpider module

strephit.web_sources_corpus.spiders.academia_net module

strephit.web_sources_corpus.spiders.american_bio module

strephit.web_sources_corpus.spiders.australasian_bio module

strephit.web_sources_corpus.spiders.australian_dictionary_of_biography module

strephit.web_sources_corpus.spiders.bbc_co_uk module

strephit.web_sources_corpus.spiders.bio_english_lit module

strephit.web_sources_corpus.spiders.bishops module

strephit.web_sources_corpus.spiders.brown_edu module

strephit.web_sources_corpus.spiders.catholic_encyclopedia module

strephit.web_sources_corpus.spiders.cesar_org_uk module

strephit.web_sources_corpus.spiders.chinese_bio module

strephit.web_sources_corpus.spiders.christian_bio module

strephit.web_sources_corpus.spiders.cooperhewitt_org module

strephit.web_sources_corpus.spiders.design_and_art_australia_online module

strephit.web_sources_corpus.spiders.dictionaryofarthistorians_org module

strephit.web_sources_corpus.spiders.dnb module

strephit.web_sources_corpus.spiders.dsi module

strephit.web_sources_corpus.spiders.english_artists module

strephit.web_sources_corpus.spiders.freethinkers module

strephit.web_sources_corpus.spiders.gameo_org module

strephit.web_sources_corpus.spiders.genealogics module

strephit.web_sources_corpus.spiders.greek_roman_bio_myth module

strephit.web_sources_corpus.spiders.indian_bio module

strephit.web_sources_corpus.spiders.irish_officers module

strephit.web_sources_corpus.spiders.medical_bio module

strephit.web_sources_corpus.spiders.men_at_the_bar module

strephit.web_sources_corpus.spiders.men_of_time module

strephit.web_sources_corpus.spiders.metal_archives_com module

strephit.web_sources_corpus.spiders.modern_english_bio module

strephit.web_sources_corpus.spiders.munksroll module

strephit.web_sources_corpus.spiders.museothyssen_org module

strephit.web_sources_corpus.spiders.musicians module

strephit.web_sources_corpus.spiders.national_bio module

strephit.web_sources_corpus.spiders.naval_bio module

strephit.web_sources_corpus.spiders.newulsterbiography_co_uk module

strephit.web_sources_corpus.spiders.nndb_com module

strephit.web_sources_corpus.spiders.parliament_uk module

strephit.web_sources_corpus.spiders.portraits_and_sketches module

strephit.web_sources_corpus.spiders.rkd_nl module

strephit.web_sources_corpus.spiders.royalsociety_org module

strephit.web_sources_corpus.spiders.sculpture_uk module

strephit.web_sources_corpus.spiders.structurae_net module

strephit.web_sources_corpus.spiders.vocab_getty_edu module

strephit.web_sources_corpus.spiders.wga_hu module

strephit.web_sources_corpus.spiders.who_is_who_america module

strephit.web_sources_corpus.spiders.who_is_who_in_china module

strephit.web_sources_corpus.spiders.yba_llgc_org_uk module

strephit.web_sources_corpus.archive_org module

back to top

strephit.web_sources_corpus.archive_org.parse_and_save(text, separator, out_file, url)

strephit.web_sources_corpus.britishmuseum_org module

back to top

strephit.web_sources_corpus.britishmuseum_org.serialize_person(person)

strephit.web_sources_corpus.items module

back to top

class strephit.web_sources_corpus.items.WebSourcesCorpusItem(*args, **kwargs)

Bases: "scrapy.item.Item"

fields = {'bio': {}, 'death': {}, 'name': {}, 'url': {}, 'other': {}, 'birth': {}}

strephit.web_sources_corpus.pipelines module

back to top

class strephit.web_sources_corpus.pipelines.WebSourcesCorpusPipeline

Bases: "object"

process_item(item, spider)

strephit.web_sources_corpus.settings module

back to top