StrepHit
StrepHit is an intelligent reading agent that understands text and translates it into Wikidata statements.
More specifically, it is a Natural Language Processing pipeline that extracts facts from text and produces Wikidata statements with references. Its final objective is to enhance the data quality of Wikidata by suggesting references to validate statements.
StrepHit was born in January 2016 and is funded by a Wikimedia Foundation Individual Engagement Grant (IEG).
This page contains the technical documentation.
Source Code
[edit]The whole codebase can be found on GitHub: https://github.com/Wikidata/StrepHit
Features
[edit]- Web spiders to collect a biographical corpus from a list of reliable sources
- Corpus analysis to understand the most meaningful verbs
- Extraction of sentences and semi-structured data from a corpus
- Train an automatic classifier through crowdsourcing
- Extract facts from text in 2 ways:
- Several utilities, ranging from NLP tasks like tokenization and part-of-speech tagging, to facilities for parallel processing, caching and logging
Pipeline
[edit]- Corpus Harvesting
- Corpus Analysis
- Sentence Extraction
- N-ary Relation Extraction
- Dataset Serialization
strephit.annotation package
[edit]strephit.annotation.create_crowdflower_input module
[edit]
strephit.annotation.create_crowdflower_input.prepare_crowdflower_input(sentences, frame_data, filter_places)
strephit.annotation.create_crowdflower_input.write_input_spreadsheet(data_units, outfile)
strephit.annotation.generate_cml module
[edit]
strephit.annotation.generate_cml.generate_crowdflower_interface_template(input_csv, output_html)
- Generate CrowFlower interface template based on input data spreadsheet
- Parameters:
- input_csv (file) -- CSV file with the input data
- Parameters:
- output_html (file) -- File in which to write the output
- Returns:
- 0 on success
- Returns:
strephit.annotation.parse_results module
[edit]
strephit.annotation.parse_results.process_unit(unit_id, sentences)
strephit.annotation.post_job module
[edit]
strephit.annotation.post_job.activate_gold(job_id)
- Activate gold units in the given job.
- Corresponds to the 'Convert Uploaded Test Questions' UI button.
- Parameters:
- job_id (str) -- job ID registered in CrowdFlower
- Parameters:
- Returns:
- True on success
- Returns:
- Return type:
- boolean
- Return type:
strephit.annotation.post_job.config_job(job_id)
- Setup a given CrowdFlower job with default settings.
- See :const: JOB_SETTINGS
- Parameters:
- job_id (str) -- job ID registered in CrowdFlower
- Parameters:
- Returns:
- the uploaded job response object, as per https://success.crowdflower.com/hc/en-us/articles/201856229-CrowdFlower-API-API-Responses-and-Messaging#job_response on success, or an error message
- Returns:
- Return type:
- dict
- Return type:
strephit.annotation.post_job.create_job(title, instructions, cml, custom_js)
- Create an empty CrowdFlower job with the specified title and instructions.
- Raise any HTTP error that may occur.
- Parameters:
- title (str) -- plain text title
- Parameters:
- instructions (str) -- instructions, can contain HTML
- cml (str) -- worker interface CML template. See https://success.crowdflower.com/hc/en-us/articles/202817989-CML-CrowdFlower-Markup-Language-Overview
- custom_js (str) -- JavaScript code to be injected into the job
- Returns:
- the created job response object, as per https://success.crowdflower.com/hc/en-us/articles/201856229-CrowdFlower-API-API-Responses-and-Messaging#job_response on success, or an error message
- Returns:
- Return type:
- dict
- Return type:
strephit.annotation.post_job.tag_job(job_id, tags)
- Tag a given job.
- Parameters:
- job_id (str) -- job ID registered in CrowdFlower
- Parameters:
- tags (list) -- list of tags
- Returns:
- True on success
- Returns:
- Return type:
- boolean
- Return type:
strephit.annotation.post_job.upload_units(job_id, csv_data)
- Upload the job data units to the given job.
- Raises any HTTP error that may occur.
- Parameters:
- job_id (str) -- job ID registered in CrowdFlower
- Parameters:
- csv_data (file) -- file handle pointing to the data units CSV
- Returns:
- the uploaded job response object, as per https://success.crowdflower.com/hc/en-us/articles/201856229-CrowdFlower-API-API-Responses-and-Messaging#job_response on success, or an error message
- Returns:
- Return type:
- dict
- Return type:
strephit.annotation.pull_results module
[edit]
strephit.annotation.pull_results.download_full_report(job_id)
- Download the full CSV report of the given job.
- See https://success.crowdflower.com/hc/en-us/articles/202703075-Guide-to-Reports-Page-and-Settings-Page#full_report
- Raises any HTTP error that may occur.
- Parameters:
- job_id (str) -- job ID registered in CrowdFlower
- Parameters:
strephit.annotation.pull_results.get_latest_job_id()
- Get the ID of the most recent job.
- Returns:
- the latest job ID
- Returns:
- Return type:
- str
- Return type:
strephit.classification package
[edit]strephit.classification.classify module
[edit]
class strephit.classification.classify.SentenceClassifier(model, extractor, language, gazetteer)
- Supervised Sentence classifier
classify_sentences(sentences)
- Classify the given sentences
- Parameters:
- sentences (list) -- sentences to be classified. Each one
- should be a dict with a *text*, a source *url* and some *linked_entities*
- Parameters:
- Returns:
- Classified sentences with the recognized *fes*
- Returns:
- Return type:
- generator of dicts
- Return type:
strephit.classification.feature_extractors module
[edit]
class strephit.classification.feature_extractors.BaseFeatureExtractor
- Feature extractor template. Will process sentences one by one
- accumulating their features and finalizes them into the final
- training set.
- It should be used to extract features prior to classification,
- in which case the fe arguments can be used to group tokens of
- the same entity into a single chunk while ignoring the actual
- frame element name, e.g. *fes = dict(enumerate(entities))*
get_features()
- Returns the final training set
- Returns:
- A matrix whose rows are samples and columns are features and a
- column vector with the sample label (i.e. the correct answer for the classifier)
- Returns:
- Return type:
- tuple
- Return type:
process_sentence(sentence, fes, add_unknown, gazetteer)
- Extracts and accumulates features for the given sentence
- Parameters:
- sentence (unicode) -- Text of the sentence
- Parameters:
- fes (dict) -- Dictionary with FEs and corresponding chunks
- add_unknown (bol) -- Whether unknown tokens should be added to the index of treaded as a special, unknown token. Set to True when building the training set and to False when building the features used to classify new sentences
- gazetteer (dict) -- Additional features to add when a given chunk is found in the sentence. Keys should be chunks and values should be list of features
- Returns:
- Nothing
- Returns:
start()
- Clears the features accumulated so far and starts over.
class strephit.classification.feature_extractors.FactExtractorFeatureExtractor(language, window_width=2)
- Bases: "strephit.classification.feature_extractors.BaseFeatureExtractor"
- Feature extractor inspired from the fact-extractor
extract_features(sentence, fes, add_unknown, gazetteer)
- Extracts the features for each token of the sentence
- Parameters:
- sentence (unicode) -- Text of the sentence
- Parameters:
- fes (dicr) -- mapping FE -> chunk
- gazetteer (dict) -- mapping chunk -> additional features
- Returns:
- List of features, each one as a sparse row
- (i.e. with the indexes of the relevant columns)
- Returns:
feature_for(term, type_, position, add_unknown)
- Returns the feature for the given token, i.e. the column of the feature in a sparse matrix
- Parameters:
- term (str) -- Actual term
- Parameters:
- type (str) -- Type of the term, for example token, pos or lemma
- position (int) -- Relative position (used for context windows)
- add_unknown (bool) -- Whether to add previously unseen terms to the dictionary or use the UNK token instead
- Returns:
- Column of the corresponding feature
- Returns:
get_features()
process_sentence(sentence, fes, add_unknown, gazetteer)
sentence_to_tokens(sentence, fes)
- Transforms a sentence into a list of tokens
- Parameters:
- sentence (unicode) -- Text of the sentence
- Parameters:
- fes (dict) -- mapping FE -> chunk
- Returns:
- List of tokens
- Returns:
start()
token_to_features(tokens, position, add_unknown, gazetteer)
- Extracts the features for the token in the given position
- Parameters:
- tokens (list) -- POS-tagged tokens of the sentence
- Parameters:
- position (int) -- position of the token for which features are requestsd
- gazetteer (dict) -- mapping chunk -> additional features
- Returns:
- sparse set of features (i.e. numbers are indexes in a row of a sparse matrix)
- Returns:
class strephit.classification.feature_extractors.SortedSet
- Very simple sorted unique collection which remembers
- the order of insertion of its items
index(item)
put(item)
reverse_map()
strephit.classification.train module
[edit]strephit.commons package
[edit]strephit.commons.cache module
[edit]
strephit.commons.cache.cached(function)
- Decorator to cache function results based on its arguments
- Sample usage:
>>> from strephit.commons import cache >>> @cache.cached ... def f(x): ... print 'inside f' ... return 2 * x ... >>> f(10) inside f 20 >>> f(10) 20
strephit.commons.cache.get(key, default=None)
- Retrieves an item from the cache
- Parameters:
- key -- Key of the item
- Parameters:
- default -- Default value to return if the key is not in the cache
- Returns:
- The item associated with the given key or
- the default value
- Returns:
- Sample usage:
>>> from strephit.commons import cache >>> cache.get('kk', 13) 13 >>> cache.get('kk', 0) 0 >>> cache.set('kk', 15) >>> cache.get('kk', 0) 15
strephit.commons.cache.set(key, value, overwrite=True)
- Stores an item in the cache under the given key
- Parameters:
- key -- Unique key used to identify the idem.
- Parameters:
- value -- Value to store in the cache. Must be JSON-dumpable
- overwrite -- Whether to overwrite the previous value associated with the key (if any)
- Returns:
- Nothing
- Returns:
- Sample usage:
>>> from strephit.commons import cache >>> cache.get('kk', 13) 13 >>> cache.get('kk', 0) 0 >>> cache.set('kk', 15) >>> cache.get('kk', 0) 15
strephit.commons.classification module
[edit]
strephit.commons.classification.apply_custom_classification_rules(classified, language, overwrite=False)
- Implements simple custom, classifier-agnostic rules for
- recognizing some frame elements
- Parameters:
- classified (dict) -- an item produced by the classifier
- Parameters:
- language (str) -- Language of the sentence
- overwrite (bool) -- Tells the priority in case the rules assign a role to the same chunk recognized by the classifier
- Returns:
- The same item with augmented FEs
- Returns:
strephit.commons.classification.reverse_gazetteer(gazetteer)
- Reverses the gazetteer from feature -> chunks to chunk -> features
- Parameters:
- gazetteer (dict) -- Gazetteer associating chunks to features
- Parameters:
- Returns:
- An equivalent gazetteer associating features to chunks
- Returns:
- Return type:
- dict
- Return type:
strephit.commons.date_normalizer module
[edit]
class strephit.commons.date_normalizer.DateNormalizer(language=None, specs=None)
- Bases: "object"
- find matches in text strings using regular expressions and transforms them
- according to a pattern transformation expression evaluated on the match
- the specifications are given in yaml format and allow to define meta functions
- and meta variables as well as the pattern and transformation rules themselves.
- meta variables will be placed inside patterns which use them in order to
- make writing patterns easier. meta variables will be available to use from
- inside the meta functions too as a dictionary named meta_vars
- a pattern transformation expression is an expression which will be evaluated
- if the corresponding regular expression matches. the pattern transformation
- will have access to all the meta functions and meta variables defined and
- to a variable named 'match' containing the regex match found
normalize_many(expression)
- Find all the matching entities in the given expression expression
- Parameters:
- expression (str) -- The expression in which to look for
- Parameters:
- Returns:
- Generator of tuples (start, end), category, result
- Returns:
- Sample usage:
>>> from pprint import pprint >>> from strephit.commons.date_normalizer import DateNormalizer >>> pprint(list(DateNormalizer('en').normalize_many('I was born on April 18th, ' ... 'and today is April 18th, 2016!'))) [((14, 24), 'Time', {'day': 18, 'month': 4}), ((39, 55), 'Time', {'day': 18, 'month': 4, 'year': 2016})]
normalize_one(expression, conflict='longest')
- Find the matching part in the given expression
- Parameters:
- expression (str) -- The expression in which to search the match
- Parameters:
- conflict (str) -- Whether to return the first match found or scan through all the provided regular expressions and return the longest or shortest part of the string matched by a regular expression. Note that the match will always be the first one found in the string, this parameter tells how to resolve conflicts when there is more than one regular expression that returns a match. When more matches have the same length the first one found counts Allowed values are *first*, *longest* and *shortest*
- Returns:
- Tuple with (start, end), category, result
- Returns:
- Return type:
- tuple
- Return type:
- Sample usage:
>>> from strephit.commons.date_normalizer import DateNormalizer >>> DateNormalizer('en').normalize_one('Today is the 1st of June, 2016') ((13, 30), 'Time', {'month': 6, 'day': 1, 'year': 2016})
strephit.commons.date_normalizer.normalize_numerical_fes(language, text)
- Normalize numerical FEs in a sentence
strephit.commons.datetime module
[edit]
strephit.commons.datetime.parse(string)
- Try to parse a date expressed in natural language.
- Parameters:
- string (str) -- Date in natural language
- Parameters:
- Returns:
- dictionary with year, month, day
- Returns:
- Type:
- dict
- Type:
strephit.commons.entity_linking module
[edit]
strephit.commons.entity_linking.extract_entities(response_json)
- Extract the list of entities from the Dandelion Entity Extraction API JSON response.
- Parameters:
- response_json (dict) -- JSON response returned by Dandelion
- Parameters:
- Returns:
- The extracted entities, with the surface form, start and end indices URI, and ontology types
- Returns:
- Return type:
- list
- Return type:
strephit.commons.io module
[edit]
strephit.commons.io.dump_corpus(corpus, dump_file_handle)
- Dump a loaded corpus to a file with one JSON object per line .
strephit.commons.io.get_and_cache(url, use_cache=True, **kwargs)
- Perform an HTTP GET request to the given url and optionally cache the
- result somewhere in the file system. The cached content will be used
- in the subsequent requests.
- Raises all HTTP errors
- Parameters:
- url -- URL of the page to retrieve
- Parameters:
- use_cache -- Whether to use cache
- **kwargs -- keyword arguments to pass to *requests.get*
- Returns:
- The content page at the given URL, unicode
- Returns:
strephit.commons.io.load_corpus(location, document_key, text_only=False)
- Load an input corpus from a directory with scraped items, in a memory-efficient way.
- Each input file must contain one JSON object per line.
- Parameters:
- document_key (str) -- a scraped item dictionary key holding textual documents
- Parameters:
strephit.commons.io.load_dumped_corpus(dump_file_handle, document_key, text_only=False)
- Load a previously dumped corpus file, in a memory-efficient way.
strephit.commons.io.load_scraped_items(location)
- Loads all the items from a directory or file.
- Parameters:
- location --
- Parameters:
- Where is the corpus.
- If it is a directory, all files with extension jsonlines will be loaded.
- if it is a file, it can be either a jsonlines of a tar compressed file.
strephit.commons.logging module
[edit]
strephit.commons.logging.log_request_data(http_response, logger)
- Send a debug log message with basic information of the HTTP request that was sent for the given HTTP response.
- Parameters:
- http_response (requests.models.Response) -- HTTP response object
- Parameters:
strephit.commons.logging.setLogLevel(module, level)
- Sets the log level used to log messages from the given module
strephit.commons.logging.setup()
strephit.commons.parallel module
[edit]
strephit.commons.parallel.execute(processes=0, *specs)
- Execute the given functions parallelly
- Parameters:
- processes -- Number of functions to execute at the same time
- Parameters:
- specs -- a sequence of functions, each followed by its arguments (arguments as a tuple or list)
- Returns:
- the results that the functions returned, in the same order as they were specified
- Returns:
- Return type:
- list
- Return type:
- Sample usage:
>>> from strephit.commons import parallel >>> list(parallel.execute(4, ... lambda x, y: x + y, (5, -5), ... lambda *x: sum(x), range(5) ... )) [0, 10]
strephit.commons.parallel.make_batches(iterable, size)
strephit.commons.parallel.map(function, iterable, processes=0, flatten=False, raise_exc=True, batch_size=0)
- Applies the given function to each element of the iterable in parallel.
- None* values are not allowed in the iterable nor as return values, they will
- simply be discarded. Can be "safely" stopped with a keboard interrupt.
- Applies the given function to each element of the iterable in parallel.
- Parameters:
- function -- the function used to transform the elements of the iterable
- Parameters:
- processes -- how many items to process in parallel. Use zero or a negative number to use all the available processors. No additional processes will be used if the value is 1.
- flatten -- If the mapping function return an iterable flatten the resulting iterables into a single one.
- raise_exc -- Only when *processes* equals 1, controls whether to propagate the exception raised by the mapping function to the called or simply to log them and carry on the computation. When *processes* is different than 1 this parameter is not used.
- batch_size -- If larger than 0, the input iterable will be grouped in groups of this size and the resulting list passed to as argument to the worker.
- Returns:
- iterable with the results. Order is not guaranteed to be preserved
- Returns:
- Sample usage:
>>> from strephit.commons import parallel >>> list(parallel.map(lambda x: 2*x, range(10))) [0, 8, 10, 12, 14, 16, 18, 2, 4, 6]
strephit.commons.pos_tag module
[edit]
class strephit.commons.pos_tag.NLTKPosTagger(language)
- Bases: "object"
- part-of-speech tagger implemented using the NLTK library
tag_many(documents, tagset=None, **kwargs)
- POS-Tag many documents.
tag_one(text, tagset, **kwargs)
- POS-Tags the given text
class strephit.commons.pos_tag.TTPosTagger(language, tt_home=None, **kwargs)
- Bases: "object"
- part-of-speech tagger implemented using tree tagger and treetaggerwrapper
tag_many(items, document_key, pos_tag_key, batch_size=10000, **kwargs)
- POS-Tags many text documents of the given items. Use this for massive text tagging
- Parameters:
- items -- Iterable of items to tag. Generator preferred
- Parameters:
- document_key -- Where to find the text to tag inside each item. Text must be unicode
- pos_tag_key -- Where to put pos tagged text
- Sample usage:
>>> from strephit.commons.pos_tag import TTPosTagger >>> from pprint import pprint >>> pprint(list(TTPosTagger('en').tag_many( ... [{'text': 'Item one is in first position'}, {'text': 'In the second position is item two'}], ... 'text', 'tagged' ... ))) [{'tagged': [Tag(word='Item', pos='NN', lemma='item'), Tag(word='one', pos='CD', lemma='one'), Tag(word='is', pos='VBZ', lemma='be'), Tag(word='in', pos='IN', lemma='in'), Tag(word='first', pos='JJ', lemma='first'), Tag(word='position', pos='NN', lemma='position')], 'text': 'Item one is in first position'}, {'tagged': [Tag(word='In', pos='IN', lemma='in'), Tag(word='the', pos='DT', lemma='the'), Tag(word='second', pos='JJ', lemma='second'), Tag(word='position', pos='NN', lemma='position'), Tag(word='is', pos='VBZ', lemma='be'), Tag(word='item', pos='RB', lemma='item'), Tag(word='two', pos='CD', lemma='two')], 'text': 'In the second position is item two'}]
tag_one(text, skip_unknown=True, **kwargs)
- POS-Tags the given text, optionally skipping unknown lemmas
- Parameters:
- text (unicode) -- Text to be tagged
- Parameters:
- skip_unknown (bool) -- Automatically emove unrecognized tags from the result
- Sample usage:
>>> from strephit.commons.pos_tag import TTPosTagger >>> from pprint import pprint >>> pprint(TTPosTagger('en').tag_one('sample sentence to be tagged fycgvkuhbj')) [Tag(word='sample', pos='NN', lemma='sample'), Tag(word='sentence', pos='NN', lemma='sentence'), Tag(word='to', pos='TO', lemma='to'), Tag(word='be', pos='VB', lemma='be'), Tag(word='tagged', pos='VVN', lemma='tag')]
tokenize(text)
- Splits a text into tokens
strephit.commons.pos_tag.get_pos_tagger(language, **kwargs)
- Returns an initialized instance of the preferred POS tagger for the given language
strephit.commons.scoring module
[edit]
strephit.commons.scoring.compute_score(sentence, score, core_fes_weight)
- Computes the confidency score for a sentence based on FE scores
- Parameters:
- sentence (dict) -- Data of the sentence, containing FEs
- Parameters:
- score (str) -- Type of score: arithmetic-mean, weighted-mean, f-score
- core_fes_weight (float) -- Weight of core FEs wrt extra FEs
strephit.commons.serialize module
[edit]
class strephit.commons.serialize.ClassificationSerializer(language, frame_data, url_to_wid=None)
get_subjects(data)
- Finds all subjects of the frame assigned to the sentence
- Parameters:
- data (dict) -- classification results
- Parameters:
- Returns:
- all subjects as tuples (chunk, wikidata id)
- Returns:
- Return type:
- generator of tuples
- Return type:
static map_fe_to_wid(frame_data)
serialize_numerical(subj, fe, url)
- Serializes a numerical FE found by the normalizer
to_statements(data, input_encoded=True)
- Converts the classification results into quick statements
- Parameters:
- data -- Data from the classifier. Can be either str or dict
- Parameters:
- input_encoded (bool) -- Whether data is a str or a dict
- Returns:
- Tuples <success, item> where item is a statement if success
- is true else it is a named entity which could not be resolved
- Returns:
- Type:
- generator
- Type:
strephit.commons.serialize.map_url_to_wid(semistructured)
- Read the quick statements generated from the semi structured data
- and build a map associating url to wikidata id
strephit.commons.split_sentences module
[edit]
class strephit.commons.split_sentences.PunktSentenceSplitter(language)
- Bases: "object"
- Sentence splitting splits a natural language text into sentences
model_path = 'tokenizers/punkt/%s.pickle'
split(text)
- Split the given text into sentences.
- Leading and trailing spaces are stripped.
- Newline characters are first interpreted as sentence boundaries.
- Then, the sentence splitter is run.
- Parameters:
- text (str) -- Text to be split
- Parameters:
- Returns:
- the sentences in the text
- Returns:
- Return type:
- generator
- Return type:
- Sample usage:
>>> from strephit.commons.split_sentences import PunktSentenceSplitter >>> list(PunktSentenceSplitter('en').split( ... "This is the first sentence. Mr. period doesn't always delimit sentences" ... )) ['This is the first sentence.', "Mr. period doesn't always delimit sentences"]
split_tokens(tokens)
- Splits the given text into sentences.
- Parameters:
- tokens (list) -- the tokens of the text
- Parameters:
- Returns:
- the sentences i the text
- Returns:
- Return type:
- generator
- Return type:
- Sample usage:
>>> from strephit.commons.split_sentences import PunktSentenceSplitter >>> list(PunktSentenceSplitter('en').split_tokens( ... "This is the first sentence. Mr. period doesn't always delimit sentences".split() ... )) [['This', 'is', 'the', 'first', 'sentence.'], ['Mr.', 'period', "doesn't", 'always', 'delimit', 'sentences']]
supported_models = {'el': 'tokenizers/punkt/greek.pickle', 'fr': 'tokenizers/punkt/french.pickle', 'en': 'tokenizers/punkt/english.pickle', 'nl': 'tokenizers/punkt/dutch.pickle', 'pt': 'tokenizers/punkt/portuguese.pickle', 'no': 'tokenizers/punkt/norwegian.pickle', 'sv': 'tokenizers/punkt/swedish.pickle', 'de': 'tokenizers/punkt/german.pickle', 'tr': 'tokenizers/punkt/turkish.pickle', 'it': 'tokenizers/punkt/italian.pickle', 'da': 'tokenizers/punkt/danish.pickle', 'cz': 'tokenizers/punkt/czech.pickle', 'es': 'tokenizers/punkt/spanish.pickle', 'fi': 'tokenizers/punkt/finnish.pickle', 'et': 'tokenizers/punkt/estonian.pickle', 'sl': 'tokenizers/punkt/slovene.pickle', 'pl': 'tokenizers/punkt/polish.pickle'}
strephit.commons.stopwords module
[edit]
class strephit.commons.stopwords.StopWords
- Bases: "object"
- This module retrieves stop words for a given language
classmethod words(language)
- Returns a list of stop words for a specified language
- Parameters:
- language (str) -- the language whose stop words are required
- Parameters:
- Returns:
- Stop words if language is supported. Else an empty list
- Returns:
- Return type:
- list
- Return type:
strephit.commons.text module
[edit]
strephit.commons.text.clean(s, unicode=True)
strephit.commons.text.clean_extract(sel, path, path_type='xpath', limit_from=None, limit_to=None, sep='\n', unicode=True)
strephit.commons.text.extract_dict(response, keys_selector, values_selector, keys_extractor='.//text()', values_extractor='.//text()', **kwargs)
- Extracts a dictionary given the selectors for the keys and the vaues.
- The selectors should point to the elements containing the text and not the
- text itself.
- Parameters:
- response -- The response object. The methods xpath or css are used
- Parameters:
- keys_selector -- Selector pointing to the elements containing the keys, starting with the type *xpath:* or *css:* followed by the selector itself
- values_selector -- Selector pointing to the elements containing the values, starting with the type *xpath:* or *css:* followed by the selector itself
- keys_extracotr -- Selector used to actually extract the value of the key from each key element. xpath only
- keys_extracotr -- Selector used to extract the actual value value from each value element. xpath only
- **kwargs -- Other parameters to pass to *clean_extract*. Nothing good will come by passing *path_type='css'*, you have been warned.
strephit.commons.text.fix_name(name)
- tries to normalize a name so that it can be searched with the wikidata APIs
- Parameters:
- name -- The name to normalize
- Parameters:
- Returns:
- a tuple with the normalized name and a list of honorifics
- Returns:
strephit.commons.text.parse_birth_death(string)
- Parses birth and death dates from a string.
- Parameters:
- string -- String with the dates. Can be 'd. <year>' to indicate the
- year of death, 'b. <year>' to indicate the year of birth, <year>-<year>
- to indicate both birth and death year. Can optionally include 'c.' or 'ca.'
- before years to indicate approximation (ignored by the return value).
- If only the century is specified, birth is the first year of the century and
- death is the last one, e.g. '19th century' will be parsed as *('1801', '1900')*
- Parameters:
- Returns:
- tuple *(birth_year, death_year)*, both strings as appearing in the original string.
- If the string cannot be parsed *(None, None)* is returned.
- Returns:
strephit.commons.text.split_at(content, delimiters)
- Splits content using given delimiters following their order, for example
>>> [x for x in split_at(range(11), range(3,10,3))] [(None, [1, 2]), (3, [4, 5]), (6, [7, 8]), (None, [9, 10])]
strephit.commons.text.strip_honorifics(name)
- Removes honorifics from the name
- Parameters:
- name -- The name
- Parameters:
- Returns:
- a tuple with the name without honorifics and a list of honorifics
- Returns:
strephit.commons.tokenize module
[edit]
class strephit.commons.tokenize.Tokenizer(language)
- Tokenization splits a natural language utterance into words (tokens)
tokenization_regexps = {'en': '[^\\p{L}\\p{N}]+', 'it': '[^\\p{L}\\p{N}]+'}
tokenize(sentence)
- Tokenize the given sentence.
- You can also pass a generic text, but you will lose the sentence segmentation.
- Parameters:
- sentence (str) -- a natural language sentence or text to be tokenized
- Parameters:
- Returns:
- the list of tokens
- Returns:
- Return type:
- list
- Return type:
strephit.commons.wikidata module
[edit]
strephit.commons.wikidata.call_api(action, cache=True, **kwargs)
- Invoke the given method of wikidata APIs with the given parameters
strephit.commons.wikidata.finalize_statement(subject, property, value, language, url=None, resolve_property=True, resolve_value=True, **kwargs)
- Given the components of a statement, convert it into a quick statement.
- Parameters:
- subject -- Subject of the statement (its Wikidata ID)
- Parameters:
- property -- Property of the statement
- value -- Value of the statement (to be resolved)
- language -- Language used to resolve the value
- url -- Source of the statement (corresponds to S854)
- resolve_property -- Whether *property* is already a Wikidata ID or needs to be resolved
- resolve_value -- Whether *value* can be inserted into the statement as-is or needs to be resolved
- kwargs -- additional information used to resolve *value*
strephit.commons.wikidata.format_date(year=None, month=None, day=None)
- Formats a date according to Wikidata syntax. Assumes that the date is mostly
- correct. The allowed values of the parameters are shown in the following
- truth table
year month day ok 1 1 1 1 1 1 0 1 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 0
- Parameters:
- year -- year of the date
- Parameters:
- month -- month of the date. Only positive values allowed
- day -- day of the date. Only positive values allowed
strephit.commons.wikidata.get_entities(ids, batch)
- Retrieve Wikidata entities metadata.
- Parameters:
- ids (list) -- list of Wikidata entity IDs
- Parameters:
- batch (int) -- number of IDs per call, to serve as paging for the API.
- Returns:
- dict of Wikidata entities with metadata
- Returns:
- Return type:
- dict
- Return type:
strephit.commons.wikidata.get_labels_and_aliases(entities, language_code)
- Extract language-specific label and aliases from a list of Wikidata entities metadata.
- Parameters:
- entities (list) -- list of Wikidata entities with metadata.
- Parameters:
- language_code (str) -- 2-letter language code, e.g., *en* for English
- Returns:
- dict of entities, with label and aliases only
- Returns:
- Return type:
- dict
- Return type:
strephit.commons.wikidata.get_property_ids(batch)
- Get the full list of Wikidata property IDs (pids).
- Parameters:
- batch (int) -- number of pids per call, to serve as paging for the API.
- Parameters:
- Returns:
- list of all pids
- Returns:
- Return type:
- list
- Return type:
strephit.commons.wikidata.honorifics_resolver(property, value, language, **kwargs)
- Resolves honorifics such as "mr.", "dr." etc
strephit.commons.wikidata.identity_resolver(property, value, language, **kwargs)
- Default resolver, converts to unicode and surrounds with double quotes
strephit.commons.wikidata.parse_date(date, precision=None)
- Tries to parse a date serialized according to the wikidata format
- into its components year, month and day
- Returns:
- dict (year, month, day)
- Returns:
strephit.commons.wikidata.resolve(property, value, language, **kwargs)
- Tries to resolve the Wikidata ID of an object given its string representation
- Parameters:
- property -- Wikidata ID of the property to resolve
- Parameters:
- value -- String value
- language -- Search only this language
- kwargs -- Additional info that might be useful to help the resolver
strephit.commons.wikidata.resolver(*properties)
- Decorator to register a function as resolver for the given properties.
strephit.commons.wikidata.resolver_with_hints(property, value, language, **kwargs)
- Resolves people names. Works better if generic biographic
- information, such as birth/death dates, is provided.
- Parameters:
- kwargs -- dictionary of wikidata property -> list of values
- Parameters:
strephit.commons.wikidata.search(term, language, type_=None, label_exact=True, limit='15')
- Uses the wikidata APIs to search for a term. Can optionally specify a type
- (corresponding to the 'instance of' P31 wikidata property. If no type is
- specified simply returns all the items containing *term* in *label*
- Parameters:
- term (str) -- The term to look for
- Parameters:
- language (str) -- Search in this language
- type (iterable) -- Type of the entity to look for, wikidata numeric id (i.e. without starting Q) Can be int or anything iterable
- label_exact (bool) -- Filter entities whose labels matches exactly the search term
- limit (str) -- How many results to return at most
- Returns:
- List of dicts with details (which details depend on *type_*)
- Returns:
- Return type:
- list of dicts
- Return type:
strephit.corpus_analysis package
[edit]strephit.corpus_analysis.compute_lu_distribution module
[edit]
strephit.corpus_analysis.compute_lu_distribution.worker_with_sentences(bio)
- Produces an histogram counting the number of verbs
- for each sentence appearing in the biography
- Parameters:
- bio (str) -- The biography to analyze
- Parameters:
- Returns:
- histogram of frequenties
- Returns:
- Type:
- dict
- Type:
strephit.corpus_analysis.compute_lu_distribution.worker_with_sub_sentences(bio)
- Produces an histogram counting the number of verbs
- for each phrase appearing in the biography
- Parameters:
- bio (str) -- The biography to analyze
- Parameters:
- Returns:
- histogram of frequenties
- Returns:
- Type:
- dict
- Type:
strephit.corpus_analysis.extract_framenet_frames module
[edit]
strephit.corpus_analysis.extract_framenet_frames.extract_top_corpus_tokens(enriched_lemmas, all_lemma_tokens)
- Extract the subset of corpus lemmas with tokens gievn the set of top lemmas
- Parameters:
- enriched_lemmas (dict) -- Dict returned by "intersect_lemmas_with_framenet()"
- Parameters:
- all_lemma_tokens (dict) -- Dict of all corpus lemmas with tokens
- Returns:
- the top lemmas with tokens dict
- Returns:
- Return type:
- dict
- Return type:
strephit.corpus_analysis.extract_framenet_frames.get_top_n_lus(ranked_lus, n)
- Extract the top N Lexical Units (LUs) from a ranking.
- Parameters:
- ranked_lus (dict) -- LUs ranking, as returned by "compute_ranking()"
- Parameters:
- n (int) -- Number of top LUs to return
- Returns:
- the top N LUs with their ranking scores
- Returns:
- Return type:
- dict
- Return type:
strephit.corpus_analysis.extract_framenet_frames.intersect_lemmas_with_framenet(corpus_lemmas, wikidata_properties)
- Intersect verb lemmas extracted from the input corpus with FrameNet Lexical Units (LUs).
- Parameters:
- corpus_lemmas (dict) -- dict of verb lemmas with their ranking scores
- Parameters:
- wikidata_properties (dict) -- dict with all Wikidata properties
- Returns:
- a dictionary of corpus lemmas enriched with FrameNet LUs data (dicts)
- Returns:
- Return type:
- dict
- Return type:
strephit.corpus_analysis.rank_verbs module
[edit]
class strephit.corpus_analysis.rank_verbs.PopularityRanking(corpus_path, pos_tag_key)
- Ranking based on the popularity of each verb. Simply counts the
- frequency of each lemma over all corpus
find_ranking(processes=0, bulk_size=10000, normalize=True)
static score_from_tokens(tokens)
class strephit.corpus_analysis.rank_verbs.TFIDFRanking(vectorizer, verbs, tfidf_matrix)
- Computes TF-IDF based rankings.
- The first ranking is based on the average TF-IDF score of each lemma over all corpus
- The second ranking is based on the average standard deviation of TF-IDF scores
- of each lemma over all corpus
find_ranking(processes=0)
- Ranks the verbs
- Parameters:
- processes (int) -- How many processes to use for parallel ranking
- Parameters:
- Returns:
- tuple with average tf-idf and average standard deviation ordered rankings
- Returns:
- Return type:
- tuple of (OrderedDict, OrderedDict)
- Return type:
score_lemma(lemma)
- Computess TF-IDF based score of a single lemma
- Parameters:
- lemma (str) -- The lemma to score
- Parameters:
- Returns:
- tuple with lemma, average tf-idf, average of tf-idf standard deviations
- Returns:
- Return type:
- tuple of (str, float, float)
- Return type:
strephit.corpus_analysis.rank_verbs.compute_tf_idf_matrix(corpus_path, document_key)
- Computes the TF-IDF matrix of the corpus
- Parameters:
- corpus_path (str) -- path of the corpus
- Parameters:
- document_key (str) -- where the textual content is in the corpus
- Returns:
- a vectorizer and the computed matrix
- Returns:
- Return type:
- tuple
- Return type:
strephit.corpus_analysis.rank_verbs.get_similarity_scores(verb_token, vectorizer, tf_idf_matrix)
- Compute the cosine similarity score of a given verb token against the input corpus TF/IDF matrix.
- Parameters:
- verb_token (str) -- Surface form of a verb, e.g., born
- Parameters:
- vectorizer (sklearn.feature_extraction.text.TfidfVectorizer) -- Vectorizer used to transform verbs into vectors
- Returns:
- cosine similarity score
- Returns:
- Return type:
- ndarray
- Return type:
strephit.corpus_analysis.rank_verbs.harmonic_ranking(*rankings)
- Combines individual rankings with an harmonic mean to obtain a final ranking
- Parameters:
- rankings -- dictionary of individual rankings
- Parameters:
- Returns:
- the new, combined ranking
- Returns:
strephit.corpus_analysis.rank_verbs.produce_lemma_tokens(pos_tagged_path, pos_tag_key, language)
- Extracts a map from lemma to all its tokens
- Parameters:
- pos_tagged_path (str) -- path of the pos-tagged corpus
- Parameters:
- pos_tag_key (str) -- where the pos tag data is in each item
- language -- language of the corpus
- Returns:
- mapping from lemma to tokens
- Returns:
- Return type:
- dict
- Return type:
strephit.corpus_analysis.test_pos_taggers module
[edit]
strephit.corpus_analysis.test_pos_taggers.tag(text, tt_home)
strephit.extraction package
[edit]strephit.extraction.balanced_extract module
[edit]
strephit.extraction.balanced_extract.extract_sentences(sentences, probabilities, processes=0, input_encoded=False, output_encoded=False)
- Extracts some sentences from the corpus following the given probabilities
- Parameters:
- sentences (iterable) -- Extracted sentences
- Parameters:
- probabilities (dict) -- Conditional probabilities of extracting a sentence containing a specific LU given the source of the sentence. It is therefore a mapping source -> probabilities, where probabilities is itself a mapping LU -> probability
- processes (int) -- how many processes to use for parallel execution
- input_encoded (bool) -- whether the corpus is an iterable of dictionaries or an iterable of JSON-encoded documents. JSON-encoded documents are preferable over large size dictionaries for performance reasons
- output_encoded (bool) -- whether to return a generator of dictionaries or a generator of JSON-encoded documents. Prefer encoded output for performance reasons
- Returns:
- Generator of sentences
- Returns:
strephit.extraction.balanced_extract.lu_count(sentences, processes=0, input_encoded=False)
- Count how many sentences per LU there are for each source
- Parameters:
- sentences (iterable) -- Corpus with the POS-tagged sentences
- Parameters:
- processes (int) -- how many processes to use for parallel execution
- input_encoded (bool) -- whether the corpus is an iterable of dictionaries or an iterable of JSON-input_encoded documents. JSON-input_encoded documents are preferable over large size dictionaries for performance reasons
- Returns:
- A dictionary source -> frequencies, where frequencies is
- another dictionary lemma -> count
- Returns:
- Type:
- bool
- Type:
strephit.extraction.extract_sentences module
[edit]
class strephit.extraction.extract_sentences.GrammarExtractor(corpus, document_key, sentences_key, language, lemma_to_token, match_base_form)
- Bases: "strephit.extraction.extract_sentences.SentenceExtractor"
- Grammar-based extraction strategy: pick sentences that comply with a pre-defined grammar.
extract_from_item(item)
grammars = {'en': '\n NOPH: {<PDT>?<DT|PP.*|>?<CD>?<JJ.*|VVN>*<N.+|FW>+<CC>?}\n CHUNK: {<NOPH>+<MD>?<V.+>+<IN|TO>?<NOPH>+}\n ', 'it': '\n SN: {<PRO.*|DET.*|>?<ADJ>*<NUM>?<NOM|NPR>+<NUM>?<ADJ|VER:pper>*}\n CHUNK: {<SN><VER.*>+<SN>}\n '}
parser = None
setup_extractor()
splitter = None
class strephit.extraction.extract_sentences.ManyToManyExtractor(corpus, document_key, sentences_key, language, lemma_to_token, match_base_form)
- Bases: "strephit.extraction.extract_sentences.SentenceExtractor"
- n2n extraction strategy: many sentences per many LUs
- N.B.: the same sentence is likely to appear multiple times
extract_from_item(item)
setup_extractor()
splitter = None
class strephit.extraction.extract_sentences.OneToOneExtractor(corpus, document_key, sentences_key, language, lemma_to_token, match_base_form)
- Bases: "strephit.extraction.extract_sentences.SentenceExtractor"
- 121 extraction strategy: 1 sentence per 1 LU
- N.B.: the same sentence will appear only once
- the sentence is assigned to a RANDOM LU
all_verb_tokens = None
extract_from_item(item)
setup_extractor()
splitter = None
token_to_lemma = None
class strephit.extraction.extract_sentences.SentenceExtractor(corpus, document_key, sentences_key, language, lemma_to_token, match_base_form)
- Base class for sentence extractors.
extract(processes=0)
- Processes the corpus extracting sentences from each item
- and storing them in the item itself.
- Parameters:
- processes (int) -- how many processes to use for parallel tagging
- Parameters:
- Returns:
- the extracted sentences
- Returns:
- Type:
- generator of dicts
- Type:
extract_from_item(item)
- Extract sentences from an item. Relies on *setup_extractor*
- having been called
- Parameters:
- item (dict) -- Item from which to extract sentences
- Parameters:
- Returns:
- The original item and list of extracted sentences
- Returns:
- Return type:
- tuple of dict, list
- Return type:
setup_extractor()
- Optional setup code, run before starting the extraction
teardown_extractor()
- Optional teardown code, run after the extraction
class strephit.extraction.extract_sentences.SyntacticExtractor(corpus, document_key, sentences_key, language, lemma_to_token, match_base_form)
- Bases: "strephit.extraction.extract_sentences.SentenceExtractor"
- Tries to split sentences into sub-sentences so that each of them
- contains only one LU
all_verbs = None
extract_from_item(item)
find_sub_sentences(tree)
find_terminals(tree, label=None)
parser = None
setup_extractor()
splitter = None
token_to_lemma = None
strephit.extraction.extract_sentences.extract_sentences(corpus, sentences_key, document_key, language, lemma_to_tokens, strategy, match_base_form, processes=0)
- Extract sentences from the given corpus by matching tokens against a given set.
- Parameters:
- corpus -- Corpus as an iterable of documents
- Parameters:
- sentences_key (str) -- dict key where to put extracted sentences
- document_key (str) -- dict key where the textual document is
- language (str) -- ISO 639-1 language code used for tokenization and sentence splitting
- lemma_to_tokens (dict) -- Dict with corpus lemmas as keys and tokens to be matched as values
- strategy (str) -- One of the 4 extraction strategies ['121', 'n2n', 'grammar', 'syntactic']
- match_base_form (bool) -- whether to match verbs base form
- processes (int) -- How many concurrent processes to use
- Returns:
- the corpus, updated with the extracted sentences and the number of extracted sentences
- Returns:
- Return type:
- generator of tuples
- Return type:
strephit.extraction.process_semistructured module
[edit]
class strephit.extraction.process_semistructured.SemistructuredSerializer(language, sourced_only)
process_corpus(items, output_file, dump_unresolved_file=None, genealogics=None, processes=0)
resolve_genealogics_family(input_file, url_to_id)
- Performs a second pass on genealogics to resolve additional family members
serialize_item(item)
- Converts an item to quick statements.
- Parameters:
- item -- Scraped item, either str (json) or dict
- Parameters:
- Returns:
- tuples <success, item> where item is an entity which
- could not be resolved if success is false, otherwise it is a
- <subject, property, object, source> tuple
- Returns:
- Return type:
- generator
- Return type:
strephit.extraction.source_id_mappings module
[edit]strephit.rule_based.resources package
[edit]strephit.rule_based.resources.frame_repo module
[edit]strephit.rule_based package
[edit]Subpackages
[edit]- strephit.rule_based.resources package
- Submodules
- strephit.rule_based.resources.frame_repo module
strephit.rule_based.classify module
[edit]
class strephit.rule_based.classify.RuleBasedClassifier(frame_data, language)
- A simple rule-based classifier
- The frame is recognized solely based on the lexical unit
- and frame elements are assigned to linked entities with
- a suitable type
assign_frame_elements(linked, frame)
- Try to assign a frame element to each of the linked entities
- based on their ontology type(s)
- Parameters:
- linked -- Entities found in the sentence
- Parameters:
- frame -- Frame data
- Returns:
- List of assigned frames
- Returns:
label_sentence(sentence, normalize_numerical, score_type, core_weight)
- Labels a single sentence
- Parameters:
- sentence -- Sentence data to label
- Parameters:
- normalize_numerical -- Automatically normalize numerical FEs
- score_type -- Which type of score (if any) to use to compute the classification confidence
- core_weight -- Weight of the core FEs (used in the scoring)
- Returns:
- Labeled data
- Returns:
label_sentences(sentences, normalize_numerical, score_type, core_weight, processes=0, input_encoded=False, output_encoded=False)
- Process all the given sentences with the rule-based classifier,
- optionally giving a confidence score
- Parameters:
- sentences -- List of sentence data
- Parameters:
- normalize_numerical -- Whether to automatically normalize numerical expressions
- score_type -- Which type of score (if any) to use to compute the classification confidence
- core_weight -- Weight of the core FEs (used in the scoring)
- processes -- how many processes to use to concurrently label sentences
- input_encoded -- whether the corpus is an iterable of dictionaries or an iterable of JSON-encoded documents. JSON-encoded documents are preferable over large size dictionaries for performance reasons
- output_encoded -- whether to return a generator of dictionaries or a generator of JSON-encoded documents. Prefer encoded output for performance reasons
- Returns:
- Generator of labeled sentences
- Returns:
strephit.rule_based.cli module
[edit]strephit.side_projects package
[edit]strephit.side_projects.wlm module
[edit]
strephit.side_projects.wlm.process_row(data)
strephit.side_projects.wlm.wlmid_resolver(property, value, language, **kwargs)
strephit.sphinx_wikisyntax package
[edit]sphinx_wikisyntax
[edit]Sphinx extension to generate documentation in wikisyntax format
strephit.sphinx_wikisyntax.setup(app)
strephit.sphinx_wikisyntax.builder module
[edit]sphinx_wikisyntax
[edit]Wikisyntax Sphinx builder.
class strephit.sphinx_wikisyntax.builder.WikisyntaxBuilder(app)
- Bases: "sphinx.builders.text.TextBuilder"
allow_parallel = True
format = 'wikisyntax'
name = 'wikisyntax'
out_suffix = '.wiki'
prepare_writing(docnames)
strephit.sphinx_wikisyntax.writer module
[edit]sphinx_wikisyntax
[edit]Custom docutils writer for wikisyntax
class strephit.sphinx_wikisyntax.writer.WikisyntaxTranslator(document, builder)
- Bases: "sphinx.writers.text.TextTranslator"
MAXWIDTH = 20000000000
STDINDENT = 1
depart_block_quote(node)
depart_centered(node)
depart_doctest_block(node)
depart_document(node)
depart_emphasis(node)
depart_list_item(node)
depart_literal_emphasis(node)
depart_literal_strong(node)
depart_strong(node)
depart_subscript(node)
depart_superscript(node)
depart_table(node)
depart_target(node)
depart_title(node)
- Called when the end of a section's title is encountered
end_state(wrap=False, end=[''], first=None)
visit_block_quote(node)
visit_centered(node)
visit_desc_parameterlist(node)
- Called when the parameter list of a function is encountered
visit_desc_signature(node)
- Called when the full name (incl. module) of a function is encountered
visit_doctest_block(node)
visit_emphasis(node)
visit_literal_emphasis(node)
visit_literal_strong(node)
visit_strong(node)
visit_subscript(node)
visit_superscript(node)
visit_target(node)
visit_transition(node)
class strephit.sphinx_wikisyntax.writer.WikisyntaxWriter(builder)
- Bases: "docutils.writers.Writer"
output = None
settings_defaults = {}
settings_spec = ('No options here.', '', ())
supported = ('text',)
translate()
strephit.web_sources_corpus.spiders package
[edit]strephit.web_sources_corpus.spiders.BaseSpider module
[edit]
class strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
- Generic base spider, to abstract most of the work.
- Specify the selectors to suit the website to scrape. The spider first uses
- a list of selectors to reach a page containing the list of items to scrape.
- Another selector is used to extract urls pointing to detail pages, containing
- the details of the items to scrape. Finally a third selector is used to
- extract the url pointing to the next "list" page.
- *list_page_selectors* is a list of selectors used to reach the page containing the items to scrape. Each selector is applied to the page(s) fetched by extracting the url from the previous page using the preceding selector.
- *detail_page_selectors* extract the urls pointing to the detail pages. Can be a single selector or a list.
- *next_page_selectors* extracts the url pointing to the next page
- Selector starting with *css:* are css selectors, those starting with *xpath:*
- are xpath selectors, all others should follow the syntax *method:selector*,
- where *method* is the name of a method of the spider and *selector* is another
- selector specified in the same way as above). The method is used to transform
- the result obtained by extracting the item pointed by the selecctor and should
- accept the response as first parameter and the result of extracting the data
- pointed by the selector (only if specified).
- The spider provides a simple method to parse items. Item class is specified in
- item_class* (must inherit from *scrapy.Item*) and item fields are specified
- in the dict *item_fields*, whose keys are field names and values are selectors
- following the syntax described above. They can also be lists or dicts arbitrarily
- nested eventually containing selectors.
- The spider provides a simple method to parse items. Item class is specified in
- Each item can be processed and refined by the method *refine_item*.
clean(response, strings, unicode=True)
- Utility function to clean strings. Can be used within your selectors
detail_page_selectors = None
get_elements_from_selector(response, selector)
item_class = None
item_fields = {}
list_page_selectors = None
make_url_absolute(page_url, url)
next_page_selectors = None
parse(response)
- First stage of the spider with the goal of reaching the list page.
parse_detail(response)
- Third stage of the spider, parses the detail page to produce an item
parse_list(response)
- Second stage of the spider implementing pagination
refine_item(response, item)
- Applies any custom post-processing to the item, override if needed.
- Return None to discard the item
strephit.web_sources_corpus.spiders.academia_net module
[edit]
class strephit.web_sources_corpus.spiders.academia_net.AcademiaNetSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['www.academia-net.org']
detail_page_selectors = 'xpath:.//li[@class="profil"]/div[1]/a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'name': 'clean:xpath:.//h1[contains(@class, "profilname")]/text()'}
list_page_selectors = None
name = 'academia_net'
next_page_selectors = 'xpath:.//div[@class="jumplist"]/a[last()]/@href'
refine_item(response, item)
strephit.web_sources_corpus.spiders.american_bio module
[edit]
class strephit.web_sources_corpus.spiders.american_bio.AmericanBioSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/table[1]//tr[3]//a/@href'
get_name_from_title(response, title)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}
list_page_selectors = 'xpath:.//div[@id="mw-content-text"]/table[2]//ul[1]/li/a/@href'
name = 'american_bio'
next_page_selectors = None
strephit.web_sources_corpus.spiders.australasian_bio module
[edit]
class strephit.web_sources_corpus.spiders.australasian_bio.AustralasianBioSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//tr[2]//a/@href'
get_name_from_title(response, title)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}
list_page_selectors = None
name = 'australasian_bio'
next_page_selectors = None
refine_item(response, item)
strephit.web_sources_corpus.spiders.australian_dictionary_of_biography module
[edit]
class strephit.web_sources_corpus.spiders.australian_dictionary_of_biography.AustralianDictionaryOfBiographySpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
- A spider for the Australian Dictionary of Biography website
allowed_domains = ['adb.anu.edu.au']
name = 'australian_dictionary_of_biography'
parse(response)
parse_person(response)
start_urls = ['http://adb.anu.edu.au/biographies/name/']
strephit.web_sources_corpus.spiders.bbc_co_uk module
[edit]
class strephit.web_sources_corpus.spiders.bbc_co_uk.BbcCoUkSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['www.bbc.co.uk']
detail_page_selectors = 'xpath:.//a[@class="artist"]/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="info"]/div[@id="bio"]//text()', 'other': {'read-more': 'clean:xpath:.//div[@id="info"]//div[@id="read-more"]//text()', 'short-desc': 'xpath:.//div[@id="info"]/ul[@id="short-desc"]/li//text()', 'oup': 'clean:xpath:.//div[@id="info"]/div[@id="oup"]/p[1]/text()', 'how-to-cite': 'clean:xpath:.//div[@id="how-to-cite"]//text()'}, 'name': 'clean:xpath:.//div[@id="info"]/h1/text()'}
list_page_selectors = None
name = 'bbc_co_uk'
next_page_selectors = 'xpath:.//div[@class="topPagination"]//li[@class="next"]//a/@href'
refine_item(response, item)
start_requests()
strephit.web_sources_corpus.spiders.bio_english_lit module
[edit]
class strephit.web_sources_corpus.spiders.bio_english_lit.BioEnglishLitSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul/li/a/@href'
get_name_from_title(response, title)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}
list_page_selectors = None
name = 'bio_english_lit'
next_page_selectors = None
strephit.web_sources_corpus.spiders.bishops module
[edit]
class strephit.web_sources_corpus.spiders.bishops.BishopsSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['www.catholic-hierarchy.org']
clean_name(response, name)
detail_page_selectors = 'xpath:/html/body/ul/li/a[1]/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'name': 'clean_name:clean:xpath:.//h1[@align="center"]//text()'}
list_page_selectors = 'xpath:.//a[starts-with(@href, "la")]/@href'
name = 'bishops'
next_page_selectors = None
parse_bio(response)
parse_microdata(response)
parse_other(response)
refine_item(response, item)
start_urls = ('http://www.catholic-hierarchy.org/bishop/la.html',)
strephit.web_sources_corpus.spiders.brown_edu module
[edit]
class strephit.web_sources_corpus.spiders.brown_edu.BrownEduSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['www.brown.edu']
custom_settings = {'DOWNLOAD_DELAY': 0.5, 'RETRY_HTTP_CODES': ['403']}
detail_page_selectors = 'xpath:.//div[@class="index"]//a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@class="index"]//text()', 'other': {'credit': 'clean:xpath:.//div[@class="credit"]//text()'}, 'name': 'clean:xpath:.//p[@class="head"]/following-sibling::p[1]/strong/text()'}
list_page_selectors = None
name = 'brown_edu'
next_page_selectors = None
refine_item(response, item)
strephit.web_sources_corpus.spiders.catholic_encyclopedia module
[edit]
class strephit.web_sources_corpus.spiders.catholic_encyclopedia.CatholicEncyclopediaSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/table[1]//tr[4]//a/@href'
get_name_from_title(response, title)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}
list_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul[1]//a/@href'
name = 'catholic_encyclopedia'
next_page_selectors = None
start_urls = ('https://en.wikisource.org/wiki/Catholic_Encyclopedia_%281913%29',)
strephit.web_sources_corpus.spiders.cesar_org_uk module
[edit]
class strephit.web_sources_corpus.spiders.cesar_org_uk.CesarOrgUkSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['cesar.org.uk']
detail_page_selectors = 'xpath:.//td[@id="keywordColumn"]//a/@href'
item_class
- alias of "WebSourcesCorpusItem"
list_page_selector = None
name = 'cesar_org_uk'
next_page_selectors = None
refine_item(response, item)
strephit.web_sources_corpus.spiders.chinese_bio module
[edit]
class strephit.web_sources_corpus.spiders.chinese_bio.ChineseBioSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@class="poem"]//a[not(@class="new")]/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p//text()', 'name': 'clean:xpath://div[@id="headerContainer"]/following-sibling::div[1]//p/b[1]/text()'}
list_page_selectors = None
name = 'chinese_bio'
next_page_selectors = None
refine_item(response, item)
start_urls = ('https://en.wikisource.org/wiki/A_Chinese_Biographical_Dictionary',)
strephit.web_sources_corpus.spiders.christian_bio module
[edit]
class strephit.web_sources_corpus.spiders.christian_bio.ChristianBioSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul//a/@href'
get_name_from_title(response, title)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}
list_page_selectors = None
name = 'christian_bio'
next_page_selectors = None
start_requests()
strephit.web_sources_corpus.spiders.cooperhewitt_org module
[edit]
class strephit.web_sources_corpus.spiders.cooperhewitt_org.CooperhewittOrgSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['collection.cooperhewitt.org']
detail_page_selectors = 'get_detail_page:xpath:.//div[@class="row"]/div[2]/ul[@class="list-o-things"]//h1/a/@href'
get_detail_page(response, urls)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[contains(@class, "person-bio")]/p//text()', 'name': 'clean:xpath:.//div[@class="page-header"]/h1/a/text()'}
list_page_selectors = None
name = 'cooperhewitt_org'
next_page_selectors = 'xpath:.//ul[@class="pagination"]/li[last()]/a/@href'
refine_item(response, item)
start_urls = ('http://collection.cooperhewitt.org/people/page1',)
strephit.web_sources_corpus.spiders.design_and_art_australia_online module
[edit]
class strephit.web_sources_corpus.spiders.design_and_art_australia_online.DesignAndArtAustraliaOnlineSpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
- A spider for the Design & Art Australia Online website
allowed_domains = ['www.daao.org.au']
name = 'design_and_art_australia_online'
parse(response)
parse_bio(response)
parse_person(response)
strephit.web_sources_corpus.spiders.dictionaryofarthistorians_org module
[edit]
class strephit.web_sources_corpus.spiders.dictionaryofarthistorians_org.DictionaryofarthistoriansOrgSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['dictionaryofarthistorians.org']
detail_page_selectors = 'xpath:.//div[@class="navigation-by-letter"]/following-sibling::p/a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@class="arthist-publish-profile__body"]/p//text()', 'death': 'clean:xpath:.//div[@class="arthist-publish-profile__deathdate"]/p//text()', 'name': 'clean:xpath:.//h1[@class="arthist-publish-profile__name"]//text()', 'birth': 'clean:xpath:.//div[@class="arthist-publish-profile__birthdate"]/p//text()'}
list_page_selectors = None
name = 'dictionaryofarthistorians_org'
next_page_selectors = None
start_requests()
strephit.web_sources_corpus.spiders.dnb module
[edit]
class strephit.web_sources_corpus.spiders.dnb.DictionaryOfNationalBiographySpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
- A spider for the Dictionary of National Biography, in Wikisource
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//table//li/a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div//p//text()'}
list_page_selectors = 'xpath:.//dd/a/@href'
name = 'dnb'
next_page_selectors = 'xpath:.//span[@id="headernext"]/a/@href'
refine_item(response, item)
strephit.web_sources_corpus.spiders.dsi module
[edit]
class strephit.web_sources_corpus.spiders.dsi.DsiSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['www.uni-stuttgart.de']
detail_page_selectors = 'xpath:.//a[contains(., "Detail page of this illustrator")]/@href'
item_class
- alias of "WebSourcesCorpusItem"
list_page_selectors = None
name = 'dsi'
next_page_selectors = 'xpath:.//a[contains(., ">")]/@href'
refine_item(response, item)
start_requests()
strephit.web_sources_corpus.spiders.english_artists module
[edit]
class strephit.web_sources_corpus.spiders.english_artists.EnglishArtistsSpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
allowed_domains = ['en.wikisource.org']
finalize(item)
name = 'english_artists'
parse(response)
parse_detail(response)
text_from_node(node)
strephit.web_sources_corpus.spiders.freethinkers module
[edit]
class strephit.web_sources_corpus.spiders.freethinkers.FreethinkersSpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
allowed_domains = ['en.wikisource.org']
name = 'freethinkers'
parse(response)
strephit.web_sources_corpus.spiders.gameo_org module
[edit]
class strephit.web_sources_corpus.spiders.gameo_org.GameoOrgSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['gameo.org']
detail_page_selectors = 'xpath:.//table[@class="mw-allpages-table-chunk"]//a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]/h1[1]/preceding-sibling::*//text()'}
list_page_selectors = None
name = 'gameo_org'
next_page_selectors = 'xpath:.//td[@class="mw-allpages-nav"]/a[3]/@href'
parse_title(title)
refine_item(response, item)
strephit.web_sources_corpus.spiders.genealogics module
[edit]
class strephit.web_sources_corpus.spiders.genealogics.GenealogicsSpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
- A spider for Leo's Genealogics website
allowed_domains = ['www.genealogics.org']
name = 'genealogics'
parse(response)
parse_person(response)
start_urls = ['http://www.genealogics.org/search.php?mybool=AND&nr=200']
strephit.web_sources_corpus.spiders.greek_roman_bio_myth module
[edit]
class strephit.web_sources_corpus.spiders.greek_roman_bio_myth.GreekRomanBioMythSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul/li/a[not(@class="new")]/@href'
get_name_from_title(response, title)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]/p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]/text()'}
list_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul/li[position()>2]/a/@href'
name = 'greek_roman_bio_myth'
next_page_selectors = None
refine_item(response, item)
strephit.web_sources_corpus.spiders.indian_bio module
[edit]
class strephit.web_sources_corpus.spiders.indian_bio.IndianBioSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul[position()>4]//a/@href'
get_name_from_title(response, title)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}
list_page_selectors = None
name = 'indian_bio'
next_page_selectors = None
refine_item(response, item)
strephit.web_sources_corpus.spiders.irish_officers module
[edit]
class strephit.web_sources_corpus.spiders.irish_officers.IrishOfficersSpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
allowed_domains = ['en.wikisource.org']
name = 'irish_officers'
parse(response)
parse_detail(response)
refine_item(response, item)
strephit.web_sources_corpus.spiders.medical_bio module
[edit]
class strephit.web_sources_corpus.spiders.medical_bio.MedicalBioSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p[position()>1]//text()', 'other': {'born_died': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p[1]/text()'}, 'name': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p[1]/b/text()'}
list_page_selectors = 'xpath:(.//div[@id="mw-content-text"]//ol)[2]//a/@href'
name = 'medical_bio'
next_page_selectors = None
refine_item(response, item)
start_urls = ('https://en.wikisource.org/wiki/American_Medical_Biographies',)
strephit.web_sources_corpus.spiders.men_at_the_bar module
[edit]
class strephit.web_sources_corpus.spiders.men_at_the_bar.MenAtTheBarSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'
get_name_from_title(response, title)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}
list_page_selectors = None
name = 'men_at_the_bar'
next_page_selectors = None
refine_item(response, item)
start_requests()
strephit.web_sources_corpus.spiders.men_of_time module
[edit]
class strephit.web_sources_corpus.spiders.men_of_time.MenOfTimeSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//text()', 'name': 'clean:xpath:.//span[@id="header_section_text"]//text()'}
list_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//ul//a[not(@class="new")]/@href'
name = 'men_of_time'
next_page_selectors = None
refine_item(response, item)
start_urls = ('https://en.wikisource.org/wiki/Men_of_the_Time,_eleventh_edition',)
strephit.web_sources_corpus.spiders.metal_archives_com module
[edit]
class strephit.web_sources_corpus.spiders.metal_archives_com.MetalArchivesComSpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
allowed_domains = ['www.metal-archives.com']
name = 'metal_archives_com'
parse(response)
parse_detail(response)
parse_extern(response)
strephit.web_sources_corpus.spiders.modern_english_bio module
[edit]
class strephit.web_sources_corpus.spiders.modern_english_bio.ModernEnglishBioSpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
allowed_domains = ['en.wikisource.org']
name = 'modern_english_bio'
parse(response)
parse_detail(response)
start_urls = ('https://en.wikisource.org/wiki/Modern_English_Biography',)
strephit.web_sources_corpus.spiders.munksroll module
[edit]
class strephit.web_sources_corpus.spiders.munksroll.MunksrollSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['munksroll.rcplondon.ac.uk']
detail_page_selectors = 'xpath:.//div[@id="maincontent"]/table//a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="prose"]//text()', 'name': 'clean:xpath:.//h2[@class="PageTitle"]/text()'}
list_page_selectors = None
name = 'munksroll'
next_page_selectors = None
refine_item(response, item)
start_requests()
strephit.web_sources_corpus.spiders.museothyssen_org module
[edit]
class strephit.web_sources_corpus.spiders.museothyssen_org.MuseothyssenOrgSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['www.museothyssen.org']
detail_page_selectors = 'xpath:.//ul[@id="autoresAZ"]/li/ul/li/a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//span[@id="contReader1"]//text()', 'other': {'born': 'clean:xpath:.//dl[@class="datosAutor"]/dt[contains(., "Born/Dead:")]/following-sibling::dd[1]//text()'}, 'name': 'clean:xpath:.//dl[@class="datosAutor"]/dt[contains(., "Author:")]/following-sibling::dd[1]//text()'}
list_page_selectors = None
name = 'museothyssen_org'
next_page_selectors = None
refine_item(response, item)
start_urls = ('http://www.museothyssen.org/en/thyssen/artistas',)
strephit.web_sources_corpus.spiders.musicians module
[edit]
class strephit.web_sources_corpus.spiders.musicians.MusiciansSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//table[@id="multicol"]//a/@href'
item_class
- alias of "WebSourcesCorpusItem"
list_page_selectors = ['xpath:.//span[@class="mw-headline"]/parent::h2/following-sibling::ul//a/@href', 'xpath:.//span[.="Articles"]/parent::h2/following-sibling::ul//a/@href']
name = 'musicians'
next_page_selectors = None
refine_item(response, item)
start_urls = ('https://en.wikisource.org/wiki/A_Dictionary_of_Music_and_Musicians',)
strephit.web_sources_corpus.spiders.national_bio module
[edit]
class strephit.web_sources_corpus.spiders.national_bio.NationalBioSpider(year)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//table[@class="prettytable"]//tr[4]//a/@href'
get_name_from_title(response, title)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]/text()'}
list_page_selectors = None
name = 'national_bio'
next_page_selectors = None
strephit.web_sources_corpus.spiders.naval_bio module
[edit]
class strephit.web_sources_corpus.spiders.naval_bio.NavalBioSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul[position()>4]//a/@href'
get_name_from_title(response, title)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p[position()>1]//text()', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text()'}
list_page_selectors = None
name = 'naval_bio'
next_page_selectors = None
start_urls = ('https://en.wikisource.org/wiki/A_Naval_Biographical_Dictionary',)
strephit.web_sources_corpus.spiders.newulsterbiography_co_uk module
[edit]
class strephit.web_sources_corpus.spiders.newulsterbiography_co_uk.NewulsterbiographyCoUkSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['www.newulsterbiography.co.uk']
detail_page_selectors = 'xpath:.//div[@id="search_results"]/p/a/@href'
get_bio(response, values)
get_name(response, values)
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'other': {'profession': 'xpath:.//span[@class="person_heading_profession"]//text()'}, 'bio': 'get_bio:xpath:.//div[@id="person_details"]/div/br[1]/preceding-sibling::*//text()', 'death': 'clean:xpath:.//div[@id="person_details"]/div/table[2]//tr[2]/td[2]/text()', 'name': 'get_name:xpath:.//h1[@class="person_heading"]/br/preceding-sibling::text()', 'birth': 'clean:xpath:.//div[@id="person_details"]/div/table[2]//tr[1]/td[2]/text()'}
list_page_selectors = None
name = 'newulsterbiography_co_uk'
next_page_selectors = None
start_urls = ('http://www.newulsterbiography.co.uk/index.php/home/browse/all',)
strephit.web_sources_corpus.spiders.nndb_com module
[edit]
class strephit.web_sources_corpus.spiders.nndb_com.NndbComSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['www.nndb.com']
detail_page_selectors = 'xpath:.//a[contains(@href, "http://www.nndb.com/people/")]/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'name': 'clean:xpath:.//td/font/b/text()'}
list_page_selectors = 'xpath:.//a[@class="newslink"]/@href'
name = 'nndb_com'
refine_item(response, item)
start_urls = ('http://www.nndb.com/',)
strephit.web_sources_corpus.spiders.parliament_uk module
[edit]
class strephit.web_sources_corpus.spiders.parliament_uk.ParliamentUkSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['www.parliament.uk']
clean_name(response, name)
detail_page_selectors = 'xpath:.//table//tr/td/a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'name': 'clean_name:clean:xpath:.//div[@id="commons-biography-header"]/h1//text()'}
list_page_selectors = None
name = 'parliament_uk'
next_page_selectors = None
refine_item(response, item)
start_urls = ('http://www.parliament.uk/mps-lords-and-offices/mps/',)
strephit.web_sources_corpus.spiders.portraits_and_sketches module
[edit]
class strephit.web_sources_corpus.spiders.portraits_and_sketches.PortraitsAndSketchesSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//text()', 'name': 'clean:xpath:(.//div[@class="tiInherit"]/p/span)[1]//text()'}
list_page_selectors = None
name = 'portraits_and_sketches'
next_page_selectors = None
refine_item(response, item)
strephit.web_sources_corpus.spiders.rkd_nl module
[edit]
class strephit.web_sources_corpus.spiders.rkd_nl.RKDArtistsSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
- A spider for RKD Netherlands Institute for Art History website
allowed_domains = ['rkd.nl']
detail_page_selectors = 'xpath:.//div[@class="header"]/a/@href'
extract_dl_key_value(dl_pairs, item)
- Feed the item with key-value pairs extracted from <dl> tags
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'url': 'make_url:xpath:.//div[@class="record-id"]//text()', 'name': 'clean:xpath:.//h2/text()'}
list_page_selectors = None
make_url(response, artist_id)
name = 'rkd_nl'
next_page_selectors = 'xpath:.//a[@title="Next page"]/@href'
refine_item(response, item)
start_urls = ['https://rkd.nl/en/explore/artists']
strephit.web_sources_corpus.spiders.royalsociety_org module
[edit]
class strephit.web_sources_corpus.spiders.royalsociety_org.RoyalsocietyOrgSpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
allowed_domains = ['royalsociety.org']
name = 'royalsociety_org'
parse(response)
parse_fellow(response)
start_requests()
start_urls = ('http://www.royalsociety.org/',)
strephit.web_sources_corpus.spiders.sculpture_uk module
[edit]
class strephit.web_sources_corpus.spiders.sculpture_uk.SculptureUkSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['sculpture.gla.ac.uk']
detail_page_selectors = 'xpath:.//div[@class="featured"]/table//a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@class="featured"]/p[child::b][last()]/following-sibling::p//text()', 'death': 'clean:xpath:.//b[.="Died"]/following-sibling::text()[1]', 'name': 'clean:xpath:.//div[@class="featured"]/h1//text()', 'birth': 'clean:xpath:.//b[.="Born"]/following-sibling::text()[1]'}
list_page_selectors = 'xpath:.//div[@class="featuredpeople"]//a/@href'
name = 'sculpture_uk'
next_page_selectors = None
refine_item(response, item)
start_urls = ('http://sculpture.gla.ac.uk/browse/index.php',)
strephit.web_sources_corpus.spiders.structurae_net module
[edit]
class strephit.web_sources_corpus.spiders.structurae_net.StructuraeNetSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['structurae.net']
detail_page_selectors = 'xpath:.//ol[@class="searchlist"]//a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'other': {'bibliography': 'xpath:.//div[@id="person-bibliography"]//li/a/@href', 'publications': 'xpath:.//div[@id="person-literature"]//li//a/@href', 'websites': 'xpath:.//div[@id="person-websites"]//li/a/@href', 'participated_in': 'xpath:.//div[@id="person-references"]//a/@href'}, 'name': 'clean:xpath:.//h1/span[@itemprop="name"]//text()'}
list_page_selectors = 'xpath:.//ol[@class="commalist"]//a/@href'
name = 'structurae_net'
next_page_selectors = 'xpath:(.//div[@class="nextPageNav"])[1]//a[1]/@href'
refine_item(response, item)
start_urls = ('http://structurae.net/persons/',)
strephit.web_sources_corpus.spiders.vocab_getty_edu module
[edit]
class strephit.web_sources_corpus.spiders.vocab_getty_edu.VocabGettyEduSpider(name=None, **kwargs)
- Bases: "scrapy.spiders.Spider"
allowed_domains = ['vocab.getty.edu']
completed_queries = set([])
db_connection = <sqlite3.Connection object>
finalize_data(table)
- This method will be called after *table* has been populated. When all tables have been
- populated with data joins them and yields the polished items.
load_into_db(table)
name = 'vocab_getty_edu'
queries = [('name', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fname%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++gvp%3AprefLabelGVP+%3Flabel.%0D%0A%3Flabel+gvp%3Aterm+%3Fname%0D%0A%7D&_implicit=false&_equivalent=false&_form=%2Fsparql'), ('bio', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fbio2%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++skos%3AscopeNote+%3Fnote.%0D%0A+%3Fnote+rdf%3Avalue+%3Fbio2.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('bio2', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FshortBio%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3Adescription+%3FshortBio.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('nationality', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fnationality%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AnationalityPreferred+%3Fny.%0D%0A+%3Fny+gvp%3AprefLabelGVP+%3FlblNationality.%0D%0A+%3FlblNationality+gvp%3Aterm+%3Fnationality.+%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('birth_year', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fbirth%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+gvp%3AestStart+%3Fbirth.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('birth_place', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FdeathPlace%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3AdeathPlace+%3Fdpf.%0D%0A+%3Fdp+foaf%3Afocus+%3Fdpf%3B%0D%0A++++++gvp%3AparentString+%3FdeathPlace.%0D%0A%7D&_implicit=false&implicit=true&_equivalent=false&_form=%2Fsparql'), ('death_year', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fdeath%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+gvp%3AestEnd+%3Fdeath%3B%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('death_place', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FbirthPlace%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3AbirthPlace+%3Fbpf.%0D%0A+%3Fbp+foaf%3Afocus+%3Fbpf%3B%0D%0A++++++gvp%3AparentString+%3FbirthPlace.%0D%0A%7D&_implicit=false&implicit=true&_equivalent=false&_form=%2Fsparql'), ('gender', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fgender%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3Agender+%3Fgender%3B%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql')]
row_to_item(row)
- Converts a single row, result of the join between all tables, into a finished item
start_requests()
strephit.web_sources_corpus.spiders.wga_hu module
[edit]
class strephit.web_sources_corpus.spiders.wga_hu.WgaHuSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['www.wga.hu']
detail_page_selectors = ['xpath:.//table//td[@class="ARTISTLIST"]//a/@href', 'xpath:.//a[starts-with(@href, "/bio/")]/@href']
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//h3[.="Biography"]/following-sibling::p/text()', 'other': {'born-died': 'clean:xpath:.//div[@class="INDEX3"]//text()'}, 'name': 'clean:xpath:.//div[@class="INDEX2"]/text()'}
list_page_selectors = None
name = 'wga_hu'
next_page_selectors = None
refine_item(response, item)
start_requests()
strephit.web_sources_corpus.spiders.who_is_who_america module
[edit]
class strephit.web_sources_corpus.spiders.who_is_who_america.WhoIsWhoAmericaSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div//p[2]//text()', 'name': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div//p/b/a/text()'}
list_page_selectors = 'xpath:.//table[@class="headertemplate"]//tr[3]//a[not(@class="new")]/@href'
name = 'who_is_who_america'
next_page_selectors = None
refine_item(response, item)
strephit.web_sources_corpus.spiders.who_is_who_in_china module
[edit]
class strephit.web_sources_corpus.spiders.who_is_who_in_china.WhoIsWhoInChinaSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['en.wikisource.org']
detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//a[not(@class="new")]/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean:xpath:.//div[@class="tiInherit"]/following-sibling::p//text()', 'name': 'clean:xpath:(.//p/b)[2]/text()'}
list_page_selectors = None
name = 'who_is_who_in_china'
next_page_selectors = None
refine_item(response, item)
strephit.web_sources_corpus.spiders.yba_llgc_org_uk module
[edit]
class strephit.web_sources_corpus.spiders.yba_llgc_org_uk.YbaLlgcOrgUkSpider(name=None, **kwargs)
- Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"
allowed_domains = ['yba.llgc.org.uk']
clean_nu(response, strings)
detail_page_selectors = 'xpath:.//div[@id="text"]/p/a/@href'
item_class
- alias of "WebSourcesCorpusItem"
item_fields = {'bio': 'clean_nu:xpath:.//div[@id="text"]//text()', 'other': {'sources': 'clean_nu:xpath:.//div[@id="text"]/div[@class="biog"]/ul/li[@class="bib_item"]//text()', 'contributer': 'clean_nu:xpath:.//div[@id="text"]/p[@class="contributer"]//text()', 'surname': 'clean_nu:xpath:.//div[@id="text"]/span[@class="article_header"]/b/span[@class="surname"]/text()', 'forename': 'clean_nu:xpath:.//div[@id="text"]/span[@class="article_header"]/b/span[@class="forename"]/text()'}}
list_page_selectors = None
name = 'yba_llgc_org_uk'
next_page_selectors = None
refine_item(response, item)
start_requests()
strephit.web_sources_corpus package
[edit]Subpackages
[edit]- strephit.web_sources_corpus.spiders package
- Submodules
- strephit.web_sources_corpus.spiders.BaseSpider module
- strephit.web_sources_corpus.spiders.academia_net module
- strephit.web_sources_corpus.spiders.american_bio module
- strephit.web_sources_corpus.spiders.australasian_bio module
- strephit.web_sources_corpus.spiders.australian_dictionary_of_biography module
- strephit.web_sources_corpus.spiders.bbc_co_uk module
- strephit.web_sources_corpus.spiders.bio_english_lit module
- strephit.web_sources_corpus.spiders.bishops module
- strephit.web_sources_corpus.spiders.brown_edu module
- strephit.web_sources_corpus.spiders.catholic_encyclopedia module
- strephit.web_sources_corpus.spiders.cesar_org_uk module
- strephit.web_sources_corpus.spiders.chinese_bio module
- strephit.web_sources_corpus.spiders.christian_bio module
- strephit.web_sources_corpus.spiders.cooperhewitt_org module
- strephit.web_sources_corpus.spiders.design_and_art_australia_online module
- strephit.web_sources_corpus.spiders.dictionaryofarthistorians_org module
- strephit.web_sources_corpus.spiders.dnb module
- strephit.web_sources_corpus.spiders.dsi module
- strephit.web_sources_corpus.spiders.english_artists module
- strephit.web_sources_corpus.spiders.freethinkers module
- strephit.web_sources_corpus.spiders.gameo_org module
- strephit.web_sources_corpus.spiders.genealogics module
- strephit.web_sources_corpus.spiders.greek_roman_bio_myth module
- strephit.web_sources_corpus.spiders.indian_bio module
- strephit.web_sources_corpus.spiders.irish_officers module
- strephit.web_sources_corpus.spiders.medical_bio module
- strephit.web_sources_corpus.spiders.men_at_the_bar module
- strephit.web_sources_corpus.spiders.men_of_time module
- strephit.web_sources_corpus.spiders.metal_archives_com module
- strephit.web_sources_corpus.spiders.modern_english_bio module
- strephit.web_sources_corpus.spiders.munksroll module
- strephit.web_sources_corpus.spiders.museothyssen_org module
- strephit.web_sources_corpus.spiders.musicians module
- strephit.web_sources_corpus.spiders.national_bio module
- strephit.web_sources_corpus.spiders.naval_bio module
- strephit.web_sources_corpus.spiders.newulsterbiography_co_uk module
- strephit.web_sources_corpus.spiders.nndb_com module
- strephit.web_sources_corpus.spiders.parliament_uk module
- strephit.web_sources_corpus.spiders.portraits_and_sketches module
- strephit.web_sources_corpus.spiders.rkd_nl module
- strephit.web_sources_corpus.spiders.royalsociety_org module
- strephit.web_sources_corpus.spiders.sculpture_uk module
- strephit.web_sources_corpus.spiders.structurae_net module
- strephit.web_sources_corpus.spiders.vocab_getty_edu module
- strephit.web_sources_corpus.spiders.wga_hu module
- strephit.web_sources_corpus.spiders.who_is_who_america module
- strephit.web_sources_corpus.spiders.who_is_who_in_china module
- strephit.web_sources_corpus.spiders.yba_llgc_org_uk module
strephit.web_sources_corpus.archive_org module
[edit]
strephit.web_sources_corpus.archive_org.parse_and_save(text, separator, out_file, url)
strephit.web_sources_corpus.britishmuseum_org module
[edit]
strephit.web_sources_corpus.britishmuseum_org.serialize_person(person)
strephit.web_sources_corpus.items module
[edit]
class strephit.web_sources_corpus.items.WebSourcesCorpusItem(*args, **kwargs)
- Bases: "scrapy.item.Item"
fields = {'bio': {}, 'death': {}, 'name': {}, 'url': {}, 'other': {}, 'birth': {}}
strephit.web_sources_corpus.pipelines module
[edit]
class strephit.web_sources_corpus.pipelines.WebSourcesCorpusPipeline
- Bases: "object"
process_item(item, spider)