RDFProcessors (rdfpro 0.6 API)

java.lang.Object
- eu.fbk.rdfpro.RDFProcessors

public final class RDFProcessors
extends Object

Utility methods dealing with RDFProcessors.

Field Summary

Fields
Modifier and Type	Field and Description
`static RDFProcessor`	`IDENTITY` The identity `RDFProcessor` that returns the input RDF stream unchanged.
`static RDFProcessor`	`NIL` The null `RDFProcessor` that always produces an empty RDF stream.

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static RDFProcessor`	`download(boolean parallelize, boolean preserveBNodes, String endpointURL, String query)` Creates an `RDFProcessor` that retrieves data from a SPARQL endpoint and inject it in the RDF stream at each pass.
`static RDFProcessor`	`inject(RDFSource source)` Creates an `RDFProcessor` that injects in the RDF stream the data loaded from the specified `RDFSource`.
`static RDFProcessor`	`mapReduce(Mapper mapper, Reducer reducer, boolean deduplicate)` Creates an `RDFProcessor` that processes the RDF stream in a MapReduce fashion.
`static RDFProcessor`	`parallel(SetOperator operator, RDFProcessor... processors)` Returns an `RDFProcessor` performing the parallel composition of the processors specified, using the given `SetOperator` to merge their results.
`static RDFProcessor`	`parse(boolean tokenize, String... args)` Creates an `RDFProcessor` by parsing the supplied specification string(s).
`static RDFProcessor`	`prefix(Map<String,String> nsToPrefixMap)` Creates an `RDFProcessor` that augments the RDF stream with prefix-to-namespace bindings from the supplied map or from `prefix.cc`.
`static RDFProcessor`	`rdfs(RDFSource tbox, org.openrdf.model.Resource tboxContext, boolean decomposeOWLAxioms, boolean dropBNodeTypes, String... excludedRules)` Creates an `RDFProcessor` that computes the RDFS closure of the RDF stream based on the TBox separately supplied.
`static RDFProcessor`	`read(boolean parallelize, boolean preserveBNodes, String baseURI, org.openrdf.rio.ParserConfig config, String... locations)` Creates an `RDFProcessor` that reads data from the files specified and inject it in the RDF stream at each pass.
`static RDFProcessor`	`rules(Ruleset ruleset, Mapper mapper, boolean dropBNodeTypes, boolean deduplicate)` Returns an `RDFProcessor` that applies the ruleset specified on input statements either as a whole or partitioned based on an optional `Mapper`.
`static RDFProcessor`	`rules(Ruleset ruleset, Mapper mapper, boolean dropBNodeTypes, boolean deduplicate, RDFSource tboxData, boolean emitTBox, org.openrdf.model.URI tboxContext)` Returns an `RDFProcessor` that expands the ruleset based on the supplied TBox and applies the resulting ruleset on input statements either as a whole or partitioned based on an optional `Mapper`.
`static RDFProcessor`	`sequence(RDFProcessor... processors)` Returns an `RDFProcessor` performing the sequence composition of the supplied `RDFProcessors`.
`static RDFProcessor`	`smush(String... rankedNamespaces)` Creates an `RDFProcessor` performing `owl:sameAs` smushing.
`static RDFProcessor`	`stats(String outputNamespace, org.openrdf.model.URI sourceProperty, org.openrdf.model.URI sourceContext, Long threshold, boolean processCooccurrences)` Creates an `RDFProcessor` extracting VOID structural statistics from the RDF stream.
`static RDFProcessor`	`tbox()` Returns a `RDFProcessor` that extracts the TBox of data in the RDF stream.
`static RDFProcessor`	`tee(org.openrdf.rio.RDFHandler... handlers)` Creates an `RDFProcessor` that duplicates data of the RDF stream to the `RDFHandlers` specified.
`static RDFProcessor`	`track(Tracker tracker)` Returns an `RDFProcessor` that tracks the number of statements flowing through it using the supplied `Tracker` object.
`static RDFProcessor`	`transform(Transformer transformer)` Returns a `RDFProcessor` that applies the supplied `Transformer` to each input triple, producing the transformed triples in output.
`static RDFProcessor`	`unique(boolean mergeContexts)` Creates an `RDFProcessor` that removes duplicate from the RDF stream, optionally merging similar statements with different contexts in a unique statement.
`static RDFProcessor`	`upload(String endpointURL)` Creates an `RDFProcessor` that uploads data of the RDF stream to the SPARQL endpoint specified, using SPARQL Update INSERT DATA calls.
`static RDFProcessor`	`write(org.openrdf.rio.WriterConfig config, int chunkSize, String... locations)` Creates an `RDFProcessor` that writes data of the RDF stream to the files specified.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - NIL
```
public static final RDFProcessor NIL
```
    The null RDFProcessor that always produces an empty RDF stream.
  - IDENTITY
```
public static final RDFProcessor IDENTITY
```
    The identity RDFProcessor that returns the input RDF stream unchanged.
- Method Detail
  - parse
```
public static RDFProcessor parse(boolean tokenize,
                                 String... args)
```
    Creates an RDFProcessor by parsing the supplied specification string(s). The specification can be already tokenized or the method can be asked by tokenize it itself (set tokenize = true).
    
    Parameters:
    
    tokenize - true if input string(s) should be tokenized (again)
    
    args - the input string(s)
    
    Returns:
    
    the created RDFProcessor
  - parallel
```
public static RDFProcessor parallel(SetOperator operator,
                                    RDFProcessor... processors)
```
    Returns an RDFProcessor performing the parallel composition of the processors specified, using the given SetOperator to merge their results.
    
    Parameters:
    
    operator - the SetOperator to use for merging the results of composed processors, not null
    
    processors - the processors to compose in parallel
    
    Returns:
    
    the resulting RDFProcessor
  - sequence
```
public static RDFProcessor sequence(RDFProcessor... processors)
```
    Returns an RDFProcessor performing the sequence composition of the supplied RDFProcessors. In a sequence composition, the first processor is applied first to the stream, with its output fed to the next processor and so on.
    
    Parameters:
    
    processors - the processor to compose in a sequence
    
    Returns:
    
    the resulting RDFProcessor
  - mapReduce
```
public static RDFProcessor mapReduce(Mapper mapper,
                                     Reducer reducer,
                                     boolean deduplicate)
```
    Creates an RDFProcessor that processes the RDF stream in a MapReduce fashion. The method is parameterized by a Mapper and a Reducer object, which perform the actual computation, and a deduplicate flag that controls whether duplicate statements mapped to the same key by the mapper should be merged. MapReduce is performed relying on external sorting: input statements are mapped to a Value key, based on which they are sorted (externally); each key partition is then fed to the reducer and the reducer output emitted. Hadoop is not involved :-) - this scheme is limited to a single machine environment on one hand; on the other it exploits this limitation by using available memory to encode sorted data, thus limiting its volume and speeding up the operation.
    
    Parameters:
    
    mapper - the mapper, not null
    
    reducer - the reducer, not null
    
    deduplicate - true if duplicate statements mapped to the same key should be merged
    
    Returns:
    
    the created RDFProcessor
  - prefix
```
public static RDFProcessor prefix(@Nullable
                                  Map<String,String> nsToPrefixMap)
```
    Creates an RDFProcessor that augments the RDF stream with prefix-to-namespace bindings from the supplied map or from prefix.cc. NOTE: if a map is supplied, it is important it is not changed externally while the produced RDFProcessor is in use, as this will alter the RDF stream produced at each pass and may cause race conditions.
    
    Parameters:
    
    nsToPrefixMap - the prefix-to-namespace map to use; if null, a builtin map derived from data of prefix.cc will be used
    
    Returns:
    
    the created RDFProcessor
  - rdfs
```
public static RDFProcessor rdfs(RDFSource tbox,
                                @Nullable
                                org.openrdf.model.Resource tboxContext,
                                boolean decomposeOWLAxioms,
                                boolean dropBNodeTypes,
                                String... excludedRules)
```
    Creates an RDFProcessor that computes the RDFS closure of the RDF stream based on the TBox separately supplied.
    
    Parameters:
    
    tbox - a RDFSource providing access to TBox data, not null
    
    tboxContext - the context where to emit TBox data; if null TBox is not emitted (use SESAME.NIL for emitting data in the default context)
    
    decomposeOWLAxioms - true if simple OWL axioms mappable to RDFS (e.g. owl:equivalentClass should be decomposed to corresponding RDFS axioms (OWL axioms are otherwise ignored when computing the closure)
    
    dropBNodeTypes - true if <x rdf:type _:b> statements should not be emitted (as uninformative); note that this option does not prevent this statements to be used for inference (even if dropped), possibly leading to infer statements that are not dropped
    
    excludedRules - a vararg array with the names of the RDFS rule to exclude; if empty, all the RDFS rules will be used
    
    Returns:
    
    the created RDFProcessor
  - smush
```
public static RDFProcessor smush(String... rankedNamespaces)
```
    Creates an RDFProcessor performing owl:sameAs smushing. A ranked list of namespaces controls the selection of the canonical URI for each coreferring URI cluster. owl:sameAs statements are emitted in output linking the selected canonical URI to the other entity aliases.
    
    Parameters:
    
    rankedNamespaces - the ranked list of namespaces used to select canonical URIs
    
    Returns:
    
    the created RDFProcessor
  - stats
```
public static RDFProcessor stats(@Nullable
                                 String outputNamespace,
                                 @Nullable
                                 org.openrdf.model.URI sourceProperty,
                                 @Nullable
                                 org.openrdf.model.URI sourceContext,
                                 @Nullable
                                 Long threshold,
                                 boolean processCooccurrences)
```
    Creates an RDFProcessor extracting VOID structural statistics from the RDF stream. A VOID dataset is associated to the whole input and to each set of graphs associated to the same 'source' URI with a configurable property, specified by sourceProperty; if parameter sourceContext is not null, these association statements are searched only in the graph with the URI specified. Class and property partitions are then generated for each of these datasets, assigning them URIs in the namespace given by outputNamespace (if null, a default namespace is used). In addition to standard VOID terms, the processor emits additional statements based on the VOIDX extension vocabulary to express the number of TBox, ABox, rdf:type and owl:sameAs statements, the average number of properties per entity and informative labels and examples for each TBox term, which are then viewable in tools such as Protégé. Internally, the processor makes use of external sorting to (conceptually) sort the RDF stream twice: first based on the subject to group statements about the same entity and compute entity-based and distinct subjects statistics; then based on the object to compute distinct objects statistics. Therefore, computing VOID statistics is quite a slow operation.
    
    Parameters:
    
    outputNamespace - the namespace for generated URIs (if null, a default is used)
    
    sourceProperty - the URI of property linking graphs to sources (if null, sources will not be considered)
    
    sourceContext - the graph where to look for graph-to-source links (if null, will be searched in the whole RDF stream)
    
    threshold - the minimum number of statements or entities that a VOID partition must have in order to be emitted; this parameter allows to drop VOID partitions for infrequent concepts, sensibly reducing the output size
    
    processCooccurrences - true to enable analysis of co-occurrences for computing void:classes and void:properties statements
    
    Returns:
    
    the created RDFProcessor
  - tbox
```
public static RDFProcessor tbox()
```
    Returns a RDFProcessor that extracts the TBox of data in the RDF stream.
    
    Returns:
    
    the TBox-extracting RDFProcessor
  - transform
```
public static RDFProcessor transform(Transformer transformer)
```
    Returns a RDFProcessor that applies the supplied Transformer to each input triple, producing the transformed triples in output.
    
    Parameters:
    
    transformer - the transformer, not null
    
    Returns:
    
    the created RDFProcessor
  - unique
```
public static RDFProcessor unique(boolean mergeContexts)
```
    Creates an RDFProcessor that removes duplicate from the RDF stream, optionally merging similar statements with different contexts in a unique statement.
    
    Parameters:
    
    mergeContexts - true if statements with same subject, predicate and object but different context should be merged in a single statement, whose context is a combination of the source contexts
    
    Returns:
    
    the created RDFProcessor
  - inject
```
public static RDFProcessor inject(RDFSource source)
```
    Creates an RDFProcessor that injects in the RDF stream the data loaded from the specified RDFSource. Data is read and injected at every pass on the RDF stream.
    
    Parameters:
    
    source - the RDFSource, not null
    
    Returns:
    
    the created RDFProcessor
  - read
```
public static RDFProcessor read(boolean parallelize,
                                boolean preserveBNodes,
                                @Nullable
                                String baseURI,
                                @Nullable
                                org.openrdf.rio.ParserConfig config,
                                String... locations)
```
    Creates an RDFProcessor that reads data from the files specified and inject it in the RDF stream at each pass. This is a utility method that relies on inject(RDFSource), on RDFSources.read(boolean, boolean, String, ParserConfig, String...) and on track(Tracker) for providing progress information on loaded statements.
    
    Parameters:
    
    parallelize - false if files should be parsed sequentially using only one thread
    
    preserveBNodes - true if BNodes in parsed files should be preserved, false if they should be rewritten on a per-file basis to avoid possible clashes
    
    baseURI - the base URI to be used for resolving relative URIs, possibly null
    
    config - the optional ParserConfig for the fine tuning of the used RDF parser; if null a default, maximally permissive configuration will be used
    
    locations - the locations of the RDF files to be read
    
    Returns:
    
    the created RDFProcessor
  - download
```
public static RDFProcessor download(boolean parallelize,
                                    boolean preserveBNodes,
                                    String endpointURL,
                                    String query)
```
    Creates an RDFProcessor that retrieves data from a SPARQL endpoint and inject it in the RDF stream at each pass. This is a utility method that relies on inject(RDFSource), on RDFSources.query(boolean, boolean, String, String) and on track(Tracker) for providing progress information on fetched statements. NOTE: as SPARQL does not provide any guarantee on the identifiers of returned BNodes, it may happen that different BNodes are returned in different passes, causing the RDF stream produced by this RDFProcessor to change from one pass to another.
    
    Parameters:
    
    parallelize - true if query results should be handled by multiple threads in parallel
    
    endpointURL - the URL of the SPARQL endpoint, not null
    
    query - the SPARQL query (CONSTRUCT or SELECT form) to submit to the endpoint
    
    preserveBNodes - true if BNodes in the query result should be preserved, false if they should be rewritten on a per-endpoint basis to avoid possible clashes
    
    Returns:
    
    the created RDFProcessor
  - tee
```
public static RDFProcessor tee(org.openrdf.rio.RDFHandler... handlers)
```
    Creates an RDFProcessor that duplicates data of the RDF stream to the RDFHandlers specified. The produced processor can be used to 'peek' into the RDF stream, possibly allowing to fork the stream. Note that RDF data is emitted to the supplied handlers at each pass; if this is not the desired behavior, please wrap the handlers using RDFHandlers.ignorePasses(RDFHandler, int).
    
    Parameters:
    
    handlers - the handlers to duplicate RDF data to
    
    Returns:
    
    the created RDFProcessor
  - write
```
public static RDFProcessor write(@Nullable
                                 org.openrdf.rio.WriterConfig config,
                                 int chunkSize,
                                 String... locations)
```
    Creates an RDFProcessor that writes data of the RDF stream to the files specified. This is a utility method that relies on tee(RDFHandler...), on RDFHandlers.write(WriterConfig, int, String...) and on track(Tracker) for reporting progress information about written statements. Note that data is written only at the first pass.
    
    Parameters:
    
    config - the optional WriterConfig for fine tuning the writing process; if null, a default configuration enabling pretty printing will be used
    
    chunkSize - the number of consecutive statements to be written as a single chunk to a single location (increase it to preserve locality)
    
    locations - the locations of the files to write
    
    Returns:
    
    the created RDFProcessor
  - upload
```
public static RDFProcessor upload(String endpointURL)
```
    Creates an RDFProcessor that uploads data of the RDF stream to the SPARQL endpoint specified, using SPARQL Update INSERT DATA calls. This is a utility method that relies on tee(RDFHandler...), on RDFHandlers.update(String) and on track(Tracker) for reporting progress information about uploaded statements. Note that data is uploaded only at the first pass.
    
    Parameters:
    
    endpointURL - the URL of the SPARQL Update endpoint, not null
    
    Returns:
    
    the created RDFProcessor
  - track
```
public static RDFProcessor track(Tracker tracker)
```
    Returns an RDFProcessor that tracks the number of statements flowing through it using the supplied Tracker object.
    
    Parameters:
    
    tracker - the tracker object
    
    Returns:
    
    an RDFProcessor that tracks the number of RDF statements passing through it
  - rules
```
public static RDFProcessor rules(Ruleset ruleset,
                                 @Nullable
                                 Mapper mapper,
                                 boolean dropBNodeTypes,
                                 boolean deduplicate)
```
    Returns an RDFProcessor that applies the ruleset specified on input statements either as a whole or partitioned based on an optional Mapper.
    
    Parameters:
    
    ruleset - the ruleset to apply
    
    mapper - the optional mapper for partitioning input statements, possibly null
    
    dropBNodeTypes - true to drop output rdf:type statements with a BNode object
    
    deduplicate - true to enforce that output statements do not contain duplicates (if false, duplicates might be returned if this enables the rule engine to operate faster)
    
    Returns:
    
    the created RDFProcessor
  - rules
```
public static RDFProcessor rules(Ruleset ruleset,
                                 @Nullable
                                 Mapper mapper,
                                 boolean dropBNodeTypes,
                                 boolean deduplicate,
                                 @Nullable
                                 RDFSource tboxData,
                                 boolean emitTBox,
                                 @Nullable
                                 org.openrdf.model.URI tboxContext)
```
    Returns an RDFProcessor that expands the ruleset based on the supplied TBox and applies the resulting ruleset on input statements either as a whole or partitioned based on an optional Mapper.
    
    Parameters:
    
    ruleset - the ruleset to apply
    
    mapper - the optional mapper for partitioning input statements, possibly null
    
    dropBNodeTypes - true to drop output rdf:type statements with a BNode object
    
    deduplicate - true to enforce that output statements do not contain duplicates (if false, duplicates might be returned if this enables the rule engine to operate faster)
    
    tboxData - the RDFSource of TBox data; null to disable TBox expansion
    
    emitTBox - true to emit TBox data (closed based on rules in the supplied Ruleset)
    
    tboxContext - the context where to emit closed TBox data; null to emit TBox statements with their original contexts (use SESAME.NIL for emitting TBox data in the default context)
    
    Returns:
    
    the created RDFProcessor

Class RDFProcessors

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

NIL

IDENTITY

Method Detail

parse

parallel

sequence

mapReduce

prefix

rdfs

smush

stats

tbox

transform

unique

inject

read

download

tee

write

upload

track

rules

rules