public final class RDFProcessors extends Object
RDFProcessor
s.Modifier and Type | Field and Description |
---|---|
static RDFProcessor |
IDENTITY
The identity
RDFProcessor that returns the input RDF stream unchanged. |
static RDFProcessor |
NIL
The null
RDFProcessor that always produces an empty RDF stream. |
Modifier and Type | Method and Description |
---|---|
static RDFProcessor |
download(boolean parallelize,
boolean preserveBNodes,
String endpointURL,
String query)
Creates an
RDFProcessor that retrieves data from a SPARQL endpoint and inject it in
the RDF stream at each pass. |
static RDFProcessor |
inject(RDFSource source)
Creates an
RDFProcessor that injects in the RDF stream the data loaded from the
specified RDFSource . |
static RDFProcessor |
mapReduce(Mapper mapper,
Reducer reducer,
boolean deduplicate)
Creates an
RDFProcessor that processes the RDF stream in a MapReduce fashion. |
static RDFProcessor |
parallel(SetOperator operator,
RDFProcessor... processors)
Returns an
RDFProcessor performing the parallel composition of the processors
specified, using the given SetOperator to merge their results. |
static RDFProcessor |
parse(boolean tokenize,
String... args)
Creates an
RDFProcessor by parsing the supplied specification string(s). |
static RDFProcessor |
prefix(Map<String,String> nsToPrefixMap)
Creates an
RDFProcessor that augments the RDF stream with prefix-to-namespace
bindings from the supplied map or from prefix.cc . |
static RDFProcessor |
rdfs(RDFSource tbox,
org.openrdf.model.Resource tboxContext,
boolean decomposeOWLAxioms,
boolean dropBNodeTypes,
String... excludedRules)
Creates an
RDFProcessor that computes the RDFS closure of the RDF stream based on
the TBox separately supplied. |
static RDFProcessor |
read(boolean parallelize,
boolean preserveBNodes,
String baseURI,
org.openrdf.rio.ParserConfig config,
String... locations)
Creates an
RDFProcessor that reads data from the files specified and inject it in
the RDF stream at each pass. |
static RDFProcessor |
rules(Ruleset ruleset,
Mapper mapper,
boolean dropBNodeTypes,
boolean deduplicate)
Returns an
RDFProcessor that applies the ruleset specified on input statements
either as a whole or partitioned based on an optional Mapper . |
static RDFProcessor |
rules(Ruleset ruleset,
Mapper mapper,
boolean dropBNodeTypes,
boolean deduplicate,
RDFSource tboxData,
boolean emitTBox,
org.openrdf.model.URI tboxContext)
Returns an
RDFProcessor that expands the ruleset based on the supplied TBox and
applies the resulting ruleset on input statements either as a whole or partitioned based on
an optional Mapper . |
static RDFProcessor |
sequence(RDFProcessor... processors)
Returns an
RDFProcessor performing the sequence composition of the supplied
RDFProcessors . |
static RDFProcessor |
smush(String... rankedNamespaces)
Creates an
RDFProcessor performing owl:sameAs smushing. |
static RDFProcessor |
stats(String outputNamespace,
org.openrdf.model.URI sourceProperty,
org.openrdf.model.URI sourceContext,
Long threshold,
boolean processCooccurrences)
Creates an
RDFProcessor extracting VOID structural statistics from the RDF stream. |
static RDFProcessor |
tbox()
Returns a
RDFProcessor that extracts the TBox of data in the RDF stream. |
static RDFProcessor |
tee(org.openrdf.rio.RDFHandler... handlers)
Creates an
RDFProcessor that duplicates data of the RDF stream to the
RDFHandlers specified. |
static RDFProcessor |
track(Tracker tracker)
Returns an
RDFProcessor that tracks the number of statements flowing through it
using the supplied Tracker object. |
static RDFProcessor |
transform(Transformer transformer)
Returns a
RDFProcessor that applies the supplied Transformer to each input
triple, producing the transformed triples in output. |
static RDFProcessor |
unique(boolean mergeContexts)
Creates an
RDFProcessor that removes duplicate from the RDF stream, optionally
merging similar statements with different contexts in a unique statement. |
static RDFProcessor |
upload(String endpointURL)
Creates an
RDFProcessor that uploads data of the RDF stream to the SPARQL endpoint
specified, using SPARQL Update INSERT DATA calls. |
static RDFProcessor |
write(org.openrdf.rio.WriterConfig config,
int chunkSize,
String... locations)
Creates an
RDFProcessor that writes data of the RDF stream to the files specified. |
public static final RDFProcessor NIL
RDFProcessor
that always produces an empty RDF stream.public static final RDFProcessor IDENTITY
RDFProcessor
that returns the input RDF stream unchanged.public static RDFProcessor parse(boolean tokenize, String... args)
RDFProcessor
by parsing the supplied specification string(s). The
specification can be already tokenized or the method can be asked by tokenize it itself
(set tokenize = true
).tokenize
- true if input string(s) should be tokenized (again)args
- the input string(s)RDFProcessor
public static RDFProcessor parallel(SetOperator operator, RDFProcessor... processors)
RDFProcessor
performing the parallel composition of the processors
specified, using the given SetOperator
to merge their results.operator
- the SetOperator
to use for merging the results of composed processors,
not nullprocessors
- the processors to compose in parallelRDFProcessor
public static RDFProcessor sequence(RDFProcessor... processors)
RDFProcessor
performing the sequence composition of the supplied
RDFProcessors
. In a sequence composition, the first processor is applied first to
the stream, with its output fed to the next processor and so on.processors
- the processor to compose in a sequenceRDFProcessor
public static RDFProcessor mapReduce(Mapper mapper, Reducer reducer, boolean deduplicate)
RDFProcessor
that processes the RDF stream in a MapReduce fashion. The
method is parameterized by a Mapper
and a Reducer
object, which perform the
actual computation, and a deduplicate
flag that controls whether duplicate
statements mapped to the same key by the mapper should be merged. MapReduce is performed
relying on external sorting: input statements are mapped to a Value
key, based on
which they are sorted (externally); each key partition is then fed to the reducer and the
reducer output emitted. Hadoop is not involved :-) - this scheme is limited to a single
machine environment on one hand; on the other it exploits this limitation by using
available memory to encode sorted data, thus limiting its volume and speeding up the
operation.mapper
- the mapper, not nullreducer
- the reducer, not nulldeduplicate
- true if duplicate statements mapped to the same key should be mergedRDFProcessor
public static RDFProcessor prefix(@Nullable Map<String,String> nsToPrefixMap)
RDFProcessor
that augments the RDF stream with prefix-to-namespace
bindings from the supplied map or from prefix.cc
. NOTE: if a map is supplied, it is
important it is not changed externally while the produced RDFProcessor
is in use,
as this will alter the RDF stream produced at each pass and may cause race conditions.nsToPrefixMap
- the prefix-to-namespace map to use; if null, a builtin map derived from data of
prefix.cc
will be usedRDFProcessor
public static RDFProcessor rdfs(RDFSource tbox, @Nullable org.openrdf.model.Resource tboxContext, boolean decomposeOWLAxioms, boolean dropBNodeTypes, String... excludedRules)
RDFProcessor
that computes the RDFS closure of the RDF stream based on
the TBox separately supplied.tbox
- a RDFSource
providing access to TBox data, not nulltboxContext
- the context where to emit TBox data; if null TBox is not emitted (use
SESAME.NIL
for emitting data in the default context)decomposeOWLAxioms
- true if simple OWL axioms mappable to RDFS (e.g. owl:equivalentClass
should be decomposed to corresponding RDFS axioms (OWL axioms are otherwise
ignored when computing the closure)dropBNodeTypes
- true if <x rdf:type _:b>
statements should not be emitted (as
uninformative); note that this option does not prevent this statements to be
used for inference (even if dropped), possibly leading to infer statements that
are not droppedexcludedRules
- a vararg array with the names of the RDFS rule to exclude; if empty, all the
RDFS rules will be usedRDFProcessor
public static RDFProcessor smush(String... rankedNamespaces)
RDFProcessor
performing owl:sameAs
smushing. A ranked list of
namespaces controls the selection of the canonical URI for each coreferring URI cluster.
owl:sameAs
statements are emitted in output linking the selected canonical URI to
the other entity aliases.rankedNamespaces
- the ranked list of namespaces used to select canonical URIsRDFProcessor
public static RDFProcessor stats(@Nullable String outputNamespace, @Nullable org.openrdf.model.URI sourceProperty, @Nullable org.openrdf.model.URI sourceContext, @Nullable Long threshold, boolean processCooccurrences)
RDFProcessor
extracting VOID structural statistics from the RDF stream.
A VOID dataset is associated to the whole input and to each set of graphs associated to the
same 'source' URI with a configurable property, specified by sourceProperty
; if
parameter sourceContext
is not null, these association statements are searched only
in the graph with the URI specified. Class and property partitions are then generated for
each of these datasets, assigning them URIs in the namespace given by
outputNamespace
(if null, a default namespace is used). In addition to standard
VOID terms, the processor emits additional statements based on the VOIDX
extension
vocabulary to express the number of TBox, ABox, rdf:type
and owl:sameAs
statements, the average number of properties per entity and informative labels and examples
for each TBox term, which are then viewable in tools such as Protégé. Internally, the
processor makes use of external sorting to (conceptually) sort the RDF stream twice: first
based on the subject to group statements about the same entity and compute entity-based and
distinct subjects statistics; then based on the object to compute distinct objects
statistics. Therefore, computing VOID statistics is quite a slow operation.outputNamespace
- the namespace for generated URIs (if null, a default is used)sourceProperty
- the URI of property linking graphs to sources (if null, sources will not be
considered)sourceContext
- the graph where to look for graph-to-source links (if null, will be searched in
the whole RDF stream)threshold
- the minimum number of statements or entities that a VOID partition must have in
order to be emitted; this parameter allows to drop VOID partitions for
infrequent concepts, sensibly reducing the output sizeprocessCooccurrences
- true to enable analysis of co-occurrences for computing void:classes
and
void:properties
statementsRDFProcessor
public static RDFProcessor tbox()
RDFProcessor
that extracts the TBox of data in the RDF stream.RDFProcessor
public static RDFProcessor transform(Transformer transformer)
RDFProcessor
that applies the supplied Transformer
to each input
triple, producing the transformed triples in output.transformer
- the transformer, not nullRDFProcessor
public static RDFProcessor unique(boolean mergeContexts)
RDFProcessor
that removes duplicate from the RDF stream, optionally
merging similar statements with different contexts in a unique statement.mergeContexts
- true if statements with same subject, predicate and object but different context
should be merged in a single statement, whose context is a combination of the
source contextsRDFProcessor
public static RDFProcessor inject(RDFSource source)
RDFProcessor
that injects in the RDF stream the data loaded from the
specified RDFSource
. Data is read and injected at every pass on the RDF stream.source
- the RDFSource
, not nullRDFProcessor
public static RDFProcessor read(boolean parallelize, boolean preserveBNodes, @Nullable String baseURI, @Nullable org.openrdf.rio.ParserConfig config, String... locations)
RDFProcessor
that reads data from the files specified and inject it in
the RDF stream at each pass. This is a utility method that relies on
inject(RDFSource)
, on
RDFSources.read(boolean, boolean, String, ParserConfig, String...)
and on
track(Tracker)
for providing progress information on loaded statements.parallelize
- false if files should be parsed sequentially using only one threadpreserveBNodes
- true if BNodes in parsed files should be preserved, false if they should be
rewritten on a per-file basis to avoid possible clashesbaseURI
- the base URI to be used for resolving relative URIs, possibly nullconfig
- the optional ParserConfig
for the fine tuning of the used RDF parser; if
null a default, maximally permissive configuration will be usedlocations
- the locations of the RDF files to be readRDFProcessor
public static RDFProcessor download(boolean parallelize, boolean preserveBNodes, String endpointURL, String query)
RDFProcessor
that retrieves data from a SPARQL endpoint and inject it in
the RDF stream at each pass. This is a utility method that relies on
inject(RDFSource)
, on RDFSources.query(boolean, boolean, String, String)
and on track(Tracker)
for providing progress information on fetched statements.
NOTE: as SPARQL does not provide any guarantee on the identifiers of returned BNodes, it
may happen that different BNodes are returned in different passes, causing the RDF stream
produced by this RDFProcessor
to change from one pass to another.parallelize
- true if query results should be handled by multiple threads in parallelendpointURL
- the URL of the SPARQL endpoint, not nullquery
- the SPARQL query (CONSTRUCT or SELECT form) to submit to the endpointpreserveBNodes
- true if BNodes in the query result should be preserved, false if they should be
rewritten on a per-endpoint basis to avoid possible clashesRDFProcessor
public static RDFProcessor tee(org.openrdf.rio.RDFHandler... handlers)
RDFProcessor
that duplicates data of the RDF stream to the
RDFHandlers
specified. The produced processor can be used to 'peek' into the RDF
stream, possibly allowing to fork the stream. Note that RDF data is emitted to the supplied
handlers at each pass; if this is not the desired behavior, please wrap the handlers using
RDFHandlers.ignorePasses(RDFHandler, int)
.handlers
- the handlers to duplicate RDF data toRDFProcessor
public static RDFProcessor write(@Nullable org.openrdf.rio.WriterConfig config, int chunkSize, String... locations)
RDFProcessor
that writes data of the RDF stream to the files specified.
This is a utility method that relies on tee(RDFHandler...)
, on
RDFHandlers.write(WriterConfig, int, String...)
and on track(Tracker)
for
reporting progress information about written statements. Note that data is written only at
the first pass.config
- the optional WriterConfig
for fine tuning the writing process; if null,
a default configuration enabling pretty printing will be usedchunkSize
- the number of consecutive statements to be written as a single chunk to a single
location (increase it to preserve locality)locations
- the locations of the files to writeRDFProcessor
public static RDFProcessor upload(String endpointURL)
RDFProcessor
that uploads data of the RDF stream to the SPARQL endpoint
specified, using SPARQL Update INSERT DATA calls. This is a utility method that relies on
tee(RDFHandler...)
, on RDFHandlers.update(String)
and on
track(Tracker)
for reporting progress information about uploaded statements. Note
that data is uploaded only at the first pass.endpointURL
- the URL of the SPARQL Update endpoint, not nullRDFProcessor
public static RDFProcessor track(Tracker tracker)
RDFProcessor
that tracks the number of statements flowing through it
using the supplied Tracker
object.tracker
- the tracker objectRDFProcessor
that tracks the number of RDF statements passing through itpublic static RDFProcessor rules(Ruleset ruleset, @Nullable Mapper mapper, boolean dropBNodeTypes, boolean deduplicate)
RDFProcessor
that applies the ruleset specified on input statements
either as a whole or partitioned based on an optional Mapper
.ruleset
- the ruleset to applymapper
- the optional mapper for partitioning input statements, possibly nulldropBNodeTypes
- true to drop output rdf:type
statements with a BNode
objectdeduplicate
- true to enforce that output statements do not contain duplicates (if false,
duplicates might be returned if this enables the rule engine to operate faster)RDFProcessor
public static RDFProcessor rules(Ruleset ruleset, @Nullable Mapper mapper, boolean dropBNodeTypes, boolean deduplicate, @Nullable RDFSource tboxData, boolean emitTBox, @Nullable org.openrdf.model.URI tboxContext)
RDFProcessor
that expands the ruleset based on the supplied TBox and
applies the resulting ruleset on input statements either as a whole or partitioned based on
an optional Mapper
.ruleset
- the ruleset to applymapper
- the optional mapper for partitioning input statements, possibly nulldropBNodeTypes
- true to drop output rdf:type
statements with a BNode
objectdeduplicate
- true to enforce that output statements do not contain duplicates (if false,
duplicates might be returned if this enables the rule engine to operate faster)tboxData
- the RDFSource
of TBox data; null to disable TBox expansionemitTBox
- true to emit TBox data (closed based on rules in the supplied Ruleset
)tboxContext
- the context where to emit closed TBox data; null to emit TBox statements with
their original contexts (use SESAME.NIL
for emitting TBox data in the
default context)RDFProcessor
Copyright © 2015–2016 FBK-irst. All rights reserved.