Subscribe to DSC Newsletter

Semantic Data Modeling For Fun and Profit

Data modeling is usually one of those subjects that make people's eyes glaze over. It's not really programming, though understanding programming concepts such as objects, inheritance, polymorphism and similar multisyllabic words is usually helpful to do modeling. It's not a business analyst function, though most BAs end up participating in the modeling process. Perhaps the best way of thinking about modeling is to see it as a way to describe a business in clearly defined pieces.

From a business perspective, if done right, data modeling can prove quite valuable. When done badly, modeling can prove worse than useless, since a big part of the value of modeling is to employ it to make predictions. A bad model will give you bad predictions, and this in turn can mean poor investments, bad hires, missed opportunity costs and expensive code rework.

 Most people create bad models. A big reason for that is that they focus on characteristics, properties and derived content first. When you ask "how much proft did we make in the fourth quarter?" your focus is on a calculated product, not the specific agents that made that revenue possible in the first place.  A better question to ask is "What are the things in my business that are ultimately responsible for determining my profit?" 

For instance, consider this question from the perspective of a software consulting company. Such a company have account managers that attempt to interest customers (potential clients) in using the services of their consultants in order to solve problems. The consultants can bill a certain rate, but also need to be paid a different rate. If the consultants can, over the course of the project, bill more money than they make, then the company makes money. However, the consulting company also has support personnel that have to be paid as well.

Finally, you have contracts, which stipulate either a time and materials (T&M) bid  or a fixed cost accounting bid. Consultants love the former, clients love the latter. In reality, no project is wholly T&M - if a project goes on for too long without producing results, it will be cancelled, often at a significant penalty.

Now this is a very simple model, but is just complex enough to showcase a few basic features of effective modeling. In a conceptual, or semantic framework, you're looking for classes of things. Thus, we have a AccountMgr class, a Consultant class, and a Support class, along with prospective and secured clients and two kinds of contracts - TandM and FixedBid.

A lot of people will stop there at modeling, but there's actually a few simplifications that could be made. For starters, some of the questions that may be asked includes "who are our most productive sales reps and and consultants?" At a minimum we need enough information to identify these people - what are their names, where do they live, how do I contact them. This implies that we need to add more information to the model.

However, it's also worth noting that sales reps and consultants will generally share this kind of information - name, address, contact info, demographic data. In other words, both sales reps and consultants are people (as are business contacts within a business and support personnel). In other words, sales reps and consultants are two different "roles" that a given person can handle, and if we can create a generic (or abstract) person class, then we can extend this class for each particular job type. Similarly, one can create roles for organizations (prospect vs. active) and for generalized contracts.

There's actually one more major thing to model - periods of time. For instance, within a given quarter, a sales rep may deal with multiple prospects, may convert some of them to clients, and may have clients move back into prospects upon completion of work. Similarly, a consultant may consult with multiple companies, usually one at a time, but for relatively simple questions such as revenue profit or loss, it's worth thinking about the consultant working for multiple clients for specific billable hours in that quarter.

This means that a quarter of a year becomes a class of object - it has a specific identifier (F3Q2015, for the 3rd quarter in fiscal year 2015). It also suggests that for a client, both billable rate and billable hours become important properties that we have to track, but for this example, file that away for later consideration.

Note throughout this example the use of simplifying assumptions. Building a good model ultimately requires not only identifying those things in your business that may affect a desired set of information, but also pruning and simplifying those things that will not apply in 80% of the use cases. You can always find exceptions to the rules, but a good model applies triage - identify those situations where the patient can wait or where the patient is too far gone to be saved and push those aside, handle the cases that require immediate care from the remaining pool, then tackle the critical cases before treating those with lighter injuries. In a modeling context, tackle the most commonly occurring use cases first, then handle those that require serious engineer, then finally take care of the edge cases that may help refine the model but won't necessarily cause it to change significantly.

This is where it's useful to start pulling out a bit of RDF for modeling. Turtle is a very terse notation for expressing triples assertions, and its frequently a useful way to test a model before getting deep into the logical guts of that model. For instance, I will typically identify a generic Entity class that contains useful descriptor properties), then will build up from there.

          rdfs:subClassOf owl:Class.
            rdfs:subClassOf class:Entity.
            rdfs:subClassOf class:Person.
            rdfs:subClassOf class:Person.
            rdfs:subClassOf class:Person.
            rdfs:subClassOf class:Entity.
            rdfs:subClassOf class:Org.
            rdfs:subClassOf class:Entity.
            rdfs:subClassOf class:Contract.
            rdfs:subClassOf class:Contract.
            rdfs:subClassOf class:Entity.
            rdfs:subClassOf class:Period.
            rdfs:subClassOf class:Period.


Figure 1. Inheritance Relationships.

Subclass relationships are very useful. If you have a personal name defined in a Person class, then anything that inherits from Person (such as Consultant) will automatically have a PersonalName class, and will support a property:personalName  property.

                rdfs:subClassOf  owl:Class.
                 rdfs:subClassOf owl:Property;
                 rdfs:domain   class:Person;
                  rdfs:range       class:PersonalName.

The domain and range relationships indicate that for this property, you can only use the predicate (or relationship) prop:personalName if an instance such as consultant "Jane Doe" is a person and the right hand side describes something that corresponds to a Personal Name class. For example,

    a   class:Consultant;
    prop:employeeID "JANDOE1";
    prop:personName [
            a  class:PersonalName;
            prop:givenName  "Jane";
             prop:surname "Doe"

The first line after the class declaration is an employee ID, which will typically be a string. The URI identifiers are universal, but they aren't necessarily consistent within IDs from existing databases. 
An anonymous  class instance (the thing within the square brackets) doesn't have an explicit identifier.  A person may have multiple names, but if the person is removed from the database, the names should be as well. This is an example of a composition in modeling. Personal names generally do not exist outside of the thing that they are naming, and semantically are considered a label - something that is useful to a human, but that internally isn't that important (the relationship of having a label is important, on the other hand, especially for user interfaces). 

The two fields givenName and surName would be defined in a similar manner to how the prop:personalName property was created:

                 rdfs:subClassOf owl:Property;
                 rdfs:domain   class:PersonName;
                  rdfs:range       xs:string.

                 rdfs:subClassOf owl:Property;
                 rdfs:domain   class:PersonName;
                  rdfs:range       xs:string.

Something worth noting about the advantages of having a semantic model - in the statement:

      prop:personName [

            a  class:PersonalName;
            prop:givenName  "Jane";
             prop:surname "Doe"

it's really not necessary to indicate the class, e.g.,

     prop:personName [

            prop:givenName  "Jane";
             prop:surname "Doe"

 Why isn't it? It turns out that we can infer this based upon the data model we're constructing. In English, this would be reasoned out as follows:

Because consultant:JaneDoe is a class:Consultant, she is also (by the way subClass works) also a class:Person. The property prop:personName takes a domain of class:Person (or its descendents) and a range of class:PersonName. This implies that since class:PersonName is not itself inherited (yet), the anonymous instance must be of class:PersonName.

You could also discover this via a SPARQL query

select ?classB1 where {
$a $prop ?b.
$a   rdf:type   ?classA.
?b   rdf:type  ?classB.
?classA rdfs:subClassOf+ ?classA1.
?classB rdfs:subClassOf+ ?classAB.
$prop rdfs:domain ?classA1.
$prop rdfs:range ?classB1.
{$a = consultant:JaneDoe, $prop = prop:personName}

In English, for the statement 
consultant:JaneDoe prop:personName ?b.
where ?b is some object (I've replaced $a with cosultant:JaneDoe and $prop with prop:personName here, as per the  last statement), find the types of $a and ?b (the latter isn't known yet, but we'll put it in ?classB). Find all of the super-classes of ?classA and ?classB. If the property has the domain of classA or one of its superclasses and the range of classB or one of its superclasses, then return the range's superclasses (or class) (?classB1.)

This is actually very potent reasoning, based upon just a handful of assertions. If I declare a relationship once among any superclass, then it holds for all subclasses on each side of this relationship. You cannot do this with most UML tools. You can't do it very well even in XSD (it's possible, but truly ugly). However, with just a few lines of RDF, we've created a powerful relationship that, within the context of a semantic database, can radically simplify the amount of both modeling and coding you have to do.

We can also add a few more properties into the model for people. For instance, everyone has a base salary which we'll calculate in quarterly chunks. This is a fixed cost, though it varies based upon role. Again this can be attached at the person level.

                 rdfs:subClassOf owl:Property;
                 rdfs:domain   class:Person;
                  rdfs:range       xs:double.

The next stage is determining the differentiation between the subclasses. For instance, the account manager gets a bonus based upon how many new clients then bring on board plus a percentage of the money earned from their existing clients in each quarter (the assumption is that a client that is onboarded in one quarter will start generating revenue the next).  We'll not create a bonus field, but will create a bonusPercentage - this will be used to calculate the total. 

                 rdfs:subClassOf owl:Property;
                 rdfs:domain   class:AccountMgr;
                  rdfs:range       xs:float.

                 rdfs:subClassOf owl:Property;
                 rdfs:domain   class:AccountMgr;
                  rdfs:range       xs:double.

Consultants get a utilization bonus for every week above six, on the assumption that a consultant who has a rounded set of skills will be more utilized. That magnitude of that bonus is a fixed amount per person, though the exact amount may change from person to person. Support personnel don't get a bonus in the current model.  So this lays out one more property, utilizationBonus: 

                 rdfs:subClassOf owl:Property;
                 rdfs:domain   class:Consultant;
                  rdfs:range       xs:double.

Within an organization, the only property that's important is the name of the client:

                 rdfs:subClassOf owl:Property;
                 rdfs:domain   class:Org;
                  rdfs:range       xs:string.

There is a temptation to want to model a client as a new, existing, or former client, but the business itself does not in fact change significantly from one state to the next. What does change is the state of the contract between the account manager (on behalf of the consulting company) from quarter to quarter. This illustrates a second, important point. In general try not to model state transitions as classes, and avoid making the model any more complicated than it has to be.

It turns out that contracts are often especially important in models, because a contract relates two or more things together. In this particular case, a contract may look something like this:

      rdf:type class:TnMContract;
       prop:quarter quarter:F3Q2015;
       prop:accountManager accountMgr:JohnDee;
       prop:previousQuarterContract contract:19492;
       prop:contractState     contractState:Continuing;
       prop:company    company:BigDataCorp;
       prop:allocation [
             prop:consultant   consultant:JaneDoe; 
             prop:weeks   "5"^^xs:integer;
             prop:rate "120.00"^^xs:float;
             prop:consultant   consultant:JanetDonne; 
             prop:weeks   "6"^^xs:integer
             prop:rate "100.00"^^xs:float;

This is easily the most complex object in the system (contracts tend to be), because it serves to bind together a number of things - an account manager, zero or more consultants, a quarter, a company, and even a previous contract.

On the previous contract, this recursive pattern actually occurs quite often in modeling, and is a good indicator that the object type in question is actually a "pivot" object, around which everything else rotates. In this case, there is a separate contract signed at the end of each quarter that acts as an extension of a previous contract, so there's a chain of such contracts all the way back to the first. If a contract does not have a preceding contract, then it can be inferred to be a new contract, otherwise, it's a continuing contract.

The reason for this has to do with auditing. Once you create a new record, you should not destroy old records, but rather should deprecate them. You can set up a separate flag (prop:contractType here) that provides an easy reference to indicate the state of the process, but it's actually the audit path (prop:previousContract) that provides an indication as to the real state of the contract. (I won't model the contract here, though it is straightforward to do so).

One final note on contracts (and similar modeling constructs). I call these backbone patterns. They may be the most complex objects (or at least the most connected), but what makes them special is that they tend to be the pivot around which everything else evolves. Contracts in particular are usually very central to determining revenue, because a continuously amending contract contains not only costs and stipulations on initial revenues, but also indicates when new programming resources or products are added or removed. It's one reason why, whenever I'm modeling a business's metadata patterns, I ALWAYS look for the contracts first, as this will be what everything else is hung on. 

At this point, you have everything (except more contracts, businesses and other core objects) necessary to answer the question how much profit was made in a given quarter For instance, you can determine the revenue that Jane Doe generated easily enough with the following SPARQL queries:

select ?firstName ?lastName ?totalWeeks ?totalBillable ?employeeRevenue where
?consultant prop:employeeID $id.
?quarter rdfs:label $quarterLabel.
?consultant prop:personName ?personName.
?personName  prop:givenName ?firstName.
?personName prop:surname ?lastName.
?consultant  prop:utilizationBonus ?utilizationBonus.
?consultant  prop:quarterlySalary ?quarterlySalary.

select (sum(?weeks) as ?totalWeeks)  (sum(?weeks * ?rate) as 
   ?totalBillable) where {
       ?contract prop:quarter ?quarter.
       ?contract prop:allocation ?allocation.
       ?allocation prop:consultant ?consultant.
       ?allocation  prop:allocationWeeks ?weeks.
       ?allocation  prop:rate ?rate.
bind (?totalBillable - if (?totalWeeks  - 6 >0,?totalWeeks  - 6,0) *
     ?utilizationBonus - ?quarterlySalary as ?employeeRevenue) 
{id: "JANDOE1", quarterLabel:"F3Q2015"} 

This query passes in two parameters - the id of the consultant in question ("JANDOE1"), and the label of the quarter ("F3Q2015") as parameters. These are used to retrieve relevant information about the consultant, including calculating the internal identifier for that consultant, getting the associated name, base salary and utilization rates, then it creates a subquery that retrieves all of the contracts that Jane worked on and determines the number of weeks and rate per week of each contract.

These are summed using SPARQL aggregate functions to determine the total number of weeks and the raw revenue. The final bind statement then determines whether the number of weeks total was over 6 and subtracts the base salary and utilization bonus (if any) from the revenue that the consultant generation. Finally this information is passed as a row in a table. If the employee ID had not been constrained, it would generate a table showing all of the consultants and their net revenue (and related information).

A similar mechanism could be used with the same query as a subquery for determining the account manager's net revenues or the whole company's net revenues. 

What is most significant here is that this whole model was fairly easy to put together, largely by concentrating on the types of objects within the system.  The models could be extended and by selective parameterization, this can even be used for modeling different scenarios (what if I increase or decrease net utilization bonuses, for instance. In this regard the model created becomes much more like a spreadsheet than a "program" but with considerably more flexibility, because there's a clear underlying logical relational model driving this.  

 Kurt Cagle is the Founder and CEO of Semantical LLC. He will be speaking at the Smart Data Conference in San Jose on August 19, 2015.

Views: 961


You need to be a member of Data Science Central to add comments!

Join Data Science Central


  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service