Friday, 16 May 2014

Azure API Management - Almost there

Concept strings are just completing a SaaS offering where access to our REST APIs is one of the key ingredients.
We downed tools when we saw the release of what had been Apiphany's offering re-launched as a part of Azure, wondering whether should ditch our infrastructure and use it,

The answer is no, but maybe in future. Here's why, for the benefit of any  Microsoft marketing people:

Documentation

Documenting your API is vital. Our web site is an MVC 5.1 site using Web.API.  This comes with an excellent help generator that creates documentation from the controller code.
Not only does this document the calls, but more importantly, since ours are quite complex, the Json/XML structures returned and expected.
The API manager in Azure imports WADL or Swagger. There's recentish code on NuGet for swagger with no instructions for use, and old code for WADL.
These don't seem viable options. Doing it by hand seems tedious, with no options to document the structures.  So if we were to use this, we'd provide a considerably worse set of documentation.

Billing integration

There isn't any.  So we'd have to respond to requests to register by email and lead customers to a second site to do it. The site permits quick initial trial sign up, but no more.

Marketing

Again, there isn't any. It's unclear if you sign up for this service that you'd get everybody and his dog viewing your APIs because it's part of a popular site, or nobody.

The structure

There are APIs, Products and applications. The former seem limited to one address. Products are really service levels, and applications are you or  other vendors who have created an APP that made use of your API.
Our idea of products, Like Concept Forms or Concept Strings are each composed of several APIs.
You sign up to a product, start at a free level of service and move up.
What we need to do does not therefore seem to line up with the API manager.

The look of the website and extensibility

Having spent months making a website (hopefully) consistent with necessary features like the ability to download tooling, The built in site looks very constricted. I understand that apiphany felt they needed this, but as part of the Azure ecosystem with loads of different web CMS platforms available the built in system looks stopgap, and doesn't fit our needs.

The good bit is the intermediation - though even there,  there is no advice given as to how to prevent users just bypassing the manager.

We'd be happy to use this if the above issues were addressed, as I'm sure they will be in the next couple of months, Right now you'd have to be a very particular kind of vendor to want it.
Well done to Apiphany though, I hope they are sitting back counting the money...

Wednesday, 31 July 2013

How to monitor threats and abuse on the Internet with minimal effect to civil liberties

 
Dr Andy Edmonds, Concept Strings, andy@conceptstrings.com, Http://www.conceptstrings.com

Overview

Although many in government long to detect and prevent abusive and threatening texts and images on the internet, current technology is not up to the job. Inaccurate net filters are a threat to the liberty of all those who would be wrongfully accused or blocked, and represent a real headache for the various service providers and search engines who are currently receiving so much ire. An objective and accurate system that is also traceable, i.e. can explain why a piece of text was deemed abusive, is needed. We at Concept Strings believe we have such a system.


In the UK and other countries, politicians are making some loud noises about the internet and some of its more negative aspects.
The two key areas, detecting and blocking child porn and the detection of abusive and threatening tweets, blogs etc. present huge technical challenges.
Politicians have been seen to be impatient with various organizations, like Google and Twitter, but being politicians, rather than technologists, they don’t realise that what they ask is in some ways beyond the capability of current technology.
If you use a piece of software to categorize text or images, there will always be so-called “false positives”, i.e. harmless images or texts that are deemed to be harmful, and “false negatives”, harmful images or texts that are deemed harmless.
Clearly if you’ve posted a harmless tweet and the police turn up outside your door, your civil liberties are about to be seriously compromised.
False positives in this area can cause real problems for those who are wrongly accused, and you can understand Google or Twitter not wanting to get involved in this. Every false positive is a potential lawsuit.
Ignoring images, where the techniques are very different, the problem with conventional text mining software that might be used to detect abusive text is that it’s not very accurate. If you consider Sentiment Mining, which looks at tweets or blogs that mention a product or brand and tries to infer positive or negative sentiment, accuracy is normally only around 75%. This doesn’t matter much for sentiment mining, where it’s the trend that users are interested in.
However, using such techniques on twitter to identify abusive or threatening tweets would cause chaos.
Part of the problem is that almost all text mining techniques rely on word frequencies and opaque models derived from them. Not only are they not particularly accurate, but it’s almost impossible for the layman to work out how they arrived at their classification.
This is a nightmare for an organization that might be asked to defend in court why it classified a piece of text one way or the other.
It’s worth mentioning that some organizations provide services using “dictionaries of abuse”. These are hard to maintain, also inaccurate (it’s quite possible to be abusive or threatening without using abusive words), and easy to circumvent.
An ideal system would be easy to set up and change, have few false negatives or positives, and have traceability, i.e. it should be immediately obvious why a particular piece of text was categorized. Ideally false positives and negatives should be fed back into the system to improve results.
As you’ll have guessed, Concept Strings has been developing just such a technology. It represents a complete break from conventional word frequency techniques, instead using concepts rather than words to recognize ideas being expressed in text. It’s a Natural Language Processing technique, but it makes use of ideas from machine learning and DNA sequencing to recognize sequences of concepts.
To use it you create templates of the kind of text you are looking for. The system then recognises the sequence of concepts implied in these templates, gives you the chance to edit them, and then can search incoming text highly efficiently for sequences of concepts that match.
The great power of this approach is that a handful of templates can match thousands of ways of saying the same thing. Our system uses internationally recognized thesauri not only to recognize words that might mean the same thing, but also words that are a kind of the concepts in the template. Thus a template containing “horse riding” would match “pony riding”, ”palomino riding” and many others.
The traceability is inherent in the use of templates. Any match can be defended easily in the boardroom or the court, and any problems in the templates can be easily corrected by any intelligent native speaker of the language employed.
Concept Strings would love to talk to anybody who might be interested in this technology, which is available in SDK form.
Please send any expressions of interest to sales@conceptstrings.com



May be reproduced freely, wholly or in part, so long as the attribution to the author and company is included.

Thursday, 26 January 2012

Introduction to Concept Strings

Scientio has been working for several years on using the space of concepts rather than words to perform various text mining applications.  See, for instance, this paper.

Using the tools we’ve created you can search for phrases in large volumes of text based on meaning, sentiment mine, text mine, categorize using the concepts implied in text rather than unwieldy word frequencies. This technique combines the best bits of “bag of words” text mining and Natural Language Processing, and opens new fields of research.

A Concept is a somewhat nebulous idea. What we mean by it is a common meaning that is language independent, normally, and often common to several words.  It is the meaning intended for a word in a piece of text, though that meaning may be obscured by ambiguity.

To give you an example, the noun “post” can be a piece of wood or metal, concept 1, or the mail, concept 2,or a record in a log, concept 3. If we consider it’s use as a verb, to post, there are even more meanings.

Various attempts have been made to classify all words in a given language into a set of concepts. The one that we make use of is WordNet, created by Princeton University. There are now WordNets for almost all the world’s languages. A WordNet is a giant thesaurus and dictionary, and one can look up the concepts associated with any word, along with other important information.

Scientio has concentrated on a particular property of concepts that others have not made much use of. They tend to form into trees.

There are several relationships that WordNet tracks, that have long grammatical names.  The important ones to us are the “is a kind of” relationship, known as hypernymy, the “is a part of” relationship, known as meronymy, and the “is opposite to” relationship, known as antonymy.

Almost every noun concept is involved in a hypernymy relationship, and they form massive trees, with a small number of root nodes representing concepts that cannot be further simplified or made more abstract or general. In these trees of noun concepts the children are more specific examples of the parent.

To give you an example of one path through a tree from root to tip, consider the following:

  • A Palamino is a kind of pony.
  • A pony is a kind of horse.
  • A horse is a kind of ungulate.
  • An ungulate is a kind of animal.
  • An animal is a kind of entity.

The same kinds of structures apply to adjectives and verbs too.

So, what’s the use of this? Well, words are unordered, other than alphabetically, and it is this unordered nature that makes text mining difficult and computationally expensive. Text mining, search, etc. are concerned with the frequencies of large numbers of different words. The space of concepts has structure, because of these trees, and so we can find ways to compare and order concepts that are much more compact compared to using words.

The drawback, as you’ll have guessed, is that which concept is meant for a given word in a given sentence is often ambiguous.

So we can convert a sentence to a string of concepts just by looking them up in WordNet, but there will be uncertainty in two areas: (1) the part of speech (POS) associated with each word, and (2) the concept intended for each word.

Concept Strings

Scientio’s approach is to invent a new data structure, the Concept String, that holds all the ambiguity associated with a piece of text. In creating Concept Strings, Scientio’s software does it’s best to reduce any ambiguity, for instance by using word order to infer POS, but it holds all the concepts for each word that might reasonably be intended, and thus all the possible alternate readings for a piece of text.

image

The above illustrates the structure of a concept string, where the red arrows indicate one particular reading.

To make life easier a long piece of text is usually broken into sentences or phrases, and these are processed into individual Concept Strings.

This gives us something very powerful, the ability to look at two pieces of text and to determine if they might, in one of their interpretations, mean the same thing.

Comparing Concept Strings

image

Comparison between two concept strings is much more complicated than comparing normal strings. Firstly we look to see if the parts of speech agree, then if there is a common concept in each words list of possible concepts, but much more subtly, using the trees we discussed above, whether there are matches further up the tree.

In this case “I'm moving to the bus” would match with “I'm running to the bus”, “I’m jogging to the bus”, “I’m walking to the bus”, as well, of course,  as “I’m running to the coach”.

This is because running,walking,jogging are all kinds of moving.

Now, again, as you’ll have guessed, the comparison above relies on a particular ordering of parts of speech. It’s possible to say the same thing with lots of different orderings of these, but at least we have simplified things dramatically. It is now possible to search large amounts of text for important statements, such as “the bomb is on the plane” using just a couple of templates, whereas to do the same thing in the space of words would require the specification of a large number of alternatives.

In my next blog I’ll look at structures we’ve found for efficiently indexing concept strings and applications.

Sunday, 15 January 2012

Azure HPC Scheduler–integrating into an existing website

Scientio is a creator of text mining, data mining, rule based and time series analysis software. Although they are designed to be as quick as possible, they are still potentially large scale consumers of processing power, especially if applied to large data sets. We’ve been looking for several years at offering access to these products as a service. The costs have always been prohibitive or the available technology too slow. Finally it looks like technology has caught up in the shape of Microsoft Azure HPC Scheduler, which offers the opportunity to run large computing clusters in the cloud. (Get the SDK here.) We’re just at the start, but we hope to be able to permit registered users to upload data to our blob storage and then run potentially large and lengthy tasks on the HPC cluster using our products using the existing HPC web based interface or the REST API.

At the time of writing the Azure HPC Scheduler software is very new and the documentation is skimpy. Microsoft have provided an example service that runs a Linq-HPC, MPI and SOA examples. They’ve not provided much in the way of documentation apart from that. The following are a few notes on integrating HPC into your own Azure hosted site. You should try running the sample service first, it will make the following a bit clearer, and create the database you need for you.

I’ll look initially at just getting the composite site going – in later blogs I’ll look at issues like controlling customer access, provisioning customers, logging, billing etc.

Configuration

There’s lots to configure with the HPC scheduler. The approach taken with the sample service was to create a WPF application that collected information from the user about accounts etc. which then dynamically created the azure configuration files and uploaded the whole thing to Azure. This won’t do if you have an existing site, like Scientio, you are integrating HPC into.

Also, since this service is experimental, we wanted to cut down on our Azure bill by not having a separate instance running as a head node.  I should explain: HPC requires three types of instances, the web front end, the head node (responsible for scheduling jobs) and worker nodes (which do the work). It’s possible to configure HPC to combine the front end and the head node. It’s not clear yet at what point you have to have an independent head node as you increase the number of workers. Anyway, we wanted to start without, and the configuration app in the sample service doesn’t do this.

Finally, the HPC front end requires secure sockets access, and we’ve already got an SSL certificate for our domain, so we want to make the HPC front end use that, accessible as <domain name>/Portal/

So how to achieve all this? There are several stages:

1) configure the existing site to accept the HPC front end

2) write an application that fills in the azure configuration

3) write HPC -friendly wrappers for each Scientio product.

 

Modifying the site

The first thing is to switch off the web config for the master site to stop it affecting the scheduler front end:

<location path="."  inheritInChildApplications="false">
…..
</location>


Place the above round the system.web element in the web.config, and separately around the system.webserver element.



This specifically prevents any dll clashes.



Next you need to edit the ServiceDefinition.csdef file in the azure project. Here’s an example:



<?xml version="1.0" encoding="utf-8"?>
<ServiceDefinition name="<your service name>" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition">
<WebRole name="<web site name>" vmsize="Medium">
<Sites>
<Site name="Web">
<VirtualApplication name="Portal" physicalDirectory="C:\Program Files\Windows Azure HPC Scheduler SDK\v1.6\hpcportal" />
<Bindings>
<Binding name="HttpIn" endpointName="HttpIn" />
<Binding name="HPCWebServiceHttps" endpointName="Microsoft.Hpc.Azure.Endpoint.HPCWebServiceHttps" />
</Bindings>
</Site>
</Sites>
<ConfigurationSettings>
<Setting name="DiagnosticsConnectionString" />
<Setting name="DataConnectionString" />
</ConfigurationSettings>
<Certificates>
<Certificate name="<your https certificate name>" storeLocation="LocalMachine" storeName="My" />
</Certificates>
<Endpoints>
<InputEndpoint name="HttpIn" protocol="http" port="80" />
</Endpoints>
<Imports>
<Import moduleName="Diagnostics" />
<Import moduleName="HpcWebFrontEndHeadNode" />
<Import moduleName="RemoteAccess" />
<Import moduleName="RemoteForwarder" />
</Imports>
</WebRole>
<WorkerRole name="ComputeNode" vmsize="Small">
<Imports>
<Import moduleName="Diagnostics" />
<Import moduleName="HpcComputeNode" />
<Import moduleName="RemoteAccess" />
</Imports>
</WorkerRole>
</ServiceDefinition>



There are several things to note: first, the web role instance is size“medium”. Anything less is unreliable at start up. This seems to be due to memory limitations.



Secondly, we have no head node, unlike the example, but import “HpcWebFrontEndHeadNode” which combines front end and head node.



Filling in the configuration



The Azure HPC SDK supplies a class, ClusterConfig, that you can use to fill in the configuration fields.



I’ve created a command line application that calls this and modifies the configuration directly. It reads the definition file above to work out what to modify.



using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Hpc.Azure.ClusterConfig;
using System.Security.Cryptography.X509Certificates;

namespace UpdateAzureHPCConfig
{
class Program
{
static void Main(string[] args)
{
ClusterConfig config = new ClusterConfig();

config.SetCsdefFile(@"C:\<path to the service definition>\ServiceDefinition.csdef");

config.EnableSOA();

// Fill in the Azure account info
config.SubScriptionId = new Guid("{<your azure account subscription id}");
config.ServiceName = "<the service name>";
config.SetStorage("<storage account name>", "<storage account key>");

// Fill in the SQL Azure account info
config.DBServer = "<sql server name>";
config.DBUser = "<sql user name>";
config.DBPassword = "<sql user password>";
config.DBName = "<database name>";

// Fill in the certificate thumbprints
X509Certificate2 sslcert = CertificateHelper.LoadCert(@"<ssl certificate>.pfx", "<cert password>");

config.AddCertificate("Microsoft.Hpc.Azure.Certificate.SSLCert", sslcert);
config.AddCertificate("Microsoft.Hpc.Azure.Certificate.PasswordEncryptionCert", sslcert););

// You can override some preconfigured settings
config.ClusterName = "scientio";

// Setup the built-in cluster admin account
config.ClusterAdmin = "<cluster admin name>";
config.ClusterAdminEncryptedPassword = CertificateHelper.EncryptWithCertificate("<password>", sslcert);

config.Generate(@"C:\<path to configuration>\ServiceConfiguration.cscfg", @"C:\<path to configuration>\ServiceConfiguration.cscfg");
}
}
}


There is a bug in the ClusterConfig class in that it renames the serviceConfiguration ServiceName – just rename it back.



You’ll note there’s a SQL server database involved. I created one of these using the sample service and then re-used it.



There are various Azure storage Blob containers and tables used – these should be generated automatically.



You will need to create a ComputeNode project. This is just an empty worker, all the clever stuff is done with the imports.



Wrappers for the products



The standard form of application you can run on HPC is a command line app. The great big gotcha at the moment is that these must be compiled with .Net3.5. Microsoft when asked wouldn’t say when .Net 4.0 would be available.



Rather than accessing the local file system these can be configured to access azure BLOB storage. If you look at the Linq-HPC example  in the sample service you can see how to do this.



As has been publicised, Microsoft has decided to not continue with Linq- HPC and the underlying Dryad distributed storage. Instead Microsoft is going with Hadoop on Azure.  There’ss obviouisly a good fit between our products and a Hadoop cluster, especially with our text and concept mining products, so we’ll be investigating this soon.



I hope this helped you to get underway with creating your own Azure HPC clusters.

Thursday, 22 December 2011

Tolkien on Engineering and Invention

In Tolkien’s Silmarillion he provides a backstory to The Lord Of The Rings and The Hobbit, and talks about the demigods that form the Valar, the controllers of the world.

Here is what he has to say about the smith god Aulë:

“but the delight and pride of Aulë is in the deed of making and in the thing made, and neither in possession nor in his own mastery; wherefore he gives and hoards not, and is free from care, passing ever on to some new work.”

Doesn’t that sum up our profession, or at least how it ought to be?

Friday, 2 December 2011

Automated Medical Diagnosis and XmlMiner

Scientio is getting towards the end of a successful collaboration with a medical devices start up. Basically the product works, and barring some tinkering and approvals the initial development phase is over. I’m not going to talk about this company; there’ll be a separate splash when they are ready to publicise things, but, obviously, having built up this expertise we’d like to reuse it.

My feeling is that there are other enterprises like this, who may not be aware of what we do, or that what we do can be done.  I don’t intend to break any confidences so I’m going to talk about our experiences in general terms in this post.

This company had a unique way of interpreting  and conditioning a kind of sensor that is frequently used. They also had a set of tests built around this and other sensors, and an expert who could detect a range of conditions using this set up. Obviously with only one expert and only so many hours in a day the earning potential of this idea was limited, so how could they automate and reproduce this idea, so it would be available across America?

Scientio’s interest in this was the automation of the expert’s diagnostic knowledge, and the provision of this as a central cloud based diagnosis engine.  The result is that this diagnostic method is now leveraged so that thousands of tests can be handled in the time required for one manual test. This previous post talks about the architecture we used.

We’ve discovered through this process that Scientio’s engine is ideal for such tasks.

First of all, in an environment where approvals and compliance are important, The rules, though stored as XML, are easily displayed as English language if…then text, so the function of the system can be easily verified.

The rules are testable, either as a complete functional block or individually, and we supply software in our Lacuna product that can find any unintentional gaps in the rule sets, i.e. combinations of inputs that ought to produce a valid result but don’t.

When you add a new fact to a conventional expert system you have no idea how long it will process before stable results are generated.  
XmlMiner uses pre-processing to format the rules for straight through processing. The run time is defined and exceedingly speedy.

The power of fuzzy logic also makes it easier to transfer the expert’s knowledge to a set of rules. Scientio’s fuzzy logic inference engine is entirely capable of handling competing solutions and handling them in a rational way. Fuzzy logic makes for very expressive rules: we were amazed how small the set of rules used in the final product were.

Smaller rule sets mean lower maintenance costs and easier approvals.

XmlMiner can tell you when a set of input data is outside of the circumstances the rule set was created to handle. This means it’s easy to flag exceptional circumstances for human supervision or monitoring.

So, if you are trying to make that jump from a human expert based process to an automatic, semi automatic or human supervised process contact Scientio, we’d be happy to hear from you.

Sunday, 21 August 2011

Human rights legislation is flawed

This is a bit of a diversion from my normal topic s. However when you come to a conclusion on a subject and find no matter what you read that no one else has quite the the same take, you are either suffering from some kind of psychosis or ought to explain yourself. Let’s assume it’s the latter.

The human rights act is a strange beast. It’s based, as I remember on the work of some committee run by Valerie Giscard D'Estaing. It seeks to be a form of constitution. Constitutions are logical and almost metaphysical beasts, and they fall into one of my area of interest.

It’s said that the philosopher Jeremy Bentham spent a lot of his time devising constitutions for a variety of new countries created by the turbulence of the early 19th century and sending them off, where they were universally ignored. I shouldn’t be ashamed to be in such company.

The fault with the human rights act is where it starts. If you look at the first constitution history knows well, that of ancient Athens, the very first consideration in deciding what rights applied to an individual was determining what kind of individual you were dealing with. In ancient Athens this was all about what deme you belonged to, if both your parents were Athenians, whether you were male, and whether you had lost rights through criminality or indigence.

There were multiple classes of people, from full male Athenian citizens meeting all the property limits to metics (foreign traders) to slaves. Each had different rights. Logically, the first thing you did when pleading in an Athenian court is explain which group you belonged to.

It’s deeply unpopular at the moment to consider kinds of people. It smacks of racism, sexism or whatever, but all the perceived flaws in the Human Rights act are concerned with this very issue. Prisoners get the vote or access to pornography, asylum seekers get to stay because they had children, all because the human rights act starts out by delineating rights before delineating who is to receive them.

Obviously I’m not proposing legislating for the return of slavery, or that women should have unequal rights to men, but our existing body of law does treat some people differently. The insane don’t get freedom if they are deemed to require treatment, prisoners have different rights from the rest of us, as do members of the armed services and children.

If Cameron ever gets to fight off the Lib Dems and create the British bill of rights he ought to consider that it should be a grid, not a list.

Clearly, given our current set of laws, identifiable categories of people are citizens of the UK, subdivided into children, prisoners, the insane, members of armed forces, undischarged bankrupts and none of the above, Citizens of the EU, with similar subdivisions and others.

If you want to create a bill of rights that the citizens of the UK consider just, you must start with which of the groups the subject belongs to, and the rights attached must reward citizens over all others.

On another point, bills of rights are somewhat alien to the British world view, because the inference is that some body such as the state dispenses rights, and that such things are theirs to give away.

We don’t believe that this side of the channel, though it’s long been considered normal on the other side.

Perhaps what we need is a “bill of wrongs”. I.e, a document clearly written to say a citizen has the right to do absolutely anything apart from a small list of exclusions. This list varied to cover the various categories I’ve mentioned above.