Scanner snares scammers
Beverley Head, Information Age
11/02/2005 13:24:41
By applying computational techniques to determine meaning in natural language, it has been possible for researchers working with the Australian Securities Investment Corporation to develop a Web spider which can identify online investment scams. Already the system, called Scamseek, is having a substantial impact on ASIC and its ability to track investment prospectuses posted on the Internet and in its first outing led to a court case.
Scamseek is just one, albeit high-profile, example of what is possible using computational linguistics, according to Professor Jon Patrick. He is head of the Sydney Language Technology Research Group at Sydney University and a member of the Capital Markets Co-operative Research Centre which was the group that worked with ASIC to deliver Scamseek.
One of the more prosaic applications of computational linguistics is the spell and grammar check on the standard PC. But Patrick believes that applications will become increasingly sophisticated and range from management of corporate reputation and competitive analysis, by using Web spiders to monitor the world's news pages to see what is written about your company, through to forensic analysis of texts.
However he acknowledges forensic searching of large databases requires a very large computer system to churn through the text.
Patrick arrived in this field by a labyrinthine route. Something of a Renaissance man, he was once an archao-astronomist, monitoring movements in the galaxy by comparing alignment of ancient monuments with the modern heavens. He moved into information technology after a brief sojourn in psychotherapy and hypnosis.
His fascination with language, and the way in which computational techniques can be applied to language, stems from a stint spent living in the Basque area of Spain and France. Basque is a "language-isolate", that is it has no direct connection to other living languages. Such was Patrick's fascination with the language that he went on to write a Basque grammar.
Back in Australia his affair with language and search for meaning continued, only this time twinned with technology. Traditionally texts have been classified using what Patrick describes as a Reuters indexing technique. Patrick argues that such text classification takes little account of the semantics of language. What he and his team have been working on is a linguistically principled analysis of texts - ie, getting closer to the real meaning.
The Scamseek project has been the first test of the theory and so far it has performed well.
Commissioned via the CRC for Capital Markets, Patrick began by taking 8000 historical investment documents as a base repository of information. Two linguists, one computational linguist and three software engineers then read the 8000 documents -- all the while attempting to identify rogues. What they found was that 1.5 per cent of the 8000 documents were scams, and that there were 19 recognisable types of scam.
The next challenge was to create a computer system able to reproduce that human performance.
"The project deliverable had to be a working system for ASIC. The project is as much a creative software engineering project as a language breakthrough," explains Patrick. The whole exercise has primed his team for future projects he believes as "Now I have my PhD students with an industrial strength environment to support their research."
By applying computational theories to what the linguists had identified when they read the 8000 investment documents the Scamseek team developed ontologies describing scams, and scam registers which were able to distinguish actual scams from the nearest non-scams.
The system was developed using open source tools to run on Linux PCs. The Postgres database was selected and Altova's XMLspy used to manage the XML ontologies.
Patrick confirms that on the system's first live outing in mid 2004 it identified a group in Adelaide posting scam investment documents. It was vindication for ASIC's $2.2 million investment in a technology which was pretty much untested.
Keith Inman, director of enforcement at ASIC, says that when the system first went into production he was pretty confident it would perform well after benchmarking it against historical documents. "We ran the system then audited the referrals and actually found a number of scams and ended up taking action on one example of acute misbehaviour." In that first case ASIC secured orders from the Federal Court against the Biri companies, ClubInvest and Gramax, and against five individuals associated with the companies, essentially closing them down.
Inman explains that Scamseek is being used to back up ASIC's preferred method of enforcement which is to run campaigns focused around a particular form of non-compliance. The benefit of the computer-based tool he says is that it will vastly increase the chance of finding the needle in the haystack, or in this case, scam on the Internet.
"To identify contraventions we have to look at several thousand Web sites and it is very hard to distinguish the scam from the legitimate. We had a less than 0.01 per cent hit rate," Inman says.
Scamseek meanwhile identifies possible scams which are then checked by human operators. "We've always anticipated that the success rate in the test sample would be higher than the live data," says Inman, "but the experience is that we are still getting (hit rates of) one in four or one in five which is a lot better than one in 1000.
"It's not only an efficiency improvement for us - but because we were scanning several thousand sites to find contravening sites it was a needle in a haystack and easy to miss. Now we are confident that we're detecting more sites for less effort than before," he adds.
Professor Patrick says that advances in computational linguistics now mean that his team is "Getting to the point where we can make a reasonable understanding of the texts."
He is now keen to pursue other applications for the technology and has been talking to the ATO, ASIO, ACCC, APRA and the High Technology Crime Centre. However he sees one of the biggest potential applications for the technology coming out of the European Union where understanding meaning is important given the wide variety of languages used. In conjunction with speech synthesis and analysis the computational linguistics techniques developed for Scamseek could deliver very powerful meaningful translation aids he believes.
[ Printer Friendly Version ]
[ Other stories about Reuters, ASIC, ASIO, ACCC, Technology Research ]
|