Audio transcript of the March 5th ABCD-GIS Presentation.
NOTE: THIS IS THE RAW TRANSCRIPTION, WITHOUT EDITING OR PROCESSING. TRANSCRIPTION ERRORS MAY BE PRESENT. CONTENT FROM 00:00 - 04:33 HAS BEEN EDITED OUT.
04:33
SPEAKER: JEFF BLOSSOM
And so without further ado, I'll introduce our speaker. We have with us Miranda loopy, she's a graduate students.
04:41
Here at Harvard, and the Regional Studies Program. The Regional Studies in Russia Eastern European Central Asia master's program.
04:48
And she's an innovation fellow with the Davis center and compare your project for research interests include in that technology regulation and use in Russia.
04:58
Russian foreign policy GIS digital humanities optical character recognition and automation. So please join me in giving a warm welcome.
05:13
SPEAKER: MIRANDA LUPION
Everyone, thank you so much for joining us.
05:22
Today about semi automated action, perhaps using the
05:30
API. So just before I dive in, the goal of research was
05:39
Really to come up with a semi automated extraction process that's accessible and also
05:47
Which sounds really simple, but for those of you that are familiar with the problem, know that this is really the task.
05:54
But before I dive into the research. I want to talk a little bit about what I am Jeff in greater adoption.
06:01
I'm innovation fellow the project, which is the geospatial industry project that used to map the Russian that's run by Dr. Kelly O'Neil. I'm also pretty comfortable with my mom pretty
06:12
Or she is obviously I'm a huge fan of yoga mat to him, I promise. But I'm not. It's also really interesting. So I'm not a machine learning or computer vision.
06:23
Expert And that's important for two reasons on the technical level. If you have questions about the model using
06:32
Layers. There are just shoot me a question and an email. I'm in contact with the Google team that's developing this product. I'll get you an answer, but I might not know that answer off top my head.
06:42
The second and perhaps more salient reason why this is important, again, kind of links back to our goal of developing a process that really
06:52
highly specialized knowledge to be able to leverage technology and extract them. So the giving advice, you know, you can probably do it.
07:00
Do it too. And before I really got into the nitty gritty of the presentation. I wanted to find to turn
07:07
The first is maps. When I heard a man I typically mean historical documents that were manually produced on paper. Prior to World War Two.
07:16
And these documents within the last 1020 years, many of them have been digitized right they can scan and effort to preserve these documents and also to make them more accessible, you know, she researchers, you
07:27
May want to look at them remotely and we believe that there's probably upwards of half a million of these digital scan historical documents and probably sitting on library servers and our servers numbers.
07:40
And again, this is a really conservative estimate and these documents are exciting because many of them most valuable and untapped geographic information for instance settlements that
07:53
Yet to be recorded in a gas year or a definitive list of the feature type for a given placed at a given time.
08:00
Um, but the barrier to reporting the settlements comes in the form of feature extraction. I'm going to just give a really brief definition of what that is feature extraction is the process by which we identify the features in our case, that's true. Parents are place names.
08:17
We extract the features. So we turn them into sort of a usable digital entity, a piece of data.
08:23
And then we also geo code those features. So we figure out, you know, this is a historical point, for instance, here we have worse off. They have little Fulton
08:31
Whereas Warsaw from 1897 located on a contemporary map. Luckily for us, if it hasn't
08:37
Much but this feature extraction process is often either really costly in terms of labor or time or can require highly specialized technical knowledge.
08:48
So in that way feature extraction can be both a prerequisite and a barrier to meaningful analysis.
08:55
Of these documents and existing extraction techniques generally fall into two categories manual an assumption existing semi automated process. I want to review, both of those categories really quickly to show you where our potential third third option might fit in.
09:12
So the first extraction method.
09:15
Sure you guys are familiar with it is manual extraction. And oftentimes, this can be a researcher sitting down with that digital map.
09:21
On GIS software and manually extracting the feature. So that's two openings manually sitting there and typing out the token in points and then you know geo coding them on a map on
09:33
study shows that extracting all of the features from the typical nautical shirt would take you roughly six hours. If you're paying a research assistant less than 20 bucks an hour.
09:42
That one single map. She's gonna cost you $120. Let's say you have 100 map sheets, they are interested in checking that data from really get $12,000 which is, you know,
09:53
Not a small sum of money to put it lightly. So this manual extraction process can be really time consuming and can also be really costly.
10:03
The second sort of existing feature extraction methods that we talked about is a semi automated and there are a few
10:12
Different lines of inquiry that we've seen in existing research. The first is trying to leverage traditional optical character recognition software to send me automate the process of extracting programs.
10:26
And this can be quite challenging because traditional OCR software was built to deal with black and white text documents was built to do primarily with documents.
10:36
That maybe have checks to one language and all of those all that text is going in straight lines across the page, you know, some will see our software can deal with church, but it certainly cannot deal to the complexities that we see for instance.
10:48
In this excerpt of one of our maps on the screen so you have different languages you have different fonts different sizes of checks, you know, the text is going in all different directions. If you look at that white box.
11:01
On the bottom left hand corner of that image there, you see, you have the name of river and that's almost sort of
11:06
Linear you have different colors and variations of illumination and symbols.
11:11
Not to mention that when we're scanning these documents to digitize them that that scanning process itself can sometimes create noise right false coloration.
11:21
And learning and that noise can be exacerbated by progression of these files so traditional OCR software really can't deal with the complexities of these documents. So the second sort of broach
11:33
Under this umbrella is to develop the researchers to develop their own extraction models.
11:40
And tailor these models to their map sets and that requires obviously a high degree of specialized knowledge right tackling blacks questions and computer vision and OCR
11:52
And again, because maps are such specific and different documents. These processes are typically tailored to one map set. You can't simply take an algorithm that one research team developed
12:02
And apply your own NASA. And again, developing these custom processes can be really hard.
12:10
On in a survey of digital humanities practitioners participants frequently cited programming coding and software development.
12:17
As skills that they wrestled the wiring, you know. And that's not to say that there aren't really really qualified and really knowledgeable GH and Digital History teams out there that can develop these models. So that's just not the case at every institution.
12:30
The developing these custom prophecies can be really hard. And I'd say a final sort of potential pitfall of this custom semi automated approach is that this helps to proliferate.
12:43
The styling of digital humanities teams or Digital History teams across institutions, because what you're doing is you're reinventing the wheel of the slightly different wheel for slightly different map sets a number of times.
12:56
And so rather than trying to develop our own custom models to extract the features from our map sets we focused on thinking about a potential method that other teams could be able
13:07
To easily apply to their own maps apps. And so I should also know just that at this point in time, the noise in digital map documents and just the variation across digital map documents generally makes it impossible to fully
13:21
automate that processing and it's sort of complete and universal way. So when I talk about semi automated feature extraction. I mean,
13:29
The extraction of these openings, but then that might require researcher to go back and show correct some of the data that be, we see
13:37
Um, but, again, turning back to our potential third method, we thought that are pre packaged commercial algorithm that had both image processing capabilities and optical character recognition.
13:50
Capabilities could significantly lower the expertise and resource barriers required to automate extraction and so
14:02
We decided to experiment with Google's Cloud Vision API. This would be our
14:08
Pre packaged product and we chose this particular product as opposed to other commercially available options for three reasons for
14:17
It's fairly easy to use and the version of the product that we used. We didn't have to train any models, we didn't have to have a lot of training data which can
14:25
Be a barrier to getting these models up and running.
14:29
It was also fairly easy to integrate now there is no graphical user interface for this particular product, but Google provides you with the code samples that you need to implement it. So even if your programming level is kind of like print hello world.
14:43
I still think that probably successfully implement this API programmatically, or you can run it from the command line. And the other part of using a commercial product is that there is a dedicated team of people who are paid to help you implement
14:58
This product which is not always the case is, you know, for for software and they I've been in contact with them very responsive.
15:05
And the second reason that we selected this particular product is that it's effective.
15:11
So in a study of the 11th meeting commercial digital platforms that included pure products like Amazon's version of this product and Microsoft's version of this product, who was the only product to rank first in all levels.
15:24
Including ease of use and including performance of the model. It also offers expensive for language support, which is really important for historical maps and they get currently offers upwards of 52 or 50 languages and more
15:37
Languages are in beta testing.
15:40
And what's really important for us is that it supports Russian and older authority. Now, what this means is that prior to the 1917 Revolution in Russia. There were a few extra characters and the rationale for that that aren't used today.
15:53
But obviously our present and our map apps which are from the 19th century. So it was really important that this product offering support for those characters.
16:03
Audition API was one of the few that really did.
16:07
And the final reason that Lisa Leslie, this particular product is that it's really cheap. And in many cases, it's actually free so teams get 1003 images are very close to the API, so to speak, a month. So that can be
16:20
1000 map. She's per month. So you only way you're going to incur costs is if, on average, you're processing more than 12,000 matches a year. If you're doing that, that's exciting. You're probably superhuman
16:33
So each additional thousand calls per month after you exceed that 30,000 calls is only $1 50. So it really is a cost effective solution, even for institutions that don't really have a large budget to purchase subscriptions to these types of technology products.
16:51
And I want to talk now a little bit about the data that we use to test the Google
16:56
Cloud Vision product on so we selected for Atlas sheets from the 1827 and geographical Atlas of the Russian Empire to each of the sheets as a province or a kingdom within the larger Russian Empire.
17:09
And we selected the sheets, because they wanted to test a variety of language combinations from combinations and geospatial patterns with respect to the settlements.
17:19
Each of the sheets has roughly 280 token in pairs or settlements reported in two languages. So each settlement on the sheet has been written twice, once in Russian.
17:30
And once in a second language and I'll go into a little bit more detail on that in a second. So across all four sheets. We had a sample of 2262 things
17:41
I'm going to go through the sheets really quickly and just talk about some of the interesting features. So this is that because on province and two openings are written in Russian. And in a lot of transliteration the font. As you can see in this sort of excerpts here it is mostly person.
17:57
And
17:59
settlements are sort of unevenly distributed distributed
18:02
So they're significantly more settlements in the southern and my strength portion of the province than there are in the northern portion and also you have these really prominent really dark interviewing features in the form of rivers or boundaries.
18:18
Next sheet we looked at was for the Miss province and your tokens are written in Russian polish and there's a good mix of block and curse upon and so minutes or more evenly distributed throughout the province and they are in the sheet.
18:34
The third Atlas looked at was the Moscow province again tokens are written in Russian and a lot of transliteration and the font is mostly person.
18:43
But what was really interesting to us about the sheet was the way that the settlements are clustered so it isn't there are a number
18:52
Of settlements on this particular sheet. It's really from a densely packed, but they're all pretty regularly spaced in these rings that sort of center on the province capital of Moscow, so that pattern was particularly interesting to us.
19:06
And the final Atlas she'd be looked at was the kingdom full and the openings are written in Russian and Polish they're mostly imprint on which is a departure from the other three sheets.
19:17
And they tend to be more dispersed, although there is some clustering in the kingdom capital Warsaw in the center there. There are also significantly fewer intervening features.
19:28
So in terms of our testing process. We use five on to make our calls to the API. And we wrote a simple script to that returned us to document. First it returned us a copy of the image.
19:40
That was annotated with bounding boxes around the text that the algorithm identified and we had the program label each of those bounding boxes with the number
19:50
Now, we also have the program, which is the CSP, which had the text that was transcribed and also the numbers that you can visually connect to what was on the CSP to what was on the math.
20:02
And we assessed two metrics detection performance and transcription performance detection performance. Pretty simple. How good was the model at detecting your fans, you know, how many times did it fail to report, even a single character.
20:17
And transcription performance, a little bit. Okay.
20:21
So there were asking how well did okay i transcribe the text that is identify and we're great at each Texas.
20:29
Either correct. There were no errors partially correct there are 123 years or incorrect, meaning that either. There were more than three errors, where the token was reported in the wrong alphabet.
20:41
And that's partially correct categories, a little interesting
20:44
Means limited because we did feel like in some cases when there's one or two errors, it's fairly easy to correct that to open them name and takes less time than it would to extract that to them by hand.
20:55
For the transcripts, your performance. We also had a number of different our types that describe the type of error that we saw. For instance, it was reported in the wrong off the bat, or there's a missing characters for the trip in itself was clipped
21:08
We also knew that the feature type we are primarily interested in settlements so not, for instance, the names of rivers, the language and the font that the text was written because, again, I think these probably could have bearing on the transcription performance.
21:26
So I want to first talk about the detection results. So this is a chart that shows you the results for each of data sheets and we have again the number of children.
21:35
And parents. Remember that extra bit of his report it into different languages. So those two tokens for a single token pairs, and we were
21:42
Interested in how many times the algorithm missed only one to open a pair. And how many times it missed both programs in the pair. And then also just the total number of
21:53
Minutes and we were surprisingly pleased with the results you can see that on average.
21:59
The model missed only roughly 6% of all to open up the sheet which, given the linguistic and fun and you know spatial complexities
22:09
Of these machines. It's pretty remarkable on the sheet that had the worst performance was the because on sheet. And I think that's partially because
22:18
Of the prevalence of those interviewing features which you can disrupt overlap with text and prevent accurate detection.
22:25
The Kingdom of Poland, she had the lowest percent of entrepreneurs missed and also recall that that's the sheet that had the fewest and remaining features and sort of the most white space.
22:36
But there were some results that we found a little bit difficult to understand. I'm going to give you three quick examples.
22:42
And I love you guys have any feedback on why this may be occurring. So for instance, you can see here, this is an excerpt from our Miscavige sheet and the algorithm fail to detect that top to open and pear shape.
22:54
But it accurately detected the sort of bottom of never watch. And it's interesting because both of those tripping in pairs are written in the same languages.
23:04
And they're written in the same block fun. They're about the same size and they have pretty similar background colors. So this was a little bit of a puzzle to us.
23:13
Similarly, this is an excerpt from the Moscow Atlas sheep, the algorithm correctly detected flop over that to open up here on the top there that's intercepted by that nice dark black line.
23:24
But fails for some reason to detective rough calm, despite the fact that it looks like they're significantly fewer intervening features in that bottom children pair.
23:34
Although there is this sort of the small river that runs through it and that settlement symbol to the right here that might have played a role in that.
23:43
And then one more example of this. This is an excerpt from the polling data sheet and the algorithm detected only one to open in the
23:51
Pair detective the trumpet in Britain in Polish
23:55
But not the version of that tool written in Russian. And again, there are sort of significantly more interesting features in that top took
24:02
A pair, but still we weren't entirely sure it wasn't entirely clear to us rather why the token for the model sometimes failed to detect these instances of tax that to the human eyes are pretty clear.
24:18
I also want to talk about the transcription results. Now I know some of these numbers.
24:23
Look a little low. But don't be dismayed because we break this out by language and Font group, you're going to see that some of these language groups perform quite well, while others significantly drop down
24:34
The average in terms of performance. So what I would note here is that because on Atlas. She had the worst performance, right.
24:42
With roughly 17% of all openings correctly transcribed and we think this has to do in part with the prevalence of Russian cursive to open items which we believe we're probably not prevalent in the training data. Yeah. So in the data that
25:01
Was used to train the model because we know that much of the data that was used to train
25:07
The model that Google uses comes from the Google Books project, which is a lot of print material that's written in you know in block font, not necessarily in cursive
25:16
We think that that's part of the issue and I'm going to talk a lot more about that in a slider to
25:21
Balance sheet that perform the best was the Kingdom of Poland. So we saw roughly 55% of all token is transcribed correctly only 15% were transcribed incorrectly and 30% had three or fewer.
25:34
Errors in transcription. And again, remember that. That's the Atlas sheet with the front of the most white space and the fewest returning features.
25:43
So this is a chart that breaks down or transcription results by font and then vibrant language.
25:49
So the first distinction we have is the top of the chart versus the bottom part of the chart so cursive
25:55
Versus not first ever person versus block and then within the font groups. We also break the results out by language. So we have Latin transliteration polish
26:06
And Russian and the first thing to note, looking at this chart is that the sample is roughly two thirds person and one third block. So in the future, we would like to be working with a little bit more on
26:17
The sheets, rather than have to open himself tend to be written and block phone just to try to balance that out a little bit. Um,
26:24
The second thing that you might notice is that compared to two openings written in cursive font.
26:29
tokens that are written in block fun performed better across the board, regardless of language. So our best performing Fanta language group.
26:38
Was polish. So that's that row. That's the second from the bottom and you can see there, and that second column that 65% of the total items that were written in block font and in Polish were transcribed correctly.
26:53
Only 2% were transcribed incorrectly and 33% had three or fewer errors. So this is really exciting.
27:01
To us, this is there's a really strong performance on this particular language funk group on the worst performing language bond group is in the top half of that chart. It's the Russian cursive which is only 3%
27:14
Of openings accurately transcribed. So let's talk about why and do that I want to talk about the two most common types of errors that we saw.
27:24
The first is script classification, which was simply the algorithm decided to transcribe Cyrillic text in the Latin health
27:32
And the second error that we saw quite frequently, which simply the transcription of the wrong character. I'll be in the correct script. So I'll start by talking about Scripps classification, little bit of detail so
27:45
You can kind of see why this might occur. The orange letters indicate letters that have local links in the other alphabet. So the a and the Latin alphabet. Sure looks a lot like the end the Cyrillic alphabet.
28:01
And remember that these are only capital letters in a single font. So when you introduce lowercase letters and you've introduced other fonts and maybe you introduce some of the characters that are specific to the Polish alphabet. You can see how this further complicates script classification
28:21
And these errors are personally a result of the way that the model classifies the language or alphabet of the checks and then transcribes the tax.
28:30
So the Cloud Vision API first identifies the script of the text, for instance, to really
28:35
Latin or Greek or Mandarin. And then based on that classification, it sends that text instance to an OCR model that is specific to that alphabet. So here we can see we have this lovely Russian token come in school.
28:50
On and it was sent to the script classification model. And within that model.
28:55
The algorithm sort of analyze this piece of text and come visit. Okay, it looks like there's a k. It looks like there's a probably an H and M.
29:04
That's probably written in the Latin alphabet. And so it was then sent with that Latin script classification to the model that
29:13
Deals with transcribing text in the Latin alphabet. So once you get past that script classification model and you have the wrong classification. That's it. You're past the point of no return which is impart how he ended up with
29:27
Kenyan shows rather than Australia. And Google has actually documented issues stemming from the high prevalence of mobile like characters in Latin, Isabella
29:37
So I'm going to 2017 paper hearing different scripts selection approaches are different approaches for that script classification model.
29:45
Google found that the best performing model for script identification of so really text and photos still had a nearly 30% error rate. Well, the worst performing model had a 50% or
29:57
So our findings aren't surprising, but they're also suggest that curse of text may amplify existing script identification issues caused by local characters.
30:08
So these are two graphs that break out the children's by the type of font that their opinion on the left here, we have cursive on the right here we have blocked and that gray bar indicates script classification errors purchase.
30:23
So for the curse of sample that rate was point for to script classification errors for every program and that sample.
30:30
On the right side you have point one one script classification issues for errors rather for every two of them in the block text sample. You can see that the rate is much higher for cursive where you know the letters tend to be more
30:47
Connected there's less white space between them. It could be more difficult to differentiate or to identify the second major
30:54
Type of error that we saw was a character identification error. And again, I have these two graphs on the left desert person funk.
31:01
Tokens, and on the right is our openings in black font and that orange bar represents character identification issues. And we were surprised that these
31:10
rates were pretty similar point three, four for cursive and I think point three to four block we expected that the algorithm would struggle to correctly identify curses connected letters.
31:22
But we didn't count script classification errors as character identification errors, even though the selection of the wrong script sort of stems from a character recognition issue.
31:33
Definitely had considered all script identification urs as character identification errors and the curse of SAML would actually have a much higher rate of character identification.
31:44
Issues but what this shows, perhaps, is that once you sort of get the correct script classification
31:50
That you know error rates don't necessarily differ substantially between bonds, which was kind of surprising.
31:57
For us, I want to show you a few examples of some of these character identification issues. So here we have a token written a Latin transliteration and you can see that that final see
32:08
Was transcribed by the algorithm as apps. So OCR shows you what the algorithm returned to us and map is what's actually on the map, it's kind of easy to understand why that happened.
32:21
When you look at this example you have that line running through it that access noise and probably is part of the reason that may happen for transcription, the last character.
32:33
Another example, this is a question to open em and the character that I'm confused was death now set for those of you that don't read her speak Russian
32:45
Is this character that kind of looks like three fourths of a box with a little tail and it confused it with a path. And if you look at these two characters in that top spot. It's kind of difficult to understand why they might have been confused you can
33:02
easily distinguish between the two of them. But when you ask, Sarah.
33:05
Those characters look a little bit more similar. And then when those Sarah's at the top of the vertical lines on the set touch that looks like it's now a top bar.
33:16
And that tail. Maybe that tells actually noise. Right. And so this example shows you how font considerations can lead to incorrect character identification.
33:30
This is another sort of example of that, again, we have the letter said
33:34
And this time, the algorithm views. The letter said with death, which is a Jeep in Russian. And again, I think that font plays a role here right you have those startups, but you also have this irrigating feature on the right that is
33:49
A fairly thin line but it creates the appearance of another tail on the character. And you can see that in that death letter there. There are two tails. So here it's a combination of font and intervening features that causes the character identification issue.
34:06
Here's another example. Um, what's written on the map is
34:11
People have up what was Dr came up with is nice one, so very different sounding
34:18
Um, and part of the issue here was that we had a false positive. So the algorithm bought that there was a cure for at the end.
34:26
That really doesn't exist. And again, it's kind of easy to see why this occurs when you zoom in on that particular tool for them. So you have this intervening feature this line.
34:36
That sort of forms the left bar and that each looking character and then to the right of that you have this process, which is a symbol of a settlement and
34:45
When you look at them in conjunction. You can kind of see I outlined and read their how they look like what looks to us to be. He is actually an ad in Russian. So here's an example of a false positives with so many combinations with this lies are important additional here.
35:03
I also just want to know, really quickly, that there are significant differences within font groups.
35:09
If you break it down by language. So here's a graph showing our types and frequencies mean cursive bomb sample.
35:18
And you can see that that gray bar, which represents a script classification Earth is significantly higher for crushing cursive texts that were foolish person for Latin person that barcodes on integrated point eight
35:32
Script identification errors for each token. And that's written in Russian person. So this one particular language bond combination drives up the error rate for the whole fun
35:44
Um, so there. This is really sort of a significant distinction that we can make between languages and also just shows you how much the algorithm struggle with Russian cursive font.
35:57
I want to say one more thing about transcription before I moved here and that's the geography matters. So what we're looking at right here are two openings from the
36:06
Sample on your left and the Moscow samples on your right, and both of these 12 minutes that's openings are written in Russian and they're both written in black font.
36:16
And yet on the left with these types of programs. The model only got 1410 of them correct but for the token is on the right, the model got 41% which is pretty decent. So what's the difference. It's geography is the presence of entertaining features, you can see that the tokens on the left.
36:35
Are by Stephen intercepted by inefficiently more intervening features and for those two openings on the right now. The exception to that. On the right is text instance for 16 which says must bar which is simply the Russian word from Moscow.
36:50
And we think that the algorithm is practically able to transcribe Musk, despite the fact that it has this giant
36:57
black line running through it because most providers probably present in the test data Moscow Moscow. It's a fairly frequent word the spelling of it has not changed since the 19th century. This also testifies to the importance of what is going into the training data to train the model.
37:18
We would say that it works pretty well on our balance sheets. And that's not just because I put more than 100 hours of my life.
37:27
You know, and that there's definitely a place for this type of approach, especially when you compare it to other existing approaches. Right.
37:34
So I'm the medical approach up there that sort of semi automated custom approach on there. And then our approach and compared to the manual. It requires significantly less effort to extract
37:44
Although it does require more effort to correct the two minutes after the fact, but not necessarily really more effort than correction for some custom models requires
37:54
At least based on what's in the literature and also requires significantly less labor for system development or sort of developing the process by which you extract those programs. Right.
38:05
And correspondingly, it also requires less specialized knowledge significantly less specialized knowledge and development and custom model well
38:13
But there are trade offs. Right. We saw you know moderate accuracy. In some cases, better in some cases, worse, depending on the language.
38:20
Groups. But again, based on what we read in the literature. Sometimes the accuracy of the free package method is not you know that much worse than what we saw for some of these customer. This is a map sheets.
38:34
We also think it can work even better on
38:37
certain characteristics. Right. So firstly, went to open those are all written in a single script, because then the user can specify
38:44
The language and the language hints to the API call which is going to force the program to rely only on the transcription model associated with the language that's provided do. You're going to ideally see
38:56
Zero script classification errors if you have all children is written in a single alpha and you're able to specify that often that to the API.
39:06
Second, we saw significantly better performance for children and block text probably linked to the training data that was used to train the model.
39:14
And third, you know, we think that performance will be better if openings are printed in front of roughly speaking, a more common language.
39:22
Widely and currently used language tend to generate more written material that can serve as model training data and Google understandably is prioritizing training the model with data and languages which drop a larger commercial customer base and the old read the graphic.
39:40
We also think that pre and post processing steps may help. And this is kind of where we're at in the research. So we'd love to get your feedback on this.
39:48
And many of these pre and post processing steps can be implemented either using Gimp which has a user interface to photo editing software or programmatically, for instance, and Python using second image or Python image library or
40:01
The first potential pre processing stuff we've discussed is a noise removal filter.
40:06
We expect that Google's algorithm is using some sort of automated process to try to organize from your image.
40:12
That at the end of the day, you are still more familiar with your maps, an algorithm is, you better understand the type of noise that's present in your maps and we're that noise came from than the algorithm does. So you might want to send
40:25
An image to the API that already has in some ways.
40:28
That are ready rather you put a noise removal filter on and I would just caution against using, for instance, the popular costume filter which is going to reduce detail and image.
40:38
By lowering great the best. It was your performance, we want to maintain sharp character borders and a high foreground, background contrast
40:45
So a Bilateral Filter or median filter that's going to preserve our badges is probably going to be a better a better
40:52
Option. The second thing that we've discussed at length is customer threshold. And I won't go into detail here, but if you have questions or want to talk about that in Q AMP. A, that would be awesome.
41:04
But suffice to say that if you're taking names are written in a distinctive shade of gray or distinctive color. You might be able to select a threshold and value that's really going to accentuate those two
41:13
Separate them in the background and make cross a little bit easier.
41:19
And so post processing steps should discuss the first would be filtering it against it as a tier if one exists. So here's an example of what that might look like.
41:28
You have your text instances that will return to the model, you have your authoritative list where there's overlap, you can
41:34
Lose your practice, you're cutting down on the time it takes to go through and correct or check the text instances that were returned by the algorithm.
41:41
Why the text instance does not appear in the authoritative list either just not yet recorded like playing new in this example, or maybe it was spelled incorrectly days and this example, the second post processing stuff we've discussed is writing a simple program to correct for
42:00
Programs that are transcribed in the Latin alphabet when they're actually struggling
42:05
Because we know which characters are most frequently mixed up. So we can say this token impassive script classification air. Let's send it through this program and that
42:15
Program, for instance, replaces every Latin H with the Russian and places every Latin with the Russian. Oh, and so on and so forth. That'll help to cut down the time that it takes to
42:26
Correct for script identification or script classification. Excuse me. So again, these are just some steps that were discussing and we'd love to get more feedback on um I told you I would come back I'm
42:41
In conclusion, we believe that as machine learning and image processing technology only
42:48
Proves that these types of pre packaged products are also only going to get better. And so there's a lot more research that needs to be done to figure out how we can leverage these products are semi automated
43:00
Extraction and we believe that's truly semi automated to open an extraction has never been one closer and to more accessible for everyone and the past or remains really high right unlocking probably 10s of thousands of points that have yet to be
43:25
So, open it up for for questions.
APPLAUSE