当前位置:文档之家› 计算机外文翻译---基于网络爬虫的有效URL缓存

计算机外文翻译---基于网络爬虫的有效URL缓存

计算机外文翻译---基于网络爬虫的有效URL缓存
计算机外文翻译---基于网络爬虫的有效URL缓存

外文资料原文

Efficient URL Caching for World Wide Web Crawling

Andrei Z. Broder

IBM TJ WatsonResearchCenter

19 Skyline Dr

Hawthorne, NY10532

abroder@https://www.doczj.com/doc/ad3276495.html,

Marc Najork

Microsoft Research

1065 La Avenida

Mountain View, CA94043

najork@https://www.doczj.com/doc/ad3276495.html,

Janet L. Wiener

Hewlett Packard Labs

1501 Page Mill Road

Palo Alto, CA94304

janet.wiener@https://www.doczj.com/doc/ad3276495.html,

ABSTRACT

Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all theURLs not seen before, repeat (a)–(c). However, the size of the web(estimated at over 4 billion pages) and its rate of change (estimatedat 7% per week) move this plan from a trivial programming exerciseto a serious algorithmic and system design challenge. Indeed, thesetwo factors alone imply that for a reasonably fresh and completecrawl of the web, step (a) must be executed about a thousand timesper second, and thus the membership test (c) must be done wellover ten thousand times per second against a set too large to storein main memory. This requires a distributed architecture, whichfurther complicates the

membership test.

A crucial way to speed up the test is to cache, that is, to store inmain memory a (dynamic) subse t of the “seen” URLs. The maingoal of this paper is to carefully investigate several URL cachingtechniques for web crawling. We consider both practical algorithms:random replacement, static cache, LRU, and CLOCK, andtheoretical limits: clairvoyant caching and infinite cache. We performedabout 1,800 simulations using these algorithms with variouscache sizes, using actual log data extracted from a massive 33day web crawl that issued over one billion HTTP requests.Our main conclusion is that caching is very effective – in oursetup, a cache of roughly 50,000 entries can achieve a hit rate ofalmost 80%. Interestingly, this cache size falls at a critical point: asubstantially smaller cache is much less effective while a substantiallylarger cache brings little additional benefit. We conjecturethat such critical points are inherent to our problem and venture anexplanation for this phenomenon.

1. INTRODUCTION

A recent Pew Foundation study [31] states that “Search engineshave become an indispensable utility for Interne t users” and estimatesthat as of mid-2002, slightly over 50% of all Americans haveused web search to find information. Hence, the technology thatpowers web search is of enormous practical interest. In this paper,we concentrate on one aspect of the search technology, namelythe process of collecting web pages that eventually constitute thesearch engine corpus.

Search engines collect pages in many ways, among them directURL submission, paid inclusion, and URL extraction from nonwebsources, but the bulk of the corpus is obtained by recursivelyexploring the web, a process known as crawling or SPIDERing. Thebasic algorithm is

(a) Fetch a page

(b) Parse it to extract all linked URLs

(c) For all the URLs not seen before, repeat (a)–(c)

Crawling typically starts from a set of seed URLs, made up ofURLs obtained by other means as described above and/or made upof URLs collected during previous crawls. Sometimes crawls arestarted from a single well connected page, or a directory such as https://www.doczj.com/doc/ad3276495.html,, but in this case a relatively large portion of the

web(estimated at over 20%) is never reached. See [9] for a discussionof the graph structure of the web that leads to this phenomenon.

If we view web pages as nodes in a graph, and hyperlinks as directededges among these nodes, then crawling becomes a processknown in mathematical circles as graph traversal. Various strategiesfor graph traversal differ in their choice of which node amongthe nodes not yet explored to explore next. Two standard strategiesfor graph traversal are Depth First Search (DFS) and BreadthFirst Search (BFS) –they are easy to implement and taught in manyintroductory algorithms classes. (See for instance [34]).

However, crawling the web is not a trivial programming exercisebut a serious algorithmic and system design challenge because ofthe following two factors.

1. The web is very large. Currently, Google [20] claims to haveindexed over 3 billion pages. Various studies [3, 27, 28] haveindicated that, historically, the web has doubled every 9-12months.

2. Web pages are changing rapidly. If “change” means “anychange”, then about 40% of all web pages change weekly[12]. Even if we consider only pages that change by a thirdor more, about 7% of all web pages change weekly [17].

These two factors imply that to obtain a reasonably fresh and679complete snapshot of the web, a search engine must crawl at least100 million pages per day. Therefore, step (a) must be executedabout 1,000 times per second, and the membership test in step (c)must be done well over ten thousand times per second,against aset of URLs that is too large to store in main memory. In addition,crawlers typically use a distributed architecture to crawl more pagesin parallel, which further complicates the membership test: it ispossible that the membership question can only be answered by apeer node, not locally.

A crucial way to speed up the membership test is to cache a (dynamic)subset of the “seen” URLs in main memory. The main goalof this paper is to investigate in depth several URL caching techniquesfor web crawling. We examined four practical techniques:random replacement, static cache, LRU, and CLOCK, and comparedthem against two theoretical limits: clairvoyant caching andinfinite cache when run against a trace of a web crawl that issuedover one billion HTTP requests. We found that simple cachingtechniques are extremely effective even at relatively small cachesizes such as 50,000 entries and show how these caches can be

implementedvery efficiently.

The paper is organized as follows: Section 2 discusses the variouscrawling solutions proposed in the literature and how cachingfits in their model. Section 3 presents an introduction to cachingtechniques and describes several theoretical and practical algorithmsfor caching. We implemented these algorithms under the experimentalsetup described in Section 4. The results of our simulationsare depicted and discussed in Section 5, and our recommendationsfor practical algorithms and data structures for URL caching arepresented in Section 6. Section 7 contains our conclusions and directionsfor further research.

2. CRAWLING

Web crawlers are almost as old as the web itself, and numerouscrawling systems have been described in the literature. In thissection, we present a brief survey of these crawlers (in historicalorder) and then discuss why most of these crawlers could benefitfrom URL caching.

The crawler used by the Internet Archive [10] employs multiplecrawling processes, each of which performs an exhaustive crawl of64 hosts at a time. The crawling processes save non-local URLsto disk; at the end of a crawl, a batch job adds these URLs to theper-host seed sets of the next crawl.

The original Google crawler, described in [7], implements thedifferent crawler components as different processes. A single URLserver process maintains the set of URLs to download; crawlingprocesses fetch pages; indexing processes extract words and links;and URL resolver processes convert relative into absolute URLs,which are then fed to the URL Server. The various processes communicatevia the file system.

For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent,communicating web crawler processes. Each crawler process is responsiblefor a subset of all web servers; the assignment of URLs tocrawler processes is based on a hash of the URL’s host component.A crawler that discovers an URL for which it is not responsiblesends this URL via TCP to the crawler that is responsible for it,batching URLs together to minimize TCP overhead. We describeMercator in more detail in Section 4.

Cho and Garcia-Molina’s crawler [13] is similar to Mercator.The system is composed of multiple independent, communicatingweb crawler processes (called

“C-procs”). Cho and Garcia-Molinaconsider different schemes for partitioning the URL space, includingURL-based (assigning an URL to a C-proc based on a hash ofthe entire URL), site-based (assigning an URL to a C-proc basedon a hash of the URL’s host part), and hierarchical (assigning a nURL to a C-proc based on some property of the URL, such as itstop-level domain).

The WebFountain crawler [16] is also composed of a set of independent,communicating crawling processes (the “ants”). An antthat discovers an URL for which it is not responsible, sends thisURL to a dedicated process (the “controller”), which forwards theURL to the appropriate ant.

UbiCrawler (formerly known as Trovatore) [4, 5] is again composedof multiple independent, communicating web crawler processes.It also employs a controller process which oversees thecrawling processes, detects process failures, and initiates fail-overto other crawling processes.

Shkapenyuk and Suel’s crawler [35] is similar to Google’s; thedifferent crawler components are implemented as different pro cesses.A “crawling application” maintains the set of URLs to bedownloaded, and schedules the order in which to download them.It sends download requests to a “crawl manager”, which forwardsthem to a pool of “downloader” processes. The downloader processesfe tch the pages and save them to an NFS-mounted file system.The crawling application reads those saved pages, extracts anylinks contained within them, and adds them to the set of URLs tobe downloaded.

Any web crawler must maintain a collection of URLs that areto be downloaded. Moreover, since it would be unacceptable todownload the same URL over and over, it must have a way to avoidadding URLs to the collection more than once. Typically, avoidanceis achieved by maintaining a set of discovered URLs, coveringthe URLs in the frontier as well as those that have already beendownloaded. If this set is too large to fit in memory (which it oftenis, given that there are billions of valid URLs), it is stored on diskand caching popular URLs in memory is a win: Caching allows thecrawler to discard a large fraction of the URLs without having toconsult the disk-based set.

Many of the distributed web crawlers described above, namelyMercator [29], WebFountain [16], UbiCrawler[4], and Cho andMolina’s crawler [13], are comprised of cooperating crawling processes,each of which downloads web pages,

extracts their links,and sends these links to the peer crawling process responsible forit. However, there is no need to send a URL to a peer crawling processmore than once. Maintaining a cache of URLs and consultingthat cache before sending a URL to a peer crawler goes a long waytoward reducing transmissions to peer crawlers, as we show in theremainder of this paper.

3. CACHING

In most computer systems, memory is hierarchical, that is, thereexist two or more levels of memory, representing different tradeoffsbetween size and speed. For instance, in a typical workstationthere is a very small but very fast on-chip memory, a largerbut slower RAM memory, and a very large and much slower diskmemory. In a network environment, the hierarchy continues withnetwork accessible storage and so on. Caching is the idea of storingfrequently used items from a slower memory in a faster memory. Inthe right circumstances, caching greatly improves the performanceof the overall system and hence it is a fundamental technique in thedesign of operating systems, discussed at length in any standardtextbook [21, 37]. In the web context, caching is often mentioned

in the context of a web proxy caching web pages [26, Chapter 11].In our web crawler context, since the number of visited URLs becomestoo large to store in main memory, we store the collection ofvisited URLs on disk, and cache a small portion in main memory.

Caching terminology is as follows: the cache is memory used tostore equal sized atomic items. A cache has size k if it can store atmost k items.1 At each unit of time, the cache receives a request foran item. If the requested item is in the cache, the situation is calleda hit and no further action is needed. Otherwise, the situation iscalled a miss or a fault. If the cache has fewer than k items, themissed item is added to the cache. Otherwise, the algorithm mustchoose either to evict an item from the cache to make room for themissed item, or not to add the missed item. The caching policy or caching algorithm decides which item to evict. The goal of thecaching algorithm is to minimize the number of misses.

Clearly, the larger the cache, the easier it is to avoid misses.Therefore, the performance of a caching algorithm is characterizedby the miss ratio for a given size cache.In general, caching is successful for two reasons:

_ Non-uniformity of requests. Some requests are much morepopular than others.

In our context, for instance, a link to https://www.doczj.com/doc/ad3276495.html, is a much more common occurrence than a linkto the authors’ home pages.

_ Temporal correlation or locality of reference. Current requestsare more likely to duplicate requests made in the recentpast than requests made long ago. The latter terminologycomes from the computer memory model – data needednow is likely to be close in the address space to data recentlyneeded. In our context, temporal correlation occurs first becauselinks tend to be repeated on the same page –we foundthat on average about 30% are duplicates, cf. Section 4.2, andsecond, because pages on a given host tend to be explored sequentiallyand they tend to share many links. For example,many pages on a Computer Science department server arelikely to share links to other Computer Science departmentsin the world, notorious papers, etc.

Because of these two factors, a cache that contains popular requestsand recent requests is likely to perform better than an arbitrarycache. Caching algorithms try to capture this intuition invarious ways.

We now describe some standard caching algorithms, whose performancewe evaluate in Section 5.

3.1 Infinite cache (INFINITE)

This is a theoretical algorithm that assumes that the size of thecache is larger than the number of distinct requests.

3.2 Clairvoyant caching (MIN)

More than 35 years ago, L′aszl′o Belady [2] showed that if theentire sequence of requests is known in advance (in other words,the algorithm is clairvoyant), then the best strategy is to evict theitem whose next request is farthest away in time. This theoreticalalgorithm is denoted MIN because it achieves the minimum numberof misses on any sequence and thus it provides a tight bound onperformance.

3.3 Least recently used (LRU)

The LRU algorithm evicts the item in the cache that has not beenrequested for the longest time. The intuition for LRU is that an itemthat has not been needed for a long time in the past will likely notbe needed for a long time in the future, and therefore the numberof misses will be minimized in the spirit of Belady’s algorithm.

Des pite the admonition that “past performance is no guaranteeof future results”, sadly verified by the current state of the stockmarkets, in practice, LRU is generally very effective. However, itrequires maintaining a priority queue of requests. This queue hasa processing time cost and a memory cost. The latter is usuallyignored in caching situations where the items are large.

3.4 CLOCK

CLOCK is a popular approximation of LRU, invented in the latesixties [15]. An array of mark bits M0;M1; : : : ;Mk correspondsto the items currently in the cache of size k. The array is viewed asa circle, that is, the first location follows the last. A clock handle points to one item in the cache. When a request X arrives, if theitem X is in the cache, then its mark bit is turned on. Otherwise,the handle moves sequentially through the array, turning the markbits off, until an unmarked location is found. The cache item correspondingto the unmarked location is evicted and replaced by X.

3.5 Random replacement (RANDOM)

Random replacement (RANDOM) completely ignores the past.If the item requested is not in the cache, then a random item fromthe cache is evicted and replaced.

In most practical situations, random replacement performs worsethan CLOCK but not much worse. Our results exhibit a similarpattern, as we show in Section 5. RANDOM can be implementedwithout any extra space cost; see Section 6.

3.6 Static caching (STATIC)

If we assume that each item has a certain fixed probability ofbeing requested, independently of the previous history of requests,then at any point in time the probability of a hit in a cache of size k is maximized if the cache contains the k items that have the highestprobability of being requested.

There are two issues with this approach: the first is that in generalthese probabilities are not known in advance; the second is thatthe independence of requests, although mathematically appealing,is antithetical to the locality of reference present in most practicalsituations.

In our case, the first issue can be finessed: we might assume thatthe most popular k URLs discovered in a previous crawl are prettymuch the k most popular URLs in the current crawl. (There arealso efficient techniques for discovering the

most popular items ina stream of data [18, 1, 11]. Therefore, an on-line approach mightwork as well.) Of course, for simulation purposes we can do a firstpass over our input to determine the k most popular URLs, andthen preload the cache with these URLs, which is what we did inour experiments.

The second issue above is the very reason we decided to testSTATIC: if STATIC performs well, then the conclusion is that thereis little locality of reference. If STATIC performs relatively poorly,then we can conclude that our data manifests substantial locality ofreference, that is, successive requests are highly correlated. 4. EXPERIMENTAL SETUP

We now describe the experiment we conducted to generate thecrawl trace fed into our tests of the various algorithms. We conducteda large web crawl using an instrumented version of the Mercatorweb crawler [29]. We first describe the Mercator crawler architecture,and then report on our crawl.

4.1 Mercator crawler architecture

A Mercator crawling system consists of a number of crawlingprocesses, usually running on separate machines. Each crawlingprocess is responsible for a subset of all web servers, and consists ofa number of worker threads (typically 500) responsible for downloadingand processing pages from these servers.

Each worker thread repeatedly performs the following operations:it obtains a URL from the URL Frontier, which is a diskbaseddata structure maintaining the set of URLs to be downloaded;downloads the corresponding page using HTTP into a buffer (calleda RewindInputStream or RIS for short); and, if the page is an HTMLpage, extracts all links from the page. The stream of extracted linksis converted into absolute URLs and run through the URL Filter,which discards some URLs based on syntactic properties. For example,it discards all URLs belonging to web servers that contactedus and asked not be crawled.

The URL stream then flows into the Host Splitter, which assignsURLs to crawling processes using a hash of the URL’s host name.Since most links are relative, most of the URLs (81.5% in our experiment)will be assigned to the local crawling process; the othersare sent in batches via TCP to the appropriate peer crawling processes. Both the stream of local URLs and the stream of URLs receivedfrom peer crawlers flow into the Duplicate URL Eliminator (DUE).The

DUE discards URLs that have been discovered previously. Thenew URLs are forwarded to the URL Frontier for future download.In order to eliminate duplicate URLs, the DUE must maintain theset of all URLs discovered so far. Given that today’s web containsseveral billion valid URLs, the memory requirements to maintainsuch a set are significant. Mercator can be configured to maintainthis set as a distributed in-memory hash table (where each crawlingprocess maintains the subset of URLs assigned to it); however, thisDUE implementation (which reduces URLs to 8-byte checksums,and uses the first 3 bytes of the checksum to index into the hashtable) requires about 5.2 bytes per URL, meaning that it takes over5 GB of RAM per crawling machine to maintain a set of 1 billionURLs per machine. These memory requirements are too steep inmany settings, and in fact, they exceeded the hardware available tous for this experiment. Therefore, we used an alternative DUE implementationthat buffers incoming URLs in memory, but keeps thebulk of URLs (or rather, their 8-byte checksums) in sorted order ondisk. Whenever the in-memory buffer fills up, it is merged into thedisk file (which is a very expensive operation due to disk latency)and newly discovered URLs are passed on to the Frontier.

Both the disk-based DUE and the Host Splitter benefit from URLcaching. Adding a cache to the disk-based DUE makes it possibleto discard incoming URLs that hit in the cache (and thus are duplicates)instead of adding them to the in-memory buffer. As aresult, the in-memory buffer fills more slowly and is merged lessfrequently into the disk file, thereby reducing the penalty imposedby disk latency. Adding a cache to the Host Splitter makes it possibleto discard incoming duplicate URLs instead of sending them tothe peer node, thereby reducing the amount of network traffic. Thisreduction is particularly important in a scenario where the individualcrawling machines are not connected via a high-speed LAN (asthey were in our experiment), but are instead globally distributed.In such a setting, each crawler would be responsible for web servers“close to it”.

Mercator performs an approximation of a breadth-first searchtraversal of the web graph. Each of the (typically 500) threadsin each process operates in parallel, which introduces a certainamount of non-determinism to the traversal. More importantly,the scheduling of downloads is moderated by Mercator’s politenesspolicy, which limits the load placed by the crawler on any particularweb server. Mercator’s politeness policy guaran tees that noserver ever receives multiple requests from Mercator in parallel; inaddition, it guarantees that the nextrequest to a

server will only beissued after a multiple (typically 10_) of the time it took to answerthe previous request has passed. Such a politeness policy is essentialto any large-scale web crawler; otherwise the crawler’s operatorbecomes inundated with complaints.

4.2 Our web crawl

Our crawling hardware consisted of four Compaq XP1000 workstations,each one equipped with a 667 MHz Alpha processor, 1.5GB of RAM, 144 GB of disk2, and a 100 Mbit/sec Ethernet connection.The machines were located at the Palo Alto Internet Exchange,quite close to the Internet’s backbone.

The crawl ran from July 12 until September 3, 2002, althoughit was actively crawling only for 33 days: the downtimes weredue to various hardware and network failures. During the crawl,the four machines performed 1.04 billion download attempts, 784million of which resulted in successful downloads. 429 million ofthe successfully downloaded documents were HTML pages. Thesepages contained about 26.83 billion links, equivalent to an averageof 62.55 links per page; however, the median number of links perpage was only 23, suggesting that the average is inflated by somepages with a very high number of links. Earlier studies reportedonly an average of 8 links [9] or 17 links per page [33]. We offerthree explanations as to why we found more links per page. First,we configured Mercator to not limit itself to URLs found in anchortags, but rather to extract URLs from all tags that may contain them(e.g. image tags). This configuration increases both the mean andthe median number of links per page. Second, we configured it todownload pages up to 16 MB in size (a setting that is significantlyhigher than usual), making it possible to encounter pages with tensof thousands of links. Third, most studies report the number of unique links per page. The numbers above include duplicate copiesof a link on a page. If we only consider unique links3 per page, thenthe average number of links is 42.74 and the median is 17.

The links extracted from these HTML pages, plus about 38 millionHTTP redirections that were encountered during the crawl,flowed into the Host Splitter. In order to test the effectivenessof various caching algorithms, we instrumented Mercator’s HostSplitter component to log all incoming URLs to disk. The HostSplitters on the four crawlers received and logged a total of 26.86billion URLs.

After completion of the crawl, we condensed the Host Splitterlogs. We hashed

each URL to a 64-bit fingerprint [32, 8]. Fingerprintingis a probabilistic technique; there is a small chance thattwo URLs have the same fingerprint. We made sure there wereno such unintentional collisions by sorting the original URL logsand counting the number of unique URLs. We then compared thisnumber to the number of unique fingerprints, which we determinedusing an in-memory hash table on a very-large-memory machine.This data reduction step left us with four condensed host splitterlogs (one per crawling machine), ranging from 51 GB to 57 GB insize and containing between 6.4 and 7.1 billion URLs.

In order to explore the effectiveness of caching with respect tointer-process communication in a distributed crawler, we also extracteda sub-trace of the Host Splitter logs that contained onlythose URLs that were sent to peer crawlers. These logs contained4.92 billion URLs, or about 19.5% of all URLs. We condensed thesub-trace logs in the same fashion. We then used the condensedlogs for our simulations.

5. SIMULATION RESULTS

We studied the effects of caching with respect to two streams ofURLs:

1. A trace of all URLs extracted from the pages assigned to aparticular machine. We refer to this as the full trace.

2. A trace of all URLs extracted from the pages assigned toa particular machine that were sent to one of the other machinesfor processing. We refer to this trace as the cross subtrace,since it is a subset of the full trace.

The reason for exploring both these choices is that, dependingon other architectural decisions, it might make sense to cache onlythe URLs to be sent to other machines or to use a separate cachejust for this purpose.

We fed each trace into implementations of each of the cachingalgorithms described above, configured with a wide range of cachesizes. We performed about 1,800 such experiments. We first describethe algorithm implementations, and then present our simulationresults.

5.1 Algorithm implementations

The implementation of each algorithm is straightforward. Weuse a hash table to find each item in the cache. We also keep aseparate data structure of the cache items, so that we can chooseone for eviction. For RANDOM, this data structure is simply

alist. For CLOCK, it is a list and a clock handle, and the item s alsocontain “mark” bits. For LRU, it is a heap, organized by last accesstime. STATIC needs no extra data structure, since it never evictsitems. MIN is more complicated since for each item in the cache,MIN needs to know when the next request for that item will be. Wetherefore describe MIN in more detail.Let A be the trace or sequence of requests, that is, At is the itemrequested at time t. We create a second sequence Nt containing thetime when At next appears in A. If there is no further request for At after time t, we set Nt = 1. Formally,

To generate the sequence Nt, we read the trace A backwards,that is, from tmax down to 0, and use a hash table with key At andvalue t. For each item At, we probe the hash table. If it is notfound, we set Nt = 1and store (At; t) in the table. If it is found,we retrieve (At; t0), set Nt = t0, and replace (At; t0) by (At; t) inthe hash table.Given Nt, implementing MIN is easy: we read At and Nt inparallel, and hence for each item requested, we know when it willbe requested next. We tag each item in the cache with the timewhen it will be requested next, and if necessary, evict the item withthe highest value for its next request, using a heap to identify itquickly. 5.2 Results

We present the results for only one crawling host. The results forthe other three hosts are quasi-identical. Figure 2 shows the missrate over the entire trace (that is, the percentage of misses out ofall requests to the cache) as a function of the size of the cache. Welook at cache sizes from k = 20 to k = 225. In Figure 3 we presentthe same data relative to the miss-rate of MIN, the optimum off-linealgorithm. The same simulations for the cross-trace are depicted inFigures 4 and 5.

For both traces, LRU and CLOCK perform almost identicallyand only slightly worse than the ideal MIN, except in the criticalregion discussed below. RANDOM is only slightly inferior toCLOCK and LRU, while STATIC is generally much worse. Therefore,we conclude that there is considerable locality of reference inthe trace, as explained in Section 3.6.For very large caches, STATIC appears to do better than MIN.However, this is just an artifact of our accounting scheme: we onlycharge for misses and STATIC is not charged for the initial loadingof the cache. If STATIC were instead charged k misses for theinitial loading of its cache, then its miss rate would be of courseworse than MIN’s.

6. CONCLUSIONS AND FUTURE DIRECTIONS

After running about 1,800 simulations over a trace containing26.86 billion URLs, our main conclusion is that URL caching isvery effective – in our setup, a cache of roughly 50,000 entriescan achieve a hit rate of almost 80%. Interestingly, this size is acritical point, that is, a substantially smaller cache is ineffectualwhile a substantially larger cache brings little additional benefit.For practical purposes our investigation is complete: In view of ourdiscussion in Section 5.2, we recommend a cache size of between100 to 500 entries per crawling thread. All caching strategies performroughly the same; we recommend using either CLOCK orRANDOM, implemented using a scatter table with circular chains.Thus, for 500 crawling threads, this cache will be about 2MB –completely insignificant compared to other data structures neededin a crawler. If the intent is only to reduce cross machine traffic in adistributed crawler, then a slightly smaller cache could be used. Ineither case, the goal should be to have a miss rate lower than 20%.

However, there are some open questions, worthy of further research.The first open problem is to what extent the crawl orderstrategy (graph traversal method) affects the caching performance.Various strategies have been proposed [14], but there are indications[30] that after a short period from the beginning of the crawlthe general strategy does not matter much. Hence, we believe thatcaching performance will be very similar for any alternative crawlingstrategy. We can try to implement other strategies ourselves, butideally we would use independent crawls. Unfortunately, crawlingon web scale is not a simple endeavor, and it is unlikely that we canobtain crawl logs from commercial search engines.

In view of the observed fact that the size of the cache neededto achieve top performance depends on the number of threads, thesecond question is whether having a per-thread cache makes sense.In general, but not always, a global cache performs better than acollection of separate caches, because common items need to bestored only once. However, this assertion needs to be verified inthe URL caching context.

The third open question concerns the explanation we propose inSection 5 regarding the scope of the links encountered on a givenhost. If our model is correct then it has certain implications regardingthe appropriate model for the web graph, a topic of considerableinterest among a wide variety of scientists: mathematicians,physicists, and computer scientists. We hope that our paper

willstimulate research to estimate the cache performance under variousmodels. Models where caching performs well due to correlation oflinks on a given host are probably closer to reality. We are makingour URL traces available for this research by donating them to theInternet Archive.

译文

基于网络爬虫的有效URL缓存

概要:

要在网络上爬行非常简单:基本的算法是:(a)取得一个网页(b)解析它提取所有的链接URLs(c)对于所有没有见过的URLs重复执行(a)-(c)。但是,网络的大小(估计有超过40亿的网页)和他们变化的频率(估计每周有7%的变化)使这个计划由一个微不足道的设计习题变成一个非常严峻的算法和系统设计挑战。实际上,光是这两个要素就意味着如果要进行及时地,完全地爬行网络,步骤(a)必须每秒钟执行大约1000次,因此,成员检测(c)必须每秒钟执行超过10000次,并有非常大的数据储存到主内存中。这个要求有一个分布式构造,使得成员检测更加复杂。

一个非常重要的方法加速这个检测就是用cache(高速缓存),这个是把见过的URLs存入主内存中的一个(动态)子集中。这个论文最主要的成果就是仔细的研究了几种关于网络爬虫的URL缓存技术。我们考虑所有实际的算法:随机置换,静态cache,LRU,和CLOCK,和理论极限:透视cache和极大的cache。我们执行了大约1800次模拟,用不同的cache大小执行这些算法,用真实的log 日志数据,获取自一个非常大的33天的网络爬行,大约执行了超过10亿次的http请求。

我们的主要的结论是cache是非常高效的-在我们的机制里,一个有大约50000个入口的cache可以完成80%的速率。有趣的是,这cache的大小下降到一个临界点:一个足够的小一点的cache更有效当一个足够的大一点的cache只能带来很小的额外好处。我们推测这个临界点是固有的并且冒昧的解释一下这个现象。

1.介绍

皮尤基金会最新的研究指出:“搜索引擎已经成为互联网用户不可或缺的工具”,估计在2002年中期,初略有超过1半的美国人用网络搜索获取信息。因此,一个强大的搜索引擎技术有巨大的实际利益,在这个论文中,我们集中于一方面的搜索技术,也就是搜集网页的过程,最终组成一个搜索引擎的文集。

搜索引擎搜集网页通过很多途径,他们中,直接提交URL,回馈内含物,然后从非web源文件中提取URL,但是大量的文集包含一个进程叫crawling 或者SPIDERing,他们递归的探索互联网。基本的算法是:

Fetch a page

Parse it to extract all linked URLs

For all the URLs not seen before,repeat(a)-(c)

网络怕从一般开始于一些种子URLs。有些时候网络爬虫开始于一个正确连接的页面,或者一个目录就像:https://www.doczj.com/doc/ad3276495.html,,但是因为这个原因相关的巨大的部分网络资源无法被访问到。(估计有超过20%)

如果把网页看作图中的节点,把超链接看作定向的移动在这些节点之间,那么网络爬虫就变成了一个进程就像数学中的图的遍历一样。不同的遍历策略决定着先不访问哪个节点,下一个访问哪个节点。2种标准的策略是深度优先算法和广度优先算法-他们容易被实现所以在很多入门的算法课中都有教。

但是,在网络上爬行并不是一个微不足道的设计习题,而是一个非常严峻的算法和系统设计挑战因为以下2点原因:

网络非常的庞大。现在,Google需要索引超过30亿的网页。很多研究都指出,在历史上,网络每9-12个月都会增长一倍。

网络的页面改变很频繁。如果这个改变指的是任何改变,那么有40%的网页每周会改变。如果我们认为页面改变三分之一或者更多,那么有大约7%的页面每周会变。

这2个要素意味着,要获得及时的,完全的网页快照,一个搜索引擎必须访问1亿个网页每天。因此,步骤(a)必须执行大约每秒1000次,成员检测的步骤(c)必须每秒执行超过10000次,并有非常大的数据储存到主内存中。另外,网络爬虫一般使用一个分布式的构造来平行地爬行更多的网页,这使成员检测更为复杂:这是可能的成员问题只能回答了一个同行节点,而不是当地。

一个非常重要的方法加速这个检测就是用cache(高速缓存),这个是把见过的URLs存入主内存中的一个(动态)子集中。这个论文最主要的成果就是仔细的研究了几种关于网络爬虫的URL缓存技术。我们考虑所有实际的算法:随机置换,静态cache,LRU,和CLOCK,和理论极限:透视cache和极大的cache。我们执行了大约1800次模拟,用不同的cache大小执行这些算法,用真实的log 日志数据,获取自一个非常大的33天的网络爬行,大约执行了超过10亿次的http请求。

这个论文像这样组织的:第2部分讨论在文学著作中几种不同的爬行解决方案和什么样的cache最适合他们。第3部分介绍关于一些cache的技术和介绍了关于

cache几种理论和实际算法。第4部分我们实现这些算法,在实验机制中。第5部分描述和讨论模拟的结果。第6部分是我们推荐的实际算法和数据结构关于URLcache。第7部分是结论和指导关于促进研究。

2.CRAWLING

网络爬虫的出现几乎和网络同期,而且有很多的文献描述了网络爬虫。在这个部分,我们呈现一个摘要关于这些爬虫程序,并讨论问什么大多数的网络爬虫会受益于URL cache。

网络爬虫用网络存档雇员多个爬行进程,每个一次性完成一个彻底的爬行对于64个hosts 。爬虫进程储存非本地的URLs到磁盘;在爬行的最后,一批工作将这些URLs加入到下个爬虫的每个host的种子sets中。

最初的google 爬虫,实现不同的爬虫组件通过不同的进程。一个单独的URL服务器进行维护需要下载的URL的集合;爬虫程序获取的网页;索引进程提取关键字和超链接;URL解决进程将相对路径转换给绝对路径。这些不同的进程通过文件系统通信。

这个论文的中实验我们使用的meractor网络爬虫。Mercator使用了一个独立的集合,通信网络爬虫进程。每个爬虫进程都是一个有效的web服务器子集;URLs 的分配基于URLs主机组件。没有责任通过TCP传送这个URL给网络爬虫,有责任把这些URLs绑在一起减少TCP开销。我们描述mercator很多的细节在第4部分。

任何网络爬虫必须维护一个集合,装那些需要被下载的URLs。此外,不能重复地下载同一个URL,必须要个方法避免加入URLs到集合中超过一次。一般的,达到避免可以用维护一个发现URLs的集合。如果数据太多,可以存入磁盘,或者储存经常被访问的URLs。

3.CACHING

在大多数的计算机系统里面,内存是分等级的,意思是,存在2级或更多级的内存,表现出不同的空间和速度。举个例,在一个典型的工作站里,有一个非常小但是非常快的内存,一个大,但是比较慢的RAM内存,一个非常大胆是很慢的disk内存。在一个网络环境中,也是分层的。Caching就是一种想法储存经常用到的项目从慢速内存到快速内存。

Caching术语就像下面:cache是内存用来储存同等大小的元素。一个cache有k 的大小,那么可以储存k个项目.在每个时间段,cache接受到来自一个项目的请求.如果这个请求项目在这个cache中,这种情况将会引发一个碰撞并且不需要进一

步的动作。另一方面,这种情况叫做丢失或者失败。如果cache没有k个项目,那个丢失的项目被加入cache。另一方面,算法必须选择驱逐一个项目来空出空间来存放那个丢失的项目,或者不加入那个丢失的项目。Caching算法的目标是最小化丢失的个数。

清楚的,cache越大,越容易避免丢失。因此,一个caching算法的性能要在看在一个给定大小的cache中的丢失率。

一般的,caching成功有2个原因:

不一致的请求。一些请求比其他一些请求多。

时间相关性或地方的职权范围。

3.1 无限cache(INFINITE)

这是一个理论的算法,假想这个cache的大小要大于明显的请求数。

3.2 透视cache(MIN)

超过35年以前,L′aszl′o Belady表示如果能提前知道完整的请求序列,就能剔除下一个请求最远的项目。这个理论的算法叫MIN,因为他达到了最小的数量关于丢失在任何序列中,而且能带来一个飞跃性的性能提升。

3.3最近被用到(LRU)

LRU算法剔除最长时间没用被用到的项目。LRU的直觉是一个项目如果很久都没被用过,那么在将来它也会在很长时间里不被用到。

尽管有警告“过去的执行不能保证未来的结果”,实际上,LRU一般是非常有效的。但是,他需要维护一个关于请求的优先权队列。这个队列将会有一个时间浪费和空间浪费。

3.4 CLOCK

CLOCK是一个非常流行的接近于LRU,被发明与20世纪60年代末。一个排列标记着M0,M1,….Mk对应那些项目在一个大小为k的cache中。这个排列可以看作一个圈,第一个位置跟着最后一个位置。CLOCK控制指针对一个项目在cache中。当一个请求X到达,如果项目X在cache中,然后他的标志打开。否则,关闭标记,知道一个未标记的位置被剔除然后用X置换。

3.5 随机置换(RANDOM)

随机置换完全忽视过去。如果一个项目请求没在cache中,然后一个随机的项目将被从cache中剔除然后置换.

在大多数实际的情况下,随机替换比CLOCK要差,但并不是差很多。

3.6 静态caching(STATIC)

如果我们假设每个项目有一个确定的固定的可能性被请求,独立的先前的访问历史,然后在任何时间一个撞击在大小为k的cache里的概率最大,如果一个cache 中包含那k个项目有非常大的概率被请求。

有2个问题关于这个步骤:第一,一般这个概率不能被提前知道;第二,独立的请求,虽然理论上有吸引力,是对立的地方参考目前在大多数实际情况。

在我们的情况中,第一种情况可以被解决:我们可以猜想上次爬行发现的最常用的k个URLs适合于这次的爬行的最常用的k个URLs。(也有有效的技术可以发现最常用的项目在一个流数据中。因此,一个在线的步骤可以运行的很好)当然,为了达到模拟的目的,我们可以首先忽略我们的输入,去确定那个k个最常用URLs,然后预装这些URLs到我们做实验的cache中。

第二个情况是我们决定测试STATIC的很大的原因:如果STATIC运行的很好,Sname结论是这里有很少的位置被涉及。如果STATIC运行的相对差,那么我们可以推断我们的数据显然是真实被提及的位置,连续的请求是密切相关的。

4 实验机制

4.1 Meractor 爬虫体系

一个Meractor爬虫体系有一组爬虫进程组成,一般在不同的机器上运行。每个爬虫进程都是总网络服务器的子集,然后由一些工作线程组成(一般有500个),他们负责下载和处理网页从这些服务器。

举一个例子,每个工作现场在一个系统里用4个爬行进程。

每个工作现场重复地完成以下的操作:它获得一个URL从URL边境里,一个磁盘数据结构维护被下载的URL集合;用HTTP协议下载对应的网页到一个缓冲区中;如果这个网页是HTML,提取所有的超链接。把提取出来的超链接流转换为完全链接然后运行通过URL过滤器,丢弃一些基于syntactic properties的链接。比如,它丢弃那些基于服务器联系我们,不能被爬行的链接。

URL流被送进主机Splitter里面,主机splitter用来分配URL给爬虫进程通过URL的主机名。直到大多数的超链接被关联,大部分的URL(在我们的实验中是81。5%)将被分配给本地爬虫进程;剩下的传说通过TCP给适当的爬虫进程。本地URLs流和从爬虫中得到的URLs流都要送到复制URL消除器中(DUE)。DUE会除去那些被访问过的URLs。新的URLs会被送到URL边境中去以供以

计算机网络新技术外文翻译文献

计算机网络新技术外文翻译文献 (文档含中英文对照即英文原文和中文翻译) 译文: 计算机网络新技术 摘要 21世纪是一个信息时代的经济,计算机网络技术是这个时期的代表技术,以非常快的、具创造性得不断地发展,并将深入到人民群众的工作,生活和学习中。因此,控制这种技术看起来似乎具有很重要的意义。现在,我主要是采用新技术的几种网络技术在现实生活的应用。 关键字 因特网数字证书数字银包网格存储 3G

1.前言 互联网满36岁,仍然是一个进展中的工作。36年后在加州大学洛杉矶分校的计算机科学家使用15英尺的灰色电缆连接两台笨重的电脑,测试了一种在网络上新的数据交换的方式,这将最终成为互联网依然是一个在取得进展的工作。 大学的研究人员正在试验如何提高网络容量和速度。编程人员正在设法为网页注入更多的智能。并正在进行重新设计网络以减少垃圾邮件(垃圾邮件)和安全麻烦的工作。 与此同时威胁织机:批评人士警告说,商业,法律和政治压力可能会阻碍一些使互联网发展到今天的创新的类型。 斯蒂芬克罗克和温顿瑟夫属于1969年9月2日研究生加入的加州大学洛杉矶分校斯莱昂兰罗克教授工程实验室的团体,作为位无意义的测试数据两台计算机之间默默流动。到第二年的1月,其他三个“节点”加入到了这个网络。 然后是电子邮箱,几年之后,在七十年代后期一个所谓的核心通信协议即TCP / IP 协议,在80年代域名系统和在1990年万维网-现在的第二个最流行的应用背后电子邮件出现了。互联网的扩大,超出其最初的军事和教育领域延伸到了企业和全球的家庭中。 今天,克罗克仍然为互联网工作,为协作设计更好的工具。作为互联网管理机构的安全委员会主席,他正试图保卫系统的核心处理免受来自外部的威胁。 他认识到,他帮助建立的互联网工作远未完成,而这些改变是在商店,以满足多媒体日益增长的需求。网络供应商现唯一的“最佳努力”是在提供的数据包上。克罗克说,需要有更好的保障,以防止跳过和过滤现在常见的视频。 瑟夫,现在在MCI公司说,他希望他建立了有内置安全的互联网。微软,雅虎和美国在线公司,和其他的一些,目前正在努力改进网络,使邮件发送者可以验证的方式发送以降低使用虚假地址发送垃圾邮件。 瑟夫说,现在正在制定许多功能,是不可能立即解决计算速度慢和互联网管道窄,或

人工神经网络原理及实际应用

人工神经网络原理及实际应用 摘要:本文就主要讲述一下神经网络的基本原理,特别是BP神经网络原理,以及它在实际工程中的应用。 关键词:神经网络、BP算法、鲁棒自适应控制、Smith-PID 本世纪初,科学家们就一直探究大脑构筑函数和思维运行机理。特别是近二十年来。对大脑有关的感觉器官的仿生做了不少工作,人脑含有数亿个神经元,并以特殊的复杂形式组成在一起,它能够在“计算"某些问题(如难以用数学描述或非确定性问题等)时,比目前最快的计算机还要快许多倍。大脑的信号传导速度要比电子元件的信号传导要慢百万倍,然而,大脑的信息处理速度比电子元件的处理速度快许多倍,因此科学家推测大脑的信息处理方式和思维方式是非常复杂的,是一个复杂并行信息处理系统。1943年Macullocu和Pitts融合了生物物理学和数学提出了第一个神经元模型。从这以后,人工神经网络经历了发展,停滞,再发展的过程,时至今日发展正走向成熟,在广泛领域得到了令人鼓舞的应用成果。本文就主要讲述一下神经网络的原理,特别是BP神经网络原理,以及它在实际中的应用。 1.神经网络的基本原理 因为人工神经网络是模拟人和动物的神经网络的某种结构和功能的模拟,所以要了解神经网络的工作原理,所以我们首先要了解生物神经元。其结构如下图所示: 从上图可看出生物神经元它包括,细胞体:由细胞核、细胞质与细胞膜组成;

轴突:是从细胞体向外伸出的细长部分,也就是神经纤维。轴突是神经细胞的输出端,通过它向外传出神经冲动;树突:是细胞体向外伸出的许多较短的树枝状分支。它们是细胞的输入端,接受来自其它神经元的冲动;突触:神经元之间相互连接的地方,既是神经末梢与树突相接触的交界面。 对于从同一树突先后传入的神经冲动,以及同一时间从不同树突输入的神经冲动,神经细胞均可加以综合处理,处理的结果可使细胞膜电位升高;当膜电位升高到一阀值(约40mV),细胞进入兴奋状态,产生神经冲动,并由轴突输出神经冲动;当输入的冲动减小,综合处理的结果使膜电位下降,当下降到阀值时。细胞进入抑制状态,此时无神经冲动输出。“兴奋”和“抑制”,神经细胞必呈其一。 突触界面具有脉冲/电位信号转换功能,即类似于D/A转换功能。沿轴突和树突传递的是等幅、恒宽、编码的离散电脉冲信号。细胞中膜电位是连续的模拟量。 神经冲动信号的传导速度在1~150m/s之间,随纤维的粗细,髓鞘的有无而不同。 神经细胞的重要特点是具有学习功能并有遗忘和疲劳效应。总之,随着对生物神经元的深入研究,揭示出神经元不是简单的双稳逻辑元件而是微型生物信息处理机制和控制机。 而神经网络的基本原理也就是对生物神经元进行尽可能的模拟,当然,以目前的理论水平,制造水平,和应用水平,还与人脑神经网络的有着很大的差别,它只是对人脑神经网络有选择的,单一的,简化的构造和性能模拟,从而形成了不同功能的,多种类型的,不同层次的神经网络模型。 2.BP神经网络 目前,再这一基本原理上已发展了几十种神经网络,例如Hopficld模型,Feldmann等的连接型网络模型,Hinton等的玻尔茨曼机模型,以及Rumelhart 等的多层感知机模型和Kohonen的自组织网络模型等等。在这众多神经网络模型中,应用最广泛的是多层感知机神经网络。 这里我们重点的讲述一下BP神经网络。多层感知机神经网络的研究始于50年代,但一直进展不大。直到1985年,Rumelhart等人提出了误差反向传递学习算法(即BP算),实现了Minsky的多层网络设想,其网络模型如下图所示。它可以分为输入层,影层(也叫中间层),和输出层,其中中间层可以是一层,也可以多层,看实际情况而定。

毕设开题报告-及开题报告分析

开题报告如何写 注意点 1.一、对指导教师下达的课题任务的学习与理解 这部分主要是阐述做本课题的重要意义 2.二、阅读文献资料进行调研的综述 这部分就是对课题相关的研究的综述落脚于本课题解决了那些关键问题 3.三、根据任务书的任务及文件调研结果,初步拟定执行实施的方案(含具体进度计划) 这部分重点写具体实现的技术路线方案的具体实施方法和步骤了,具体进度计划只是附在后面的东西不是重点

南京邮电大学通达学院毕业设计(论文)开题报告

文献[5] 基于信息数据分析的微博研究综述[J];研究微博信息数据的分析,在这类研究中,大多数以微博消息传播的三大构件---微博消息、用户、用户关系为研究对象。以微博消息传播和微博成员组织为主要研究内容,目的在于发祥微博中用户、消息传博、热点话题、用户关系网络等的规律。基于微博信息数据分析的研究近年来在国内外都取得了很多成果,掌握了微博中的大量特征。该文献从微博消息传播三大构件的角度,对当前基于信息数据分析的微博研究进行系统梳理,提出微博信息传播三大构件的概念,归纳了此类研究的主要研究内容及方法。 对于大多用户提出的与主题或领域相关的查询需求,传统的通用搜索引擎往往不能提供令人满意的结果网页。为了克服通用搜索引擎的以上不足,提出了面向主题的聚焦爬虫的研究。文献[6]综述了聚焦爬虫技术的研究。其中介绍并分析了聚焦爬虫中的关键技术:抓取目标定义与描述,网页分析算法和网页分析策略,并根据网络拓扑、网页数据内容、用户行为等方面将各种网页分析算法做了分类和比较。聚焦爬虫能够克服通用爬虫的不足之处。 文献[7]首先介绍了网络爬虫工作原理,传统网络爬虫的实现过程,并对网络爬虫中使用的关键技术进行了研究,包括网页搜索策略、URL去重算法、网页分析技术、更新策略等。然后针对微博的特点和Ajax技术的实现方法,指出传统网络爬虫的不足,以及信息抓取的技术难点,深入分析了现有的基于Ajax的网络爬虫的最新技术——通过模拟浏览器行为,触发JavaScript事件(如click, onmouseo ver等),解析JavaScript脚本,动态更新网页DOM树,抽取网页中的有效信息。最后,详细论述了面向SNS网络爬虫系统的设计方案,整体构架,以及各功能模块的具体实现。面向微博的网络爬虫系统的实现是以新浪微博作为抓取的目标网站。结合新浪微博网页的特点,通过模拟用户行为,解析JavaSc ript,建立DOM树来获取网页动态信息,并按照一定的规则提取出网页中的URL和有效信息,并将有效信息存入数据库。本系统成功的实现了基于Ajax技术的网页信息的提取。 文献[8]引入网页页面分析技术和主题相关性分析技术,解决各大网站微博相继提供了抓取微博的API,这些API都有访问次数的限制,无法满足获取大量微博数据的要求,同时抓取的数据往往很杂乱的问题。展开基于主题的微博网页爬虫的研究与设计。本文的主要工作有研究分析网页页面分析技术,根据微博页面特点选择微博页面信息获取方法;重点描述基于“剪枝”的广度优先搜索策略的思考以及设计的详细过程,着重解决URL的去重、URL地址集合动态变化等问题;研究分析短文本主题抽取技术以及多关键匹配技术,确定微博主题相关性分析的设计方案;最后设计实现基于主题的微博网页爬虫的原型系统,实时抓取和存储微博数据。本文研究的核心问题是,根据微博数据的特点设计一种基于“剪枝”的广度优先搜索策略,并将其应用到微博爬虫中;同时使用微博页面分析技术使得爬虫不受微博平台API限制,从而让用户尽可能准确地抓取主题相关的微博数据。通过多次反复实验获取原型系统实验结果,将实验结果同基于API微博爬虫和基于网页微博爬虫的抓取效果进行对比分析得出结论:本文提出的爬行策略能够抓取主题相关的微博数据,虽然在效率上有所降低,但在抓取的微博数据具有较好的主题相关性。这实验结果证明本论文研究的实现方案是可行的。 文献[9]阐述了基于ajax的web应用程序的爬虫和用户界面状态改变的动态分析的过程和思路。文献[10]对于全球社交网络Twitter,设计并实现了,一个爬虫系统,从另一个角度阐明了Python在编写爬虫这个方面的强大和快速。仅仅用少量的代码就能实现爬虫系统,并且再强大的社交网站也可

基于BP神经网络的车型识别外文翻译

、外文资料 License Plate Recognition Based On Prior Knowledge Abstract - In this paper, a new algorithm based on improved BP (back propagation) neural network for Chinese vehicle license plate recognition (LPR) is described. The proposed approach provides a solution for the vehicle license plates (VLP) which were degraded severely. What it remarkably differs from the traditional methods is the application of prior knowledge of license plate to the procedure of location, segmentation and recognition. Color collocation is used to locate the license plate in the image. Dimensions of each character are constant, which is used to segment the character of VLPs. The Layout of the Chinese VLP is an important feature, which is used to construct a classifier for recognizing. The experimental results show that the improved algorithm is effective under the condition that the license plates were degraded severely. Index Terms - License plate recognition, prior knowledge, vehicle license plates, neural network. I. INTRODUCTION Vehicle License-Plate (VLP) recognition is a very interesting but difficult problem. It is important in a number of applications such as weight-and-speed-limit, red traffic infringement, road surveys and park security [1]. VLP recognition system consists of the plate location, the characters segmentation, and the characters recognition. These tasks become more sophisticated when dealing with plate images taken in various inclined angles or under various lighting, weather condition and cleanliness of the plate. Because this problem is usually used in real-time systems, it requires not only accuracy but also fast processing. Most existing VLP recognition methods [2], [3], [4], [5] reduce the complexity and increase the recognition rate by using some specific features of local VLPs and establishing some constrains on the position, distance from the camera to vehicles, and the inclined angles. In addition, neural network was used to increase the recognition rate [6], [7] but the traditional recognition methods seldom consider the prior knowledge of the local VLPs. In this paper, we proposed a new improved learning method of BP algorithm based on specific features of Chinese VLPs. The proposed algorithm overcomes the low speed convergence of BP neural network [8] and remarkable increases the recognition rate especially under the condition that the license plate images were degrade severely.

网络爬虫外文翻译

外文资料 ABSTRACT Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon. 1. INTRODUCTION A recent Pew Foundation study [31] states that “Search eng ines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightly

计算机网络体系结构外文翻译

附录A With the new network technology and application of the continuous rapid development of the computer network should. Use of becoming increasingly widespread, the role played by the increasingly important computer networks and human. More inseparable from the lives of the community's reliance on them will keep growing. In order for computers to communicate, they must speak the same language or protocol. In the early days of networking, networks were disorganized in many ways. Companies developed proprietary network technologies that had great difficulties in exchanging information with other or existing technologies; so network interconnections were very hard to build. To solve this problem, the International Organization for Standardization(ISO)created a network model that helps vendors to create networks compatible with each other. Finding the best software is not easy. A better understanding of what you need and asking the right questions makes it easier. The software should be capable of handling challenges specific to your company. If you operate multiple distribution centers, it may be beneficial to create routes with product originating from more than one depot. Few software providers though, are capable of optimizing routes using multiple depots. The provider should be able to support installation of its product. Make sure to clearly understand what training and software maintenance is offered. Obviously, selecting the right routing/scheduling software is critically important. Unfortunately, some companies are using software that may not be best suited to their operation. Logistics actives with responsibility for approving the software ought to be comfortable they've made the right decision. It is important to realize that not all routing/scheduling software is alike! There questions to ask are:Which operating system is used?How easy is the software to use?Here is a good way to tell. Ask if its graphical user interface(GUI)is flexible. Find out about installation speed - how long does it take?Is the software able to route third party customers with your core business?When was the software originally released and when was it last upgraded? In 1984, ISO released the Open Systems Interconnection(OSI)reference model,

人工神经网络的发展及应用

人工神经网络的发展与应用 神经网络发展 启蒙时期 启蒙时期开始于1980年美国著名心理学家W.James关于人脑结构与功能的研究,结束于1969年Minsky和Pape~发表的《感知器》(Perceptron)一书。早在1943年,心理学家McCulloch和数学家Pitts合作提出了形式神经元的数学模型(即M—P模型),该模型把神经细胞的动作描述为:1神经元的活动表现为兴奋或抑制的二值变化;2任何兴奋性突触有输入激励后,使神经元兴奋与神经元先前的动作状态无关;3任何抑制性突触有输入激励后,使神经元抑制;4突触的值不随时间改变;5突触从感知输入到传送出一个输出脉冲的延迟时问是0.5ms。可见,M—P模型是用逻辑的数学工具研究客观世界的事件在形式神经网络中的表述。现在来看M—P 模型尽管过于简单,而且其观点也并非完全正确,但是其理论有一定的贡献。因此,M—P模型被认为开创了神经科学理论研究的新时代。1949年,心理学家D.0.Hebb 提出了神经元之间突触联系强度可变的假设,并据此提出神经元的学习规则——Hebb规则,为神经网络的学习算法奠定了基础。1957年,计算机学家FrankRosenblatt提出了一种具有三层网络特性的神经网络结构,称为“感知器”(Perceptron),它是由阈值性神经元组成,试图模拟动物和人脑的感知学习能力,Rosenblatt认为信息被包含在相互连接或联合之中,而不是反映在拓扑结构的表示法中;另外,对于如何存储影响认知和行为的信息问题,他认为,存储的信息在神经网络系统内开始形成新的连接或传递链路后,新 的刺激将会通过这些新建立的链路自动地激活适当的响应部分,而不是要求任何识别或坚定他们的过程。1962年Widrow提出了自适应线性元件(Ada—line),它是连续取值的线性网络,主要用于自适应信号处理和自适应控制。 低潮期 人工智能的创始人之一Minkey和pape~经过数年研究,对以感知器为代表的网络系统的功能及其局限性从数学上做了深入的研究,于1969年出版了很有影响的《Perceptron)一书,该书提出了感知器不可能实现复杂的逻辑函数,这对当时的人工神经网络研究产生了极大的负面影响,从而使神经网络研究处于低潮时期。引起低潮的更重要的原因是:20世纪7O年代以来集成电路和微电子技术的迅猛发展,使传统的冯·诺伊曼型计算机进入发展的全盛时期,因此暂时掩盖了发展新型计算机和寻求新的神经网络的必要性和迫切性。但是在此时期,波士顿大学的S.Grossberg教授和赫尔辛基大学的Koho—nen教授,仍致力于神经网络的研究,分别提出了自适应共振理论(Adaptive Resonance Theory)和自组织特征映射模型(SOM)。以上开创性的研究成果和工作虽然未能引起当时人们的普遍重视,但其科学价值却不可磨灭,它们为神经网络的进一步发展奠定了基础。 复兴时期 20世纪80年代以来,由于以逻辑推理为基础的人工智能理论和冯·诺伊曼型计算机在处理诸如视觉、听觉、联想记忆等智能信息处理问题上受到挫折,促使人们

外文翻译---人工神经网络

英文文献 英文资料: Artificial neural networks (ANNs) to ArtificialNeuralNetworks, abbreviations also referred to as the neural network (NNs) or called connection model (ConnectionistModel), it is a kind of model animals neural network behavior characteristic, distributed parallel information processing algorithm mathematical model. This network rely on the complexity of the system, through the adjustment of mutual connection between nodes internal relations, so as to achieve the purpose of processing information. Artificial neural network has since learning and adaptive ability, can provide in advance of a batch of through mutual correspond of the input/output data, analyze master the law of potential between, according to the final rule, with a new input data to calculate, this study analyzed the output of the process is called the "training". Artificial neural network is made of a number of nonlinear interconnected processing unit, adaptive information processing system. It is in the modern neuroscience research results is proposed on the basis of, trying to simulate brain neural network processing, memory information way information processing. Artificial neural network has four basic characteristics: (1) the nonlinear relationship is the nature of the nonlinear common characteristics. The wisdom of the brain is a kind of non-linear phenomena. Artificial neurons in the activation or inhibit the two different state, this kind of behavior in mathematics performance for a nonlinear relationship. Has the threshold of neurons in the network formed by the has better properties, can improve the fault tolerance and storage capacity. (2) the limitations a neural network by DuoGe neurons widely usually connected to. A system of the overall behavior depends not only on the characteristics of single neurons, and may mainly by the unit the interaction between the, connected to the. Through a large number of connection between units simulation of the brain limitations. Associative memory is a typical example of limitations. (3) very qualitative artificial neural network is adaptive, self-organizing, learning ability. Neural network not only handling information can have all sorts of change, and in the treatment of the information at the same time, the nonlinear dynamic system itself is changing. Often by iterative process description of the power system evolution. (4) the convexity a system evolution direction, in certain conditions will depend on a particular state function. For example energy function, it is corresponding to the extreme value of the system stable state. The convexity refers to the function extreme value, it has DuoGe DuoGe system has a stable equilibrium state, this will cause the system to the diversity of evolution. Artificial neural network, the unit can mean different neurons process of the object, such as characteristics, letters, concept, or some meaningful abstract model. The type of network processing unit is divided into three categories: input unit, output unit and hidden units. Input unit accept outside the world of signal and data; Output unit of output system processing results; Hidden unit is in input and output unit, not between by external observation unit. The system The connections between neurons right value reflect the connection between the unit strength, information processing and embodied in the network said the processing unit in the connections. Artificial neural network is a kind of the procedures, and adaptability, brain style of information processing, its essence is through the network of transformation and dynamic behaviors have a

计算机网络外文翻译

附录 一、英文原文: The NetWorks Birth of the Net The Internet has had a relatively brief, but explosive history so far. It grew out of an experiment begun in the 1960's by the U.S. Department of Defense. The DoD wanted to create a computer network that would continue to function in the event of a disaster, such as a nuclear war. If part of the network were damaged or destroyed, the rest of the system still had to work. That network was ARPANET, which linked U.S. scientific and academic researchers. It was the forerunner of today's Internet. In 1985, the National Science Foundation (NSF) created NSFNET, a series of networks for research and education communication. Based on ARPANET protocols, the NSFNET created a national backbone service, provided free to any U.S. research and educational institution. At the same time, regional networks were created to link individual institutions with the national backbone service. NSFNET grew rapidly as people discovered its potential, and as new software applications were created to make access easier. Corporations such as Sprint and MCI began to build their own networks, which they linked to NSFNET. As commercial firms and other regional network providers have taken over the operation of the major Internet arteries, NSF has withdrawn from the backbone business. NSF also coordinated a service called InterNIC, which registered all addresses on the Internet so that data could be routed to the right system. This service has now been taken over by Network Solutions, Inc., in cooperation with NSF. How the Web Works The World Wide Web, the graphical portion of the Internet, is the most popular part of the Internet by far. Once you spend time on the Web,you will begin to feel like there is no limit to what you can discover. The Web allows rich and diverse communication by displaying text, graphics, animation, photos, sound and video. So just what is this miraculous creation? The Web physically consists of your personal computer, web browser software, a connection to an Internet service provider, computers called servers that host digital data and routers and switches to direct the flow of information. The Web is known as a client-server system. Your computer is the client; the remote computers that store electronic files are the servers. Here's how it works: Let's say you want to pay a visit to the the Louvre museum website. First you enter the address or URL of the website in your web browser (more about this shortly). Then your browser requests the web page from the web server that hosts the Louvre's site. The Louvre's server sends the data over the Internet to your computer. Your web

相关主题
文本预览
相关文档 最新文档