当前位置:文档之家› 网络爬虫外文翻译

网络爬虫外文翻译

网络爬虫外文翻译
网络爬虫外文翻译

外文资料

ABSTRACT

Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.

A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.

1. INTRODUCTION

A recent Pew Foundation study [31] states that “Search eng ines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightly

over 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the search engine corpus.

Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is

(a) Fetch a page

(b) Parse it to extract all linked URLs

(c) For all the URLs not seen before, repeat (a)–(c)

Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as https://www.doczj.com/doc/193557716.html,, but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure of the web that leads to this phenomenon.

If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling becomes a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for instance [34]).

However, crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors.

1. The web is very large. Currently, Google [20] claims to have indexed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 9-12 months.

2. Web pages are changing rapidly. If “change” means “any change”, then about

40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17].

These two factors imply that to obtain a reasonably fresh and 679 complete snapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, which further complicates the membership test: it is possible that the membership question can only be answered by a peer node, not locally.

A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and compared them against two theoretical limits: clairvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple caching techniques are extremely effective even at relatively small cache sizes such as 50,000 entries and show how these caches can be implemented very efficiently.

The paper is organized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our recommendations for practical algorithms and data structures for URL caching are presented in Section 6. Section 7 contains our conclusions and directions for further research.

2. CRAWLING

Web crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of these

crawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching.

The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save non-local URLs to disk; at the end of a crawl, a batch job adds these URLs to the per-host seed sets of the next crawl.

The original Google crawler, described in [7], implements the different crawler components as different processes. A single URL server process maintains the set of URLs to download; crawling processes fetch pages; indexing processes extract words and links; and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes communicate via the file system.

For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, communicating web crawler processes. Each crawler process is responsible for a subset of all web servers; the assignment of URLs to crawler processes is based on a hash of the URL’s host componen t. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching URLs together to minimize TCP overhead. We describe Mercator in more detail in Section 4.

Cho and Garcia-Molin a’s crawler [13] is similar to Mercator. The system is composed of multiple independent, communicating web crawler processes (called “C-procs”). Cho and Garcia-Molina consider different schemes for partitioning the URL space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigning an URL to a C-proc based on a hash of the URL’s host part), and hierarchical (assigning an URL to a C-proc based on some property of the URL, such as its top-level domain).

The WebFountain crawler [16] is also composed of a set of independent, communicating crawling processes (the “ants”). An ant that discovers an URL for which it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant.

UbiCrawler (formerly known as Trovatore) [4, 5] is again composed of multiple

independent, communicating web crawler processes. It also employs a controller process which oversees the crawling processes, detects process failures, and initiates fail-over to other crawling processes.

Shkapenyuk and Suel’s crawler [35] is similar to Google’s; the different crawler components are implemented as different processes. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “downloader” processes. The downloader processes fetch the pages and save them to an NFS-mounted file system. The crawling application reads those saved pages, extracts any links contained within them, and adds them to the set of URLs to be downloaded.

Any web crawler must maintain a collection of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is achieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are billions of valid URLs), it is stored on disk and caching popular URLs in memory is a win: Caching allows the crawler to discard a large fraction of the URLs without having to consult the

disk-based set.

Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are comprised of cooperating crawling processes, each of which downloads web pages, extracts their links, and sends these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than once. Maintaining a cache of URLs and consulting that cache before sending a URL to a peer crawler goes a long way toward reducing transmissions to peer crawlers, as we show in the remainder of this paper.

3. CACHING

In most computer systems, memory is hierarchical, that is, there exist two or more

levels of memory, representing different tradeoffs between size and speed. For instance, in a typical workstation there is a very small but very fast on-chip memory, a larger but slower RAM memory, and a very large and much slower disk memory. In a network environment, the hierarchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a faster memory. In the right circumstances, caching greatly improves the performance of the overall system and hence it is a fundamental technique in the design of operating systems, discussed at length in any standard textbook [21, 37]. In the web context, caching is often mentioned

in the context of a web proxy caching web pages [26, Chapter 11]. In our web crawler context, since the number of visited URLs becomes too large to store in main memory, we store the collection of visited URLs on disk, and cache a small portion in main memory.

Caching terminology is as follows: the cache is memory used to store equal sized atomic items. A cache has size k if it can store at most k items.1 At each unit of time, the cache receives a request for an item. If the requested item is in the cache, the situation is called a hit and no further action is needed. Otherwise, the situation is called a miss or a fault. If the cache has fewer than k items, the missed item is added to the cache. Otherwise, the algorithm must choose either to evict an item from the cache to make room for the missed item, or not to add the missed item. The caching policy or caching algorithm decides which item to evict. The goal of the caching algorithm is to minimize the number of misses.

Clearly, the larger the cache, the easier it is to avoid misses. Therefore, the performance of a caching algorithm is characterized by the miss ratio for a given size cache. In general, caching is successful for two reasons:

_ Non-uniformity of requests. Some requests are much more popular than others. In our context, for instance, a link to https://www.doczj.com/doc/193557716.html, is a much more common occurrence than a link to the authors’ home pages.

_ Temporal correlation or locality of reference. Current requests are more likely to duplicate requests made in the recent past than requests made long ago. The latter

terminology comes from the computer memory model – data needed now is likely to be close in the address space to data recently needed. In our context, temporal correlation occurs first because links tend to be repeated on the same page – we found that on average about 30% are duplicates, cf. Section 4.2, and second, because pages on a given host tend to be explored sequentially and they tend to share many links. For example, many pages on a Computer Science department server are likely to share links to other Computer Science departments in the world, notorious papers, etc.

Because of these two factors, a cache that contains popular requests and recent requests is likely to perform better than an arbitrary cache. Caching algorithms try to capture this intuition in various ways.

We now describe some standard caching algorithms, whose performance we evaluate in Section 5.

3.1 Infinite cache (INFINITE)

This is a theoretical algorithm that assumes that the size of the cache is larger than the number of distinct requests.

3.2 Clairvoyant caching (MIN)

More than 35 years ago, L′aszl′o Belady [2] showed that if the entire sequence of requests is known in advance (in other words, the algorithm is clairvoyant), then the best strategy is to evict the item whose next request is farthest away in time. This theoretical algorithm is denoted MIN because it achieves the minimum number of misses on any sequence and thus it provides a tight bound on performance.

3.3 Least recently used (LRU)

The LRU algorithm evicts the item in the cache that has not been requested for the longest time. The intuition for LRU is that an item that has not been needed for a long time in the past will likely not be needed for a long time in the future, and therefore the number of misses will be minimiz ed in the spirit of Belady’s algorithm.

Despite the admonition that “past performance is no guarantee of future results”, sadly verified by the current state of the stock markets, in practice, LRU is generally very effective. However, it requires maintaining a priority queue of requests. This

queue has a processing time cost and a memory cost. The latter is usually ignored in caching situations where the items are large.

3.4 CLOCK

CLOCK is a popular approximation of LRU, invented in the late sixties [15]. An array of mark bits M0;M1; : : : ;Mk corresponds to the items currently in the cache of size k. The array is viewed as a circle, that is, the first location follows the last. A clock handle points to one item in the cache. When a request X arrives, if the item X is in the cache, then its mark bit is turned on. Otherwise, the handle moves sequentially through the array, turning the mark bits off, until an unmarked location is found. The cache item corresponding to the unmarked location is evicted and replaced by X.

3.5 Random replacement (RANDOM)

Random replacement (RANDOM) completely ignores the past. If the item requested is not in the cache, then a random item from the cache is evicted and replaced.

In most practical situations, random replacement performs worse than CLOCK but not much worse. Our results exhibit a similar pattern, as we show in Section 5. RANDOM can be implemented without any extra space cost; see Section 6.

3.6 Static caching (STATIC)

If we assume that each item has a certain fixed probability of being requested, independently of the previous history of requests, then at any point in time the probability of a hit in a cache of size k is maximized if the cache contains the k items that have the highest probability of being requested.

There are two issues with this approach: the first is that in general these probabilities are not known in advance; the second is that the independence of requests, although mathematically appealing, is antithetical to the locality of reference present in most practical situations.

In our case, the first issue can be finessed: we might assume that the most popular k URLs discovered in a previous crawl are pretty much the k most popular URLs in the current crawl. (There are also efficient techniques for discovering the

most popular items in a stream of data [18, 1, 11]. Therefore, an on-line approach might work as well.) Of course, for simulation purposes we can do a first pass over our input to determine the k most popular URLs, and then preload the cache with these URLs, which is what we did in our experiments.

The second issue above is the very reason we decided to test STATIC: if STATIC performs well, then the conclusion is that there is little locality of reference. If STATIC performs relatively poorly, then we can conclude that our data manifests substantial locality of reference, that is, successive requests are highly correlated.

4. EXPERIMENTAL SETUP

We now describe the experiment we conducted to generate the crawl trace fed into our tests of the various algorithms. We conducted a large web crawl using an instrumented version of the Mercator web crawler [29]. We first describe the Mercator crawler architecture, and then report on our crawl.

4.1 Mercator crawler architecture

A Mercator crawling system consists of a number of crawling processes, usually running on separate machines. Each crawling process is responsible for a subset of all web servers, and consists of a number of worker threads (typically 500) responsible for downloading and processing pages from these servers.

Each worker thread repeatedly performs the following operations: it obtains a URL from the URL Frontier, which is a diskbased data structure maintaining the set of URLs to be downloaded; downloads the corresponding page using HTTP into a buffer (called a RewindInputStream or RIS for short); and, if the page is an HTML page, extracts all links from the page. The stream of extracted links is converted into absolute URLs and run through the URL Filter, which discards some URLs based on syntactic properties. For example, it discards all URLs belonging to web servers that contacted us and asked not be crawled.

The URL stream then flows into the Host Splitter, which assigns URLs to crawling processes usin g a hash of the URL’s host name. Since most links are relative, most of the URLs (81.5% in our experiment) will be assigned to the local crawling process; the others are sent in batches via TCP to the appropriate peer crawling

processes. Both the stream of local URLs and the stream of URLs received from peer crawlers flow into the Duplicate URL Eliminator (DUE). The DUE discards URLs that have been discovered previously. The new URLs are forwarded to the URL Frontier for future download. In order to eliminate duplicate URLs, the DUE must maintain the set of all URLs discovered so far. Given that today’s web contains several billion valid URLs, the memory requirements to maintain such a set are significant. Mercator can be configured to maintain this set as a distributed

in-memory hash table (where each crawling process maintains the subset of URLs assigned to it); however, this DUE implementation (which reduces URLs to 8-byte checksums, and uses the first 3 bytes of the checksum to index into the hash table) requires about 5.2 bytes per URL, meaning that it takes over 5 GB of RAM per crawling machine to maintain a set of 1 billion URLs per machine. These memory requirements are too steep in many settings, and in fact, they exceeded the hardware available to us for this experiment. Therefore, we used an alternative DUE implementation that buffers incoming URLs in memory, but keeps the bulk of URLs (or rather, their 8-byte checksums) in sorted order on disk. Whenever the in-memory buffer fills up, it is merged into the disk file (which is a very expensive operation due to disk latency) and newly discovered URLs are passed on to the Frontier.

Both the disk-based DUE and the Host Splitter benefit from URL caching. Adding a cache to the disk-based DUE makes it possible to discard incoming URLs that hit in the cache (and thus are duplicates) instead of adding them to the in-memory buffer. As a result, the in-memory buffer fills more slowly and is merged less frequently into the disk file, thereby reducing the penalty imposed by disk latency. Adding a cache to the Host Splitter makes it possible to discard incoming duplicate URLs instead of sending them to the peer node, thereby reducing the amount of network traffic. This reduction is particularly important in a scenario where the individual crawling machines are not connected via a high-speed LAN (as they were in our experiment), but are instead globally distributed. In such a setting, each crawler would be responsible for web servers “close to it”.

Mercator performs an approximation of a breadth-first search traversal of the

web graph. Each of the (typically 500) threads in each process operates in parallel, which introduces a certain amount of non-determinism to the traversal. More importantly, the schedul ing of downloads is moderated by Mercator’s politeness policy, which limits the load placed by the crawler on any particular web server. Mercator’s politeness policy guarantees that no server ever receives multiple requests from Mercator in parallel; in addition, it guarantees that the next request to a server will only be issued after a multiple (typically 10_) of the time it took to answer the previous request has passed. Such a politeness policy is essential to any large-scale web crawler; otherwise the crawler’s operator becomes inundated with complaints. 4.2 Our web crawl

Our crawling hardware consisted of four Compaq XP1000 workstations, each one equipped with a 667 MHz Alpha processor, 1.5 GB of RAM, 144 GB of disk2, and a 100 Mbit/sec Ethernet connection. The machines were located at the Palo Alto Internet Exchange, quite close to the Internet’s backbone.

The crawl ran from July 12 until September 3, 2002, although it was actively crawling only for 33 days: the downtimes were due to various hardware and network failures. During the crawl, the four machines performed 1.04 billion download attempts, 784 million of which resulted in successful downloads. 429 million of the successfully downloaded documents were HTML pages. These pages contained about 26.83 billion links, equivalent to an average of 62.55 links per page; however, the median number of links per page was only 23, suggesting that the average is inflated by some pages with a very high number of links. Earlier studies reported only an average of 8 links [9] or 17 links per page [33]. We offer three explanations as to why we found more links per page. First, we configured Mercator to not limit itself to URLs found in anchor tags, but rather to extract URLs from all tags that may contain them (e.g. image tags). This configuration increases both the mean and the median number of links per page. Second, we configured it to download pages up to 16 MB in size (a setting that is significantly higher than usual), making it possible to encounter pages with tens of thousands of links. Third, most studies report the number of unique links per page. The numbers above include duplicate copies of a link on a page. If we

only consider unique links3 per page, then the average number of links is 42.74 and the median is 17.

The links extracted from these HTML pages, plus about 38 million HTTP redirections that were encountered during the crawl, flowed into the Host Splitter. In order to test the effectiveness of various caching algorithms, we instrumented

Mer cator’s Host Splitter component to log all incoming URLs to disk. The Host Splitters on the four crawlers received and logged a total of 26.86 billion URLs.

After completion of the crawl, we condensed the Host Splitter logs. We hashed each URL to a 64-bit fingerprint [32, 8]. Fingerprinting is a probabilistic technique; there is a small chance that two URLs have the same fingerprint. We made sure there were no such unintentional collisions by sorting the original URL logs and counting the number of unique URLs. We then compared this number to the number of unique fingerprints, which we determined using an in-memory hash table on a

very-large-memory machine. This data reduction step left us with four condensed host splitter logs (one per crawling machine), ranging from 51 GB to 57 GB in size and containing between 6.4 and 7.1 billion URLs.

In order to explore the effectiveness of caching with respect to inter-process communication in a distributed crawler, we also extracted a sub-trace of the Host Splitter logs that contained only those URLs that were sent to peer crawlers. These logs contained 4.92 billion URLs, or about 19.5% of all URLs. We condensed the sub-trace logs in the same fashion. We then used the condensed logs for our simulations.

5. SIMULATION RESULTS

We studied the effects of caching with respect to two streams of URLs:

1. A trace of all URLs extracted from the pages assigned to a particular machine. We refer to this as the full trace.

2. A trace of all URLs extracted from the pages assigned to a particular machine that were sent to one of the other machines for processing. We refer to this trace as the cross subtrace, since it is a subset of the full trace.

The reason for exploring both these choices is that, depending on other

architectural decisions, it might make sense to cache only the URLs to be sent to other machines or to use a separate cache just for this purpose.

We fed each trace into implementations of each of the caching algorithms described above, configured with a wide range of cache sizes. We performed about 1,800 such experiments. We first describe the algorithm implementations, and then present our simulation results.

5.1 Algorithm implementations

The implementation of each algorithm is straightforward. We use a hash table to find each item in the cache. We also keep a separate data structure of the cache items, so that we can choose one for eviction. For RANDOM, this data structure is simply a list. For CLOCK, it is a list and a clock handle, and the it ems also contain “mark” bits. For LRU, it is a heap, organized by last access time. STATIC needs no extra data structure, since it never evicts items. MIN is more complicated since for each item in the cache, MIN needs to know when the next request for that item will be. We therefore describe MIN in more detail. Let A be the trace or sequence of requests, that is, At is the item requested at time t. We create a second sequence Nt containing the time when At next appears in A. If there is no further request for At after time t, we set Nt = 1. Formally, To generate the sequence Nt, we read the trace A backwards, that is, from tmax down to 0, and use a hash table with key At and value t. For each item At, we probe the hash table. If it is not found, we set Nt = 1and store (At; t) in the table. If it is found, we retrieve (At; t0), set Nt = t0, and replace (At; t0) by (At; t) in the hash table. Given Nt, implementing MIN is easy: we read At and Nt in parallel, and hence for each item requested, we know when it will be requested next. We tag each item in the cache with the time when it will be requested next, and if necessary, evict the item with the highest value for its next request, using a heap to identify itquickly.

5.2 Results

We present the results for only one crawling host. The results for the other three hosts are quasi-identical. Figure 2 shows the miss rate over the entire trace (that is, the percentage of misses out of all requests to the cache) as a function of the size of the cache. We look at cache sizes from k = 20 to k = 225. In Figure 3 we present the same

data relative to the miss-rate of MIN, the optimum off-line algorithm. The same simulations for the cross-trace are depicted in Figures 4 and 5.

For both traces, LRU and CLOCK perform almost identically and only slightly worse than the ideal MIN, except in the critical region discussed below. RANDOM is only slightly inferior to CLOCK and LRU, while STATIC is generally much worse. Therefore, we conclude that there is considerable locality of reference in the trace, as explained in Section 3.6. For very large caches, STATIC appears to do better than MIN. However, this is just an artifact of our accounting scheme: we only charge for misses and STATIC is not charged for the initial loading of the cache. If STATIC were instead charged k misses for the initial loading of its cache, then its miss rate would be of course worse than MIN’s.

6. CONCLUSIONS AND FUTURE DIRECTIONS

After running about 1,800 simulations over a trace containing 26.86 billion URLs, our main conclusion is that URL caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this size is a critical point, that is, a substantially smaller cache is ineffectual while a substantially larger cache brings little additional benefit. For practical purposes our investigation is complete: In view of our discussion in Section 5.2, we recommend a cache size of between 100 to 500 entries per crawling thread. All caching strategies perform roughly the same; we recommend using either CLOCK or RANDOM, implemented using a scatter table with circular chains. Thus, for 500 crawling threads, this cache will be about 2MB – completely insignificant compared to other data structures needed in a crawler. If the intent is only to reduce cross machine traffic in a distributed crawler, then a slightly smaller cache could be used. In either case, the goal should be to have a miss rate lower than 20%.

However, there are some open questions, worthy of further research. The first open problem is to what extent the crawl order strategy (graph traversal method) affects the caching performance. Various strategies have been proposed [14], but there are indications [30] that after a short period from the beginning of the crawl the general strategy does not matter much. Hence, we believe that caching performance

will be very similar for any alternative crawling strategy. We can try to implement other strategies ourselves, but ideally we would use independent crawls. Unfortunately, crawling on web scale is not a simple endeavor, and it is unlikely that we can obtain crawl logs from commercial search engines.

In view of the observed fact that the size of the cache needed to achieve top performance depends on the number of threads, the second question is whether having a per-thread cache makes sense. In general, but not always, a global cache performs better than a collection of separate caches, because common items need to be stored only once. However, this assertion needs to be verified in the URL caching context.

The third open question concerns the explanation we propose in Section 5 regarding the scope of the links encountered on a given host. If our model is correct then it has certain implications regarding the appropriate model for the web graph, a topic of considerable interest among a wide variety of scientists: mathematicians, physicists, and computer scientists. We hope that our paper will stimulate research to estimate the cache performance under various models. Models where caching performs well due to correlation of links on a given host are probably closer to reality. We are making our URL traces available for this research by donating them to the Internet Archive.

中文译文

要在网络上爬行非常简单:基本的算法是:(a)取得一个网页(b)解析它提取所有的链接URLs(c)对于所有没有见过的URLs重复执行(a)-(c)。但是,网络的大小(估计有超过40亿的网页)和他们变化的频率(估计每周有7%的变化)使这个计划由一个微不足道的设计习题变成一个非常严峻的算法和系统设计挑战。实际上,光是这两个要素就意味着如果要进行及时地,完全地爬行网络,步骤(a)必须每秒钟执行大约1000次,因此,成员检测(c)必须每秒钟执行超过10000次,并有非常大的数据储存到主内存中。这个要求有一个分布式构造,使得成员检测更加复杂。

一个非常重要的方法加速这个检测就是用cache(高速缓存),这个是把见过的URLs存入主内存中的一个(动态)子集中。这个论文最主要的成果就是仔细的研究了几种关于网络爬虫的URL缓存技术。我们考虑所有实际的算法:随机置换,静态cache,LRU,和CLOCK,和理论极限:透视cache和极大的cache。我们执行了大约1800次模拟,用不同的cache大小执行这些算法,用真实的log 日志数据,获取自一个非常大的33天的网络爬行,大约执行了超过10亿次的http请求。

我们的主要的结论是 cache是非常高效的-在我们的机制里,一个有大约50000个入口的cache可以完成80%的速率。有趣的是,这cache的大小下降到一个临界点:一个足够的小一点的cache更有效当一个足够的大一点的cache 只能带来很小的额外好处。我们推测这个临界点是固有的并且冒昧的解释一下这个现象。

1.介绍

皮尤基金会最新的研究指出:“搜索引擎已经成为互联网用户不可或缺的工具”,估计在2002年中期,初略有超过1半的美国人用网络搜索获取信息。因此,一个强大的搜索引擎技术有巨大的实际利益,在这个论文中,我们集中于一方面的搜索技术,也就是搜集网页的过程,最终组成一个搜索引擎的文集。

搜索引擎搜集网页通过很多途径,他们中,直接提交URL,回馈内含物,然后从非web源文件中提取URL,但是大量的文集包含一个进程叫 crawling 或者SPIDERing,他们递归的探索互联网。基本的算法是:

Fetch a page

Parse it to extract all linked URLs

For all the URLs not seen before,repeat(a)-(c)

网络爬虫一般开始于一些种子URLs。有些时候网络爬虫开始于一个正确连接的页面,或者一个目录就像:https://www.doczj.com/doc/193557716.html,,但是因为这个原因相关的巨大的部分网络资源无法被访问到。(估计有超过20%)

如果把网页看作图中的节点,把超链接看作定向的移动在这些节点之间,那么网络爬虫就变成了一个进程就像数学中的图的遍历一样。不同的遍历策略决定着先不访问哪个节点,下一个访问哪个节点。2种标准的策略是深度优先算法和广度优先算法-他们容易被实现所以在很多入门的算法课中都有教。

但是,在网络上爬行并不是一个微不足道的设计习题,而是一个非常严峻的算法和系统设计挑战因为以下2点原因:

网络非常的庞大。现在,Google需要索引超过30亿的网页。很多研究都指出,

在历史上,网络每9-12个月都会增长一倍。

网络的页面改变很频繁。如果这个改变指的是任何改变,那么有40%的网页每周会改变。如果我们认为页面改变三分之一或者更多,那么有大约7%的页面每周会变。

这2个要素意味着,要获得及时的,完全的网页快照,一个搜索引擎必须访问1亿个网页每天。因此,步骤(a)必须执行大约每秒1000次,成员检测的步骤(c)必须每秒执行超过10000次,并有非常大的数据储存到主内存中。另外,网络爬虫一般使用一个分布式的构造来平行地爬行更多的网页,这使成员检测更为复杂:这是可能的成员问题只能回答了一个同行节点,而不是当地。

一个非常重要的方法加速这个检测就是用cache(高速缓存),这个是把见过的URLs存入主内存中的一个(动态)子集中。这个论文最主要的成果就是仔细的研究了几种关于网络爬虫的URL缓存技术。我们考虑所有实际的算法:随机置换,静态cache,LRU,和CLOCK,和理论极限:透视cache和极大的cache。我们执行了大约1800次模拟,用不同的cache大小执行这些算法,用真实的log 日志数据,获取自一个非常大的33天的网络爬行,大约执行了超过10亿次的http请求。

这个论文像这样组织的:第2部分讨论在文学著作中几种不同的爬行解决方案和什么样的cache最适合他们。第3部分介绍关于一些cache的技术和介绍了关于cache几种理论和实际算法。第4部分我们实现这些算法,在实验机制中。第5部分描述和讨论模拟的结果。第6部分是我们推荐的实际算法和数据结构关于URLcache。第7部分是结论和指导关于促进研究。

2.CRAWLING

网络爬虫的出现几乎和网络同期,而且有很多的文献描述了网络爬虫。在这个部分,我们呈现一个摘要关于这些爬虫程序,并讨论问什么大多数的网络爬虫会受益于URL cache。

网络爬虫用网络存档雇员多个爬行进程,每个一次性完成一个彻底的爬行对于64个hosts 。爬虫进程储存非本地的URLs到磁盘;在爬行的最后,一批工作将这些URLs加入到下个爬虫的每个host的种子sets中。

最初的google 爬虫,实现不同的爬虫组件通过不同的进程。一个单独的URL 服务器进行维护需要下载的URL的集合;爬虫程序获取的网页;索引进程提取关键字和超链接;URL解决进程将相对路径转换给绝对路径。这些不同的进程通过文件系统通信。

这个论文的中实验我们使用的meractor网络爬虫。Mercator使用了一个独立的集合,通信网络爬虫进程。每个爬虫进程都是一个有效的web服务器子集;URLs的分配基于URLs主机组件。没有责任通过TCP传送这个URL给网络爬虫,有责任把这些URLs绑在一起减少TCP开销。我们描述 mercator很多的细节在第4部分。

任何网络爬虫必须维护一个集合,装那些需要被下载的URLs。此外,不能重复地下载同一个URL,必须要个方法避免加入URLs到集合中超过一次。一般的,达到避免可以用维护一个发现URLs的集合。如果数据太多,可以存入磁盘,或者储存经常被访问的URLs。

3.CACHING

在大多数的计算机系统里面,内存是分等级的,意思是,存在2级或更多级的内存,表现出不同的空间和速度。举个例,在一个典型的工作站里,有一个非

常小但是非常快的内存,一个大,但是比较慢的RAM内存,一个非常大胆是很慢的disk内存。在一个网络环境中,也是分层的。Caching就是一种想法储存经常用到的项目从慢速内存到快速内存。

Caching术语就像下面:cache是内存用来储存同等大小的元素。一个cache 有k的大小,那么可以储存k个项目.在每个时间段,cache接受到来自一个项目的请求.如果这个请求项目在这个cache中,这种情况将会引发一个碰撞并且不需要进一步的动作。另一方面,这种情况叫做丢失或者失败。如果cache没有k个项目,那个丢失的项目被加入cache。另一方面,算法必须选择驱逐一个项目来空出空间来存放那个丢失的项目,或者不加入那个丢失的项目。Caching 算法的目标是最小化丢失的个数。

清楚的,cache越大,越容易避免丢失。因此,一个caching算法的性能要在看在一个给定大小的cache中的丢失率。

一般的,caching成功有2个原因:

不一致的请求。一些请求比其他一些请求多。

时间相关性或地方的职权范围。

3.1 无限cache(INFINITE)

这是一个理论的算法,假想这个cache的大小要大于明显的请求数。

3.2 透视cache(MIN)

超过35年以前,L?aszl?o Belady表示如果能提前知道完整的请求序列,就能剔除下一个请求最远的项目。这个理论的算法叫MIN,因为他达到了最小的数量关于丢失在任何序列中,而且能带来一个飞跃性的性能提升。

3.3 最近被用到(LRU)

LRU算法剔除最长时间没用被用到的项目。LRU的直觉是一个项目如果很久都没被用过,那么在将来它也会在很长时间里不被用到。

尽管有警告“过去的执行不能保证未来的结果”,实际上,LRU一般是非常有效的。但是,他需要维护一个关于请求的优先权队列。这个队列将会有一个时间浪费和空间浪费。

3.4 CLOCK

CLOCK是一个非常流行的接近于LRU,被发明与20世纪60年代末。一个排列标记着M0,M1,….Mk对应那些项目在一个大小为k的cache中。这个排列可以看作一个圈,第一个位置跟着最后一个位置。CLOCK控制指针对一个项目在cache中。当一个请求X到达,如果项目X在cache中,然后他的标志打开。否则,关闭标记,知道一个未标记的位置被剔除然后用X置换。

3.5 随机置换(RANDOM)

随机置换完全忽视过去。如果一个项目请求没在cache中,然后一个随机的项目将被从cache中剔除然后置换.

在大多数实际的情况下,随机替换比CLOCK要差,但并不是差很多。

3.6 静态caching(STATIC)

如果我们假设每个项目有一个确定的固定的可能性被请求,独立的先前的访问历史,然后在任何时间一个撞击在大小为k的cache里的概率最大,如果一个cache中包含那k个项目有非常大的概率被请求。

有2个问题关于这个步骤:第一,一般这个概率不能被提前知道;第二,独立的请求,虽然理论上有吸引力,是对立的地方参考目前在大多数实际情况。

在我们的情况中,第一种情况可以被解决:我们可以猜想上次爬行发现的最

常用的k个URLs适合于这次的爬行的最常用的k个URLs。(也有有效的技术可以发现最常用的项目在一个流数据中。因此,一个在线的步骤可以运行的很好)当然,为了达到模拟的目的,我们可以首先忽略我们的输入,去确定那个k个最常用 URLs,然后预装这些URLs到我们做实验的cache中。

第二个情况是我们决定测试STATIC的很大的原因:如果STATIC运行的很好,Sname结论是这里有很少的位置被涉及。如果STATIC运行的相对差,那么我们可以推断我们的数据显然是真实被提及的位置,连续的请求是密切相关的。

4 实验机制

4.1 Meractor 爬虫体系

一个Meractor爬虫体系有一组爬虫进程组成,一般在不同的机器上运行。每个爬虫进程都是总网络服务器的子集,然后由一些工作线程组成(一般有500个),他们负责下载和处理网页从这些服务器。

举一个例子,每个工作现场在一个系统里用4个爬行进程。

每个工作现场重复地完成以下的操作:它获得一个URL从URL边境里,一个磁盘数据结构维护被下载的URL集合;用HTTP协议下载对应的网页到一个缓冲区中;如果这个网页是HTML,提取所有的超链接。把提取出来的超链接流转换为完全链接然后运行通过URL过滤器,丢弃一些基于syntactic properties的链接。比如,它丢弃那些基于服务器联系我们,不能被爬行的链接。

URL流被送进主机Splitter里面,主机splitter用来分配URL给爬虫进程通过URL的主机名。直到大多数的超链接被关联,大部分的URL(在我们的实验中是81。5%)将被分配给本地爬虫进程;剩下的传说通过TCP给适当的爬虫进程。

本地URLs流和从爬虫中得到的URLs流都要送到复制URL消除器中(DUE)。DUE会除去那些被访问过的URLs。新的URLs会被送到URL边境中去以供以后下载。

为了避免重复URLs,DUE必须维护发现的URLs的集合。假设今天的网络包括几十亿有效的URLs,内存就需要维护这个集合是非常重要的。 Mercator可以被认为可以维护这个集合通过一个分布式的内存中的hash table(这个地方是每个爬虫进程维护URLs的子集时分配给它);但是DUE执行(这个强制URLs成8-byte的checksums,而且用前3 —bytes来用作hash table的索引)需要大约5.2bytes 每个URl,意思就是它会用5GB的RAM每个爬虫机器来维护一个10亿个URLs的集合每台机器。这个内存需求非常不合理在很多的设置里,而且实际上,它对于我们超过了硬件的适用性在这个实验里。因此,我们用一个选择性的DUE来执行那个缓冲器引入URLs到内存中,但是保存大多的URLs(或者更好,他们的8-bytes checksum)到一个排序好的队列在磁盘中。

基于磁盘的DUE和主机Splitter都受益于URL caching。给基于磁盘的DUE加一个cache可以使它丢弃引入的URLs,发生碰撞在cache中,替代加入他们到内存缓存区中。而且有个结果是,内存缓存区要慢些,而且不频繁地和磁盘文件链接。将cache加入到一个主机Splitter中可以丢弃引入的重复的URLs代替将它们传入每个节点,这样可以减少总的网络通信。这个通信的减少非常重要在爬虫进程没有通过高速LAN连接的时候,但是被替代成球形分布式。在这样一个装置中,每个爬虫负责让web servers关掉它。 Mercator执行一个遍历通过广度优先算法在网络图上。每个爬虫进程中的线程同时执行。更重要的是,下载的行程被Mercator的 politeness policy调节,它限制每个爬虫的负载咋一

些特殊的网络服务器中。Mercator的politeness policy保证没有服务器不断同时收到多个请求;而且,它还保证下一个请求会在它的上一个请求几倍的时间内完成(通常是10倍)。这样一个 politeness policy基本在任何一个大量搜索的网络爬虫中,否则爬虫将会陷入繁重的处理中。

4.2 我们的网络爬虫

我们的网络爬虫由4个Compaq XP1000工作站组成,每个装备一个667MHz 的Alpha processor,1.5GB的RAM,144GB的磁盘,和一个100Mbit/sec的以太网连接。每个机器定位于Palo Alto Internet Exchange,十分接近于Internet 的backbone。

这个爬虫运行从7月12到2002年的9月3日,虽然它活跃地爬行只有33天。下载时由于不同的硬件和网络故障。爬行过程中,那4个机器完成了10.4亿的下载尝试,7.84亿成功下载。4.29亿的成功下载文档是HTML页面。这些页面包含了大约268.3亿个超链接,相当于每个页面有62.55个超链接;但是,中间的数值每个超链接只有24个,暗示平均的超链接数被一些包含很多链接的页面扩大了。早期的论文报道每个页面平均只有8个超链或者17个超链接。我们提供了3个解释关于为什么我们每个页面找到了更过的超链接。首先,我们认为Mercator并没有限制发现URLs在anchor tags,但是更好的是提取所有的tags 在可能包含他们的地方。第二,我们认为他下载页面一直16MB的大小(一个设置显著地大于平常),让它可能遇到上万个的超链接页面.第三,大部分的论文报道那些每个页面中唯一的超链接。如果我们只考虑每个页面中唯一的超链接,那么平均值是47.74,而中间值为 17.

那些超链接从这些HTML中提取出来,加上大约3800万的HTTP跳转,在这个爬行中,流入到Host Splitter中。为了去测试不同caching算法的效率,我们通过Mercator的Host Splitter组件将所有的引入URLs打日志到磁盘中.四个爬虫中的Host Splitter接收并日志记录了总共268.6亿个URLs。

完成爬行后,我们浓缩了Host Splitter日志文件。我们把每个URL hash 化为一个64—bit的识别码。我们确信没有故意的碰撞在排序最初的URL日志文档,而且计算了唯一的URLs个数。然后我们把这些唯一的URL 数和唯一的识别码数比较,我们决定用一个内存hash table在一个内存很大的机器里。数据减少的过程,大小距离51GB到57GB,而且包含64亿和71亿个URLs。

为了发现caching的效率,在一个分布式的爬虫程序里的交互的进程通信,我们也获得了一个日志文件,记录那些URLs被传给每个爬虫。这个日志文件包含49.2亿个URLs,大约相当于全部URLs的19.5%。我们也浓缩了这个日志文件用同样的方式。然后我们会用这个浓缩的日志文件到我们的模拟中。

外文翻译--农村金融主流的非正规金融机构

附录1 RURAL FINANCE: MAINSTREAMING INFORMAL FINANCIAL INSTITUTIONS By Hans Dieter Seibel Abstract Informal financial institutions (IFIs), among them the ubiquitous rotating savings and credit associations, are of ancient origin. Owned and self-managed by local people, poor and non-poor, they are self-help organizations which mobilize their own resources, cover their costs and finance their growth from their profits. With the expansion of the money economy, they have spread into new areas and grown in numbers, size and diversity; but ultimately, most have remained restricted in size, outreach and duration. Are they best left alone, or should they be helped to upgrade their operations and be integrated into the wider financial market? Under conducive policy conditions, some have spontaneously taken the opportunity of evolving into semiformal or formal microfinance institutions (MFIs). This has usually yielded great benefits in terms of financial deepening, sustainability and outreach. Donors may build on these indigenous foundations and provide support for various options of institutional development, among them: incentives-driven mainstreaming through networking; encouraging the establishment of new IFIs in areas devoid of financial services; linking IFIs/MFIs to banks; strengthening Non-Governmental Organizations (NGOs) as promoters of good practices; and, in a nonrepressive policy environment, promoting appropriate legal forms, prudential regulation and delegated supervision. Key words: Microfinance, microcredit, microsavings。 1. informal finance, self-help groups In March 1967, on one of my first field trips in Liberia, I had the opportunity to observe a group of a dozen Mano peasants cutting trees in a field belonging to one of them. Before they started their work, they placed hoe-shaped masks in a small circle, chanted words and turned into animals. One turned into a lion, another one into a bush hog, and so on, and they continued to imitate those animals throughout the whole day, as they worked hard on their land. I realized I was onto something serious, and at the end of the day, when they had put the masks into a bag and changed back into humans,

毕设开题报告-及开题报告分析

开题报告如何写 注意点 1.一、对指导教师下达的课题任务的学习与理解 这部分主要是阐述做本课题的重要意义 2.二、阅读文献资料进行调研的综述 这部分就是对课题相关的研究的综述落脚于本课题解决了那些关键问题 3.三、根据任务书的任务及文件调研结果,初步拟定执行实施的方案(含具体进度计划) 这部分重点写具体实现的技术路线方案的具体实施方法和步骤了,具体进度计划只是附在后面的东西不是重点

南京邮电大学通达学院毕业设计(论文)开题报告

文献[5] 基于信息数据分析的微博研究综述[J];研究微博信息数据的分析,在这类研究中,大多数以微博消息传播的三大构件---微博消息、用户、用户关系为研究对象。以微博消息传播和微博成员组织为主要研究内容,目的在于发祥微博中用户、消息传博、热点话题、用户关系网络等的规律。基于微博信息数据分析的研究近年来在国内外都取得了很多成果,掌握了微博中的大量特征。该文献从微博消息传播三大构件的角度,对当前基于信息数据分析的微博研究进行系统梳理,提出微博信息传播三大构件的概念,归纳了此类研究的主要研究内容及方法。 对于大多用户提出的与主题或领域相关的查询需求,传统的通用搜索引擎往往不能提供令人满意的结果网页。为了克服通用搜索引擎的以上不足,提出了面向主题的聚焦爬虫的研究。文献[6]综述了聚焦爬虫技术的研究。其中介绍并分析了聚焦爬虫中的关键技术:抓取目标定义与描述,网页分析算法和网页分析策略,并根据网络拓扑、网页数据内容、用户行为等方面将各种网页分析算法做了分类和比较。聚焦爬虫能够克服通用爬虫的不足之处。 文献[7]首先介绍了网络爬虫工作原理,传统网络爬虫的实现过程,并对网络爬虫中使用的关键技术进行了研究,包括网页搜索策略、URL去重算法、网页分析技术、更新策略等。然后针对微博的特点和Ajax技术的实现方法,指出传统网络爬虫的不足,以及信息抓取的技术难点,深入分析了现有的基于Ajax的网络爬虫的最新技术——通过模拟浏览器行为,触发JavaScript事件(如click, onmouseo ver等),解析JavaScript脚本,动态更新网页DOM树,抽取网页中的有效信息。最后,详细论述了面向SNS网络爬虫系统的设计方案,整体构架,以及各功能模块的具体实现。面向微博的网络爬虫系统的实现是以新浪微博作为抓取的目标网站。结合新浪微博网页的特点,通过模拟用户行为,解析JavaSc ript,建立DOM树来获取网页动态信息,并按照一定的规则提取出网页中的URL和有效信息,并将有效信息存入数据库。本系统成功的实现了基于Ajax技术的网页信息的提取。 文献[8]引入网页页面分析技术和主题相关性分析技术,解决各大网站微博相继提供了抓取微博的API,这些API都有访问次数的限制,无法满足获取大量微博数据的要求,同时抓取的数据往往很杂乱的问题。展开基于主题的微博网页爬虫的研究与设计。本文的主要工作有研究分析网页页面分析技术,根据微博页面特点选择微博页面信息获取方法;重点描述基于“剪枝”的广度优先搜索策略的思考以及设计的详细过程,着重解决URL的去重、URL地址集合动态变化等问题;研究分析短文本主题抽取技术以及多关键匹配技术,确定微博主题相关性分析的设计方案;最后设计实现基于主题的微博网页爬虫的原型系统,实时抓取和存储微博数据。本文研究的核心问题是,根据微博数据的特点设计一种基于“剪枝”的广度优先搜索策略,并将其应用到微博爬虫中;同时使用微博页面分析技术使得爬虫不受微博平台API限制,从而让用户尽可能准确地抓取主题相关的微博数据。通过多次反复实验获取原型系统实验结果,将实验结果同基于API微博爬虫和基于网页微博爬虫的抓取效果进行对比分析得出结论:本文提出的爬行策略能够抓取主题相关的微博数据,虽然在效率上有所降低,但在抓取的微博数据具有较好的主题相关性。这实验结果证明本论文研究的实现方案是可行的。 文献[9]阐述了基于ajax的web应用程序的爬虫和用户界面状态改变的动态分析的过程和思路。文献[10]对于全球社交网络Twitter,设计并实现了,一个爬虫系统,从另一个角度阐明了Python在编写爬虫这个方面的强大和快速。仅仅用少量的代码就能实现爬虫系统,并且再强大的社交网站也可

计算机网络安全文献综述

计算机网络安全综述学生姓名:李嘉伟 学号:11209080279 院系:信息工程学院指导教师姓名:夏峰二零一三年十月

[摘要] 随着计算机网络技术的快速发展,网络安全日益成为人们关注的焦点。本文分析了影响网络安全的主要因素及攻击的主要方式,从管理和技术两方面就加强计算机网络安全提出了针对性的建议。 [关键词] 计算机网络;安全;管理;技术;加密;防火墙 一.引言 计算机网络是一个开放和自由的空间,但公开化的网络平台为非法入侵者提供了可乘之机,黑客和反黑客、破坏和反破坏的斗争愈演愈烈,不仅影响了网络稳定运行和用户的正常使用,造成重大经济损失,而且还可能威胁到国家安全。如何更有效地保护重要的信息数据、提高计算机网络的安全性已经成为影响一个国家的政治、经济、军事和人民生活的重大关键问题。本文通过深入分析网络安全面临的挑战及攻击的主要方式,从管理和技术两方面就加强计算机网络安全提出针对性建议。

二.正文 1.影响网络安全的主要因素[1] 计算机网络安全是指“为数据处理系统建立和采取的技术和管理的安全保护,保护计算机硬件、软件数据不因偶然和恶意的原因而遭到破坏、更改和泄漏”。计算机网络所面临的威胁是多方面的,既包括对网络中信息的威胁,也包括对网络中设备的威胁,但归结起来,主要有三点:一是人为的无意失误。如操作员安全配置不当造成系统存在安全漏洞,用户安全意识不强,口令选择不慎,将自己的帐号随意转借他人或与别人共享等都会给网络安全带来威胁。二是人为的恶意攻击。这也是目前计算机网络所面临的最大威胁,比如敌手的攻击和计算机犯罪都属于这种情况,此类攻击又可以分为两种:一种是主动攻击,它以各种方式有选择地破坏信息的有效性和完整性;另一类是被动攻击,它是在不影响网络正常工作的情况下,进行截获、窃取、破译以获得重要机密信息。这两种攻击均可对计算机网络造成极大的危害,并导致机密数据的泄漏。三是网络软件的漏洞和“后门”。任何一款软件都或多或少存在漏洞,这些缺陷和漏洞恰恰就是黑客进行攻击的首选目标。绝大部分网络入侵事件都是因为安全措施不完善,没有及时补上系统漏洞造成的。此外,软件公司的编程人员为便于维护而设置的软件“后门”也是不容忽视的巨大威胁,一旦“后门”洞开,别人就能随意进入系统,后果不堪设想。

财务管理外文翻译

财务风险管理 尽管近年来金融风险大大增加,但风险和风险管理不是当代的主要问题。全球市场越来越多的问题是,风险可能来自几千英里以外的与这些事件无关的国外市场。意味着需要的信息可以在瞬间得到,而其后的市场反应,很快就发生了。经济气候和市场可能会快速影响外汇汇率变化、利率及大宗商品价格,交易对手会迅速成为一个问题。因此,重要的一点是要确保金融风险是可以被识别并且管理得当的。准备是风险管理工作的一个关键组成部分。 什么是风险? 风险给机会提供了基础。风险和暴露的条款让它们在含义上有了细微的差别。风险是指有损失的可能性,而暴露是可能的损失,尽管他们通常可以互换。风险起因是由于暴露。金融市场的暴露影响大多数机构,包括直接或间接的影响。当一个组织的金融市场暴露,有损失的可能性,但也是一个获利或利润的机会。金融市场的暴露可以提供战略性或竞争性的利益。 风险损失的可能性事件来自如市场价格的变化。事件发生的可能性很小,但这可能导致损失率很高,特别麻烦,因为他们往往比预想的要严重得多。换句话说,可能就是变异的风险回报。由于它并不总是可能的,或者能满意地把风险消除,在决定如何管理它中了解它是很重要的一步。识别暴露和风险形式的基础需要相应的财务风险管理策略。 财务风险是如何产生的呢? 无数金融性质的交易包括销售和采购,投资和贷款,以及其他各种业务活动,产生了财务风险。它可以出现在合法的交易中,新项目中,兼并和收购中,债务融资中,能源部分的成本中,或通过管理的活动,利益相关者,竞争者,外国政府,或天气出现。当金融的价格变化很大,它可以增加成本,降低财政收入,或影响其他有不利影响的盈利能力的组织。金融波动可能使人们难以规划和预算商品和服务的价格,并分配资金。 有三种金融风险的主要来源: 1、金融风险起因于组织所暴露出来的市场价格的变化,如利率、汇率、和大宗商品价格。 2、引起金融风险的行为有与其他组织的交易如供应商、客户,和对方在金融衍生产品中的交易。 3、由于内部行动或失败的组织,特别是人、过程和系统所造成的金融风险。 什么是财务风险管理? 财务风险管理是用来处理金融市场中不确定的事情的。它涉及到一个组织所面临的评估和组织的发展战略、内部管理的优先事项和当政策一致时的财务风险。企业积极应对金融风险可以使企业成为一个具有竞争优势的组织。它还确保管理,业务人员,利益相关者,董事会董事在对风险的关键问题达成协议。金融风险管理组织就必须作出那些不被接受的有关风险的决定。那些被动不采取行动的战略是在默认情况下接受所有的风险,组织使用各种策略和产品来管理金融风险。重要的是要了解这些产品和战略方面,通过工作来减少该组织内的风险承受能力和目标范围内的风险。 风险管理的策略往往涉及衍生工具。在金融机构和有组织的交易所,衍生物广泛地进行

网络游戏植入式广告研究【文献综述】

毕业论文文献综述 市场营销 网络游戏植入式广告研究 (一)国外对这一课题的研究 在对网络游戏植入式广告的直接研究方面,由于网络游戏兴起时间不长,只有M.R.Nelson在2002年的《电脑/电子游戏中植入式广告唤起品牌的回忆度》中提出,通常而言,消费者对植入式广告持肯定态度。他认为,消费者在短期内的品牌回忆度比长期内的品牌回忆度高;而当被植入产品是游戏中的主要道具,或该产品是本地品牌、新品牌的时候,植入式广告更容易唤起消费者对品牌的回忆。在这篇文章中,作者着重研究的是消费者对于这一广告形式的态度,但并没有明确提出“网络游戏植入式广告”这一称谓,也没有对其进行系统的理论梳理,因此,这方面的研究空间还是很大的。在相关研究方面,由于电影电视的植入式广告在国外兴起己有时日,因此相关著述相对较多,主要分为以下两个方面: 1.对影视植入式广告基础理论的研究方面 J.A.Karrh在1998年研究《各种媒介中植入式广告对受众的影响》中提出植入式广告的意义转移模式。尽管Karrh的研究并没有针对网络游戏植入式广告,但是这一模式揭示了网络游戏植入式广告的内在作用机制,也就是网络游戏植入式广告是如何影响消费者的。同年,C.A.Russell发表了《从学习理论、行为理论建立电影/电视中植入式广告的基本理论框架》。Russell在这边文章中建构了影视植入式广告的研究框架,从而为后续的效果研究提供了理论基础,这也成为本文对网络游戏植入式广告进行理论探索和研究的主要参考之一。 因此,我们可以发现,国外在植入式广告基础理论方面研究比较少,仅有两位学者对这一问题进行研究,而且从这两篇文章的实证和理论研究各自所占的篇幅来看,研究的重点仍然偏向实证。 2.对影视植入式广告效果的研究方面 国外学者对植入式广告研究的重心是效果研究,数量极其可观。比较有影响力的学者有M.R.Nelson、J.A.Karrh、C.A.Russell等。他们从植入形式、消费者、广

网络爬虫外文翻译

外文资料 ABSTRACT Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon. 1. INTRODUCTION A recent Pew Foundation study [31] states that “Search eng ines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightly

网络安全外文翻译文献

网络安全外文翻译文献 (文档含英文原文和中文翻译) 翻译: 计算机网络安全与防范 1.1引言 计算机技术的飞速发展提供了一定的技术保障,这意味着计算机应用已经渗透到社会的各个领域。在同一时间,巨大的进步和网络技术的普及,社会带来了巨大的经济利润。然而,在破坏和攻击计算机信息系统的方法已经改变了很多的网络环境下,网络安全问题逐渐成为计算机安全的主流。

1.2网络安全 1.2.1计算机网络安全的概念和特点 计算机网络的安全性被认为是一个综合性的课题,由不同的人,包括计算机科学、网络技术、通讯技术、信息安全技术、应用数学、信息理论组成。作为一个系统性的概念,网络的安全性由物理安全、软件安全、信息安全和流通安全组成。从本质上讲,网络安全是指互联网信息安全。一般来说,安全性、集成性、可用性、可控性是关系到网络信息的相关理论和技术,属于计算机网络安全的研究领域。相反,狭隘“网络信息安全”是指网络安全,这是指保护信息秘密和集成,使用窃听、伪装、欺骗和篡夺系统的安全性漏洞等手段,避免非法活动的相关信息的安全性。总之,我们可以保护用户利益和验证用户的隐私。 计算机网络安全有保密性、完整性、真实性、可靠性、可用性、非抵赖性和可控性的特点。 隐私是指网络信息不会被泄露给非授权用户、实体或程序,但是授权的用户除外,例如,电子邮件仅仅是由收件人打开,其他任何人都不允许私自这样做。隐私通过网络信息传输时,需要得到安全保证。积极的解决方案可能会加密管理信息。虽然可以拦截,但它只是没有任何重要意义的乱码。 完整性是指网络信息可以保持不被修改、破坏,并在存储和传输过程中丢失。诚信保证网络的真实性,这意味着如果信息是由第三方或未经授权的人检查,内容仍然是真实的和没有被改变的。因此保持完整性是信息安全的基本要求。 可靠性信息的真实性主要是确认信息所有者和发件人的身份。 可靠性表明该系统能够在规定的时间和条件下完成相关的功能。这是所有的网络信息系统的建立和运作的基本目标。 可用性表明网络信息可被授权实体访问,并根据自己的需求使用。 不可抵赖性要求所有参加者不能否认或推翻成品的操作和在信息传输过程中的承诺。

财务风险中英文对照外文翻译文献

中英文资料外文翻译 财务风险重要性分析 译文: 摘要:本文探讨了美国大型非金融企业从1964年至2008年股票价格风险的决定小性因素。我们通过相关结构以及简化模型,研究诸如债务总额,债务期限,现金持有量,及股利政策等公司财务特征,我们发现,股票价格风险主要通过经营和资产特点,如企业年龄,规模,有形资产,经营性现金流及其波动的水平来体现。与此相反,隐含的财务风险普遍偏低,且比产权比率稳定。在过去30年,我们对财务风险采取的措施有所减少,反而对股票波动(如独特性风险)采取的措施逐渐增加。因此,股票价格风险的记载趋势比公司的资产风险趋势更具代表性。综合二者,结果表明,典型的美国公司谨慎管理的财政政策大大降低了财务风险。因此,现在看来微不足道的剩余财务风险相对底层的非金融公司为一典型的经济风险。 关键词:资本结构;财务风险;风险管理;企业融资 1 绪论 2008年的金融危机对金融杠杆的作用产生重大影响。毫无疑问,向金融机构的巨额举债和内部融资均有风险。事实上,有证据表明,全球主要银行精心策划的杠杆(如通过抵押贷款和担保债务)和所谓的“影子银行系统”可能是最近的经济和金融混乱的根本原因。财务杠杆在非金融企业的作用不太明显。迄今为止,尽管资本市场已困在危机中,美国非金融部门的问题相比金融业的困境来说显得微不足道。例如,非金融企业破产机遇仅限于自20世纪30年代大萧条以来的最大经济衰退。事实上,非金融公司申请破产的事件大都发生在美国各行业(如汽车制造业,报纸,房地产)所面临的基本经济压力即金融危机之前。这令人惊讶的事实引出了一个问题“非金融公司的财务风险是如何重要?”。这个问题的核心是关于公司的总风险以及公司风险组成部分的各决定因素的不确定性。 最近在资产定价和企业融资再度引发的两个学术研究中分析了股票价格风险利

2020年英语六级翻译预测及解析:网络游戏

2020年英语六级翻译预测及解析:网络游戏 请将下面这段话翻译成英文: 随着电脑和互联网的普及,人们能玩的游戏很多。年轻人尤其喜欢网络游戏。不过,这个现象给很多家长和老师带来了很多的困扰,很多学生沉迷于游戏,导致健康状况和学习成绩的下降。相反地,有些人指出玩网络游戏能够协助玩家练习他们的反应水平,这对形成敏捷的反应和思维很有协助。而且,在玩电脑游戏的时候玩家能够暂时释放在工作和学习上的巨大压力。只要我们不沉迷于网络游戏,就能尽情享受它给我们带来的快乐。 翻译及详解 With the popularity of computers and the Internet,there is a wide range of games that people can choose from.Young people like online games most.However,this phenomenon troubles many parents and teachers.Many students are addicted to games,leading to a decline in their health condition and academic performance.On the contrary,some say that playing online games can train the players,reaction ability,which is of great help in forming quick reaction and fast thinking.Moreover,when playing computer games,players can temporarily release great pressure from their work and study.As long as we are not addicted to online games,we can enjoy the joy it brings us heartily. 翻译讲解 1.随着电脑和互联网的普及:可译为With the popularity of computers and the Internet.Popularity意为“普及,流行”。 2.带来了很多的困扰:能够使用动词trouble来翻译,意为“使烦恼,使苦恼”。作名词时意为“困难,麻烦”。

农村金融小额信贷中英文对照外文翻译文献

农村金融小额信贷中英文对照外文翻译文献(文档含英文原文和中文翻译) RURAL FINANCE: MAINSTREAMING INFORMAL FINANCIAL INSTITUTIONS By Hans Dieter Seibel Abstract Informal financial institutions (IFIs), among them the ubiquitous rotating savings and credit associations, are of ancient origin. Owned and self-managed by local people, poor and non-poor, they are self-help organizations which mobilize their own resources, cover their costs and finance their growth from their profits. With the expansion of the money economy, they have spread into new areas and grown in numbers, size and diversity; but ultimately, most have remained restricted in size, outreach and duration. Are they best left alone, or should they be helped to upgrade

简析网络语言的文献综述

浅析网络语言的文献综述 摘要 语言是一种文化,一个民族要有文化前途,靠的是创新。从这个意义上说,新词语用过了些并不可怕,如果语言僵化,词汇贫乏,那才是真正的可悲。语汇系统如果只有基本词,永远稳稳当当,语言就没有生命力可言,因此,在规定一定的规范的同时,要允许歧疑的存在,但更要积极吸收那些脱离当时的规范而能促进语言的丰富和发展的成分。正确看待网络语言。 关键字 网络语言;因素;发展趋势; 一、关于“网络语言”涵义及现状的研究 1.网络语言的涵义研究 网络语言是一个有着多种理解的概念,既可以指称网络特有的言语表达方式,也可以指网络中使用的自然语言,还可以把网络中使用的所有符号全部包括在内。网络语言起初多指网络语言的研究现状(网络的计算机语言,又指网络上使用的有自己特点的自然语言。于根元,2001)。 较早开展网络语言研究的劲松、麒可(2000)认为,广义的网络语言是与网络时代、e时代出现的与网络和电子技术有关的“另类语言”;狭义的网络语言指自称网民、特称网虫的语言。 周洪波(2001)则认为,网络语言是指人们在网络交流中所使用的语言形式,大体上可分为三类:一是与网络有关的专业术语;二是与网络有关的特别用语;三是网民在聊天室和BBS上的常用词语。 于根元(2003)指出,“网络语言”本身也是一个网络用语。起初多指网络的计算机语言,也指网络上使用的有自己特点的自然语言。现在一般指后者。狭义的网络语言指论坛和聊天室的具有特点的用语。 何洪峰(2003)进一步指出,网络语言是指媒体所使用的语言,其基本词汇及语法结构形式还是全民使用的现代汉语,这是它的主体形式;二是指IT领域的专业用语,或是指与电子计算机联网或网络活动相关的名词术语;其三,狭义上是指网民所创造的一些特殊的信息符号。总的看来,研究者基本认为网络语言有广义、狭义两种含义,广义的网络语言主要指与网络有关的专业术语,狭义的网络语言主要指在聊天室和BBS上常用的词语和符号。 2. 网络语言的研究现状 如:国人大常委会委员原国家教委副主任柳斌表示,网络语言的混乱,是对汉语纯洁性的破坏,语言文字工作者应对此类现象加以引导和批评。国家网络工程委会副秘书史自文表示,老师要引导学生使用网络语言。比如说在写出作文的时候,可以针对彩简单的网络语言还是用含义更有韵味的唐诗更好做一个主题研讨会,和学生一起探讨。这样就可以在理解、尊重学生的基础上进行引导。经过这样的过程,学生对于用何种语言形式多了一个选择,又加深了对传统文化的理解。 如:北京教科院基教所研究员王晓春表示,在网络世界里用网络语言无可厚非。但在正式场合要引导学生不使用网络语言。在教学中老师要引导学生如何正

网络安全中的中英对照

网络安全中的中英对照 Access Control List(ACL)访问控制列表 access token 访问令牌 account lockout 帐号封锁 account policies 记帐策略 accounts 帐号 adapter 适配器 adaptive speed leveling 自适应速率等级调整 Address Resolution Protocol(ARP) 地址解析协议Administrator account 管理员帐号 ARPANET 阿帕网(internet的前身) algorithm 算法 alias 别名 allocation 分配、定位 alias 小应用程序 allocation layer 应用层 API 应用程序编程接口 anlpasswd 一种与Passwd+相似的代理密码检查器 applications 应用程序 ATM 异步传递模式

audio policy 审记策略 auditing 审记、监察 back-end 后端 borde 边界 borde gateway 边界网关 breakabie 可破密的 breach 攻破、违反 cipher 密码 ciphertext 密文 CAlass A domain A类域 CAlass B domain B类域 CAlass C domain C类域 classless addressing 无类地址分配 cleartext 明文 CSNW Netware客户服务 client 客户,客户机 client/server 客户机/服务器 code 代码 COM port COM口(通信端口) CIX 服务提供者 computer name 计算机名

互联网金融对传统金融业的影响外文文献翻译

互联网金融对传统金融业的影响外文文献翻译(文档含中英文对照即英文原文和中文翻译) 译文: 互联网金融对传统金融业的影响 摘要 网络的发展,深刻地改变甚至颠覆了许多传统行业,金融业也不例外。近年来,金融业成为继商业分销、传媒之后受互联网影响最为深远的领域,许多基于互联网的金融服务模式应运而生,并对传统金融业产生了深刻的影响和巨大的冲击。“互联网金融”成为社会各界关注的焦点。 互联网金融低成本、高效率、关注用户体验,这些特点使其能够充分满足传统金融“长尾市场”的特殊需求,灵活提供更为便捷、高

效的金融服务和多样化的金融产品,大大拓展了金融服务的广度和深度,缩短了人们在时空上的距离,建立了一种全新的金融生态环境;可以有效整合、利用零散的时间、信息、资金等碎片资源,积少成多,形成规模效益,成为各类金融服务机构新的利润增长点。此外,随着互联网金融的不断渗透和融合,将给传统金融行业带来新的挑战和机遇。互联网金融可以促进传统银行业的转型,弥补传统银行在资金处理效率、信息整合等方面的不足;为证券、保险、基金、理财产品的销售与推广提供新渠道。对于很多中小企业来说,互联网金融拓展了它们的融资渠道,大大降低了融资门槛,提高了资金的使用效率。但是,互联网金融的跨行业性决定了它的风险因素更为复杂、敏感、多变,因此要处理好创新发展与市场监管、行业自律的关系。 关键词:互联网金融;商业银行;影响;监管 1 引言 互联网技术的不断发展,云计算、大数据、社交网络等越来越多的互联网应用为传统行业的业务发展提供了有力支持,互联网对传统行业的渗透程度不断加深。20世纪末,微软总裁比尔盖茨就曾断言,“传统商业银行会成为新世纪的恐龙”。如今,随着互联网电子信息技术的发展,我们真切地感受到了这种趋势,移动支付、电子银行早已在我们的日常生活中占据了重要地位。 由于互联网金融的概念几乎完全来自于商业实践,因此目前的研究多集中在探讨互联网金融的具体模式上,而对传统金融行业的影响力分析和应对措施则缺乏系统性研究。互联网与金融行业一向是风险

毕业论文外文文献翻译Android平台上的社交应用和游戏应用来比较学习软件体系结构

毕业设计(论文)外文文献翻译 文献、资料中文题目:通过开发Android平台上的社交应 用和游戏应用来比较学习软件体系结构 文献、资料英文题目: 文献、资料来源: 文献、资料发表(出版)日期: 院(部): 专业: 班级: 姓名: 学号: 指导教师: 翻译日期: 2017.02.14

通过开发Android平台上的社交应用和游戏应用来比较学习软件体系结构 1.引言 电脑游戏和视频游戏非常受儿童和青少年的欢迎,在年轻人的文化发挥了突出的作用[1]。现在游戏可以在技术丰富的配备了笔记本电脑,智能手机,游戏机(移动和固定),机顶盒,和其他数字设备的环境中运行。从这一现象,人们相信将年轻人对游戏的内在动机与教育内容和目标结合就会变成Prensky称之为“以数字游戏为基础的学习”的学习方法[2]。 青年学生生活的游戏中除了丰富的外观,游戏开发技术已经成熟,并且越来越先进[3]。基于现有的各种游戏开发环境,游戏开发过程中的全部责任可以分为几个的专家领域和角色,如游戏程序员,3D模型的创造者,游戏设计师,音乐家,漫画家,剧作家,等等。游戏内容与技术相结合的过程可以通过游戏引擎和使用网络上的各种用户和专家社区的可用信息得到简化。例如,微软的XNA 游戏开发工具包提供的游戏循环函数绘制及更新游戏内容,而且还提供了方便的游戏开发组件来加载不同格式的图形,音频和视频。这使得游戏迷们如无论有没有编程背景的学生修改现有的游戏或开发新游戏。他们可以用这些游戏创作工具实现自己的游戏概念设计,学习发展技能和相关知识,积累相关的实际经验。 在这种情况下,不但游戏可以用于学习而且通过激发任务机制,游戏开发工具可以用来研究计算机科学(CS),软件工程(SE),和游戏编程相关主题。一般来说,游戏可以用三种方式集成在教育中[4,5]。首先,游戏可以用来代替传统的练习,鼓励学生把额外的努力用来做练习,给老师或助教一个实时地监控学生是如何练习的机会[6,7]。第二,游戏可以作为一个讲座的一部分来促进学生的参与,增加学生的动力[8,9]。第三,将要求学生们修改或开发游戏作为使用游戏开发框架(GDF)学习CS和SE方面技能的课程的一部分。我们把后者的学习方法为以游戏开发为基础的学习(GDBL)。GDF表示可以用来开发或修改游戏,例如,该工具包的游戏引擎,游戏编辑器,或游戏(模拟)平台,甚至任何集成开发环境(IDE),如Visual C + +,Eclipse和Android SDK,J2ME,因为所有的人都可以用来开发游戏。 本文重点研究学生通过在Android平台开发游戏应用学习软件体系结构和在Android平台开发社交应用(例如,天气预报,聊天软件)学习软件体系结构的相似点和不同点。将游戏开发放到CS或者SE课程中的动机是利用学生对游戏及游戏开发的迷恋来激发他们通过该项目更多更好的学习课程材料。

会计学专业外文翻译

本科毕业论文外文翻译 题目:民间融资存在的问题及其对策分析 原文题目:《正规与非正规金融:来自中国的证据》 作者:Meghana Ayyagari,Asli Demirguc-kunt,Vojislav Maksimovic 原文出处:世界银行发展研究组财政与私营部门政策研究工作文件,2008 正规与非正规金融:来自中国的证据 中国在财务结果与经济增长的反例中是被经常提及的,因为它的银行体系存在很大的弱点,但它却是发展最快的全球经济体之一。在中国,私营部门依据其筹资治理机制,促进公司的快速增长,以及促进中国的发展。本文以一个企业的融资模式和使用2400份的中国企业数据库资料,以及一个相对较小的公司在利用非正式资金来源的样本比例,得出其是更大依赖正式的银行融资。尽管中国的银行存在较大的弱点,但正规融资金融体系关系企业快速成长,而从其他渠道融资则不是。通过使用选择模型我们发现,没有证据证明这些成果的产生是因为选择的公司已进入正规金融体系。虽然公司公布银行贪污,但没有证据表明它严重影响了信贷分配或公司的业绩获得。 金融的发展与更快的增长已证明和改善资源配置有关。尽管研究的重点一直是正式的金融机构,但也认识到非正式金融系统的存在和其发挥的巨大作用,特别是在发展中国家。虽然占主导地位的观点是,非正规金融机构发挥辅助作用,提供低端市场的服务,通常无抵押,短期贷款只限于农村地区,农业承包合同,家庭,个人或小企业的贷款筹资。非正规金融机构依靠关系和声誉,可以有效地监督和执行还款。可是非正规金融系统不能取代正规金融,这是因为他们的监督和执行机制不能适应金融体系高端的规模需要。 最近,有研究强调非正式融资渠道发挥了关键的作用,甚至在发达市场。圭索,萨皮恩扎和津加莱斯(2004)表明,非正式资本影响整个意大利不同地区的金融发展水平。戈麦斯(2000)调查为什么一些股东在新股投资恶劣的环境下,仍保护投资者的权利,并得出结论,控股股东承诺不征用股本,是因为担心他们的声誉。Garmaise和莫斯科维茨(2003)显示,即使在美国,非正式金融在融资方面也发挥了重要的作用。举一个商业房地产作为例子,市场经纪人,他们主要是为客户提供服务,以获得资金,这和证券经纪和银行开发具有非正式

搜索引擎爬虫外文翻译文献

搜索引擎爬虫外文翻译文献 (文档含中英文对照即英文原文和中文翻译) 译文: 探索搜索引擎爬虫 随着网络难以想象的急剧扩张,从Web中提取知识逐渐成为一种受欢迎的途径。这是由于网络的便利和丰富的信息。通常需要使用基于网络爬行的搜索引擎来找到我们需要的网页。本文描述了搜索引擎的基本工作任务。概述了搜索引擎与网络爬虫之间的联系。 关键词:爬行,集中爬行,网络爬虫 1.导言 在网络上WWW是一种服务,驻留在链接到互联网的电脑上,并允许最终用户访问是用标准的接口软件的计算机中的存储数据。万维网是获取访问网络信息的宇

宙,是人类知识的体现。 搜索引擎是一个计算机程序,它能够从网上搜索并扫描特定的关键字,尤其是商业服务,返回的它们发现的资料清单,抓取搜索引擎数据库的信息主要通过接收想要发表自己作品的作家的清单或者通过“网络爬虫”、“蜘蛛”或“机器人”漫游互联网捕捉他们访问过的页面的相关链接和信息。 网络爬虫是一个能够自动获取万维网的信息程序。网页检索是一个重要的研究课题。爬虫是软件组件,它访问网络中的树结构,按照一定的策略,搜索并收集当地库中检索对象。 本文的其余部分组织如下:第二节中,我们解释了Web爬虫背景细节。在第3节中,我们讨论爬虫的类型,在第4节中我们将介绍网络爬虫的工作原理。在第5节,我们搭建两个网络爬虫的先进技术。在第6节我们讨论如何挑选更有趣的问题。 2.调查网络爬虫 网络爬虫几乎同网络本身一样古老。第一个网络爬虫,马修格雷浏览者,写于1993年春天,大约正好与首次发布的OCSA Mosaic网络同时发布。在最初的两次万维网会议上发表了许多关于网络爬虫的文章。然而,在当时,网络i现在要小到三到四个数量级,所以这些系统没有处理好当今网络中一次爬网固有的缩放问题。 显然,所有常用的搜索引擎使用的爬网程序必须扩展到网络的实质性部分。但是,由于搜索引擎是一项竞争性质的业务,这些抓取的设计并没有公开描述。有两个明显的例外:股沟履带式和网络档案履带式。不幸的是,说明这些文献中的爬虫程序是太简洁以至于能够进行重复。 原谷歌爬虫(在斯坦福大学开发的)组件包括五个功能不同的运行流程。服务器进程读取一个URL出来然后通过履带式转发到多个进程。每个履带进程运行在不同的机器,是单线程的,使用异步I/O采用并行的模式从最多300个网站来抓取数据。爬虫传输下载的页面到一个能进行网页压缩和存储的存储服务器进程。然后这些页面由一个索引进程进行解读,从HTML页面中提取链接并将他们保存到不同的磁盘文件中。一个URL解析器进程读取链接文件,并将相对的网址进行存储,并保存了完整的URL到磁盘文件然后就可以进行读取了。通常情况下,因

相关主题
文本预览
相关文档 最新文档