「Python] from size を使って全権取得しようとしたら、1万件までしかムリと言われたので仕方なく scroll を使った時のメモ - Solr, Python, MacBook Air in Shinagawa Seaside

from + size の合計で 1万が上限らしい
パラメータをいじると上限変えられるらしいけど、根本的な解決にならないのでメッセージ通りscrollを使ってみる

elasticsearch.exceptions.TransportError: TransportError(500, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. Se
e the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')

ググりながら scroll 使ってみた

# elasticsearch から1万件以上取得するサンプル
from elasticsearch import Elasticsearch

counter = 1
size = 10000

# ドキュメント毎の処理
def get_doc(hits):
 global counter
 for hit in hits:
   text = hit['_source']['text']
   print(counter, text)
   counter += 1

# Elasticsearch
es = Elasticsearch( hosts = [{'host':'hoge_host', 'port':9200} ] )
response = es.index(index="hoge_index", doc_type="hoge_type", body={"key": "value"})

# 検索条件
response = es.search( scroll='2m', index="hoge_index", size=size,
 body={ "query":{ "range":{ "timestamp_jpn":{ "from":"2018-10-01 00:00:00", "to":"2018-10-02 00:00:00" } } },
 "sort" : [ {"timestamp_jpn" : {"order" : "asc"}} ]} )

sid = response['_scroll_id']
print('sid', sid)
print( 'total', response['hits']['total'] )

scroll_size = len( response['hits']['hits'] )
print('scroll_size', scroll_size)

while True:
 # スクロールサイズ 0 だったら終了
 if scroll_size <= 0:
  break

 # 検索結果を処理
 get_doc(response['hits']['hits'])

 # スクロールから次の検索結果取得
 response = es.scroll(scroll_id=sid, scroll='10m')
 scroll_size = len(response['hits']['hits'])
 print( 'scroll_size', scroll_size)

参考ページ https://gist.github.com/hmldd/44d12d3a61a8d8077a3091c4ff7b9307

データ取り出すだけだったら elasticdump 使ったほうがはやそう...