from + size の合計で 1万が上限らしい
パラメータをいじると上限変えられるらしいけど、根本的な解決にならないのでメッセージ通りscrollを使ってみる
elasticsearch.exceptions.TransportError: TransportError(500, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. Se e the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')
ググりながら scroll 使ってみた
# elasticsearch から1万件以上取得するサンプル
from elasticsearch import Elasticsearch
counter = 1
size = 10000
# ドキュメント毎の処理
def get_doc(hits):
global counter
for hit in hits:
text = hit['_source']['text']
print(counter, text)
counter += 1
# Elasticsearch
es = Elasticsearch( hosts = [{'host':'hoge_host', 'port':9200} ] )
response = es.index(index="hoge_index", doc_type="hoge_type", body={"key": "value"})
# 検索条件
response = es.search( scroll='2m', index="hoge_index", size=size,
body={ "query":{ "range":{ "timestamp_jpn":{ "from":"2018-10-01 00:00:00", "to":"2018-10-02 00:00:00" } } },
"sort" : [ {"timestamp_jpn" : {"order" : "asc"}} ]} )
sid = response['_scroll_id']
print('sid', sid)
print( 'total', response['hits']['total'] )
scroll_size = len( response['hits']['hits'] )
print('scroll_size', scroll_size)
while True:
# スクロールサイズ 0 だったら終了
if scroll_size <= 0:
break
# 検索結果を処理
get_doc(response['hits']['hits'])
# スクロールから次の検索結果取得
response = es.scroll(scroll_id=sid, scroll='10m')
scroll_size = len(response['hits']['hits'])
print( 'scroll_size', scroll_size)
参考ページ https://gist.github.com/hmldd/44d12d3a61a8d8077a3091c4ff7b9307
データ取り出すだけだったら elasticdump 使ったほうがはやそう...